[review] Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)
Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)
This paper addresses the modality gap issue in CLIP (Contrastive Language-Image Pretraining). CLIP aims to embed images and texts into a shared representation space, but studies have shown that the two modalities tend to be segregated within this space. This hinders performance in cross-modal tasks such as image-text retrieval and zero-shot learning.
The paper proposes AlignCLIP as a solution to reduce the modality gap and improve performance. AlignCLIP introduces two key approaches:
- Parameter Sharing: Sharing the transformer encoder and projection layer between vision and language encoders to increase consistency.
- Intra-Modality Separation (IMSep): Expanding the embeddings within the same modality (image-image, text-text) to ensure semantically distinct samples are separated.
Problems and Solutions
1. Problem: Modality Gap in CLIP
- CLIP struggles to embed images and texts in a truly shared space, as the two modalities cluster in separate subregions.
- This leads to a performance drop in cross-modal retrieval tasks.
- The root cause is a combination of model initialization and contrastive loss optimization.
Solution
- AlignCLIP introduces parameter sharing, where the transformer and projection layers are shared between vision and language encoders.
- Result: Increased cross-modal alignment, as evidenced by improved average cosine similarity.
2. Problem: Over-concentration of Intra-Modality Embeddings
- CLIP not only separates images and texts but also over-concentrates embeddings within each modality.
- This results in semantically different samples being placed too close together, reducing fine-grained retrieval accuracy.
Solution
- IMSep (Intra-Modality Separation) is applied to push apart semantically different embeddings within the same modality.
- Result: More balanced intra-modality embedding distributions, improving cross-modal retrieval performance.
3. Problem: Limitations of Previous Approaches
- Earlier research attempted to reduce the modality gap through simple isomorphic transformations, but this distorts the meaningful structure of the embedding space.
Solution
- AlignCLIP uses a semantic-based alignment approach rather than simple distance-based transformations.
- Result: Maintains the natural embedding structure while improving alignment.
Experimental Results
The paper compares CLIP, SharedCLIP (which applies only parameter sharing), and AlignCLIP (which includes both parameter sharing and IMSep).
- Cross-Modal Alignment
- AlignCLIP increased average cosine similarity (0.42 → 0.64).
- Significant improvements in cross-modal retrieval performance (higher R@1 scores).
- Zero-Shot Classification Performance
- AlignCLIP improved CIFAR-10 accuracy by 8%.
- Minor improvements in ImageNet-1K.
- Linear Probing Classification Performance
- 1.5% improvement on ImageNet-1K.
- 4.8% improvement on CIFAR-100.
- Robustness to Distribution Shift
- AlignCLIP outperformed both CLIP and SharedCLIP on ImageNet-R, ImageNet-A, and other datasets.
Conclusion
AlignCLIP successfully reduces the modality gap while maintaining or improving downstream performance. Specifically:
- Parameter sharing strengthens cross-modal alignment.
- IMSep enhances intra-modality distinction, leading to more meaningful semantic organization.
- Unlike naive transformations, AlignCLIP preserves the natural structure of the embedding space, ensuring robust improvements.
This study suggests that semantic-based structural modifications are more effective than simple distance-based corrections in mitigating the modality gap, setting a new direction for improving CLIP and similar models.