[review] Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)

February 16, 2025 1 minute read

This paper addresses the modality gap issue in CLIP (Contrastive Language-Image Pretraining). CLIP aims to embed images and texts into a shared representation space, but studies have shown that the two modalities tend to be segregated within this space. This hinders performance in cross-modal tasks such as image-text retrieval and zero-shot learning.

The paper proposes AlignCLIP as a solution to reduce the modality gap and improve performance. AlignCLIP introduces two key approaches:

Parameter Sharing: Sharing the transformer encoder and projection layer between vision and language encoders to increase consistency.
Intra-Modality Separation (IMSep): Expanding the embeddings within the same modality (image-image, text-text) to ensure semantically distinct samples are separated.

Problems and Solutions

1. Problem: Modality Gap in CLIP

CLIP struggles to embed images and texts in a truly shared space, as the two modalities cluster in separate subregions.
This leads to a performance drop in cross-modal retrieval tasks.
The root cause is a combination of model initialization and contrastive loss optimization.

Solution

AlignCLIP introduces parameter sharing, where the transformer and projection layers are shared between vision and language encoders.
Result: Increased cross-modal alignment, as evidenced by improved average cosine similarity.

2. Problem: Over-concentration of Intra-Modality Embeddings

CLIP not only separates images and texts but also over-concentrates embeddings within each modality.
This results in semantically different samples being placed too close together, reducing fine-grained retrieval accuracy.

Solution

IMSep (Intra-Modality Separation) is applied to push apart semantically different embeddings within the same modality.
Result: More balanced intra-modality embedding distributions, improving cross-modal retrieval performance.

3. Problem: Limitations of Previous Approaches

Earlier research attempted to reduce the modality gap through simple isomorphic transformations, but this distorts the meaningful structure of the embedding space.

Solution

AlignCLIP uses a semantic-based alignment approach rather than simple distance-based transformations.
Result: Maintains the natural embedding structure while improving alignment.

Experimental Results

The paper compares CLIP, SharedCLIP (which applies only parameter sharing), and AlignCLIP (which includes both parameter sharing and IMSep).

Cross-Modal Alignment
- AlignCLIP increased average cosine similarity (0.42 → 0.64).
- Significant improvements in cross-modal retrieval performance (higher R@1 scores).
Zero-Shot Classification Performance
- AlignCLIP improved CIFAR-10 accuracy by 8%.
- Minor improvements in ImageNet-1K.
Linear Probing Classification Performance
- 1.5% improvement on ImageNet-1K.
- 4.8% improvement on CIFAR-100.
Robustness to Distribution Shift
- AlignCLIP outperformed both CLIP and SharedCLIP on ImageNet-R, ImageNet-A, and other datasets.

Conclusion

AlignCLIP successfully reduces the modality gap while maintaining or improving downstream performance. Specifically:

Parameter sharing strengthens cross-modal alignment.
IMSep enhances intra-modality distinction, leading to more meaningful semantic organization.
Unlike naive transformations, AlignCLIP preserves the natural structure of the embedding space, ensuring robust improvements.

This study suggests that semantic-based structural modifications are more effective than simple distance-based corrections in mitigating the modality gap, setting a new direction for improving CLIP and similar models.

Share on

Twitter Facebook LinkedIn

[review] Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)

Problems and Solutions

1. Problem: Modality Gap in CLIP

Solution

2. Problem: Over-concentration of Intra-Modality Embeddings

Solution

3. Problem: Limitations of Previous Approaches

Solution

Experimental Results

Conclusion

Share on

You may also enjoy

[review] Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)

[review] ADAPT: Attentive Self-Distillation and Dual-Decoder Prediction Fusion for Continual Panoptic Segmentation (ICLR 2025)

[review] Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)

[review] Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation (MM 2022)