1 minute read

Mitigate the Gap: Improving Cross-Modal Alignment in CLIP (ICLR 2025)

This paper addresses the modality gap issue in CLIP (Contrastive Language-Image Pretraining). CLIP aims to embed images and texts into a shared representation space, but studies have shown that the two modalities tend to be segregated within this space. This hinders performance in cross-modal tasks such as image-text retrieval and zero-shot learning.

The paper proposes AlignCLIP as a solution to reduce the modality gap and improve performance. AlignCLIP introduces two key approaches:

  1. Parameter Sharing: Sharing the transformer encoder and projection layer between vision and language encoders to increase consistency.
  2. Intra-Modality Separation (IMSep): Expanding the embeddings within the same modality (image-image, text-text) to ensure semantically distinct samples are separated.

Problems and Solutions

1. Problem: Modality Gap in CLIP

  • CLIP struggles to embed images and texts in a truly shared space, as the two modalities cluster in separate subregions.
  • This leads to a performance drop in cross-modal retrieval tasks.
  • The root cause is a combination of model initialization and contrastive loss optimization.

Solution

  • AlignCLIP introduces parameter sharing, where the transformer and projection layers are shared between vision and language encoders.
  • Result: Increased cross-modal alignment, as evidenced by improved average cosine similarity.

2. Problem: Over-concentration of Intra-Modality Embeddings

  • CLIP not only separates images and texts but also over-concentrates embeddings within each modality.
  • This results in semantically different samples being placed too close together, reducing fine-grained retrieval accuracy.

Solution

  • IMSep (Intra-Modality Separation) is applied to push apart semantically different embeddings within the same modality.
  • Result: More balanced intra-modality embedding distributions, improving cross-modal retrieval performance.

3. Problem: Limitations of Previous Approaches

  • Earlier research attempted to reduce the modality gap through simple isomorphic transformations, but this distorts the meaningful structure of the embedding space.

Solution

  • AlignCLIP uses a semantic-based alignment approach rather than simple distance-based transformations.
  • Result: Maintains the natural embedding structure while improving alignment.

Experimental Results

The paper compares CLIP, SharedCLIP (which applies only parameter sharing), and AlignCLIP (which includes both parameter sharing and IMSep).

  • Cross-Modal Alignment
    • AlignCLIP increased average cosine similarity (0.42 → 0.64).
    • Significant improvements in cross-modal retrieval performance (higher R@1 scores).
  • Zero-Shot Classification Performance
    • AlignCLIP improved CIFAR-10 accuracy by 8%.
    • Minor improvements in ImageNet-1K.
  • Linear Probing Classification Performance
    • 1.5% improvement on ImageNet-1K.
    • 4.8% improvement on CIFAR-100.
  • Robustness to Distribution Shift
    • AlignCLIP outperformed both CLIP and SharedCLIP on ImageNet-R, ImageNet-A, and other datasets.

Conclusion

AlignCLIP successfully reduces the modality gap while maintaining or improving downstream performance. Specifically:

  1. Parameter sharing strengthens cross-modal alignment.
  2. IMSep enhances intra-modality distinction, leading to more meaningful semantic organization.
  3. Unlike naive transformations, AlignCLIP preserves the natural structure of the embedding space, ensuring robust improvements.

This study suggests that semantic-based structural modifications are more effective than simple distance-based corrections in mitigating the modality gap, setting a new direction for improving CLIP and similar models.