2 minute read

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)

1. Abstract

This paper proposes a novel approach to Scene Text Retrieval (STR). Existing methods rely on Optical Character Recognition (OCR)-based pipelines, which are often inefficient and lack flexibility. To overcome these challenges, we leverage the potential of Contrastive Language-Image Pre-training (CLIP) to develop an OCR-free STR method.

We identify two major challenges when applying CLIP to STR:

  1. Limited text perceptual scale: The low image resolution in CLIP makes it difficult to recognize small text within scenes.
  2. Entangled visual-semantic concepts: CLIP tends to confuse visual elements with linguistic semantics.

To address these issues, we introduce the FDP (Focus, Distinguish, and Prompt) model, which consists of three key steps:

  • Focus: Encouraging CLIP to concentrate on text regions within images.
  • Distinguish: Semantically distinguishing the target text (e.g., content words vs. function words).
  • Prompt: Enhancing retrieval performance using context-aware prompts.

FDP achieves a 4x speed improvement while increasing retrieval accuracy by 4.37% compared to conventional STR methods.


2.1 Scene Text Retrieval (STR)

STR involves finding images that contain a specific query text. Traditional STR methods typically follow an OCR-based pipeline:

  1. Text Detection: Identifying text regions within an image.
  2. Text Recognition: Converting detected text into characters for retrieval.

However, OCR-based STR is slow and struggles with multilingual text, diverse fonts, and varying styles.

2.2 Leveraging CLIP for OCR-Free STR

Recent advances in large-scale multimodal models like CLIP suggest potential applications for STR:

  • CLIP inherently includes some OCR capabilities through text-image alignment learning.
  • However, CLIP primarily learns semantic relationships rather than directly extracting text.
  • Thus, CLIP’s limitations must be addressed for effective STR performance.

3. FDP Method

FDP optimizes CLIP for text retrieval through three key processes: Focus, Distinguish, and Prompt.

3.1 Focus (Enhancing CLIP’s Text Attention)

Standard CLIP models analyze the entire image without explicitly focusing on text. To refine this process:

  • Dynamic Attention Shift: A technique that highlights text regions within images.
  • Text Knowledge Probing: Using prompts like “scene text” to enhance CLIP’s text perception.

3.2 Distinguish (Semantic Text Differentiation)

Query words are categorized into content words and function words:

  • E.g., “coffee” and “hotel” have distinct meanings, whereas “and” and “with” may be ambiguous in CLIP.
  • Unsupervised clustering is used to separate these groups, improving retrieval clarity.

3.3 Prompt (Prompt Engineering for Enhanced Retrieval)

  • Semantic-aware Prompting: Transforming query words into learnable prompts.
  • Distracted Queries Assistance: A strategy to differentiate similar words (e.g., “advice” vs. “advise”).

4. Experiments and Results

4.1 Datasets

We evaluate FDP on several datasets:

  • IIIT-STR (10,000 images, 50 query words)
  • SVT (Street View images, 249 test images)
  • TotalText (1,255 images with diverse fonts and styles)
  • PSTR (Phrase-level STR dataset): 36 phrases used to assess sentence-based retrieval.

4.2 Experimental Results

  • FDP achieves faster and more accurate retrieval than existing STR models.
  • On IIIT-STR, the best FDP model (FDP-L) improves mAP from 77.40% (SOTA) to 89.46% (+12.06%).
  • PSTR experiments show that FDP performs well even for long-phrase retrieval tasks.

4.3 Performance Comparison

Method mAP (IIIT-STR) mAP (SVT) FPS
CLIP-RN50 52.93% 65.07% 76.32
FDP-S (Ours) 81.77% 82.56% 45.11
FDP-B (Ours) 86.65% 86.64% 31.43
FDP-L (Ours) 89.46% 89.63% 11.82
  • FDP surpasses traditional OCR-based methods in both speed and accuracy.
  • As an OCR-free method, FDP is significantly more efficient than existing techniques.

5. Conclusion

This study introduces FDP, an OCR-free Scene Text Retrieval (STR) method powered by CLIP.

  • “Focus, Distinguish, and Prompt” strategy optimizes CLIP for text retrieval.
  • FDP improves both speed and accuracy while enabling flexible retrieval scenarios (word-level, phrase-level, and attribute-based retrieval).

➡ Practical Applications:

  • Product image search
  • Book/document search systems
  • Scene text retrieval in videos

Furthermore, FDP is the first research to directly adapt CLIP for STR, laying the foundation for future CLIP-based text recognition studies.
The source code will be released, and further research will explore the scalability of OCR-free methods.


Summary

  1. Traditional STR methods rely on OCR, making them complex and slow.
  2. OCR-free STR using CLIP is possible but has inherent text recognition limitations.
  3. FDP (Focus, Distinguish, Prompt) optimizes CLIP for STR.
  4. FDP achieves higher accuracy (+4.37%) and 4x faster retrieval than previous methods.
  5. Supports phrase-based and attribute-based retrieval.

FDP introduces a new paradigm for STR and can be applied to various search systems.