[review] Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)
1. Abstract
This paper proposes a novel approach to Scene Text Retrieval (STR). Existing methods rely on Optical Character Recognition (OCR)-based pipelines, which are often inefficient and lack flexibility. To overcome these challenges, we leverage the potential of Contrastive Language-Image Pre-training (CLIP) to develop an OCR-free STR method.
We identify two major challenges when applying CLIP to STR:
- Limited text perceptual scale: The low image resolution in CLIP makes it difficult to recognize small text within scenes.
- Entangled visual-semantic concepts: CLIP tends to confuse visual elements with linguistic semantics.
To address these issues, we introduce the FDP (Focus, Distinguish, and Prompt) model, which consists of three key steps:
- Focus: Encouraging CLIP to concentrate on text regions within images.
- Distinguish: Semantically distinguishing the target text (e.g., content words vs. function words).
- Prompt: Enhancing retrieval performance using context-aware prompts.
FDP achieves a 4x speed improvement while increasing retrieval accuracy by 4.37% compared to conventional STR methods.
2. Related Work
2.1 Scene Text Retrieval (STR)
STR involves finding images that contain a specific query text. Traditional STR methods typically follow an OCR-based pipeline:
- Text Detection: Identifying text regions within an image.
- Text Recognition: Converting detected text into characters for retrieval.
However, OCR-based STR is slow and struggles with multilingual text, diverse fonts, and varying styles.
2.2 Leveraging CLIP for OCR-Free STR
Recent advances in large-scale multimodal models like CLIP suggest potential applications for STR:
- CLIP inherently includes some OCR capabilities through text-image alignment learning.
- However, CLIP primarily learns semantic relationships rather than directly extracting text.
- Thus, CLIP’s limitations must be addressed for effective STR performance.
3. FDP Method
FDP optimizes CLIP for text retrieval through three key processes: Focus, Distinguish, and Prompt.
3.1 Focus (Enhancing CLIP’s Text Attention)
Standard CLIP models analyze the entire image without explicitly focusing on text. To refine this process:
- Dynamic Attention Shift: A technique that highlights text regions within images.
- Text Knowledge Probing: Using prompts like “scene text” to enhance CLIP’s text perception.
3.2 Distinguish (Semantic Text Differentiation)
Query words are categorized into content words and function words:
- E.g., “coffee” and “hotel” have distinct meanings, whereas “and” and “with” may be ambiguous in CLIP.
- Unsupervised clustering is used to separate these groups, improving retrieval clarity.
3.3 Prompt (Prompt Engineering for Enhanced Retrieval)
- Semantic-aware Prompting: Transforming query words into learnable prompts.
- Distracted Queries Assistance: A strategy to differentiate similar words (e.g., “advice” vs. “advise”).
4. Experiments and Results
4.1 Datasets
We evaluate FDP on several datasets:
- IIIT-STR (10,000 images, 50 query words)
- SVT (Street View images, 249 test images)
- TotalText (1,255 images with diverse fonts and styles)
- PSTR (Phrase-level STR dataset): 36 phrases used to assess sentence-based retrieval.
4.2 Experimental Results
- FDP achieves faster and more accurate retrieval than existing STR models.
- On IIIT-STR, the best FDP model (FDP-L) improves mAP from 77.40% (SOTA) to 89.46% (+12.06%).
- PSTR experiments show that FDP performs well even for long-phrase retrieval tasks.
4.3 Performance Comparison
Method | mAP (IIIT-STR) | mAP (SVT) | FPS |
---|---|---|---|
CLIP-RN50 | 52.93% | 65.07% | 76.32 |
FDP-S (Ours) | 81.77% | 82.56% | 45.11 |
FDP-B (Ours) | 86.65% | 86.64% | 31.43 |
FDP-L (Ours) | 89.46% | 89.63% | 11.82 |
- FDP surpasses traditional OCR-based methods in both speed and accuracy.
- As an OCR-free method, FDP is significantly more efficient than existing techniques.
5. Conclusion
This study introduces FDP, an OCR-free Scene Text Retrieval (STR) method powered by CLIP.
- “Focus, Distinguish, and Prompt” strategy optimizes CLIP for text retrieval.
- FDP improves both speed and accuracy while enabling flexible retrieval scenarios (word-level, phrase-level, and attribute-based retrieval).
➡ Practical Applications:
- Product image search
- Book/document search systems
- Scene text retrieval in videos
Furthermore, FDP is the first research to directly adapt CLIP for STR, laying the foundation for future CLIP-based text recognition studies.
The source code will be released, and further research will explore the scalability of OCR-free methods.
Summary
- Traditional STR methods rely on OCR, making them complex and slow.
- OCR-free STR using CLIP is possible but has inherent text recognition limitations.
- FDP (Focus, Distinguish, Prompt) optimizes CLIP for STR.
- FDP achieves higher accuracy (+4.37%) and 4x faster retrieval than previous methods.
- Supports phrase-based and attribute-based retrieval.
➡ FDP introduces a new paradigm for STR and can be applied to various search systems.