[review] Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)

February 25, 2025 2 minute read

Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (MM 2024)

1. Abstract

This paper proposes a novel approach to Scene Text Retrieval (STR). Existing methods rely on Optical Character Recognition (OCR)-based pipelines, which are often inefficient and lack flexibility. To overcome these challenges, we leverage the potential of Contrastive Language-Image Pre-training (CLIP) to develop an OCR-free STR method.

We identify two major challenges when applying CLIP to STR:

Limited text perceptual scale: The low image resolution in CLIP makes it difficult to recognize small text within scenes.
Entangled visual-semantic concepts: CLIP tends to confuse visual elements with linguistic semantics.

To address these issues, we introduce the FDP (Focus, Distinguish, and Prompt) model, which consists of three key steps:

Focus: Encouraging CLIP to concentrate on text regions within images.
Distinguish: Semantically distinguishing the target text (e.g., content words vs. function words).
Prompt: Enhancing retrieval performance using context-aware prompts.

FDP achieves a 4x speed improvement while increasing retrieval accuracy by 4.37% compared to conventional STR methods.

2.1 Scene Text Retrieval (STR)

STR involves finding images that contain a specific query text. Traditional STR methods typically follow an OCR-based pipeline:

Text Detection: Identifying text regions within an image.
Text Recognition: Converting detected text into characters for retrieval.

However, OCR-based STR is slow and struggles with multilingual text, diverse fonts, and varying styles.

2.2 Leveraging CLIP for OCR-Free STR

Recent advances in large-scale multimodal models like CLIP suggest potential applications for STR:

CLIP inherently includes some OCR capabilities through text-image alignment learning.
However, CLIP primarily learns semantic relationships rather than directly extracting text.
Thus, CLIP’s limitations must be addressed for effective STR performance.

3. FDP Method

FDP optimizes CLIP for text retrieval through three key processes: Focus, Distinguish, and Prompt.

3.1 Focus (Enhancing CLIP’s Text Attention)

Standard CLIP models analyze the entire image without explicitly focusing on text. To refine this process:

Dynamic Attention Shift: A technique that highlights text regions within images.
Text Knowledge Probing: Using prompts like “scene text” to enhance CLIP’s text perception.

3.2 Distinguish (Semantic Text Differentiation)

Query words are categorized into content words and function words:

E.g., “coffee” and “hotel” have distinct meanings, whereas “and” and “with” may be ambiguous in CLIP.
Unsupervised clustering is used to separate these groups, improving retrieval clarity.

3.3 Prompt (Prompt Engineering for Enhanced Retrieval)

Semantic-aware Prompting: Transforming query words into learnable prompts.
Distracted Queries Assistance: A strategy to differentiate similar words (e.g., “advice” vs. “advise”).

4. Experiments and Results

4.1 Datasets

We evaluate FDP on several datasets:

IIIT-STR (10,000 images, 50 query words)
SVT (Street View images, 249 test images)
TotalText (1,255 images with diverse fonts and styles)
PSTR (Phrase-level STR dataset): 36 phrases used to assess sentence-based retrieval.

4.2 Experimental Results

FDP achieves faster and more accurate retrieval than existing STR models.
On IIIT-STR, the best FDP model (FDP-L) improves mAP from 77.40% (SOTA) to 89.46% (+12.06%).
PSTR experiments show that FDP performs well even for long-phrase retrieval tasks.

4.3 Performance Comparison

Method	mAP (IIIT-STR)	mAP (SVT)	FPS
CLIP-RN50	52.93%	65.07%	76.32
FDP-S (Ours)	81.77%	82.56%	45.11
FDP-B (Ours)	86.65%	86.64%	31.43
FDP-L (Ours)	89.46%	89.63%	11.82

FDP surpasses traditional OCR-based methods in both speed and accuracy.
As an OCR-free method, FDP is significantly more efficient than existing techniques.

5. Conclusion

This study introduces FDP, an OCR-free Scene Text Retrieval (STR) method powered by CLIP.

“Focus, Distinguish, and Prompt” strategy optimizes CLIP for text retrieval.
FDP improves both speed and accuracy while enabling flexible retrieval scenarios (word-level, phrase-level, and attribute-based retrieval).

➡ Practical Applications:

Product image search
Book/document search systems
Scene text retrieval in videos

Furthermore, FDP is the first research to directly adapt CLIP for STR, laying the foundation for future CLIP-based text recognition studies.
The source code will be released, and further research will explore the scalability of OCR-free methods.

Summary

Traditional STR methods rely on OCR, making them complex and slow.
OCR-free STR using CLIP is possible but has inherent text recognition limitations.
FDP (Focus, Distinguish, Prompt) optimizes CLIP for STR.
FDP achieves higher accuracy (+4.37%) and 4x faster retrieval than previous methods.
Supports phrase-based and attribute-based retrieval.

➡ FDP introduces a new paradigm for STR and can be applied to various search systems.

Share on

Twitter Facebook LinkedIn