Instruction-Guided Scene Text Recognition

Yongkun Du; Zhineng Chen; Yuchen Su; Caiyan Jia; Yu-Gang Jiang

doi:10.1109/tpami.2025.3525526

journal article Apr 01, 2025

Instruction-Guided Scene Text Recognition

Yongkun Du

IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 47 No. 4 pp. 2723-2738 · Institute of Electrical and Electronics Engineers (IEEE)

View at Publisher Save 10.1109/tpami.2025.3525526

Topics

No keywords indexed for this article. Browse by subject →

References

84

[1]

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

Baoguang Shi, Xiang Bai, Cong Yao

IEEE Transactions on Pattern Analysis and Machine... 10.1109/tpami.2016.2646371

[2]

10.1109/cvpr42600.2020.01213

[3]

10.1109/cvpr46437.2021.00702

[4]

10.1007/978-3-031-19815-1_20

[5]

10.24963/ijcai.2022/124

[6]

Context Perception Parallel Decoder for Scene Text Recognition

Yongkun Du, Zhineng Chen, Caiyan Jia et al.

IEEE Transactions on Pattern Analysis and Machine... 10.1109/tpami.2025.3545453

[7]

10.24963/ijcai.2023/189

[8]

10.1109/tpami.2018.2848939

[9]

10.1609/aaai.v33i01.33018610

[10]

10.1109/icdar.2019.00130

[11]

10.1007/978-3-031-19815-1_11

[12]

10.1109/cvpr52688.2022.01069

[13]

10.1007/978-3-031-72970-6_3

[14]

10.1109/iccv51070.2023.00371

[15]

10.1109/iccv51070.2023.00110

[16]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, Trevor Darrell

2015 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2015.7298965

[17]

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Ross Girshick, Jeff Donahue, Trevor Darrell et al.

2014 IEEE Conference on Computer Vision and Patter... 10.1109/cvpr.2014.81

[18]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie et al.

Lecture Notes in Computer Science 10.1007/978-3-319-10602-1_48

[19]

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, Jia Deng, Hao Su et al.

International Journal of Computer Vision 10.1007/s11263-015-0816-y

[20]

Radford "Learning transferable visual models from natural language supervision"

[21]

Liu "Visual instruction tuning"

[22]

Liu "TextMonkey: An OCR-free large multimodal model for understanding document" (2024)

[23]

Li "PP-OCRv3: More attempts for the improvement of ultra lightweight OCR system" (2022)

[24]

10.1109/cvpr52733.2024.02527

[25]

10.1109/cvpr46437.2021.00869

[26]

10.1609/aaai.v34i07.6735

[27]

Connectionist temporal classification

Alex Graves, Santiago Fernández, Faustino Gomez et al.

Proceedings of the 23rd international conference o... 10.1145/1143844.1143891

[28]

10.1145/3474085.3475238

[29]

10.1109/iccv48922.2021.01393

[30]

10.1109/cvprw50498.2020.00281

[31]

10.1007/978-3-030-58529-7_9

[32]

10.1609/aaai.v34i07.6891

[33]

CDistNet: Perceiving Multi-domain Character Distance for Robust Text Recognition

Tianlun Zheng, Zhineng Chen, Shancheng Fang et al.

International Journal of Computer Vision 10.1007/s11263-023-01880-0

[34]

Bleeker "Bidirectional scene text recognition with a single decoder"

[35]

10.1007/978-3-031-19815-1_26

[36]

Jia "Scaling up visual and vision-language representation learning with noisy text supervision"

[37]

10.1007/978-3-031-46308-2_29

[38]

Lai "Instruction-Following speech recognition" (2023)

[39]

Wang "OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework"

[40]

10.1109/iccv.2015.279

[41]

10.18653/v1/2023.findings-emnlp.1055

[42]

Deshmukh "Pengi: An audio language model for audio tasks"

[43]

Zhu "MiniGPT-4: Enhancing vision-language understanding with advanced large language models" (2023)

[44]

Gu "A systematic survey of prompt engineering on vision-language foundation models" (2023)

[45]

Alayrac "Flamingo: A visual language model for few-shot learning"

[46]

10.24963/ijcai.2023/197

[47]

10.48550/arxiv.1810.04805

[48]

10.48550/arxiv.1706.03762

[49]

Jaderberg "Synthetic data and artificial neural networks for natural scene text recognition" (2014)

[50]

Reading Text in the Wild with Convolutional Neural Networks

Max Jaderberg, Karen Simonyan, Andrea Vedaldi et al.

International Journal of Computer Vision 10.1007/s11263-015-0823-z

Showing 50 of 84 references

Metrics

17

Citations

84

References

Details

Published: Apr 01, 2025
Vol/Issue: 47(4)
Pages: 2723-2738
License: View

Authors

Y

Yongkun Du

School of Computer Science, Fudan University, Shanghai, China

Z

Zhineng Chen

School of Computer Science, Fudan University, Shanghai, China

Y

Yuchen Su

School of Computer Science, Fudan University, Shanghai, China

C

Caiyan Jia

School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China

Y

Yu-Gang Jiang

School of Computer Science, Fudan University, Shanghai, China

Funding

National Natural Science Foundation of China Award: 32341012

National Key R&D Program of China Award: 2022YFB3104703

Cite This Article

Yongkun Du, Zhineng Chen, Yuchen Su, et al. (2025). Instruction-Guided Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4), 2723-2738. https://doi.org/10.1109/tpami.2025.3525526

Instruction-Guided Scene Text Recognition

You May Also Like