Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.14014
Cited By
v1
v2
v3 (latest)
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
23 May 2023
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model"
50 / 57 papers shown
Title
AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding
Feng Xiao
Hongbin Xu
Guocan Zhao
Wenxiong Kang
214
0
0
07 May 2025
Scene Text Recognition Models Explainability Using Local Features
M. Ty
Rowel Atienza
63
1
0
14 Oct 2023
Data Filtering Networks
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
96
144
0
29 Sep 2023
LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition
Changxu Cheng
Peng Wang
Cheng Da
Qi Zheng
Cong Yao
73
15
0
24 Aug 2023
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLM
MLLM
CLIP
200
3,493
0
16 Oct 2022
Scene Text Recognition with Permuted Autoregressive Sequence Models
Darwin Bautista
Rowel Atienza
102
173
0
14 Jul 2022
Reading and Writing: Discriminative and Generative Modeling for Self-Supervised Text Recognition
Mingkun Yang
Minghui Liao
Pu Lu
Jing Wang
Shenggao Zhu
Hualin Luo
Qingzhen Tian
X. Bai
SSL
98
59
0
01 Jul 2022
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
Yi-Lin Sung
Jaemin Cho
Joey Tianyi Zhou
VLM
97
244
0
13 Jun 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
169
1,307
0
04 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
418
3,602
0
29 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
105
128
0
12 Apr 2022
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song
Li Dong
Weinan Zhang
Ting Liu
Furu Wei
VLM
CLIP
81
139
0
14 Mar 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
154
880
0
07 Feb 2022
Language-driven Semantic Segmentation
Boyi Li
Kilian Q. Weinberger
Serge Belongie
V. Koltun
René Ranftl
VLM
124
625
0
10 Jan 2022
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
Yi-Lin Sung
Jaemin Cho
Joey Tianyi Zhou
VLM
VPVLM
112
356
0
13 Dec 2021
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
141
908
0
22 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
108
642
0
09 Nov 2021
Towards artificial general intelligence via a multimodal foundation model
Nanyi Fei
Zhiwu Lu
Yizhao Gao
Guoxing Yang
Yuqi Huo
...
Ruihua Song
Xin Gao
Tao Xiang
Haoran Sun
Jiling Wen
AI4CE
LRM
84
227
0
27 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLM
CLIP
152
458
0
11 Oct 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao
Shijie Geng
Renrui Zhang
Teli Ma
Rongyao Fang
Yongfeng Zhang
Hongsheng Li
Yu Qiao
VLM
CLIP
309
1,045
0
09 Oct 2021
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
Gwanghyun Kim
Taesung Kwon
Jong Chul Ye
DiffM
200
655
0
06 Oct 2021
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Minghao Li
Tengchao Lv
Jingye Chen
Lei Cui
Yijuan Lu
D. Florêncio
Cha Zhang
Zhoujun Li
Furu Wei
ViT
240
370
0
21 Sep 2021
Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
VPVLM
CLIP
VLM
505
2,409
0
02 Sep 2021
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network
Yuxin Wang
Hongtao Xie
Shancheng Fang
Jing Wang
Shenggao Zhu
Yongdong Zhang
VLM
85
154
0
22 Aug 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
221
1,972
0
16 Jul 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIP
VLM
MLLM
257
410
0
13 Jul 2021
Open Images V5 Text Annotation and Yet Another Mask Text Spotter
Ilya Krylov
S. Nosov
V. Sovrasov
VLM
68
54
0
23 Jun 2021
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao
Li Dong
Songhao Piao
Furu Wei
ViT
289
2,841
0
15 Jun 2021
Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition
Meng Cui
Wei Wang
Jinjin Zhang
Liang Wang
3DV
76
12
0
13 Jun 2021
TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text
Amanpreet Singh
Guan Pang
Mandy Toh
Jing Huang
Wojciech Galuba
Tal Hassner
64
174
0
12 May 2021
ImageNet-21K Pretraining for the Masses
T. Ridnik
Emanuel Ben-Baruch
Asaf Noy
Lihi Zelnik-Manor
SSeg
VLM
CLIP
324
711
0
22 Apr 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
CLIP
153
1,584
0
18 Apr 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIP
VLM
129
1,209
0
31 Mar 2021
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
Shancheng Fang
Hongtao Xie
Yuxin Wang
Zhendong Mao
Yongdong Zhang
78
306
0
11 Mar 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
459
3,893
0
11 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
131
1,757
0
05 Feb 2021
SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition
Zhi Qiao
Yu Zhou
Dongbao Yang
Yucan Zhou
Weiping Wang
71
228
0
22 May 2020
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks
Deli Yu
Xuan Li
Chengquan Zhang
Junyu Han
Jingtuo Liu
Errui Ding
95
287
0
27 Mar 2020
TextScanner: Reading Characters in Order for Robust Scene Text Recognition
Zhaoyi Wan
Minghang He
Haoran Chen
X. Bai
Cong Yao
67
139
0
28 Dec 2019
ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard
Xi Liu
Rui Zhang
Yongsheng Zhou
Qianyi Jiang
Qi Song
...
X. Bai
Baoguang Shi
Dimosthenis Karatzas
Shijian Lu
C. V. Jawahar
3DV
52
160
0
20 Dec 2019
RandAugment: Practical automated data augmentation with a reduced search space
E. D. Cubuk
Barret Zoph
Jonathon Shlens
Quoc V. Le
MQ
258
3,502
0
30 Sep 2019
ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling -- RRC-LSVT
Yipeng Sun
Zihan Ni
Chee-Kheng Chng
Yuliang Liu
Canjie Luo
...
Errui Ding
Jingtuo Liu
Dimosthenis Karatzas
Chee Seng Chan
Lianwen Jin
3DV
100
158
0
17 Sep 2019
ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT)
Chee-Kheng Chng
Yuliang Liu
Yipeng Sun
Chun Chet Ng
Canjie Luo
...
Errui Ding
Jingtuo Liu
Dimosthenis Karatzas
Chee Seng Chan
Lianwen Jin
3DV
92
215
0
16 Sep 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
677
24,541
0
26 Jul 2019
ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019
Nibal Nayef
Yash J. Patel
M. Busta
Pinaki Nath Chowdhury
Dimosthenis Karatzas
...
Jirí Matas
Umapada Pal
J. Burie
Cheng-Lin Liu
J. Ogier
3DV
78
251
0
01 Jul 2019
Scene Text Detection and Recognition: The Deep Learning Era
Shangbang Long
Xin He
Cong Yao
VLM
113
398
0
10 Nov 2018
Ray: A Distributed Framework for Emerging AI Applications
Philipp Moritz
Robert Nishihara
Stephanie Wang
Alexey Tumanov
Richard Liaw
...
Melih Elibol
Zongheng Yang
William Paul
Michael I. Jordan
Ion Stoica
GNN
107
1,267
0
16 Dec 2017
Focusing Attention: Towards Accurate Text Recognition in Natural Images
Zhanzhan Cheng
Fan Bai
Yunlu Xu
Gang Zheng
Shiliang Pu
Shuigeng Zhou
56
449
0
07 Sep 2017
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal
Piotr Dollár
Ross B. Girshick
P. Noordhuis
Lukasz Wesolowski
Aapo Kyrola
Andrew Tulloch
Yangqing Jia
Kaiming He
3DH
128
3,685
0
08 Jun 2017
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju
Michael Cogswell
Abhishek Das
Ramakrishna Vedantam
Devi Parikh
Dhruv Batra
FAtt
325
20,086
0
07 Oct 2016
1
2
Next