Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.16553
Cited By
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
30 March 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers"
29 / 29 papers shown
Title
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
165
0
0
21 Feb 2025
Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval
Fan Jiang
Qiongkai Xu
Tom Drummond
Trevor Cohn
19
2
0
27 Nov 2023
HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
Yinghui He
Yufan Wu
Yilin Jia
Rada Mihalcea
Yulong Chen
Naihao Deng
LRM
LLMAG
30
21
0
25 Oct 2023
Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
Jiamin Zhuang
Jing Yu
Yang Ding
Xiangyang Qu
Yue Hu
24
9
0
27 Aug 2023
Are Diffusion Models Vision-And-Language Reasoners?
Benno Krojer
Elinor Poole-Dayan
Vikram S. Voleti
Christopher Pal
Siva Reddy
37
13
0
25 May 2023
ZeroSearch: Local Image Search from Text with Zero Shot Learning
Jatin Nainani
A. Mazumdar
Viraj Sheth
20
0
0
01 May 2023
Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime
Rhydian Windsor
A. Jamaludin
T. Kadir
Andrew Zisserman
VLM
27
11
0
30 Mar 2023
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
Ding Jiang
Mang Ye
30
140
0
22 Mar 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
42
47
0
21 Mar 2023
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
24
4
0
05 Jan 2023
Automating Nearest Neighbor Search Configuration with Constrained Optimization
Philip Sun
Ruiqi Guo
Surinder Kumar
23
7
0
04 Jan 2023
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Dongwon Kim
Nam-Won Kim
Suha Kwak
24
37
0
30 Nov 2022
Cross-Modal Adapter for Text-Video Retrieval
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Jie Zhou
S. Song
Gao Huang
45
36
0
17 Nov 2022
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
Abhra Chaudhuri
Massimiliano Mancini
Yanbei Chen
Zeynep Akata
Anjan Dutta
18
5
0
19 Oct 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
20
18
0
01 Aug 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
50
525
0
13 Jun 2022
Egocentric Video-Language Pretraining
Kevin Qinghong Lin
Alex Jinpeng Wang
Mattia Soldan
Michael Wray
Rui Yan
...
Hongfa Wang
Dima Damen
Bernard Ghanem
Wei Liu
Mike Zheng Shou
VLM
EgoV
46
188
0
03 Jun 2022
A CLIP-Hitchhiker's Guide to Long Video Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
CLIP
126
62
0
17 May 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
22
60
0
31 Mar 2022
Image Retrieval from Contextual Descriptions
Benno Krojer
Vaibhav Adlakha
Vibhav Vineet
Yash Goyal
E. Ponti
Siva Reddy
13
29
0
29 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
31
148
0
28 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
29
74
0
18 Mar 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
22
103
0
16 Jan 2022
Cross Modal Retrieval with Querybank Normalisation
Simion-Vlad Bogolin
Ioana Croitoru
Hailin Jin
Yang Liu
Samuel Albanie
27
84
0
23 Dec 2021
Cascaded Fast and Slow Models for Efficient Semantic Code Search
Akhilesh Deepak Gotmare
Junnan Li
Shafiq R. Joty
S. Hoi
33
10
0
15 Oct 2021
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
28
56
0
13 Sep 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
252
927
0
24 Sep 2019
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Vijay Badrinarayanan
Alex Kendall
R. Cipolla
SSeg
446
15,637
0
02 Nov 2015
A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics
Yunchao Gong
Qifa Ke
Michael Isard
Svetlana Lazebnik
3DV
68
584
0
18 Dec 2012
1