Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1602.07332
Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"
50 / 1,654 papers shown
Title
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
89
35
0
10 May 2022
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
110
22
0
10 May 2022
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Chia-Wen Kuo
Z. Kira
97
56
0
09 May 2022
Scene Graph Expansion for Semantics-Guided Image Outpainting
Chiao-An Yang
C. Tan
Wanshu Fan
Cheng Yang
Meng-Lin Wu
Yu-Chiang Frank Wang
97
18
0
05 May 2022
What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning
Jae Hee Lee
Matthias Kerzel
Kyra Ahrens
C. Weber
S. Wermter
70
9
0
05 May 2022
All You May Need for VQA are Image Captions
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
101
76
0
04 May 2022
Visual Commonsense in Pretrained Unimodal and Multimodal Models
Chenyu Zhang
Benjamin Van Durme
Zhuowan Li
Elias Stengel-Eskin
VLM
SSL
79
41
0
04 May 2022
RU-Net: Regularized Unrolling Network for Scene Graph Generation
Xin Lin
Changxing Ding
Jing Zhang
Yibing Zhan
Dacheng Tao
81
34
0
03 May 2022
Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation
Idris Abdulmumin
S. Dash
Musa Abdullahi Dawud
Shantipriya Parida
Shamsuddeen Hassan Muhammad
Ibrahim Said Ahmad
Subhadarshi Panda
Ondrej Bojar
B. Galadanci
Bello Shehu Bello
59
18
0
02 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
77
16
0
02 May 2022
Visual Spatial Reasoning
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
131
185
0
30 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
105
56
0
28 Apr 2022
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
Xiaojian Ma
Weili Nie
Zhiding Yu
Huaizu Jiang
Chaowei Xiao
Yuke Zhu
Song-Chun Zhu
Anima Anandkumar
ViT
LRM
138
19
0
24 Apr 2022
Reinforced Causal Explainer for Graph Neural Networks
Xiang Wang
Y. Wu
An Zhang
Fuli Feng
Xiangnan He
Tat-Seng Chua
CML
126
47
0
23 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
68
9
0
23 Apr 2022
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension
Peihan Miao
Wei Su
Gaoang Wang
Xuewei Li
Xi Li
ObjD
82
10
0
21 Apr 2022
Reinforced Structured State-Evolution for Vision-Language Navigation
Jinyu Chen
Chen Gao
Erli Meng
Qiong Zhang
Si Liu
LM&Ro
60
42
0
20 Apr 2022
Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations
Leila Pishdad
Ran Zhang
Konstantinos G. Derpanis
Allan D. Jepson
Afsaneh Fazly
41
2
0
20 Apr 2022
Active Learning Helps Pretrained Models Learn the Intended Task
Alex Tamkin
Dat Nguyen
Salil Deshpande
Jesse Mu
Noah D. Goodman
81
39
0
18 Apr 2022
A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension
Gen Luo
Yiyi Zhou
Jiamu Sun
Xiaoshuai Sun
Rongrong Ji
ObjD
78
10
0
17 Apr 2022
Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang
Tai Le Quy
LRM
82
0
0
17 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
64
47
0
16 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
96
56
0
15 Apr 2022
Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning
Feilong Chen
Xiuyi Chen
Shuang Xu
Bo Xu
VLM
92
19
0
15 Apr 2022
Measuring Compositional Consistency for Video Question Answering
Mona Gandhi
Mustafa Omer Gul
Eva Prakash
Madeleine Grunde-McLaughlin
Ranjay Krishna
Maneesh Agrawala
CoGe
87
16
0
14 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
114
128
0
12 Apr 2022
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
Zhaowei Cai
Gukyeong Kwon
Avinash Ravichandran
Erhan Bas
Zhuowen Tu
Rahul Bhotika
Stefano Soatto
ObjD
MLLM
VLM
67
50
0
12 Apr 2022
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Shunyu Zhang
X. Jiang
Zequn Yang
T. Wan
Zengchang Qin
62
12
0
10 Apr 2022
Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention
Yu Yang
Seung Wook Kim
Jungseock Joo
FAtt
61
17
0
10 Apr 2022
Adapting CLIP For Phrase Localization Without Further Training
Jiahao Li
G. Shakhnarovich
Raymond A. Yeh
VLM
CLIP
90
25
0
07 Apr 2022
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
Sanghyuk Chun
Wonjae Kim
Song Park
Minsuk Chang
Seong Joon Oh
VLM
559
46
0
07 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
153
429
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
138
43
0
06 Apr 2022
KNN-Diffusion: Image Generation via Large-Scale Retrieval
Shelly Sheynin
Oron Ashual
Adam Polyak
Uriel Singer
Oran Gafni
Eliya Nachmani
Yaniv Taigman
VLM
SyDa
DiffM
85
124
0
06 Apr 2022
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
90
86
0
01 Apr 2022
SimVQA: Exploring Simulated Environments for Visual Question Answering
Paola Cascante-Bonilla
Hui Wu
Letao Wang
Rogerio Feris
Vicente Ordonez
84
7
0
31 Mar 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
105
62
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
111
95
0
30 Mar 2022
End-to-End Transformer Based Model for Image Captioning
Yiyu Wang
Jungang Xu
Yingfei Sun
VLM
ViT
64
125
0
29 Mar 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
125
17
0
27 Mar 2022
Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Chao Lou
Wenjuan Han
Yuh-Chen Lin
Zilong Zheng
CoGe
84
10
0
27 Mar 2022
4D-OR: Semantic Scene Graphs for OR Domain Modeling
Ege Özsoy
Evin Pınar Örnek
U. Eck
Tobias Czempiel
F. Tombari
Nassir Navab
82
38
0
22 Mar 2022
A Broad Study of Pre-training for Domain Generalization and Adaptation
Donghyun Kim
Kaihong Wang
Stan Sclaroff
Kate Saenko
OOD
AI4CE
105
81
0
22 Mar 2022
Fine-Grained Scene Graph Generation with Data Transfer
Ao Zhang
Yuan Yao
Qián Chen
Wei Ji
Zhiyuan Liu
Maosong Sun
Tat-Seng Chua
119
94
0
22 Mar 2022
Relationformer: A Unified Framework for Image-to-Graph Generation
Suprosanna Shit
Rajat Koner
Bastian Wittmann
Johannes C. Paetzold
Ivan Ezhov
...
Jia Pan
Sahand Sharifzadeh
Georgios Kaissis
Volker Tresp
Bjoern Menze
ViT
84
62
0
19 Mar 2022
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Xingning Dong
Tian Gan
Xuemeng Song
Jianlong Wu
Yuan Cheng
Liqiang Nie
115
96
0
18 Mar 2022
Context-Dependent Anomaly Detection with Knowledge Graph Embedding Models
Nathan Vaska
Kevin J. Leahy
Victoria Helus
31
1
0
17 Mar 2022
Finding Structural Knowledge in Multimodal-BERT
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
72
10
0
17 Mar 2022
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Yang Ding
Jing Yu
Bangchang Liu
Yue Hu
Mingxin Cui
Qi Wu
58
64
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
51
22
0
17 Mar 2022
Previous
1
2
3
...
11
12
13
...
32
33
34
Next