Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1602.07332
Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li-Jia Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"
50 / 1,142 papers shown
Title
Learning Imbalanced Datasets with Maximum Margin Loss
Haeyong Kang
Thang Vu
Chang D. Yoo
21
18
0
11 Jun 2022
Referring Image Matting
Jizhizi Li
Jing Zhang
Dacheng Tao
ObjD
VLM
29
23
0
10 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
34
66
0
09 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
24
111
0
07 Jun 2022
Scan2Part: Fine-grained and Hierarchical Part-level Understanding of Real-World 3D Scans
A. Notchenko
Vladislav Ishimtsev
Alexey Artemov
V. Selyutin
Emil Bogomolov
Evgeny Burnaev
3DPC
3DV
35
3
0
06 Jun 2022
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
16
507
0
03 Jun 2022
Modeling Image Composition for Complex Scene Generation
Zuopeng Yang
Daqing Liu
Chaoyue Wang
J. Yang
Dacheng Tao
ViT
36
50
0
02 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
32
13
0
30 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
61
529
0
27 May 2022
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
Huaizu Jiang
Xiaojian Ma
Weili Nie
Zhiding Yu
Yuke Zhu
Song-Chun Zhu
Anima Anandkumar
VLM
28
36
0
27 May 2022
A Survey on Long-Tailed Visual Recognition
Lu Yang
He Jiang
Q. Song
Jun Guo
27
123
0
27 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
21
12
0
25 May 2022
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset
Ashish V. Thapliyal
Jordi Pont-Tuset
Xi Chen
Radu Soricut
VGen
90
72
0
25 May 2022
The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
Gi-Cheon Kang
Sungdong Kim
Jin-Hwa Kim
Donghyun Kwak
Byoung-Tak Zhang
34
10
0
25 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
36
6
0
24 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
31
4
0
24 May 2022
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
Yanan Wang
Michihiro Yasunaga
Hongyu Ren
Shinya Wada
J. Leskovec
29
17
0
23 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
29
38
0
23 May 2022
Heterogeneous Semantic Transfer for Multi-label Recognition with Partial Labels
Tianshui Chen
Tao Pu
Lingbo Liu
Yukai Shi
Zhijing Yang
Liang Lin
54
17
0
23 May 2022
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs
Jiarui Zhang
Filip Ilievski
Kaixin Ma
Jonathan M Francis
A. Oltramari
SSL
23
5
0
21 May 2022
Let's Talk! Striking Up Conversations via Conversational Visual Question Generation
Shih-Han Chan
Tsai-Lun Yang
Yun-Wei Chu
Chi-Yang Hsu
Ting-Hao 'Kenneth' Huang
Yu-Shian Chiu
Lun-Wei Ku
21
1
0
19 May 2022
GraphMapper: Efficient Visual Navigation by Scene Graph Generation
Zachary Seymour
Niluthpol Chowdhury Mithun
Han-Pang Chiu
S. Samarasekera
Rakesh Kumar
30
8
0
17 May 2022
Importance Weighted Structure Learning for Scene Graph Generation
Daqing Liu
M. Bober
J. Kittler
29
5
0
14 May 2022
Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer
A. Gritsenko
Austin Stone
Maxim Neumann
Dirk Weissenborn
...
Zhuoran Shen
Tianlin Li
Xiaohua Zhai
Thomas Kipf
N. Houlsby
ObjD
CLIP
VLM
ViT
OCL
34
307
0
12 May 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
39
33
0
10 May 2022
Weakly-supervised segmentation of referring expressions
Robin Strudel
Ivan Laptev
Cordelia Schmid
22
21
0
10 May 2022
Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Chia-Wen Kuo
Z. Kira
27
52
0
09 May 2022
Scene Graph Expansion for Semantics-Guided Image Outpainting
Chiao-An Yang
C. Tan
Wanshu Fan
Cheng Yang
Meng-Lin Wu
Yu-Chiang Frank Wang
42
17
0
05 May 2022
What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning
Jae Hee Lee
Matthias Kerzel
Kyra Ahrens
C. Weber
S. Wermter
40
9
0
05 May 2022
All You May Need for VQA are Image Captions
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
32
70
0
04 May 2022
Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation
Idris Abdulmumin
S. Dash
Musa Abdullahi Dawud
Shantipriya Parida
Shamsuddeen Hassan Muhammad
I. Ahmad
Subhadarshi Panda
Ondrej Bojar
B. Galadanci
Bello Shehu Bello
21
17
0
02 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
17
16
0
02 May 2022
Visual Spatial Reasoning
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
44
160
0
30 Apr 2022
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
Xiaojian Ma
Weili Nie
Zhiding Yu
Huaizu Jiang
Chaowei Xiao
Yuke Zhu
Song-Chun Zhu
Anima Anandkumar
ViT
LRM
30
19
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
20
8
0
23 Apr 2022
Attention Mechanism based Cognition-level Scene Understanding
Xuejiao Tang
Tai Le Quy
LRM
32
0
0
17 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
27
55
0
15 Apr 2022
Measuring Compositional Consistency for Video Question Answering
Mona Gandhi
Mustafa Omer Gul
Eva Prakash
Madeleine Grunde-McLaughlin
Ranjay Krishna
Maneesh Agrawala
CoGe
40
15
0
14 Apr 2022
Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness
Paul Pu Liang
30
4
0
14 Apr 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
Sanjay Subramanian
William Merrill
Trevor Darrell
Matt Gardner
Sameer Singh
Anna Rohrbach
ObjD
44
126
0
12 Apr 2022
Adapting CLIP For Phrase Localization Without Further Training
Jiahao Li
G. Shakhnarovich
Raymond A. Yeh
VLM
CLIP
30
25
0
07 Apr 2022
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
Sanghyuk Chun
Wonjae Kim
Song Park
Minsuk Chang
Seong Joon Oh
VLM
375
43
0
07 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
36
404
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
54
39
0
06 Apr 2022
DT2I: Dense Text-to-Image Generation from Region Descriptions
Stanislav Frolov
Prateek Bansal
Jörn Hees
Andreas Dengel
VLM
27
5
0
05 Apr 2022
Co-VQA : Answering by Interactive Sub Question Sequence
Ruonan Wang
Yuxi Qian
Fangxiang Feng
Xiaojie Wang
Huixing Jiang
LRM
29
16
0
02 Apr 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
36
60
0
31 Mar 2022
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo
Yiran Luo
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
29
4
0
30 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
30
94
0
30 Mar 2022
End-to-End Transformer Based Model for Image Captioning
Yiyu Wang
Jungang Xu
Yingfei Sun
VLM
ViT
26
117
0
29 Mar 2022
Previous
1
2
3
...
9
10
11
...
21
22
23
Next