Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1602.07332
Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li-Jia Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"
50 / 1,062 papers shown
Title
Learning to Discover and Detect Objects
V. Fomenko
Ismail Elezi
Deva Ramanan
Laura Leal-Taixé
Aljosa Osep
ObjD
33
10
0
19 Oct 2022
Dense but Efficient VideoQA for Intricate Compositional Reasoning
Jihyeon Janel Lee
Wooyoung Kang
Eun-Sol Kim
CoGe
19
3
0
19 Oct 2022
Commonsense Knowledge from Scene Graphs for Textual Environments
Tsunehiko Tanaka
Daiki Kimura
Michiaki Tatsubori
20
2
0
19 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
30
43
0
17 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
32
62
0
14 Oct 2022
Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets
Anurag Roy
David Johnson Ekka
Saptarshi Ghosh
Abir Das
23
1
0
13 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
20
68
0
12 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
40
4
0
10 Oct 2022
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Yan Gong
Georgina Cosma
27
11
0
10 Oct 2022
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Zijia Zhao
Longteng Guo
Xingjian He
Shuai Shao
Zehuan Yuan
Jing Liu
21
8
0
09 Oct 2022
LOCL: Learning Object-Attribute Composition using Localization
Satish Kumar
A S M Iftekhar
Ekta Prashnani
B.S.Manjunath
19
3
0
07 Oct 2022
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang
Hanwang Zhang
Chongyang Gao
Jianfei Cai
MLLM
40
10
0
04 Oct 2022
Unbiased Scene Graph Generation using Predicate Similarities
Misaki Ohashi
Yusuke Matsui
32
1
0
03 Oct 2022
Data Poisoning Attacks Against Multimodal Encoders
Ziqing Yang
Xinlei He
Zheng Li
Michael Backes
Mathias Humbert
Pascal Berrang
Yang Zhang
AAML
116
45
0
30 Sep 2022
DRAMA: Joint Risk Localization and Captioning in Driving
Srikanth Malla
Chiho Choi
Isht Dwivedi
Joonhyang Choi
Jiachen Li
107
87
0
22 Sep 2022
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering
Hao Li
Jinfa Huang
Peng Jin
Guoli Song
Qi Wu
Jie Chen
39
21
0
21 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
211
1,113
0
20 Sep 2022
The Ability of Image-Language Explainable Models to Resemble Domain Expertise
P. Werner
Anna Zapaishchykova
Ujjwal Ratan
48
2
0
19 Sep 2022
3D VSG: Long-term Semantic Scene Change Prediction through 3D Variable Scene Graphs
Sam Looper
Javier Rodriguez Puigvert
Roland Siegwart
Cesar Cadena
L. Schmid
3DPC
13
22
0
16 Sep 2022
VIPHY: Probing "Visible" Physical Commonsense Knowledge
Shikhar Singh
Ehsan Qasemi
Muhao Chen
46
6
0
15 Sep 2022
Combining Metric Learning and Attention Heads For Accurate and Efficient Multilabel Image Classification
K. Prokofiev
V. Sovrasov
VLM
26
9
0
14 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
25
68
0
14 Sep 2022
Towards explainable evaluation of language models on the semantic similarity of visual concepts
Maria Lymperaiou
George Manoliadis
Orfeas Menis Mastromichalakis
Edmund Dervakos
Giorgos Stamou
AAML
24
5
0
08 Sep 2022
Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering
Jiong Wang
Zhou Zhao
Weike Jin
18
0
0
08 Sep 2022
VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph
Yanzeng Li
Zilong Zheng
Wenjuan Han
Lei Zou
34
2
0
07 Sep 2022
Scalable Regularization of Scene Graph Generation Models using Symbolic Theories
Davide Buffelli
Efthymia Tsamoura
25
2
0
06 Sep 2022
Design of the topology for contrastive visual-textual alignment
Zhun Sun
30
1
0
05 Sep 2022
Interactive Question Answering Systems: Literature Review
Giovanni Maria Biancofiore
Yashar Deldjoo
Tommaso Di Noia
E. Sciascio
Fedelucio Narducci
34
13
0
04 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
32
64
0
04 Sep 2022
Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis
Wanshu Fan
Yen-Chun Chen
Dongdong Chen
Yu Cheng
Lu Yuan
Yu-Chiang Frank Wang
DiffM
34
90
0
29 Aug 2022
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
44
3
0
24 Aug 2022
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang
A. Mahjoubfar
Anupama Joshi
29
4
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
41
0
0
23 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
54
629
0
22 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
34
11
0
19 Aug 2022
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
Xiujun Shu
Wei Wen
Haoqian Wu
Keyun Chen
Yi-Zhe Song
Ruizhi Qiao
Bohan Ren
Xiao Wang
27
91
0
18 Aug 2022
Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning
Tao He
Lianli Gao
Jingkuan Song
Yuan-Fang Li
VLM
34
50
0
17 Aug 2022
Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation
Liguang Zhou
Yuhongze Zhou
Tin Lun Lam
Yangsheng Xu
EDL
MoE
28
2
0
15 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
23
11
0
08 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
36
67
0
03 Aug 2022
Rethinking the Evaluation of Unbiased Scene Graph Generation
Xingchen Li
Long Chen
Jian Shao
Shaoning Xiao
Songyang Zhang
Jun Xiao
42
12
0
03 Aug 2022
Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation
Xingchen Li
Long Chen
Wenbo Ma
Yi Yang
Jun Xiao
21
26
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
25
1
0
31 Jul 2022
Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection
Xiaoqian Wu
Yong-Lu Li
Xinpeng Liu
Junyi Zhang
Yuzhe Wu
Cewu Lu
29
37
0
28 Jul 2022
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
Li Xu
Haoxuan Qu
Jason Kuen
Jiuxiang Gu
Jun Liu
CML
31
27
0
23 Jul 2022
Panoptic Scene Graph Generation
Jingkang Yang
Yi Zhe Ang
Zujin Guo
Kaiyang Zhou
Wayne Zhang
Ziwei Liu
47
106
0
22 Jul 2022
Human-centric Image Cropping with Partition-aware and Content-preserving Features
Bo Zhang
Li Niu
Xing Zhao
Liqing Zhang
21
5
0
21 Jul 2022
Is an Object-Centric Video Representation Beneficial for Transfer?
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
ViT
37
27
0
20 Jul 2022
ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network
Nikolaos Gkalelis
Dimitrios Daskalakis
Vasileios Mezaris
19
10
0
20 Jul 2022
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
ViT
36
106
0
20 Jul 2022
Previous
1
2
3
...
7
8
9
...
20
21
22
Next