Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1602.07332
Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"
50 / 1,650 papers shown
Title
VindLU: A Recipe for Effective Video-and-Language Pretraining
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
125
81
0
09 Dec 2022
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Jinze Bai
Rui Men
Han Yang
Xuancheng Ren
Kai Dang
...
Wenhang Ge
Jianxin Ma
Junyang Lin
Jingren Zhou
Chang Zhou
88
16
0
08 Dec 2022
Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning
Ukyo Honda
Taro Watanabe
Yuji Matsumoto
63
9
0
06 Dec 2022
Semantic-Conditional Diffusion Networks for Image Captioning
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Jianlin Feng
Hongyang Chao
Tao Mei
DiffM
91
74
0
06 Dec 2022
Beyond Object Recognition: A New Benchmark towards Object Concept Learning
Yong-Lu Li
Yue Xu
Xinyu Xu
Xiaohan Mao
Yuan Yao
Siqi Liu
Cewu Lu
OCL
148
9
0
06 Dec 2022
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
61
24
0
04 Dec 2022
Named Entity and Relation Extraction with Multi-Modal Retrieval
Xinyu Wang
Jiong Cai
Yong Jiang
Pengjun Xie
Kewei Tu
Wei Lu
90
52
0
03 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
64
2
0
02 Dec 2022
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset
Sidra Hanif
Longin Jan Latecki
90
0
0
01 Dec 2022
Scaling Language-Image Pre-training via Masking
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
111
330
0
01 Dec 2022
Multimodal Query-guided Object Localization
Aditay Tripathi
Rajath R Dani
Anand Mishra
Anirban Chakraborty
62
0
0
01 Dec 2022
Hyperbolic Contrastive Learning for Visual Representations beyond Objects
Songwei Ge
Shlok Kumar Mishra
Simon Kornblith
Chun-Liang Li
David Jacobs
OCL
SSL
129
57
0
01 Dec 2022
GRiT: A Generative Region-to-text Transformer for Object Understanding
Jialian Wu
Jianfeng Wang
Zhengyuan Yang
Zhe Gan
Zicheng Liu
Junsong Yuan
Lijuan Wang
ObjD
VLM
81
119
0
01 Dec 2022
Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning
Zhuowan Li
Xingrui Wang
Elias Stengel-Eskin
Adam Kortylewski
Wufei Ma
Benjamin Van Durme
Max Planck Institute for Informatics
OOD
LRM
102
70
0
01 Dec 2022
Iterative Scene Graph Generation with Generative Transformers
Sanjoy Kundu
Sathyanarayanan N. Aakur
ViT
85
28
0
30 Nov 2022
Abstract Visual Reasoning with Tangram Shapes
Anya Ji
Noriyuki Kojima
N. Rush
Alane Suhr
Wai Keen Vong
Robert D. Hawkins
Yoav Artzi
LRM
77
40
0
29 Nov 2022
Neuro-Symbolic Spatio-Temporal Reasoning
Pascal Hitzler
Michael Sioutis
Md Kamruzzaman Sarker
Marjan Alirezaie
Aaron Eberhart
Stefan Wermter
NAI
85
0
0
28 Nov 2022
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Siyi Liu
Yaoyuan Liang
Feng Li
Shijia Huang
Hao Zhang
Hang Su
Jun Zhu
Lei Zhang
ObjD
105
28
0
28 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Qi Zhang
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjD
VLM
66
1
0
28 Nov 2022
ILSGAN: Independent Layer Synthesis for Unsupervised Foreground-Background Segmentation
Qiran Zou
Yu Yang
Wing Yin Cheung
Chang-rui Liu
Xiang Ji
GAN
143
4
0
25 Nov 2022
TPA-Net: Generate A Dataset for Text to Physics-based Animation
Yuxing Qiu
Feng Gao
Minchen Li
Govind Thattai
Yin Yang
Chenfanfu Jiang
PINN
DiffM
VGen
58
0
0
25 Nov 2022
ComCLIP: Training-Free Compositional Image and Text Matching
Kenan Jiang
Xuehai He
Ruize Xu
Xinze Wang
VLM
CLIP
CoGe
106
20
0
25 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
78
15
0
24 Nov 2022
Open-vocabulary Attribute Detection
M. A. Bravo
Sudhanshu Mittal
Simon Ging
Thomas Brox
VLM
ObjD
92
31
0
23 Nov 2022
Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference
E. Mitchell
Joseph J. Noh
Siyan Li
William S. Armstrong
Ananth Agarwal
Patrick Liu
Chelsea Finn
Christopher D. Manning
84
35
0
21 Nov 2022
Teaching Structured Vision&Language Concepts to Vision&Language Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Yikang Shen
Roei Herzig
...
Donghyun Kim
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
126
72
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
100
24
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
70
4
0
21 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
109
15
0
21 Nov 2022
Intelligent Computing: The Latest Advances, Challenges and Future
Shiqiang Zhu
Ting Yu
Tao Xu
Hongyang Chen
Schahram Dustdar
...
Tariq S. Durrani
Huaimin Wang
Jiangxing Wu
Tongyi Zhang
Yunhe Pan
AI4CE
87
130
0
21 Nov 2022
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
Ling Yang
Zhilin Huang
Yang Song
Shenda Hong
Ge Li
Wentao Zhang
Tengjiao Wang
Guohao Li
Ming-Hsuan Yang
104
57
0
21 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
78
11
0
20 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
161
15
0
19 Nov 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Hao Li
Jinguo Zhu
Xiaohu Jiang
Xizhou Zhu
Hongsheng Li
...
Xiaohua Wang
Yu Qiao
Xiaogang Wang
Wenhai Wang
Jifeng Dai
MLLM
87
58
0
17 Nov 2022
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
James Smith
Paola Cascante-Bonilla
Assaf Arbelle
Donghyun Kim
Yikang Shen
David D. Cox
Diyi Yang
Z. Kira
Rogerio Feris
Leonid Karlinsky
VLM
146
23
0
17 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
99
88
0
17 Nov 2022
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Pengpeng Zeng
Jinkuan Zhu
Jingkuan Song
Lianli Gao
VLM
63
30
0
17 Nov 2022
MapQA: A Dataset for Question Answering on Choropleth Maps
Shuaichen Chang
David Palzer
Jialin Li
Eric Fosler-Lussier
N. Xiao
59
48
0
15 Nov 2022
Visually Grounded VQA by Lattice-based Retrieval
Daniel Reich
F. Putze
Tanja Schultz
47
2
0
15 Nov 2022
A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation
Shijia Huang
Feng Li
Hao Zhang
Siyi Liu
Lei Zhang
Liwei Wang
64
5
0
15 Nov 2022
Category-Adaptive Label Discovery and Noise Rejection for Multi-label Image Recognition with Partial Positive Labels
Tao Pu
Q. Lao
Hefeng Wu
Tianshui Chen
Liang Lin
75
2
0
15 Nov 2022
Probabilistic Debiasing of Scene Graphs
Bashirul Azam Biswas
Qian Ji
69
12
0
11 Nov 2022
SSGVS: Semantic Scene Graph-to-Video Synthesis
Yuren Cong
Jinhui Yi
Bodo Rosenhahn
M. Yang
135
8
0
11 Nov 2022
Watching the News: Towards VideoQA Models that can Read
Soumya Jahagirdar
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
93
20
0
10 Nov 2022
Towards Reasoning-Aware Explainable VQA
Rakesh Vaideeswaran
Feng Gao
Abhinav Mathur
Govind Thattai
LRM
83
3
0
09 Nov 2022
OSIC: A New One-Stage Image Captioner Coined
Bo Wang
Zhao Zhang
Ming Zhao
Xiaojie Jin
Mingliang Xu
Meng Wang
VLM
85
4
0
04 Nov 2022
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing
Aditay Tripathi
Anand Mishra
Anirban Chakraborty
51
2
0
03 Nov 2022
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji
Seungjun Nah
Xun Huang
Arash Vahdat
Jiaming Song
...
Timo Aila
S. Laine
Bryan Catanzaro
Tero Karras
Xuan Li
VLM
MoE
217
832
0
02 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
53
2
0
01 Nov 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Chandu
A. Geramifard
72
3
0
30 Oct 2022
Previous
1
2
3
...
7
8
9
...
31
32
33
Next