ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1602.07332
  4. Cited By
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense
  Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

23 February 2016
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin Johnson
Kenji Hata
Joshua Kravitz
Stephanie Chen
Yannis Kalantidis
Li Li
David A. Shamma
Michael S. Bernstein
Fei-Fei Li
ArXiv (abs)PDFHTML

Papers citing "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations"

50 / 1,644 papers shown
Title
Perception Encoder: The best visual embeddings are not at the output of the network
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjDVOS
329
9
0
17 Apr 2025
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Zhanglin Wu
Tengfei Song
Ning Xie
Mengli Zhu
Weidong Zhang
...
Pengfei Li
Chong Li
Junhao Zhu
Hao Yang
Shiliang Sun
119
2
0
16 Apr 2025
Mutual Understanding between People and Systems via Neurosymbolic AI and Knowledge Graphs
Mutual Understanding between People and Systems via Neurosymbolic AI and Knowledge Graphs
I. Celino
Mario Scrocca
Agnese Chiatti
73
0
0
15 Apr 2025
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Iñigo Pikabea
Iñaki Lacunza
Oriol Pareras
Carlos Escolano
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
VLM
205
1
0
28 Mar 2025
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Aniket Didolkar
Andrii Zadaianchuk
Rabiul Awal
Maximilian Seitzer
E. Gavves
Aishwarya Agrawal
OCLVLM
178
3
0
27 Mar 2025
A Causal Adjustment Module for Debiasing Scene Graph Generation
A Causal Adjustment Module for Debiasing Scene Graph Generation
Li Liu
Shuzhou Sun
Shuaifeng Zhi
Fan Shi
Zhen Liu
J. Heikkilä
Yongxiang Liu
CML
83
2
0
22 Mar 2025
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Haochen Zhang
Nader Zantout
Pujith Kachana
Ji Zhang
Wenshan Wang
VGen
83
0
0
20 Mar 2025
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Shengqiong Wu
Hao Fei
Jingkang Yang
Xiaochen Li
Juncheng Li
Hao Zhang
Tat-Seng Chua
102
1
0
19 Mar 2025
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
Sayak Nag
Udita Ghosh
Sarosij Bose
Calvin-Khang Ta
Jiachen Li
Amit K. Roy-Chowdhury
221
0
0
18 Mar 2025
Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic
Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic
Monika Shah
Somdeb Sarkhel
Deepak Venugopal
MLLMBDLVLM
127
0
0
18 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
142
8
0
18 Mar 2025
Grounded Chain-of-Thought for Multimodal Large Language Models
Grounded Chain-of-Thought for Multimodal Large Language Models
Qiong Wu
Xiangcong Yang
Yiyi Zhou
Chenxin Fang
Baiyang Song
Xiaoshuai Sun
Rongrong Ji
LRM
192
3
0
17 Mar 2025
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Haoqi Yuan
Yu Bai
Yuhui Fu
Bohan Zhou
Yicheng Feng
Xinrun Xu
Yi Zhan
Börje F. Karlsson
Zongqing Lu
LM&Ro
203
1
0
16 Mar 2025
Salient Temporal Encoding for Dynamic Scene Graph Generation
Salient Temporal Encoding for Dynamic Scene Graph Generation
Zhihao Zhu
91
0
0
15 Mar 2025
Referring to Any Person
Referring to Any Person
Qing Jiang
Lin Wu
Zhaoyang Zeng
Tianhe Ren
Yuda Xiong
Yihao Chen
Qin Liu
Lei Zhang
500
2
0
11 Mar 2025
MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification
Xiangyan Qu
Jing Yu
Jiamin Zhuang
Gaopeng Gou
Gang Xiong
Qi Wu
VLM
134
0
0
10 Mar 2025
Towards Fine-Grained Video Question Answering
Wei Dai
Alan Luo
Zane Durante
Debadutta Dash
Arnold Milstein
Kevin Schulman
Ehsan Adeli
L. Fei-Fei
109
1
0
10 Mar 2025
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Wei-Yao Wang
Zhao Wang
Helen Suzuki
Yoshiyuki Kobayashi
LRM
105
1
0
04 Mar 2025
Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Tianjie Ju
Yi Hua
Hao Fei
Zhenyu Shao
Yubin Zheng
Haodong Zhao
Mong Li Lee
Wynne Hsu
Zhuosheng Zhang
Gongshen Liu
146
0
0
03 Mar 2025
HalCECE: A Framework for Explainable Hallucination Detection through Conceptual Counterfactuals in Image Captioning
HalCECE: A Framework for Explainable Hallucination Detection through Conceptual Counterfactuals in Image Captioning
Maria Lymperaiou
Giorgos Filandrianos
Angeliki Dimitriou
Athanasios Voulodimos
Giorgos Stamou
MLLM
56
0
0
01 Mar 2025
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
E. Ghaleb
Bulat Khaertdinov
Aslı Özyürek
Raquel Fernández
114
0
0
27 Feb 2025
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation
Shaharukh Khan
Ayush Tarun
Ali Faraz
Palash Kamble
Vivek Dahiya
Praveen Kumar Pokala
Ashish Kulkarni
Chandra Khatri
Abhinav Ravi
Shubham Agarwal
439
1
0
27 Feb 2025
Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Sanghyeok Chu
Seonguk Seo
Bohyung Han
114
1
0
23 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
165
9
0
21 Feb 2025
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
Kyungmin Min
Minbeom Kim
Kang-il Lee
Dongryeol Lee
Kyomin Jung
MLLM
184
7
0
20 Feb 2025
Contrastive Localized Language-Image Pre-Training
Contrastive Localized Language-Image Pre-Training
Hong-You Chen
Zhengfeng Lai
Hao Zhang
Xiang Wang
Marcin Eichner
Keen You
Meng Cao
Bowen Zhang
Yue Yang
Zhe Gan
CLIPVLM
124
10
0
20 Feb 2025
Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments
Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments
Luca Barsellotti
Roberto Bigazzi
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
231
1
0
20 Feb 2025
Object-centric Binding in Contrastive Language-Image Pretraining
Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel
Pietro Astolfi
Florian Bordes
M. Drozdzal
Adriana Romero Soriano
OCLVLMCoGe
161
3
0
19 Feb 2025
A Comprehensive Survey on Composed Image Retrieval
A Comprehensive Survey on Composed Image Retrieval
Xuemeng Song
Haoqiang Lin
Haokun Wen
Bohan Hou
Mingzhu Xu
Liqiang Nie
131
3
0
19 Feb 2025
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
Weikang Qiu
Zheng Huang
Haoyu Hu
Aosong Feng
Yujun Yan
Rex Ying
99
0
0
18 Feb 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Hugh Mee Wong
Rick Nouwen
Albert Gatt
155
0
0
17 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
Xinsong Zhang
Jingyu Wang
Wenbing Tao
127
10
0
17 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begüm Demir
Ioannis Papoutsis
VLM
196
1
0
13 Feb 2025
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Dexian Cai
Xiaocui Yang
Yongkang Liu
Daling Wang
Shi Feng
Yifei Zhang
Soujanya Poria
LRM
113
1
0
13 Feb 2025
Color Universal Design Neural Network for the Color Vision Deficiencies
Color Universal Design Neural Network for the Color Vision Deficiencies
Sunyong Seo
Jinho Park
104
0
0
12 Feb 2025
Large Multimodal Models for Low-Resource Languages: A Survey
Large Multimodal Models for Low-Resource Languages: A Survey
Marian Lupascu
Ana-Cristina Rogoz
Mihai-Sorin Stupariu
Radu Tudor Ionescu
181
2
0
08 Feb 2025
Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models
Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models
Guanqun Cao
Ryan Mckenna
Erich Graf
John Oyekan
LM&Ro
231
0
0
30 Jan 2025
Mirage in the Eyes: Hallucination Attack on Multi-modal Large Language Models with Only Attention Sink
Yining Wang
Mi Zhang
Junjie Sun
Chenyue Wang
Min Yang
Hui Xue
Jialing Tao
Ranjie Duan
Qingbin Liu
65
2
0
28 Jan 2025
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition
Multi-Grained Query-Guided Set Prediction Network for Grounded Multimodal Named Entity Recognition
Jielong Tang
Zhenxing Wang
Ziyang Gong
Jianxing Yu
Shuang Wang
Jian Yin
156
0
0
28 Jan 2025
Making Reliable and Flexible Decisions in Long-tailed Classification
Making Reliable and Flexible Decisions in Long-tailed Classification
Bolian Li
Ruqi Zhang
457
0
0
23 Jan 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
182
51
0
21 Jan 2025
ComplexVAD: Detecting Interaction Anomalies in Video
ComplexVAD: Detecting Interaction Anomalies in Video
Furkan Mumcu
Michael J. Jones
Yasin Yilmaz
A. Cherian
107
0
0
17 Jan 2025
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Ahmad Süleyman
Göksel Biricik
87
2
0
15 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjDVLM
276
3
0
14 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
254
134
0
10 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
171
15
0
06 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
360
59
0
03 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
188
42
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
284
5
0
31 Dec 2024
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue
Quan Deng
Fei Yu
Yanhao Wang
Jun Wang
Yongqian Li
VLM
129
5
0
31 Dec 2024
Previous
12345...313233
Next