ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,037 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLMVLM
21
0
0
20 Jun 2025
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
Teng Li
Quanfeng Lu
Lirui Zhao
Hao Li
X. Zhu
Yu Qiao
Jun Zhang
Wenqi Shao
32
0
0
20 Jun 2025
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Zijian Song
Xiaoxin Lin
Qiuming Huang
Guangrun Wang
Liang Lin
LRM
39
0
0
17 Jun 2025
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Shaolei Zhang
Shoutao Guo
Qingkai Fang
Yan Zhou
Yang Feng
MLLMAuLLMVLM
75
0
0
16 Jun 2025
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design
FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design
Kai Lan
Jiayong Zhu
Jiangtong Li
Dawei Cheng
Guang-Sheng Chen
Changjun Jiang
LRM
36
0
0
16 Jun 2025
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence
Yibo Yang
Sihao Liu
Chuan Rao
Bang An
Tiancheng Shen
Philip Torr
Ming-Hsuan Yang
Bernard Ghanem
37
0
0
16 Jun 2025
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning
Haotian Zhang
Liu Liu
Baosheng Yu
Jiayan Qiu
Yanwei Ren
Xianglong Liu
38
0
0
14 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
41
0
0
13 Jun 2025
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu
Jiuhai Chen
Zhaojiang Lin
Xichen Pan
Lifu Huang
...
Di Jin
Michihiro Yasunaga
Lili Yu
Xi Lin
Shaoliang Nie
127
1
0
12 Jun 2025
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang
Mengzhen Liu
Lichen Li
Ming Lu
Yuan Zhang
Junwen Pan
Qi She
Shanghang Zhang
VLM
135
0
0
12 Jun 2025
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Benjamin Z. Reichman
Constantin Patsch
Jack Truxal
Atishay Jain
Larry Heck
47
0
0
11 Jun 2025
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Ayush Gupta
A. Roy
Rama Chellappa
Nathaniel D. Bastian
Alvaro Velasquez
Susmit Jha
67
0
0
11 Jun 2025
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
Benno Krojer
Mojtaba Komeili
Candace Ross
Q. Garrido
Koustuv Sinha
Nicolas Ballas
Mahmoud Assran
77
1
0
11 Jun 2025
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Dianyi Wang
Wei Song
Yikun Wang
Siyuan Wang
Kaicheng Yu
Zhongyu Wei
Jiaqi Wang
45
1
0
10 Jun 2025
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models
Pranav Guruprasad
Yangyue Wang
Sudipta Chowdhury
Jaewoo Song
Harshvardhan Sikka
53
0
0
10 Jun 2025
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Zheda Mai
A. Chowdhury
Zihe Wang
Sooyoung Jeon
Jingyan Bai
Jiacheng Hou
Jihyung Kil
Wei-Lun Chao
CoGe
66
0
0
10 Jun 2025
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin
Dana Arad
Yossi Gandelsman
Yonatan Belinkov
63
0
0
10 Jun 2025
Synthetic Visual Genome
Synthetic Visual Genome
J. S. Park
Zixian Ma
Linjie Li
Chenhao Zheng
Cheng-Yu Hsieh
...
Quan Kong
Norimasa Kobori
Ali Farhadi
Yejin Choi
Ranjay Krishna
33
0
0
09 Jun 2025
Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests
Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests
Arnau Igualde Sáez
Lamyae Rhomrasi
Yusef Ahsini
Ricardo Vinuesa
S. Hoyas
Jose P. García Sabater
Marius J. Fullana i Alfonso
J. Alberto Conejero
LRM
23
0
0
09 Jun 2025
Language-Vision Planner and Executor for Text-to-Visual Reasoning
Language-Vision Planner and Executor for Text-to-Visual Reasoning
Yichang Xu
Gaowen Liu
Ramana Rao Kompella
Sihao Hu
Tiansheng Huang
Fatih Ilhan
Selim Furkan Tekin
Zachary Yahn
Ling Liu
LRMVLM
31
0
0
09 Jun 2025
A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Vishaal Udandarao
Mehdi Cherti
Shyamgopal Karthik
J. Jitsev
Samuel Albanie
Matthias Bethge
CoGe
24
0
0
09 Jun 2025
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Tianyi Bai
Yuxuan Fan
Jiantao Qiu
Fupeng Sun
Jiayi Song
Junlin Han
Zichen Liu
Conghui He
Wentao Zhang
Binhang Yuan
MLLMVLM
30
0
0
08 Jun 2025
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury
Mohamed Elmoghany
Yohan Abeysinghe
Junjie Fei
Sayan Nag
Salman Khan
Mohamed Elhoseiny
Dinesh Manocha
46
0
0
08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
FREE: Fast and Robust Vision Language Models with Early Exits
Divya J. Bajpai
M. Hanawal
VLM
21
0
0
07 Jun 2025
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Fanhu Zeng
Deli Yu
Zhenglun Kong
Hao Tang
ViT
67
1
0
06 Jun 2025
CoMemo: LVLMs Need Image Context with Image Memory
CoMemo: LVLMs Need Image Context with Image Memory
Shi-Qi Liu
Weijie Su
Xizhou Zhu
Wenhai Wang
Jifeng Dai
VLM
64
0
0
06 Jun 2025
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models
Z. Babaiee
Peyman M. Kiasari
Daniela Rus
Radu Grosu
53
0
0
06 Jun 2025
TextVidBench: A Benchmark for Long Video Scene Text Understanding
Yangyang Zhong
Ji Qi
Yuan Yao
Pengxin Luo
Yunfeng Yan
Donglian Qi
Zhiyuan Liu
Tat-Seng Chua
103
0
0
05 Jun 2025
CIVET: Systematic Evaluation of Understanding in VLMs
CIVET: Systematic Evaluation of Understanding in VLMs
Massimo Rizzoli
Simone Alghisi
Olha Khomyn
Gabriel Roccabruna
Seyed Mahed Mousavi
Giuseppe Riccardi
178
0
0
05 Jun 2025
Coordinated Robustness Evaluation Framework for Vision-Language Models
Coordinated Robustness Evaluation Framework for Vision-Language Models
Ashwin Ramesh Babu
Sajad Mousavi
Vineet Gundecha
Sahand Ghorbanpour
Avisek Naug
Antonio Guillen
Ricardo Luna Gutierrez
Soumyendu Sarkar
AAML
36
0
0
05 Jun 2025
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang
Z. Liu
Yongming Rao
Jiwen Lu
VLMLRM
197
0
0
05 Jun 2025
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization
Jiulong Wu
Zhengliang Shi
Shuaiqiang Wang
J. Huang
Dawei Yin
Lingyong Yan
Min Cao
Min Zhang
MLLM
80
0
0
04 Jun 2025
Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
Di Wen
Lei Qi
Kunyu Peng
Kailun Yang
Fei Teng
...
Yufan Chen
R. Liu
Yitian Shi
M. Sarfraz
Rainer Stiefelhagen
74
0
0
03 Jun 2025
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
Yicheng Xiao
Lin Song
Rui Yang
Cheng Cheng
Zunnan Xu
Zhaoyang Zhang
Yixiao Ge
Xiu Li
Ying Shan
66
2
0
03 Jun 2025
Is Extending Modality The Right Path Towards Omni-Modality?
Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu
Kai Zhang
Muhao Chen
Yu Su
VLM
57
0
0
02 Jun 2025
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
Yanyuan Qiao
Haodong Hong
Wenqi Lyu
Dong An
Siqi Zhang
Yutong Xie
Xinyu Wang
Qi Wu
LM&Ro
56
0
0
01 Jun 2025
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Shivam Chandhok
Qian Yang
Oscar Manas
Kanishk Jain
Leonid Sigal
Aishwarya Agrawal
43
0
0
01 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
52
0
0
01 Jun 2025
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs
Yudong Zhang
Ruobing Xie
Yiqing Huang
Jiansheng Chen
Xingwu Sun
Zhanhui Kang
Di Wang
Yu Wang
AAML
55
0
0
01 Jun 2025
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
Duzhen Zhang
Yong Ren
Zhong-Zhi Li
Yahan Yu
Jiahua Dong
Chenxing Li
Zhilong Ji
Jinfeng Bai
CLL
57
1
0
31 May 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Junyu Luo
Zhizhuo Kou
Liming Yang
Xiao Luo
Jinsheng Huang
...
Jiaming Ji
Xuanzhe Liu
Sirui Han
Ming Zhang
Yike Guo
28
0
0
30 May 2025
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Yaxin Luo
Zhaoyi Li
Jiacheng Liu
Jiacheng Cui
Xiaohan Zhao
Zhiqiang Shen
LLMAGLRMVLM
40
0
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
46
1
0
30 May 2025
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Chen Huang
Skyler Seto
Hadi Pouransari
Mehrdad Farajtabar
Raviteja Vemulapalli
Fartash Faghri
Oncel Tuzel
B. Theobald
Josh Susskind
CLL
59
0
0
30 May 2025
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Redwan Sony
Parisa Farmanifard
Hamzeh Alzwairy
Nitish Shukla
Arun Ross
CVBMVLM
67
0
0
30 May 2025
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Yiming Lei
Zhizheng Yang
Zeming Liu
Haitao Leng
Shaoguo Liu
Tingting Gao
Qingjie Liu
Yunhong Wang
42
0
0
29 May 2025
Synthetic Document Question Answering in Hungarian
Synthetic Document Question Answering in Hungarian
Jonathan Li
Zoltan Csaki
Nidhi Hiremath
Etash Guha
Fenglu Hong
Edward Ma
Urmish Thakker
49
0
0
29 May 2025
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Danny Driess
Jost Tobias Springenberg
Brian Ichter
Lili Yu
Adrian Li-Bell
...
Allen Z. Ren
Homer Walke
Quan Vuong
Lucy Xiaoyang Shi
Sergey Levine
136
2
0
29 May 2025
Multi-Sourced Compositional Generalization in Visual Question Answering
Multi-Sourced Compositional Generalization in Visual Question Answering
Chuanhao Li
Wenbo Ye
Zhen Li
Yuwei Wu
Yunde Jia
CoGe
63
0
0
29 May 2025
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
Linglin Jing
Yuting Gao
Zhigang Wang
Wang Lan
Yiwen Tang
Wenhai Wang
Kaipeng Zhang
Qingpei Guo
MoE
45
0
0
28 May 2025
1234...394041
Next