ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.00468
  4. Cited By
VQA: Visual Question Answering
v1v2v3v4v5v6v7 (latest)

VQA: Visual Question Answering

3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "VQA: Visual Question Answering"

50 / 2,957 papers shown
Title
Semantic Scene Difference Detection in Daily Life Patroling by Mobile
  Robots using Pre-Trained Large-Scale Vision-Language Model
Semantic Scene Difference Detection in Daily Life Patroling by Mobile Robots using Pre-Trained Large-Scale Vision-Language Model
Yoshiki Obinata
Kento Kawaharazuka
Naoaki Kanazawa
N. Yamaguchi
Naoto Tsukamoto
Iori Yanokura
Shingo Kitagawa
Koki Shinjo
K. Okada
Masayuki Inaba
LM&Ro
65
6
0
28 Sep 2023
Toloka Visual Question Answering Benchmark
Toloka Visual Question Answering Benchmark
Mert Pilanci
Nikita Pavlichenko
Sergey Koshelev
Daniil Likhobaba
Alisa Smirnova
81
4
0
28 Sep 2023
Social Media Fashion Knowledge Extraction as Captioning
Social Media Fashion Knowledge Extraction as Captioning
Yifei Yuan
Wenxuan Zhang
Yang Deng
Wai Lam
54
1
0
28 Sep 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
69
3
0
28 Sep 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Avamarie Brueggeman
Andrea Madotto
Zhaojiang Lin
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
89
94
0
27 Sep 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced
  Text-image Comprehension and Composition
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang
Xiaoyi Wang
Bin Wang
Yuhang Cao
Chao Xu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Da Lin
Jiaqi Wang
MLLM
191
241
0
26 Sep 2023
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
  Vision
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
Haoning Wu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
Liang Liao
...
Chunyi Li
Wenxiu Sun
Qiong Yan
Guangtao Zhai
Weisi Lin
VLM
139
157
0
25 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
128
7
0
23 Sep 2023
Multi-modal Domain Adaptation for REG via Relation Transfer
Multi-modal Domain Adaptation for REG via Relation Transfer
Yifan Ding
Liqiang Wang
Boqing Gong
63
0
0
23 Sep 2023
Contextual Emotion Estimation from Image Captions
Contextual Emotion Estimation from Image Captions
V. Yang
Archita Srivastava
Yasaman Etesam
Chuxuan Zhang
Angelica Lim
73
3
0
22 Sep 2023
Towards Answering Health-related Questions from Medical Videos: Datasets
  and Approaches
Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches
Deepak Gupta
Kush Attal
Dina Demner-Fushman
LM&MA
54
1
0
21 Sep 2023
Sentence Attention Blocks for Answer Grounding
Sentence Attention Blocks for Answer Grounding
Seyedalireza Khoshsirat
Chandra Kambhamettu
75
8
0
20 Sep 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
DreamLLM: Synergistic Multimodal Comprehension and Creation
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
111
199
0
20 Sep 2023
Visual Question Answering in the Medical Domain
Visual Question Answering in the Medical Domain
Louisa Canepa
Sonit Singh
Arcot Sowmya
58
14
0
20 Sep 2023
Investigating the Catastrophic Forgetting in Multimodal Large Language
  Models
Investigating the Catastrophic Forgetting in Multimodal Large Language Models
Yuexiang Zhai
Shengbang Tong
Xiao Li
Mu Cai
Qing Qu
Yong Jae Lee
Yi Ma
VLMMLLMCLL
171
88
0
19 Sep 2023
Syntax Tree Constrained Graph Network for Visual Question Answering
Syntax Tree Constrained Graph Network for Visual Question Answering
Xiangrui Su
Qi Zhang
Chongyang Shi
Jiachang Liu
Liang Hu
GNNNAI
68
3
0
17 Sep 2023
D3: Data Diversity Design for Systematic Generalization in Visual
  Question Answering
D3: Data Diversity Design for Systematic Generalization in Visual Question Answering
Amir Rahimi
Vanessa D’Amario
Moyuru Yamada
Kentaro Takemoto
Tomotake Sasaki
Xavier Boix
68
1
0
15 Sep 2023
PoseFix: Correcting 3D Human Poses with Natural Language
PoseFix: Correcting 3D Human Poses with Natural Language
Ginger Delmas
Philippe Weinzaepfel
Francesc Moreno-Noguer
Grégory Rogez
99
24
0
15 Sep 2023
Learning by Self-Explaining
Learning by Self-Explaining
Wolfgang Stammer
Felix Friedrich
David Steinmann
Manuel Brack
Hikaru Shindo
Kristian Kersting
138
12
0
15 Sep 2023
A Data Source for Reasoning Embodied Agents
A Data Source for Reasoning Embodied Agents
Jack Lanchantin
Sainbayar Sukhbaatar
Gabriel Synnaeve
Yuxuan Sun
Kavya Srinet
Arthur Szlam
LM&RoLRM
57
5
0
14 Sep 2023
MMICL: Empowering Vision-language Model with Multi-Modal In-Context
  Learning
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Haozhe Zhao
Zefan Cai
Shuzheng Si
Xiaojian Ma
Kaikai An
Liang Chen
Zixuan Liu
Sheng Wang
Wenjuan Han
Baobao Chang
MLLMVLM
130
143
0
14 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
  and Ethics
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu
Bingchen Zhao
Chen Wei
Cihang Xie
MLLM
73
15
0
13 Sep 2023
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
Palaash Agrawal
Haidi Azaman
Cheston Tan
160
3
0
13 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection
  and Segmentation
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
77
14
0
12 Sep 2023
Language Models as Black-Box Optimizers for Vision-Language Models
Language Models as Black-Box Optimizers for Vision-Language Models
Shihong Liu
Zhiqiu Lin
Samuel Yu
Ryan Lee
Tiffany Ling
Deepak Pathak
Deva Ramanan
VLM
126
30
0
12 Sep 2023
MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering
  over Text, Tables and Images
MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images
Weihao Liu
Fangyu Lei
Tongxu Luo
Jiahe Lei
Shizhu He
Jun Zhao
Kang Liu
LMTD
61
11
0
09 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex
  Visually-Grounded Referencing using Determiners
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
Clarence Lee
M Ganesh Kumar
Cheston Tan
76
3
0
07 Sep 2023
A Joint Study of Phrase Grounding and Task Performance in Vision and
  Language Models
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima
Hadar Averbuch-Elor
Yoav Artzi
76
2
0
06 Sep 2023
CIEM: Contrastive Instruction Evaluation Method for Better Instruction
  Tuning
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning
Hongyu Hu
Jiyuan Zhang
Minyi Zhao
Zhenbang Sun
MLLM
82
49
0
05 Sep 2023
ATM: Action Temporality Modeling for Video Question Answering
ATM: Action Temporality Modeling for Video Question Answering
Junwen Chen
Jie Zhu
Yu Kong
69
1
0
05 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical
  Learning
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
68
7
0
05 Sep 2023
A Survey on Interpretable Cross-modal Reasoning
A Survey on Interpretable Cross-modal Reasoning
Dizhan Xue
Shengsheng Qian
Zuyi Zhou
Changsheng Xu
LRM
107
4
0
05 Sep 2023
Multimodal Contrastive Learning with Hard Negative Sampling for Human
  Activity Recognition
Multimodal Contrastive Learning with Hard Negative Sampling for Human Activity Recognition
Hyeongju Choi
Apoorva Beedu
Irfan Essa
SSL
73
3
0
03 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual
  Augmentation
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Separate and Locate: Rethink the Text in Text-based Visual Question
  Answering
Separate and Locate: Rethink the Text in Text-based Visual Question Answering
Chengyang Fang
Jiangnan Li
Liang Li
Can Ma
Dayong Hu
83
13
0
31 Aug 2023
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes
  in Product Images for e-commerce Vision-Language Applications
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Wenyi Wu
Karim Bouyarmane
Ismail B. Tutar
33
2
0
30 Aug 2023
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning
  Based on Visually Grounded Conversations
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
Kilichbek Haydarov
Xiaoqian Shen
Avinash Madasu
Mahmoud Salem
Jia Li
Gamaleldin F. Elsayed
Mohamed Elhoseiny
67
4
0
30 Aug 2023
On the Potential of CLIP for Compositional Logical Reasoning
On the Potential of CLIP for Compositional Logical Reasoning
Justin D. Brody
LRM
48
4
0
30 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
80
5
0
30 Aug 2023
Explaining Vision and Language through Graphs of Events in Space and
  Time
Explaining Vision and Language through Graphs of Events in Space and Time
Mihai Masala
Nicolae Cudlenco
Traian Rebedea
Marius Leordeanu
VLM
93
2
0
29 Aug 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
84
29
0
28 Aug 2023
Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on
  Language, Multimodal, and Scientific GPT Models
Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models
Kaiyuan Gao
Su He
Zhenyu He
Jiacheng Lin
Qizhi Pei
Jie Shao
Wei Zhang
LM&MASyDa
64
5
0
27 Aug 2023
Synergizing Contrastive Learning and Optimal Transport for 3D Point
  Cloud Domain Adaptation
Synergizing Contrastive Learning and Optimal Transport for 3D Point Cloud Domain Adaptation
Siddharth Katageri
Arkadipta De
Chaitanya Devaguptapu
Vssv Prasad
Charu Sharma
Manohar Kaul
3DPC
44
4
0
27 Aug 2023
Towards Fast and Accurate Image-Text Retrieval with Self-Supervised
  Fine-Grained Alignment
Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment
Jiamin Zhuang
Jing Yu
Yang Ding
Xiangyang Qu
Yue Hu
83
9
0
27 Aug 2023
ORES: Open-vocabulary Responsible Visual Synthesis
ORES: Open-vocabulary Responsible Visual Synthesis
Minheng Ni
Chenfei Wu
Xiaodong Wang
Sheng-Siang Yin
Lijuan Wang
Zicheng Liu
Nan Duan
DiffM
75
9
0
26 Aug 2023
DLIP: Distilling Language-Image Pre-training
DLIP: Distilling Language-Image Pre-training
Huafeng Kuang
Jie Wu
Xiawu Zheng
Ming Li
Xuefeng Xiao
Rui Wang
Min Zheng
Rongrong Ji
VLM
70
4
0
24 Aug 2023
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across
  Languages
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu
Yuan Yao
Chong Wang
Shanonan Wang
Yinxu Pan
...
Yankai Lin
Jiao Xue
Dahai Li
Zhiyuan Liu
Maosong Sun
MLLMVLM
120
56
0
23 Aug 2023
GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised
  Learning
GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning
Mainak Singha
Ankit Jha
Biplab Banerjee
VLM
77
4
0
22 Aug 2023
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive
  Language-Image Pre-training
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Xi Deng
Han Shi
Runhu Huang
Changlin Li
Hang Xu
Jianhua Han
James T. Kwok
Shen Zhao
Wei Zhang
Xiaodan Liang
CLIPVLM
91
3
0
22 Aug 2023
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal
  Heterogeneous Federated Learning
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
Haokun Chen
Yao Zhang
Denis Krompass
Jindong Gu
Volker Tresp
FedML
114
55
0
21 Aug 2023
Previous
123...171819...585960
Next