Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1802.08218
Cited By
v1
v2
v3
v4 (latest)
VizWiz Grand Challenge: Answering Visual Questions from Blind People
22 February 2018
Danna Gurari
Qing Li
Abigale Stangl
Anhong Guo
Chi Lin
Kristen Grauman
Jiebo Luo
Jeffrey P. Bigham
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VizWiz Grand Challenge: Answering Visual Questions from Blind People"
50 / 573 papers shown
Title
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
89
7
0
15 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
113
8
1
14 Oct 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Peng Xia
Siwei Han
Shi Qiu
Yiyang Zhou
Zhaoyang Wang
...
Chenhang Cui
Mingyu Ding
Linjie Li
Lijuan Wang
Huaxiu Yao
163
16
0
14 Oct 2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks
Sungkyung Kim
Adam Lee
Junyoung Park
Andrew Chung
Jusang Oh
Jay-Yoon Lee
45
3
0
12 Oct 2024
Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
Shengcao Cao
Liang-Yan Gui
Yu-Xiong Wang
85
3
0
10 Oct 2024
Q-VLM: Post-training Quantization for Large Vision-Language Models
Changyuan Wang
Ziwei Wang
Xiuwei Xu
Yansong Tang
Jie Zhou
Jiwen Lu
MQ
115
7
0
10 Oct 2024
VHELM: A Holistic Evaluation of Vision Language Models
Tony Lee
Haoqin Tu
Chi Heem Wong
Wenhao Zheng
Yiyang Zhou
...
Josselin Somerville Roberts
Michihiro Yasunaga
Huaxiu Yao
Cihang Xie
Percy Liang
VLM
95
16
0
09 Oct 2024
To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models
Junyan Lin
Haoran Chen
Dawei Zhu
Xiaoyu Shen
45
2
0
09 Oct 2024
NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People
Jun Yu
Yifan Zhang
Badrinadh Aila
V. Namboodiri
106
1
0
08 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Yifei Xing
Xiangyuan Lan
Ruiping Wang
D. Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
121
0
0
08 Oct 2024
ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models
Fanhu Zeng
Fei Zhu
Haiyang Guo
Xu-Yao Zhang
Cheng-Lin Liu
VLM
CLL
80
12
0
08 Oct 2024
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta
Shreyas Verma
Ujjwala Anantheswaran
Kevin Scaria
Mihir Parmar
Swaroop Mishra
Chitta Baral
ReLM
LRM
74
8
0
06 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRM
AI4CE
68
12
0
04 Oct 2024
Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
Yufang Liu
Tao Ji
Changzhi Sun
Yuanbin Wu
Aimin Zhou
VLM
MLLM
90
3
0
04 Oct 2024
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
Xin Zou
Yizhou Wang
Yibo Yan
Yuanhuiyi Lyu
Kening Zheng
...
Junkai Chen
Peijie Jiang
Qingbin Liu
Chang Tang
Xuming Hu
165
8
0
04 Oct 2024
BadCM: Invisible Backdoor Attack Against Cross-Modal Learning
Zheng Zhang
Xu Yuan
Lei Zhu
Jingkuan Song
Liqiang Nie
AAML
85
12
0
03 Oct 2024
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu
MLLM
122
1
0
03 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
Prashanth Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
81
2
0
02 Oct 2024
Addition is All You Need for Energy-efficient Language Models
Hongyin Luo
Wei Sun
30
7
0
01 Oct 2024
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan
Wilbert Pumacay
Nishanth Kumar
Yi Ru Wang
Shulin Tian
Wentao Yuan
Ranjay Krishna
Dieter Fox
Ajay Mandlekar
Yijie Guo
VLM
LRM
117
29
0
01 Oct 2024
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
Shitian Zhao
Renrui Zhang
Xu Luo
Yan Wang
Shanghang Zhang
Peng Gao
91
0
0
01 Oct 2024
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference
Yejin Lee
Anna Y. Sun
Basil Hosmer
Bilge Acun
Can Balioglu
...
Ram Pasunuru
Scott Yih
Sravya Popuri
Xing Liu
Carole-Jean Wu
175
2
0
30 Sep 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
124
9
0
28 Sep 2024
DARE: Diverse Visual Question Answering with Robustness Evaluation
Hannah Sterz
Jonas Pfeiffer
Ivan Vulić
OOD
VLM
41
2
0
26 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
175
12
0
26 Sep 2024
Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization
Minyi Zhao
Jie Wang
Zerui Li
Jiyuan Zhang
Zhenbang Sun
Shuigeng Zhou
MLLM
VLM
138
0
0
22 Sep 2024
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
Xin Jiang
Junwei Zheng
Ruiping Liu
Jiahang Li
Jiaming Zhang
Sven Matthiesen
Rainer Stiefelhagen
VLM
55
1
0
21 Sep 2024
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan
Liqiang Niu
Fandong Meng
Wenbo Li
Jie Zhou
Jinsong Su
VLM
60
3
0
20 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLM
VLM
LRM
123
73
0
17 Sep 2024
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan
Diego Frassinelli
Barbara Plank
52
1
0
17 Sep 2024
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye
Qiong Wu
Wenhao Lin
Yiyi Zhou
VLM
117
13
0
16 Sep 2024
Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding
Xiaoyu Liang
Jiayuan Yu
Lianrui Mu
Jiedong Zhuang
Jiaqi Hu
Yuchen Yang
Jiangnan Ye
Lu Lu
Jian Chen
Haoji Hu
VLM
68
3
0
10 Sep 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALM
VLM
113
5
0
06 Sep 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
70
1
0
30 Aug 2024
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
157
12
0
29 Aug 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu
Yue Liao
Le Zhuo
Chenning Xu
Guanghao Zhang
...
Bolin Li
Zhelun Yu
Si Liu
Hongsheng Li
Hao Jiang
VLM
MoE
70
18
0
28 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
118
26
0
28 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
155
68
0
28 Aug 2024
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
Aishik Nagar
Shantanu Jaiswal
Cheston Tan
ReLM
LRM
60
12
0
27 Aug 2024
Identifying Crucial Objects in Blind and Low-Vision Individuals' Navigation
Md Touhidul Islam
Imran Kabir
Elena Ariel Pearce
Md. Alimoor Reza
Syed Masum Billah
35
3
0
23 Aug 2024
Multimodal Contrastive In-Context Learning
Yosuke Miyanishi
Minh Le Nguyen
76
2
0
23 Aug 2024
CT-AGRG: Automated Abnormality-Guided Report Generation from 3D Chest CT Volumes
Theo Di Piazza
131
1
0
21 Aug 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
Yaqi Zhao
Yajie Zhang
Ke Lin
Jiahao Wang
Xin Tao
Pengfei Wan
Di Zhang
Baoqun Yin
Wentao Zhang
LRM
111
9
0
21 Aug 2024
ProgramAlly: Creating Custom Visual Access Programs via Multi-Modal End-User Programming
Jaylin Herskovitz
Andi Xu
Rahaf Alharbi
Anhong Guo
39
2
0
20 Aug 2024
Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs
Jinming Liu
Yuntao Wei
Junyan Lin
Shengyang Zhao
Heming Sun
Zhibo Chen
Wenjun Zeng
Xin Jin
137
2
0
16 Aug 2024
Misfitting With AI: How Blind People Verify and Contest AI Errors
Rahaf Alharbi
P. Lor
Jaylin Herskovitz
S. Schoenebeck
Robin Brewer
76
14
0
13 Aug 2024
How Well Can Vision Language Models See Image Details?
Chenhui Gou
Abdulwahab Felemban
Faizan Farooq Khan
Deyao Zhu
Jianfei Cai
Hamid Rezatofighi
Mohamed Elhoseiny
VLM
MLLM
100
5
0
07 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
171
865
0
06 Aug 2024
Towards Flexible Evaluation for Generative Visual Question Answering
Huishan Ji
Q. Si
Zheng Lin
Weiping Wang
90
1
0
01 Aug 2024
ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
Wenjun Huang
Jiakai Pan
Jiahao Tang
Yanyu Ding
Yifei Xing
Yuhe Wang
Zhengzhuo Wang
Jianguo Hu
Mamba
107
7
0
29 Jul 2024
Previous
1
2
3
4
5
...
10
11
12
Next