Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.03744
Cited By
Improved Baselines with Visual Instruction Tuning
5 October 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Improved Baselines with Visual Instruction Tuning"
50 / 483 papers shown
Title
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang
Chi Chen
Fuwen Luo
Yurui Dong
Yuanchi Zhang
Yuzhuang Xu
Xiaolong Wang
Peng Li
Yang Liu
LRM
40
3
0
07 Oct 2024
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Mehdi Azarafza
Fatima Idrees
Ali Ehteshami Bejnordi
Charles Steinmetz
Stefan Henkler
A. Rettberg
39
0
0
07 Oct 2024
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou
Ruohan Wang
Boyuan Zheng
Yanan Xie
Cheng Chang
Yiheng Shu
Huan Sun
Yu Su
LM&Ro
LLMAG
76
49
0
07 Oct 2024
Geometric Analysis of Reasoning Trajectories: A Phase Space Approach to Understanding Valid and Invalid Multi-Hop Reasoning in LLMs
Javier Marin
LRM
85
0
0
06 Oct 2024
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
Jiayi He
Hehai Lin
Q. Wang
Yi Ren Fung
Heng Ji
ReLM
LRM
103
4
0
05 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
84
26
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
69
5
0
04 Oct 2024
RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions
Ziyao Zeng
Yangchao Wu
Hyoungseob Park
Daniel Wang
Fengyu Yang
Stefano Soatto
Dong Lao
Byung-Woo Hong
Alex Wong
MDE
25
7
0
03 Oct 2024
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
Kaizhi Zheng
Xiaotong Chen
Xuehai He
Jing Gu
Linjie Li
Zhengyuan Yang
Kevin Qinghong Lin
Jianfeng Wang
Lijuan Wang
Xin Eric Wang
KELM
DiffM
40
0
0
03 Oct 2024
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu
MLLM
30
0
0
03 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang
Anish Kachinthaya
Suzie Petryk
Yossi Gandelsman
VLM
34
15
0
03 Oct 2024
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation
Mike Ranzinger
Jon Barker
Greg Heinrich
Pavlo Molchanov
Bryan Catanzaro
Andrew Tao
42
5
0
02 Oct 2024
The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs
Hong Li
Nanxi Li
Yuanjie Chen
Jianbin Zhu
Qinlu Guo
Cewu Lu
Yong-Lu Li
MLLM
39
1
0
02 Oct 2024
Probing Mechanical Reasoning in Large Vision Language Models
Haoran Sun
Qingying Gao
Haiyun Lyu
Dezhi Luo
Yijiang Li
Hokin Deng
LRM
44
2
0
01 Oct 2024
Vision Language Models Know Law of Conservation without Understanding More-or-Less
Dezhi Luo
Haiyun Lyu
Qingying Gao
Haoran Sun
Yijiang Li
Hokin Deng
17
1
0
01 Oct 2024
Vision Language Models See What You Want but not What You See
Qingying Gao
Yijiang Li
Haiyun Lyu
Haoran Sun
Dezhi Luo
Hokin Deng
LRM
VLM
34
3
0
01 Oct 2024
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
Zicheng Zhang
Ziheng Jia
H. Wu
Chunyi Li
Zijian Chen
...
Wei Sun
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
32
7
0
30 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
60
11
0
26 Sep 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
134
32
0
26 Sep 2024
CloudTrack: Scalable UAV Tracking with Cloud Semantics
Yannik Blei
Michael Krawez
Nisarga Nilavadi
Tanja Katharina Kaiser
Wolfram Burgard
44
1
0
24 Sep 2024
With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models
Tyler Loakman
Yucheng Li
Chenghua Lin
VLM
37
1
0
23 Sep 2024
Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models
Patrick Amadeus Irawan
Genta Indra Winata
Samuel Cahyawijaya
Ayu Purwarianti
34
0
0
23 Sep 2024
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
Jing Hao
Yuxiang Zhao
Song Chen
Yanpeng Sun
Qiang Chen
Gang Zhang
Kun Yao
Errui Ding
Jingdong Wang
VLM
VGen
MLLM
48
5
0
20 Sep 2024
Guided Profile Generation Improves Personalization with LLMs
Jiarui Zhang
34
4
0
19 Sep 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
86
54
0
19 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
36
9
0
18 Sep 2024
AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots
Zhaxizhuoma
Pengan Chen
Ziniu Wu
Jiawei Sun
Dong Wang
Peng Zhou
Nieqing Cao
Yan Ding
Bin Zhao
Xuelong Li
46
4
0
18 Sep 2024
MotIF: Motion Instruction Fine-tuning
Minyoung Hwang
Joey Hejna
Dorsa Sadigh
Yonatan Bisk
54
1
0
16 Sep 2024
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye
Qiong Wu
Wenhao Lin
Yiyi Zhou
VLM
41
10
0
16 Sep 2024
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
Yuan-Hong Liao
Rafid Mahmood
Sanja Fidler
David Acuna
ReLM
LRM
39
9
0
15 Sep 2024
Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings
Tanisha Hisariya
Huan Zhang
Jinhua Liang
29
3
0
12 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAML
VLM
43
3
0
11 Sep 2024
Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering
Larissa Pusch
Tim O. F. Conrad
38
0
0
06 Sep 2024
Generating Faithful and Salient Text from Multimodal Data
Tahsina Hashem
Weiqing Wang
Derry Tanti Wijaya
Mohammed Eunus Ali
Yuan-Fang Li
31
0
0
06 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
115
1
0
04 Sep 2024
EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding
Muye Huang
Han Lai
Xinyu Zhang
Wenjun Wu
Jie Ma
Lingling Zhang
Jun Liu
39
4
0
03 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead
Jacob Phillips
Sean Hendryx
31
0
0
30 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
30
53
0
28 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
L. Wang
Rong Jin
Tieniu Tan
OffRL
69
36
0
23 Aug 2024
IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities
Bin Wang
Chunyu Xie
Dawei Leng
Yuhui Yin
MLLM
54
1
0
23 Aug 2024
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model
Feipeng Ma
Yizhou Zhou
Hebei Li
Zilong He
Siying Wu
Fengyun Rao
Siying Wu
Fengyun Rao
Yueyi Zhang
Xiaoyan Sun
33
3
0
21 Aug 2024
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs
Eui Jun Hwang
Sukmin Cho
Junmyeong Lee
Jong C. Park
SLR
76
4
0
20 Aug 2024
Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant
Guofeng Mei
Luigi Riz
Yiming Wang
Fabio Poiesi
ISeg
VLM
64
3
0
20 Aug 2024
Visual Agents as Fast and Slow Thinkers
Guangyan Sun
Mingyu Jin
Zhenting Wang
Cheng-Long Wang
Siqi Ma
Qifan Wang
Ying Nian Wu
Ying Nian Wu
Dongfang Liu
Dongfang Liu
LLMAG
LRM
79
13
0
16 Aug 2024
HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes
Xuanyu Su
Yansong Li
Diana Inkpen
Nathalie Japkowicz
VLM
81
2
0
11 Aug 2024
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Fushuo Huo
Wenchao Xu
Zhong Zhang
Yining Qi
Zhicheng Chen
Peilin Zhao
VLM
MLLM
66
19
0
04 Aug 2024
ExpertAF: Expert Actionable Feedback from Video
Kumar Ashutosh
Tushar Nagarajan
Georgios Pavlakos
Kris M. Kitani
Kristen Grauman
VGen
44
2
0
01 Aug 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
Ming-Kuan Wu
Xinyue Cai
Jiayi Ji
Jiale Li
Oucheng Huang
Gen Luo
Hao Fei
Xiaoshuai Sun
Rongrong Ji
MLLM
49
7
0
31 Jul 2024
Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection
Jinfa Huang
Jinsheng Pan
Zhongwei Wan
Hanjia Lyu
Jiebo Luo
58
4
0
30 Jul 2024
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance
Yongqiang Yao
Jingru Tan
Jiahao Hu
Feizhao Zhang
Xin Jin
...
Ruihao Gong
Pengfei Liu
Pengfei Liu
Dahua Lin
Ningyi Xu
VLM
52
1
0
30 Jul 2024
Previous
1
2
3
...
10
5
6
7
8
9
Next