Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
2412.12932
Cited By
v1
v2
v3 (latest)
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models
17 December 2024
Zihui Cheng
Qiguang Chen
Jin Zhang
Hao Fei
Xiaocheng Feng
Wanxiang Che
Min Li
L. Qin
VLM
MLLM
LRM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models"
40 / 40 papers shown
Title
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Zeyuan Yang
Xueyang Yu
Delin Chen
Maohao Shen
Chuang Gan
LRM
31
0
0
20 Jun 2025
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
Zeyu Liu
Y. Liu
Guanghao Zhu
C. Xie
Zhen Li
...
Qing Li
Shing-Chi Cheung
Shengyu Zhang
Fei Wu
Hongxia Yang
ReLM
LRM
115
0
0
29 May 2025
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Xuecheng Wu
Jiaxing Liu
Danlei Huang
Xiaoyu Li
Yifan Wang
Chen Chen
Liya Ma
Xuezhi Cao
Junxiao Xue
LRM
126
0
0
20 May 2025
Visual Planning: Let's Think Only with Images
Yi Xu
Chengzu Li
Han Zhou
Xingchen Wan
Caiqi Zhang
Anna Korhonen
Ivan Vulić
LM&Ro
LRM
170
1
0
16 May 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yansen Wang
Shengqiong Wu
Yize Zhang
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
225
31
0
16 Mar 2025
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei
Shengqiong Wu
Wei Ji
Hao Zhang
Hao Fei
Mong Li Lee
Wynne Hsu
LRM
VGen
156
83
0
08 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
204
42
0
31 Dec 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
L. Qin
Qiguang Chen
Hao Fei
Zhi Chen
Min Li
Wanxiang Che
94
11
0
27 Oct 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
130
43
0
27 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
102
23
0
04 Jun 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
127
7
0
31 May 2024
Faithful Logical Reasoning via Symbolic Chain-of-Thought
Jundong Xu
Hao Fei
Liangming Pan
Qian Liu
Mong Li Lee
Wynne Hsu
OffRL
LRM
LLMAG
156
65
0
28 May 2024
Large Language Models Meet NLP: A Survey
Libo Qin
Qiguang Chen
Xiachong Feng
Yang Wu
Yongheng Zhang
Hai-Tao Zheng
Min Li
Wanxiang Che
Philip S. Yu
ALM
LM&MA
ELM
LRM
123
59
0
21 May 2024
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
...
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
VLM
135
374
0
08 Mar 2024
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan
Junqi Dai
Jiasheng Ye
Yunhua Zhou
Dong Zhang
...
Jie Fu
Tao Gui
Tianxiang Sun
Yugang Jiang
Xipeng Qiu
MLLM
100
136
0
19 Feb 2024
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Debjyoti Mondal
Suraj Modi
Subhadarshi Panda
Rituraj Singh
Godawari Sudhakar Rao
LRM
80
46
0
23 Jan 2024
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Daoan Zhang
Junming Yang
Hanjia Lyu
Zijian Jin
Yuan Yao
Mingkai Chen
Jiebo Luo
104
40
0
05 Jan 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
324
1,217
0
21 Dec 2023
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He
Zuchao Li
Xiantao Cai
Ping Wang
LRM
87
25
0
14 Dec 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
480
960
0
27 Nov 2023
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
Yifan Wu
Pengchuan Zhang
Wenhan Xiong
Barlas Oğuz
James C. Gee
Yixin Nie
LRM
77
20
0
15 Nov 2023
Chain of Images for Intuitively Reasoning
Fanxu Meng
Haotong Yang
Yiding Wang
Muhan Zhang
LRM
83
10
0
09 Nov 2023
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
Ge Zheng
Bin Yang
Jiajin Tang
Hong-Yu Zhou
Sibei Yang
LRM
MLLM
100
117
0
25 Oct 2023
Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning across Languages
Libo Qin
Qiguang Chen
Fuxuan Wei
Shijue Huang
Wanxiang Che
LRM
119
95
0
23 Oct 2023
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
131
507
0
11 Sep 2023
Generating Images with Multimodal Language Models
Jing Yu Koh
Daniel Fried
Ruslan Salakhutdinov
MLLM
181
259
0
26 May 2023
Reasoning Implicit Sentiment with Chain-of-Thought Prompting
Hao Fei
Bobo Li
Qian Liu
Lidong Bing
Fei Li
Tat-Seng Chua
ReLM
LRM
127
104
0
18 May 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai
Junnan Li
Dongxu Li
A. M. H. Tiong
Junqi Zhao
Weisheng Wang
Boyang Albert Li
Pascale Fung
Steven C. H. Hoi
MLLM
VLM
291
2,105
0
11 May 2023
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering
Lei Wang
Yilang Hu
Jiabang He
Xingdong Xu
Ning Liu
Hui-juan Liu
Hengtao Shen
LRM
MLLM
133
48
0
05 May 2023
Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
169
553
0
03 Apr 2023
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang
Aston Zhang
Mu Li
Hai Zhao
George Karypis
Alexander J. Smola
LRM
178
466
0
02 Feb 2023
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning
O. Yu. Golovneva
Moya Chen
Spencer Poff
Martin Corredor
Luke Zettlemoyer
Maryam Fazel-Zarandi
Asli Celikyilmaz
ReLM
LRM
122
152
0
15 Dec 2022
Abstract Visual Reasoning with Tangram Shapes
Anya Ji
Noriyuki Kojima
N. Rush
Alane Suhr
Wai Keen Vong
Robert D. Hawkins
Yoav Artzi
LRM
82
40
0
29 Nov 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
304
1,303
0
20 Sep 2022
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
100
556
0
03 Jun 2022
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
651
4,077
0
24 May 2022
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
CLIP
311
1,597
0
18 Apr 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
1.1K
30,116
0
26 Feb 2021
JHU-CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method
Vishwanath A. Sindagi
R. Yasarla
Vishal M. Patel
108
230
0
07 Apr 2020
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
LRM
BDL
OCL
ReLM
282
885
0
27 Nov 2018
1