ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.18603
  4. Cited By
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning

24 May 2025
Ye Mo
Zirui Shao
Kai Ye
Xianwei Mao
Bo Zhang
Hangdi Xing
Peng Ye
Gang Huang
Kehan Chen
Zhou Huan
Zixu Yan
Sheng Zhou
    LRM
ArXiv (abs)PDFHTML

Papers citing "Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning"

40 / 40 papers shown
Title
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yansen Wang
Shengqiong Wu
Yize Zhang
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
208
31
0
16 Mar 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLMVLMOffRLAI4TSLRM
384
2,022
0
22 Jan 2025
Object-level Visual Prompts for Compositional Image Generation
Gaurav Parmar
Or Patashnik
Kuan-Chieh Wang
Daniil Ostashev
Srinivasa Narasimhan
Jun-Yan Zhu
Daniel Cohen-Or
Kfir Aberman
DiffM
52
6
0
03 Jan 2025
C3oT: Generating Shorter Chain-of-Thought without Compromising
  Effectiveness
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness
Yu Kang
Xianghui Sun
Liangyu Chen
Wei Zou
LRM
195
55
0
16 Dec 2024
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large
  Language Models without Fine-Tuning
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
Hai-Ming Xu
Qi Chen
Lei Wang
Lingqiao Liu
99
3
0
14 Dec 2024
DOGE: Towards Versatile Visual Document Grounding and Referring
DOGE: Towards Versatile Visual Document Grounding and Referring
Yinan Zhou
Yuxin Chen
Haokun Lin
Shuyu Yang
Li Zhu
Zhongang Qi
Chen Ma
Ying Shan
ObjD
150
4
0
26 Nov 2024
MinerU: An Open-Source Solution for Precise Document Content Extraction
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang
Chao Xu
Xiaomeng Zhao
Linke Ouyang
Fan Wu
...
Wei Li
Botian Shi
Yu Qiao
Dahua Lin
Conghui He
60
47
0
27 Sep 2024
Attention Prompting on Image for Large Vision-Language Models
Attention Prompting on Image for Large Vision-Language Models
Runpeng Yu
Weihao Yu
Xinchao Wang
VLM
102
11
0
25 Sep 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLMSyDaVLM
166
865
0
06 Aug 2024
Token-level Correlation-guided Compression for Efficient Multimodal
  Document Understanding
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Renshan Zhang
Yibo Lyu
Rui Shao
Gongwei Chen
Weili Guan
Liqiang Nie
76
10
0
19 Jul 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
145
644
0
25 Apr 2024
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
Bozhi Luan
Hao Feng
Hong Chen
Yonghui Wang
Wen-gang Zhou
Houqiang Li
MLLM
107
17
0
15 Apr 2024
LayoutLLM: Layout Instruction Tuning with Large Language Models for
  Document Understanding
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo
Yufan Shen
Zhaoqing Zhu
Qi Zheng
Zhi Yu
Cong Yao
111
49
0
08 Apr 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
  Understanding
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
117
125
0
19 Mar 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
266
1,216
0
21 Dec 2023
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in
  Language Models
Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models
Liqi He
Zuchao Li
Xiantao Cai
Ping Wang
LRM
83
25
0
14 Dec 2023
Attention Where It Matters: Rethinking Visual Document Understanding
  with Selective Region Concentration
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
H. Cao
Changcun Bao
Chaohu Liu
Huang-wei Chen
Kun Yin
Hao Liu
Yinsong Liu
Deqiang Jiang
Xing Sun
63
14
0
03 Sep 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLMVLMObjD
187
945
0
24 Aug 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
162
1,059
0
12 Jul 2023
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
  Understanding
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Jiabo Ye
Anwen Hu
Haiyang Xu
Qinghao Ye
Mingshi Yan
...
Chenliang Li
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
VLMMLLM
87
128
0
04 Jul 2023
Fine-Grained Visual Prompting
Fine-Grained Visual Prompting
Lingfeng Yang
Yueze Wang
Xiang Li
Xinlong Wang
Jian Yang
ObjDVLM
115
68
0
07 Jun 2023
Layout and Task Aware Instruction Prompt for Zero-shot Document Image
  Question Answering
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
Wenjin Wang
Yunhao Li
Yixin Ou
Yin Zhang
VLM
129
26
0
01 Jun 2023
Document Understanding Dataset and Evaluation (DUDE)
Document Understanding Dataset and Evaluation (DUDE)
Jordy Van Landeghem
Rubèn Pérez Tito
Łukasz Borchmann
Michal Pietruszka
Pawel Józiak
...
Bertrand Ackaert
Ernest Valveny
Matthew Blaschko
Sien Moens
Tomasz Stanislawek
VGen
93
66
0
15 May 2023
Structured Chain-of-Thought Prompting for Code Generation
Structured Chain-of-Thought Prompting for Code Generation
Jia Li
Ge Li
Yongming Li
Zhi Jin
LRM
111
138
0
11 May 2023
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
  Language Model Signals for Science Question Answering
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering
Lei Wang
Yilang Hu
Jiabang He
Xingdong Xu
Ning Liu
Hui-juan Liu
Hengtao Shen
LRMMLLM
114
48
0
05 May 2023
GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
Chuwei Luo
Changxu Cheng
Qi Zheng
Cong Yao
78
49
0
21 Apr 2023
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
Progressive Visual Prompt Learning with Contrastive Feature Re-formation
C. Xu
Yuhan Zhu
Haocheng Shen
Fengyuan Shi
Boheng Chen
Yixuan Liao
Xiaoxin Chen
Limin Wang
VLM
100
21
0
17 Apr 2023
What does CLIP know about a red circle? Visual prompt engineering for
  VLMs
What does CLIP know about a red circle? Visual prompt engineering for VLMs
Aleksandar Shtedritski
Christian Rupprecht
Andrea Vedaldi
VLMMLLM
106
162
0
13 Apr 2023
Multimodal Chain-of-Thought Reasoning in Language Models
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang
Aston Zhang
Mu Li
Hai Zhao
George Karypis
Alexander J. Smola
LRM
140
466
0
02 Feb 2023
VRDU: A Benchmark for Visually-rich Document Understanding
VRDU: A Benchmark for Visually-rich Document Understanding
Zilong Wang
Yichao Zhou
Wei Wei
Chen-Yu Lee
Sandeep Tata
58
17
0
15 Nov 2022
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
Chuwei Luo
Guozhi Tang
Qi Zheng
Cong Yao
Lianwen Jin
Chenliang Li
Yang Xue
Luo Si
84
18
0
27 Jun 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
937
9,784
0
28 Jan 2022
Document AI: Benchmarks, Models and Applications
Document AI: Benchmarks, Models and Applications
Lei Cui
Yiheng Xu
Tengchao Lv
Furu Wei
VLM
89
74
0
16 Nov 2021
FeTaQA: Free-form Table Question Answering
FeTaQA: Free-form Table Question Answering
Linyong Nan
Chia-Hsuan Hsieh
Ziming Mao
Xi Lin
Neha Verma
...
Isabel Trindade
Renusree Bandaru
Jacob Cunningham
Caiming Xiong
Dragomir R. Radev
LMTD
146
167
0
01 Apr 2021
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction
Zheng Huang
Kai Chen
Jianhua He
X. Bai
Dimosthenis Karatzas
Shijian Lu
C. V. Jawahar
81
321
0
18 Mar 2021
DocVQA: A Dataset for VQA on Document Images
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew
Dimosthenis Karatzas
C. V. Jawahar
161
748
0
01 Jul 2020
LayoutLM: Pre-training of Text and Layout for Document Image
  Understanding
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Yiheng Xu
Minghao Li
Lei Cui
Shaohan Huang
Furu Wei
Ming Zhou
155
718
0
31 Dec 2019
PubLayNet: largest dataset ever for document layout analysis
PubLayNet: largest dataset ever for document layout analysis
Xu Zhong
Jianbin Tang
Antonio Jimeno Yepes
54
464
0
16 Aug 2019
ICDAR 2019 Competition on Scene Text Visual Question Answering
ICDAR 2019 Competition on Scene Text Visual Question Answering
Ali Furkan Biten
Rubèn Pérez Tito
Andrés Mafla
Lluís Gómez
Marçal Rusiñol
Minesh Mathew
C. V. Jawahar
Ernest Valveny
Dimosthenis Karatzas
74
76
0
30 Jun 2019
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
Guillaume Jaume
H. K. Ekenel
Jean-Philippe Thiran
183
372
0
27 May 2019
1