ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.05525
  4. Cited By
DeepSeek-VL: Towards Real-World Vision-Language Understanding
v1v2 (latest)

DeepSeek-VL: Towards Real-World Vision-Language Understanding

8 March 2024
Haoyu Lu
Wen Liu
Bo Zhang
Bing-Li Wang
Kai Dong
Bo Liu
Jingxiang Sun
Zhaolin Ren
Zhuoshu Li
Hao Yang
Yaofeng Sun
Chengqi Deng
Hanwei Xu
Zhenda Xie
Chong Ruan
    VLM
ArXiv (abs)PDFHTML

Papers citing "DeepSeek-VL: Towards Real-World Vision-Language Understanding"

50 / 109 papers shown
Title
Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks
Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks
Dong Nguyen Tien
Dung D. Le
AAML
10
0
0
19 Jun 2025
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Shuo Xing
Lanqing guo
Hongyuan Hua
Seoyoung Lee
Peiran Li
Yufei Wang
Zhangyang Wang
Zhengzhong Tu
VLM
38
0
0
18 Jun 2025
3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
Yuke Xing
Jiarui Wang
Peizhi Niu
Wenjie Huang
Guangtao Zhai
Yiling Xu
3DGS
23
0
0
17 Jun 2025
RationalVLA: A Rational Vision-Language-Action Model with Dual System
RationalVLA: A Rational Vision-Language-Action Model with Dual System
Wenxuan Song
Jiayi Chen
Wenxue Li
Xu He
Han Zhao
...
Xinhu Zheng
Zhe Liu
Hesheng Wang
Yunhui Liu
Haoang Li
LM&Ro
161
1
0
12 Jun 2025
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
Yitong Zhang
Jia Li
L. Cai
Ge Li
VLM
43
0
0
11 Jun 2025
Reinforcing Multimodal Understanding and Generation with Dual Self-rewards
Reinforcing Multimodal Understanding and Generation with Dual Self-rewards
Jixiang Hong
Yiran Zhang
Guanzhong Wang
Yi Liu
Ji-Rong Wen
Rui Yan
LRM
26
0
0
09 Jun 2025
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
Yi Zhao
Siqi Wang
Jing Li
54
0
0
04 Jun 2025
MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection
MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection
Juntong Li
Lingwei Dang
Yukun Su
Yun Hao
Qingxin Xiao
Yongwei Nie
Qingyao Wu
64
0
0
03 Jun 2025
DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models
Jiarui Wang
Huiyu Duan
Juntong Wang
Ziheng Jia
Woo Yi Yang
...
Yu Zhao
Jiaying Qian
Yuke Xing
Guangtao Zhai
Xiongkuo Min
59
0
0
03 Jun 2025
Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation
Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation
Yichi Zhang
Zhuo Chen
Lingbing Guo
Yajing Xu
M. Zhang
Wen Zhang
H. Chen
58
0
0
02 Jun 2025
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Yuyuan Liu
Yuanhong Chen
Chong Wang
Junlin Han
Junde Wu
Can Peng
Jingkun Chen
Yu Tian
Gustavo Carneiro
VLM
49
0
0
01 Jun 2025
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
Olga Loginova
Sofía Ortega Loguinova
LRM
33
0
0
01 Jun 2025
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Xin He
Xumeng Han
Longhui Wei
Lingxi Xie
Qi Tian
MoE
34
0
0
30 May 2025
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Redwan Sony
Parisa Farmanifard
Hamzeh Alzwairy
Nitish Shukla
Arun Ross
CVBMVLM
49
0
0
30 May 2025
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Chenbin Pan
Wenbin He
Zhengzhong Tu
Liu Ren
LRMVLM
68
0
0
29 May 2025
VModA: An Effective Framework for Adaptive NSFW Image Moderation
VModA: An Effective Framework for Adaptive NSFW Image Moderation
Han Bao
Qinying Wang
Zhi Chen
Qingming Li
Xuhong Zhang
Changjiang Li
Zonghui Wang
Shouling Ji
Wenzhi Chen
25
0
0
29 May 2025
NegVQA: Can Vision Language Models Understand Negation?
NegVQA: Can Vision Language Models Understand Negation?
Yuhui Zhang
Yuchang Su
Yiming Liu
Serena Yeung-Levy
MLLMCoGe
40
0
0
28 May 2025
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
Linglin Jing
Yuting Gao
Zhigang Wang
Wang Lan
Yiwen Tang
Wenhai Wang
Kaipeng Zhang
Qingpei Guo
MoE
30
0
0
28 May 2025
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Jin Wang
Yao Lai
Aoxue Li
Shifeng Zhang
Jiacheng Sun
Ning Kang
Chengyue Wu
Zhenguo Li
Ping Luo
67
2
0
26 May 2025
Medical Large Vision Language Models with Multi-Image Visual Ability
Medical Large Vision Language Models with Multi-Image Visual Ability
Xikai Yang
Juzheng Miao
Yuchen Yuan
Jiaze Wang
Qi Dou
Jinpeng Li
Pheng Ann Heng
34
0
0
25 May 2025
VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis
VISTA: Vision-Language Inference for Training-Free Stock Time-Series Analysis
Tina Khezresmaeilzadeh
Parsa Razmara
Seyedarmin Azizi
Mohammad Erfan Sadeghi
Erfan Baghaei Portaghloo
AI4TS
276
0
0
24 May 2025
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities
Ziwei Zhou
Rui Wang
Zuxuan Wu
AuLLMVGen
80
0
0
23 May 2025
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Zebin You
Shen Nie
Xiaolu Zhang
Jun Hu
Jun Zhou
Zhiwu Lu
J. Wen
Chongxuan Li
MLLMVLM
112
2
0
22 May 2025
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Song Dai
Yibo Yan
Jiamin Su
Dongfang Zihao
Yubo Gao
...
Jungang Li
Junyan Zhang
Sicheng Tao
Zhuoran Gao
Xuming Hu
LRMAI4CE
61
0
0
21 May 2025
From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems
From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems
Xiuchao Sui
Daiying Tian
Qi Sun
Ruirui Chen
Dongkyu Choi
Kenneth Kwok
Soujanya Poria
LM&Ro
113
0
0
21 May 2025
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Maoyuan Ye
Jing Zhang
Juhua Liu
Bo Du
Dacheng Tao
LRM
180
0
0
18 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
142
0
0
17 May 2025
OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
Fanqi Lin
Ruiqian Nai
Yingdong Hu
Jiacheng You
Junming Zhao
Yang Gao
LRM
97
0
0
17 May 2025
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
Pengju Xu
Yan Wang
Shuyuan Zhang
Xuan Zhou
Xin Li
...
Fengzhao Li
Shuigeng Zhou
Xingyu Wang
Yi Zhang
Haiying Zhao
VLM
128
1
0
16 May 2025
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction
Biao Gong
Cheng Zou
Dandan Zheng
Hu Yu
Jingdong Chen
...
Qingpei Guo
Rui Liu
Weilong Chai
Xinyu Xiao
Ziyuan Huang
MLLM
216
3
0
05 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
Jieneng Chen
LRM
118
1
0
01 May 2025
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Yuanchen Wu
Lu Zhang
Hang Yao
Junlong Du
Ke Yan
Shouhong Ding
Yunsheng Wu
Xuzhao Li
MLLM
241
0
0
29 Apr 2025
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang
Wenliang Zheng
Aashrith Madasu
Peng Shi
Ryo Kamoi
...
Ranran Haoran Zhang
Avitej Iyer
Renze Lou
Wenpeng Yin
Rui Zhang
306
0
0
25 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xiang Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRLAI4TSSyDaLRMVLM
152
9
0
23 Apr 2025
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi R. Fung
135
0
0
23 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
122
6
0
20 Apr 2025
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Zhanglin Wu
Tengfei Song
Ning Xie
Mengli Zhu
Weidong Zhang
...
Pengfei Li
Chong Li
Junhao Zhu
Hao Yang
Shiliang Sun
114
2
0
16 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
Jian Wu
Hao Yang
Xinhua Zeng
Guibing He
Zhe Chen
Zhu Li
Xinming Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
385
1
0
12 Apr 2025
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
A Survey of Large Language Models in Mental Health Disorder Detection on Social Media
Zhuohan Ge
Nicole Hu
Darian Li
Yubo Wang
Shihao Qi
Yuming Xu
Han Shi
Junxuan Zhang
AI4MH
119
0
0
03 Apr 2025
Efficient Adaptation For Remote Sensing Visual Grounding
Efficient Adaptation For Remote Sensing Visual Grounding
Hasan Moughnieh
Mohamad Chalhoub
Hasan Nasrallah
Cristiano Nattero
Paolo Campanella
Giovanni Nico
A. Ghandour
109
0
0
29 Mar 2025
Shape and Texture Recognition in Large Vision-Language Models
Shape and Texture Recognition in Large Vision-Language Models
Sagi Eppel
Mor Bismut
Alona Faktor
3DVVLM
97
2
0
29 Mar 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu
Yuyao Sun
Zilu Zhang
Leping Huang
Jianliang Zeng
Mao Shu
Huo Cao
140
4
0
27 Mar 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu
Wentao Zhang
Lumin Xu
Sheng Jin
Zhonghua Wu
Qingyi Tao
Wentao Liu
Wei Li
Chen Change Loy
VGen
459
6
0
27 Mar 2025
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
Ivan Sviridov
Amina Miftakhova
Artemiy Tereshchenko
Galina Zubkova
Pavel Blinov
Andrey Savchenko
LM&MA
95
0
0
26 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
219
2
0
26 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
211
1
0
24 Mar 2025
Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Yize Zhang
Chunwang Zou
Bo Wang
Jing Qin
137
0
0
24 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
142
8
0
18 Mar 2025
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
Wenzhuo Xu
Zhipeng Wei
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Xinming Zhang
AAML
90
0
0
10 Mar 2025
Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios
Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios
Chenglu Pan
Xiaogang Xu
Ganggui Ding
Yunke Zhang
Wenbo Li
Jiarong Xu
Qingbiao Wu
142
0
0
10 Mar 2025
123
Next