ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.07533
  4. Cited By
VILA: On Pre-training for Visual Language Models
v1v2v3v4 (latest)

VILA: On Pre-training for Visual Language Models

12 December 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
    MLLMVLM
ArXiv (abs)PDFHTML

Papers citing "VILA: On Pre-training for Visual Language Models"

50 / 139 papers shown
Title
Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes
Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes
Chao-Yeh Chen
Nobel Dang
Juexiao Zhang
Wenkai Sun
Pengfei Zheng
Xuhang He
Yimeng Ye
Taarun Srinivas
Taarun Srinivas
Chen Feng
3DV
33
0
0
20 Jun 2025
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Yi Chen
Yuying Ge
Rui Wang
Yixiao Ge
Junhao Cheng
Ying Shan
Xihui Liu
OffRLVLMLRM
32
0
0
19 Jun 2025
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
44
0
0
18 Jun 2025
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
Xinyi Zhao
Congjing Zhang
Pei Guo
Wei Li
Lin Chen
Chaoyue Zhao
Shuai Huang
23
0
0
15 Jun 2025
How Visual Representations Map to Language Feature Space in Multimodal LLMs
How Visual Representations Map to Language Feature Space in Multimodal LLMs
Constantin Venhoff
Ashkan Khakzar
Sonia Joseph
Philip Torr
Neel Nanda
19
0
0
13 Jun 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu
Y. Wu
Meng Chu
Zhifei Ren
Z. Huang
...
Conghui He
Yu Qiao
Yali Wang
Yi Wang
L. Wang
LRM
119
0
0
12 Jun 2025
Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Brian Gordon
Yonatan Bitton
Andreea Marzoca
Yasumasa Onoe
Xiao Wang
Daniel Cohen-Or
Idan Szpektor
CoGe
24
0
0
09 Jun 2025
CoMemo: LVLMs Need Image Context with Image Memory
CoMemo: LVLMs Need Image Context with Image Memory
Shi-Qi Liu
Weijie Su
Xizhou Zhu
Wenhai Wang
Jifeng Dai
VLM
48
0
0
06 Jun 2025
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
77
0
0
04 Jun 2025
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
Yicheng Xiao
Lin Song
Rui Yang
Cheng Cheng
Zunnan Xu
Zhaoyang Zhang
Yixiao Ge
Xiu Li
Ying Shan
60
2
0
03 Jun 2025
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
METok: Multi-Stage Event-based Token Compression for Efficient Long Video Understanding
Mengyue Wang
Shuo Chen
Kristian Kersting
Volker Tresp
Yunpu Ma
VLM
62
0
0
03 Jun 2025
Affordance Benchmark for MLLMs
Affordance Benchmark for MLLMs
Junying Wang
Wenzhe Li
Yalun Wu
Yingji Liang
Yijin Guo
Chunyi Li
Haodong Duan
Zicheng Zhang
Guangtao Zhai
44
0
0
01 Jun 2025
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng
Shijia Huang
Yanyang Li
Liwei Wang
45
0
0
30 May 2025
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg
Naman D. Singh
Matthias Hein
CoGeVLM
37
0
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
39
1
0
30 May 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay
Mukul Ranjan
Zhiqiang Shen
Mohamed Elhoseiny
VLM
26
0
0
30 May 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Junyu Luo
Zhizhuo Kou
Liming Yang
Xiao Luo
Jinsheng Huang
...
Jiaming Ji
Xuanzhe Liu
Sirui Han
Ming Zhang
Yike Guo
20
0
0
30 May 2025
A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Shengyuan Liu
Boyun Zheng
Wenting Chen
Zhihao Peng
Zhenfei Yin
Jing Shao
Jiancong Hu
Yixuan Yuan
ELM
84
0
0
29 May 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu
Fangfu Liu
Yi-Hsin Hung
Yueqi Duan
LRM
85
1
0
29 May 2025
NegVQA: Can Vision Language Models Understand Negation?
NegVQA: Can Vision Language Models Understand Negation?
Yuhui Zhang
Yuchang Su
Yiming Liu
Serena Yeung-Levy
MLLMCoGe
50
0
0
28 May 2025
QuARI: Query Adaptive Retrieval Improvement
QuARI: Query Adaptive Retrieval Improvement
Eric Xing
Abby Stylianou
Robert Pless
Nathan Jacobs
VLM
27
0
0
27 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
67
0
0
27 May 2025
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Fanheng Kong
Jingyuan Zhang
Hongzhi Zhang
Shi Feng
Daling Wang
Linhao Yu
Xingguang Ji
Yu Tian
Qi Wang
Fuzheng Zhang
62
1
0
26 May 2025
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models
Rui Cai
Bangzheng Li
Xiaofei Wen
Muhao Chen
Zhe Zhao
26
0
0
26 May 2025
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Baolin Zheng
Guanlin Chen
Hongqiong Zhong
Qingyang Teng
Yingshui Tan
...
Jincheng Wei
Wenbo Su
Xiaoyong Zhu
Bo Zheng
Kaifu Zhang
ELM
29
0
0
26 May 2025
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs
Xuan Zhang
Cunxiao Du
Sicheng Yu
Jiawei Wu
Fengzhuo Zhang
Wei Gao
Qian Liu
65
0
0
25 May 2025
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM
Penghao Wu
Lewei Lu
Ziwei Liu
129
0
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
102
0
0
20 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
142
0
0
17 May 2025
FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition
FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition
Shuai Yuan
Guowen Xu
Hongwei Li
Rui Zhang
Xinyuan Qian
Wenbo Jiang
Hangcheng Cao
Qingchuan Zhao
AAML
122
0
0
17 May 2025
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation
Zihan Wang
Seungjun Lee
Gim Hee Lee
VGen
132
0
0
16 May 2025
Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models
Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models
Congcong Zhu
Xiaoyan Xu
Jiayue Han
Jingrun Chen
OODAI4CE
153
0
0
16 May 2025
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Yifu Yuan
Haiqin Cui
Yibin Chen
Zibin Dong
Fei Ni
Longxin Kou
Jinyi Liu
Pengyi Li
Yan Zheng
Jianye Hao
153
0
0
13 May 2025
VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
T. Vuong
J. T. Kwak
VGen
99
0
0
07 May 2025
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Shuhang Xun
Sicheng Tao
Jiajun Li
Yibo Shi
Zhixin Lin
...
Shikang Wang
Yang Liu
Hao Zhang
Ying Ma
Xuming Hu
VLMLRM
100
1
0
04 May 2025
Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition
Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition
Zhifeng Wang
Qixuan Zhang
Peter Zhang
Wenjia Niu
Kaihao Zhang
Ramesh Sankaranarayana
Sabrina Caldwell
Tom Gedeon
92
0
0
24 Apr 2025
VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
VideoVista-CulturalLingo: 360∘^\circ∘ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Xinyu Chen
Yunxin Li
Haoyuan Shi
Baotian Hu
Wenhan Luo
Yaowei Wang
Hao Fei
ELM
116
0
0
23 Apr 2025
FaceInsight: A Multimodal Large Language Model for Face Perception
FaceInsight: A Multimodal Large Language Model for Face Perception
Jingzhi Li
Changjiang Luo
Ruoyu Chen
Hua Zhang
Wenqi Ren
Jianhou Gan
Xiaochun Cao
CVBMLRM
138
0
0
22 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
124
6
0
20 Apr 2025
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Yikun Ji
Y. Hong
Jiahui Zhan
H. Chen
Jun Lan
Huijia Zhu
Weiqiang Wang
Lefei Zhang
Jianfu Zhang
MLLMLRM
115
0
0
19 Apr 2025
Aligning Anime Video Generation with Human Feedback
Aligning Anime Video Generation with Human Feedback
Bingwen Zhu
Yudong Jiang
Baohan Xu
Siqian Yang
Mingyu Yin
Yidi Wu
Huyang Sun
Zuxuan Wu
EGVMVGen
125
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLMVLM
221
132
1
14 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
Kai Zhang
Jinahua Han
Lanqing Hong
Hang Xu
Xuelong Li
MLLMVLM
502
0
0
08 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
209
3
0
03 Apr 2025
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng
Kaixiong Gong
Yangqiu Song
Zonghao Guo
Yibing Wang
Tianshuo Peng
Jian Wu
Xiaoying Zhang
Benyou Wang
Xiangyu Yue
AI4TSSyDaLRM
173
62
0
27 Mar 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu
Yuyao Sun
Zilu Zhang
Leping Huang
Jianliang Zeng
Mao Shu
Huo Cao
140
4
0
27 Mar 2025
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Yichao Yan
VLM
242
1
0
26 Mar 2025
ACVUBench: Audio-Centric Video Understanding Benchmark
ACVUBench: Audio-Centric Video Understanding Benchmark
Yue Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Yongqian Li
P. Li
Yifan Jiang
W. Li
Zejun Ma
Chao Zhang
AuLLMCoGe
116
0
0
25 Mar 2025
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Kexian Tang
Junyao Gao
Yanhong Zeng
Haodong Duan
Yanan Sun
Zhening Xing
Wenran Liu
Kaifeng Lyu
Kai-xiang Chen
ELMLRM
146
9
0
25 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
213
1
0
24 Mar 2025
123
Next