Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.17043
Cited By
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
28 November 2023
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models"
50 / 205 papers shown
Title
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Zehua Wang
Yang Liu
Ziwei Sun
Yansen Wang
VLM
237
0
0
13 Mar 2025
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Liqiang Nie
70
1
0
13 Mar 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin Qinghong Lin
Mike Zheng Shou
VGen
234
1
0
12 Mar 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin
Zhiyu Tan
Mengping Yang
Xiaomeng Yang
Hao Li
90
0
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
207
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
231
1
0
12 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
63
0
0
12 Mar 2025
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Haoyu Zhang
Qiaohui Chu
Meng Liu
Yunxiao Wang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Yaowei Wang
Liqiang Nie
EgoV
75
0
0
12 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind People
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoV
VLM
70
0
0
11 Mar 2025
RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan
Yunfan Ye
Yuanjing Luo
Qian Wan
Fang Liu
Zhiping Cai
VLM
72
1
0
11 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
75
3
0
10 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
Xiaoyu Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
72
1
0
10 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
211
0
0
08 Mar 2025
Is Your Video Language Model a Reliable Judge?
M. Liu
Wensheng Zhang
67
2
0
07 Mar 2025
SHAPE : Self-Improved Visual Preference Alignment by Iteratively Generating Holistic Winner
Kejia Chen
Jiawen Zhang
Jiacong Hu
Jiazhen Yang
Jian Lou
Zunlei Feng
Mingli Song
71
0
0
06 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
Hairu Wang
Can Huang
B. Li
VLM
63
2
0
06 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
71
2
0
05 Mar 2025
Advancing vision-language models in front-end development via data synthesis
Tong Ge
Yashu Liu
Jieping Ye
Tianyi Li
Chao Wang
78
0
0
03 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding
Louis Mahon
Mirella Lapata
VLM
41
2
0
03 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
76
4
0
01 Mar 2025
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation
Yunpeng Gao
C. Li
Zhongrui You
Jing Liu
Zhen Li
...
Yan Ding
Dong Wang
Zhilin Wang
Bin Zhao
Xuelong Li
52
4
0
25 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
Xinsong Zhang
Jingyu Wang
Wenbing Tao
66
4
0
17 Feb 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Shengsheng Qian
Jiahong Wu
Fan Yang
W. Dong
Changsheng Xu
47
5
0
15 Feb 2025
CoS: Chain-of-Shot Prompting for Long Video Understanding
Jian Hu
Zixu Cheng
Chenyang Si
Wei Li
Shaogang Gong
57
4
0
10 Feb 2025
Survey on AI-Generated Media Detection: From Non-MLLM to MLLM
Yueying Zou
Peipei Li
Zekun Li
Huaibo Huang
Xing Cui
Xuannan Liu
Chenghanyu Zhang
Ran He
DeLMO
132
3
0
07 Feb 2025
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren
Lingrui Xu
Long Xia
Shuaiqiang Wang
Dawei Yin
Chao Huang
VGen
VLM
79
3
0
03 Feb 2025
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Jiaxing Zhao
Q. Yang
Yixing Peng
Detao Bai
Shimin Yao
...
Xiang Chen
Shenghao Fu
Weixuan chen
Xihan Wei
Liefeng Bo
VGen
AuLLM
58
5
0
28 Jan 2025
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin
Emily Ruoyu Liu
Joyce Chuyi Chen
Ines Dormoy
Jinyoung Kim
Samar Khanna
Zhuo Zheng
Stefano Ermon
MLLM
VLM
60
5
0
28 Jan 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
93
24
0
21 Jan 2025
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Sitong Gong
Yunzhi Zhuge
Lu Zhang
Zheng Yang
Pingping Zhang
Huchuan Lu
46
0
0
15 Jan 2025
TimeLogic: A Temporal Logic Benchmark for Video QA
S. Swetha
Hilde Kuehne
Mubarak Shah
52
1
0
13 Jan 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming Yang
VLM
96
12
0
07 Jan 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLM
VLM
79
28
0
07 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
91
12
0
06 Jan 2025
MLVU: Benchmarking Multi-task Long Video Understanding
Yueze Wang
Yan Shu
Bo Zhao
Boya Wu
Junjie Zhou
...
Xi Yang
Y. Xiong
Bo Zhang
Tiejun Huang
Zheng Liu
VLM
63
33
0
03 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
88
7
0
02 Jan 2025
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li
Yi Wang
Jiashuo Yu
Xiangyu Zeng
Yuhan Zhu
...
Yinan He
Chenting Wang
Yu Qiao
Yali Wang
L. Wang
VLM
86
26
0
31 Dec 2024
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
88
6
0
29 Dec 2024
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei
Yujie Zhong
Haoxian Tan
Yingsen Zeng
Yong Liu
Zheng Zhao
Yujiu Yang
MLLM
VLM
VOS
108
2
0
18 Dec 2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
Xinsong Zhang
K. Chen
Yu Qiao
Dahua Lin
Jiaqi Wang
KELM
86
12
0
12 Dec 2024
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Haozhao Wang
Yuxiang Nie
Yongjie Ye
Deng GuanYu
Yanjie Wang
Shuai Li
Haiyang Yu
Jinghui Lu
Can Huang
VLM
MLLM
84
1
0
12 Dec 2024
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong
Chengyao Wang
Yuqi Liu
Senqiao Yang
Longxiang Tang
...
Shaozuo Yu
Sitong Wu
Eric Lo
Shu Liu
Jiaya Jia
AuLLM
111
7
0
12 Dec 2024
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
Sai Bhargav Rongali
M. Cui
Ankit Jha
Neha Bhargava
Saurabh Prasad
Biplab Banerjee
84
0
0
12 Dec 2024
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Zechao Li
Yibing Song
Kaidi Wang
Zhangyang Wang
Yang You
91
7
0
04 Dec 2024
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
Kaixiong Gong
Kaituo Feng
Yangqiu Song
Yibing Wang
Mofan Cheng
...
Jiaming Han
Benyou Wang
Yutong Bai
Zheng Yang
Xiangyu Yue
MLLM
AuLLM
VLM
91
6
0
03 Dec 2024
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Meng Cao
Haoran Tang
Haoze Zhao
Hangyu Guo
Jing Liu
Ge Zhang
Ruyang Liu
Qiang Sun
Ian Reid
Xiaodan Liang
102
2
0
02 Dec 2024
Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
Zhuokun Chen
Jinwu Hu
Zeshuai Deng
Yufeng Wang
Bohan Zhuang
Mingkui Tan
71
0
0
02 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
106
2
0
01 Dec 2024
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye
Yukang Gan
Yixiao Ge
Xiao Zhang
Yansong Tang
101
7
0
30 Nov 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
MLLM
87
13
0
27 Nov 2024
Previous
1
2
3
4
5
Next