ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.02713
  4. Cited By
Video Instruction Tuning With Synthetic Data

Video Instruction Tuning With Synthetic Data

3 October 2024
Yuanhan Zhang
Jinming Wu
Wei Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
    SyDa
    VGen
ArXivPDFHTML

Papers citing "Video Instruction Tuning With Synthetic Data"

50 / 100 papers shown
Title
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
Kaisi Guan
Zhengfeng Lai
Yizhou Sun
Peng Zhang
Wei Liu
Kieran Liu
Meng Cao
Ruihua Song
VGen
56
0
0
21 Mar 2025
Improving LLM Video Understanding with 16 Frames Per Second
Improving LLM Video Understanding with 16 Frames Per Second
Yongqian Li
Changli Tang
Jimin Zhuang
Yudong Yang
Guangzhi Sun
W. Li
Z. Ma
Chao Zhang
VLM
83
1
0
18 Mar 2025
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding
Jiahe Zhao
Ruibing Hou
Zejie Tian
Hong Chang
Shiguang Shan
45
0
0
17 Mar 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu
Q. Yang
Yuan-Ming Li
Yi-Xing Peng
Kun-Yu Lin
Xihan Wei
Jian-Fang Hu
Xiaohua Xie
Wei-Shi Zheng
VLM
67
1
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Efficient Motion-Aware Video MLLM
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
Binghui Wang
Xin Wu
Jiaheng Liu
65
0
0
17 Mar 2025
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
83
3
0
16 Mar 2025
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen
Guoqiang Gong
Tao He
Yifeng Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
VLM
69
0
0
14 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Wenhu Chen
Mamba
59
3
0
14 Mar 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng
Jian Hu
Ziquan Liu
Chenyang Si
Wei Li
Shaogang Gong
LRM
75
2
0
14 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Y. Chen
DiffM
VGen
53
0
0
13 Mar 2025
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Yuanxin Liu
Rui Zhu
Shuhuai Ren
Jiacong Wang
Haoyuan Guo
Xu Sun
Lu Jiang
133
1
0
13 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Zehua Wang
Yang Liu
Peng Li
Yixuan Wang
VLM
160
0
0
13 Mar 2025
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Liqiang Nie
67
1
0
13 Mar 2025
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Qiji Zhou
Yifan Gong
Guangsheng Bao
Hongjie Qiu
Jinqiang Li
Xiangrong Zhu
Huajian Zhang
Yue Zhang
LRM
44
0
0
12 Mar 2025
Generative Frame Sampler for Long Video Understanding
Linli Yao
Haoning Wu
Kun Ouyang
Yuyao Zhang
Caiming Xiong
Bei Chen
Xu Sun
Junnan Li
VLM
VGen
52
0
0
12 Mar 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin
Zhiyu Tan
Mengping Yang
Xiaomeng Yang
Hao Li
84
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
159
0
0
12 Mar 2025
ReelWave: A Multi-Agent Framework Toward Professional Movie Sound Generation
Zixuan Wang
Chi-Keung Tang
Yu-Wing Tai
DiffM
VGen
63
0
0
10 Mar 2025
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing
Jianxiong Gao
Yichang Liu
Baofeng Yang
Jianfeng Feng
Yanwei Fu
VGen
63
1
0
10 Mar 2025
Video Action Differencing
James Burgess
Xiaohan Wang
Yuhui Zhang
Anita Rau
Alejandro Lozano
Lisa Dunlap
Trevor Darrell
Serena Yeung-Levy
VGen
48
0
0
10 Mar 2025
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang
Yuhang Zang
Hao Li
Cheng Jin
J. Wang
EGVM
68
4
0
07 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
Hairu Wang
Can Huang
B. Li
VLM
63
1
0
06 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Z. Liu
EgoV
51
2
0
05 Mar 2025
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Xiao Wang
Jingyun Hua
Weihong Lin
Yuyao Zhang
Fuzheng Zhang
Jianlong Wu
Di Zhang
Liqiang Nie
VLM
88
0
0
28 Feb 2025
Adaptive Keyframe Sampling for Long Video Understanding
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
85
0
0
28 Feb 2025
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan
H. Shen
Xin Wang
Junfeng Fang
Zheda Mai
M. Zhang
VLM
65
3
0
24 Feb 2025
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
Hengzhi Li
Megan Tjandrasuwita
Yi R. Fung
Armando Solar-Lezama
Paul Pu Liang
47
1
0
23 Feb 2025
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
Marian Qian
Paul Pu Liang
Louis-Philippe Morency
LRM
157
1
0
21 Feb 2025
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness
Zhihang Liu
Chen-Wei Xie
Bin Wen
Feiwu Yu
Jixuan Chen
...
Pandeng Li
Yun Zheng
Hongtao Xie
Yun Zheng
Hongtao Xie
VLM
CoGe
100
0
0
19 Feb 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Guangzhi Sun
Yudong Yang
Jimin Zhuang
Changli Tang
Yongqian Li
W. Li
Z. Ma
Chao Zhang
LRM
MLLM
VLM
64
4
0
17 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Junxian Cai
Kai Ma
Xi Chen
Jiaheng Liu
49
0
0
17 Feb 2025
CoS: Chain-of-Shot Prompting for Long Video Understanding
CoS: Chain-of-Shot Prompting for Long Video Understanding
Jian Hu
Zixu Cheng
Chenyang Si
Wei Li
Shaogang Gong
49
4
0
10 Feb 2025
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren
Lingrui Xu
Long Xia
S. Wang
Dawei Yin
Chao Huang
VGen
VLM
76
3
0
03 Feb 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
89
19
0
21 Jan 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Ziyu Liu
...
Haodong Duan
W. Zhang
Kai Chen
Dahua Lin
Jiaqi Wang
VLM
74
19
0
21 Jan 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong
Yean Cheng
Z. Yang
Weihan Wang
Lefan Wang
Xiaotao Gu
Shiyu Huang
Yuxiao Dong
J. Tang
CoGe
VLM
71
4
0
06 Jan 2025
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li
Yi Wang
Jiashuo Yu
Xiangyu Zeng
Yuhan Zhu
...
Yinan He
Chenting Wang
Yu Qiao
Yali Wang
L. Wang
VLM
77
25
0
31 Dec 2024
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
82
6
0
29 Dec 2024
VidCtx: Context-aware Video Question Answering with Image Models
VidCtx: Context-aware Video Question Answering with Image Models
Andreas Goulas
Vasileios Mezaris
Ioannis Patras
159
0
0
23 Dec 2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
  Long-term Streaming Video and Audio Interactions
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
Xiaotian Zhang
K. Chen
Yu Qiao
Dahua Lin
Jiaqi Wang
KELM
84
12
0
12 Dec 2024
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
Aditya Chinchure
Sahithya Ravi
R. Ng
Vered Shwartz
Boyang Albert Li
Leonid Sigal
ReLM
LRM
VLM
77
2
0
07 Dec 2024
Progress-Aware Video Frame Captioning
Progress-Aware Video Frame Captioning
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
100
1
0
03 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
103
2
0
01 Dec 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
MLLM
81
13
0
27 Nov 2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
  Comprehension with Video-Text Duet Interaction Format
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang
Xiaojun Meng
Y. Wang
Jianxin Liang
Jiansheng Wei
Huishuai Zhang
Dongyan Zhao
VGen
85
8
0
27 Nov 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Xiao Yang
Guilin Li
Haojia Lin
Jinfa Huang
Jiayi Ji
Rongrong Ji
Jiebo Luo
Rongrong Ji
VLM
85
18
0
20 Nov 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
39
14
0
08 Oct 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
134
32
0
26 Sep 2024
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos
Dhruv Verma
Debaditya Roy
Basura Fernando
27
1
0
30 Jul 2024
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Haodong Duan
Junming Yang
Junming Yang
Xinyu Fang
Lin Chen
...
Yuhang Zang
Pan Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LM&MA
VLM
39
115
0
16 Jul 2024
Previous
12