Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.12353
Cited By
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
18 April 2024
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning"
19 / 19 papers shown
Title
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec
Zhengyuan Liu
Nicholas Asher
Philippe Muller
Nancy F. Chen
VGen
31
0
0
10 May 2025
HierSum: A Global and Local Attention Mechanism for Video Summarization
Apoorva Beedu
Irfan Essa
75
0
0
25 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
37
0
0
07 Apr 2025
Do Language Models Understand Time?
Xi Ding
Lei Wang
178
0
0
18 Dec 2024
Progress-Aware Video Frame Captioning
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
100
1
0
03 Dec 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
39
14
0
08 Oct 2024
Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
Liu He
Yizhi Song
Hejun Huang
Pinxin Liu
Yunlong Tang
Daniel G. Aliaga
Xin Zhou
DiffM
VGen
90
3
0
19 Aug 2024
VTimeLLM: Empower LLM to Grasp Video Moments
Bin Huang
Xin Wang
Hong Chen
Zihan Song
Wenwu Zhu
MLLM
89
113
0
30 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLM
MLLM
194
591
0
16 Nov 2023
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning
Yunlong Tang
Jinrui Zhang
Xiangchen Wang
Teng Wang
Feng Zheng
VLM
66
9
0
17 Jun 2023
DiffuSum: Generation Enhanced Extractive Summarization with Diffusion
Haopeng Zhang
Xiao Liu
Jiawei Zhang
DiffM
69
40
0
02 May 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
95
54
0
27 Apr 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
270
4,244
0
30 Jan 2023
Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward
Yunlong Tang
Siting Xu
Teng Wang
Qin Lin
Qinglin Lu
Feng Zheng
VOS
64
10
0
25 Sep 2022
Exploiting Context Information for Generic Event Boundary Captioning
Jinrui Zhang
Teng Wang
Feng Zheng
Ran Cheng
Ping Luo
98
5
0
03 Jul 2022
Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning
Jing Bi
Jiebo Luo
Chenliang Xu
76
48
0
05 Oct 2021
GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
Jia-Hong Huang
L. Murn
M. Mrak
M. Worring
ViT
94
37
0
26 Apr 2021
Generic Event Boundary Detection: A Benchmark for Event Segmentation
Mike Zheng Shou
Stan Weixian Lei
Weiyao Wang
Deepti Ghadiyaram
Matt Feiszli
VOS
85
76
0
26 Jan 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
240
4,469
0
23 Jan 2020
1