Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.04346
Cited By
Koala: Key frame-conditioned long video-LLM
5 April 2024
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Koala: Key frame-conditioned long video-LLM"
46 / 46 papers shown
Title
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
Shuhang Xun
Sicheng Tao
Jiajun Li
Yibo Shi
Zhixin Lin
...
Shikang Wang
Yang Liu
Hao Zhang
Ying Ma
Xuming Hu
VLM
LRM
66
1
0
04 May 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
Pooyan Fazli
73
0
0
18 Apr 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
73
4
0
17 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
Jing Zhang
Haoyang Li
Yiran Chen
VLM
115
2
0
13 Mar 2025
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li
Yi Wang
Jiashuo Yu
Xiangyu Zeng
Yuhan Zhu
...
Yinan He
Chenting Wang
Yu Qiao
Yali Wang
L. Wang
VLM
145
38
0
31 Dec 2024
SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang
Yujia Chen
Wen-Sheng Chu
Vishnu Boddeti
Du Tran
VLM
163
0
0
02 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
161
2
0
01 Dec 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
122
32
0
04 Oct 2024
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
90
289
0
17 Aug 2023
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Ming-Hui Qiu
MLLM
100
203
0
12 Jun 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai
Junnan Li
Dongxu Li
A. M. H. Tiong
Junqi Zhao
Weisheng Wang
Boyang Albert Li
Pascale Fung
Steven C. H. Hoi
MLLM
VLM
104
2,049
0
11 May 2023
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
275
948
0
27 Apr 2023
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting
Syed Talal Wasim
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
M. Shah
VLM
VPVLM
84
78
0
06 Apr 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
424
4,539
0
30 Jan 2023
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
115
326
0
06 Dec 2022
Scaling Instruction-Finetuned Language Models
Hyung Won Chung
Le Hou
Shayne Longpre
Barret Zoph
Yi Tay
...
Jacob Devlin
Adam Roberts
Denny Zhou
Quoc V. Le
Jason W. Wei
ReLM
LRM
181
3,117
0
20 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
61
69
0
12 Oct 2022
Expanding Language-Image Pretrained Models for General Video Recognition
Bolin Ni
Houwen Peng
Minghao Chen
Songyang Zhang
Gaofeng Meng
Jianlong Fu
Shiming Xiang
Haibin Ling
VLM
CLIP
ViT
99
325
0
04 Aug 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
135
237
0
16 Jun 2022
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
272
2,474
0
15 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
371
3,535
0
29 Apr 2022
Learning Local and Global Temporal Contexts for Video Semantic Segmentation
Guolei Sun
Yun Liu
Henghui Ding
Min Wu
Luc Van Gool
85
31
0
07 Apr 2022
Continuous Scene Representations for Embodied AI
S. Gadre
Kiana Ehsani
Shuran Song
Roozbeh Mottaghi
84
47
0
31 Mar 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
75
160
0
28 Mar 2022
Prompting Visual-Language Models for Efficient Video Understanding
Chen Ju
Tengda Han
Kunhao Zheng
Ya Zhang
Weidi Xie
VPVLM
VLM
79
376
0
08 Dec 2021
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Fan Hu
Aozhu Chen
Ziyu Wang
Fangming Zhou
Jianfeng Dong
Xirong Li
53
31
0
03 Dec 2021
DexMV: Imitation Learning for Dexterous Manipulation from Human Videos
Yuzhe Qin
Yueh-hua Wu
Shaowei Liu
Hanwen Jiang
Ruihan Yang
Yang Fu
Xiaolong Wang
166
197
0
12 Aug 2021
Multimodal Few-Shot Learning with Frozen Language Models
Maria Tsimpoukelli
Jacob Menick
Serkan Cabi
S. M. Ali Eslami
Oriol Vinyals
Felix Hill
MLLM
159
778
0
25 Jun 2021
Towards Long-Form Video Understanding
Chaoxia Wu
Philipp Krahenbuhl
VLM
ViT
111
169
0
21 Jun 2021
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
404
10,301
0
17 Jun 2021
Anticipative Video Transformer
Rohit Girdhar
Kristen Grauman
ViT
58
211
0
03 Jun 2021
A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
Christoph Feichtenhofer
Haoqi Fan
Bo Xiong
Ross B. Girshick
Kaiming He
SSL
AI4TS
90
262
0
29 Apr 2021
Multiscale Vision Transformers
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
127
1,258
0
22 Apr 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain
Arsha Nagrani
Gül Varol
Andrew Zisserman
VGen
136
1,173
0
01 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
213
2,149
0
29 Mar 2021
Perceiver: General Perception with Iterative Attention
Andrew Jaegle
Felix Gimeno
Andrew Brock
Andrew Zisserman
Oriol Vinyals
João Carreira
VLM
ViT
MDE
185
1,014
0
04 Mar 2021
Coarse-Fine Networks for Temporal Activity Detection in Videos
Kumara Kahatapitiya
Michael S. Ryoo
AI4TS
77
39
0
01 Mar 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
365
2,048
0
09 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
261
432
0
01 Feb 2021
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
AAML
137
2,731
0
05 Jun 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
106
503
0
01 May 2020
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
128
1,019
0
09 Apr 2020
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
153
1,663
0
22 Aug 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
108
1,199
0
07 Jun 2019
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
164
3,273
0
10 Dec 2018
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
229
8,015
0
22 May 2017
1