Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.02858
Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"
50 / 703 papers shown
Title
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
Ulugbek S. Kamilov
Yue Dong
Amit K. Roy-Chowdhury
Chengyu Song
41
16
0
27 May 2024
Matryoshka Multimodal Models
Mu Cai
Jianwei Yang
Jianfeng Gao
Yong Jae Lee
VLM
60
28
0
27 May 2024
Hawk: Learning to Understand Open-World Video Anomalies
Jiaqi Tang
Hao Lu
Ruizheng Wu
Xiaogang Xu
Ke Ma
Cheng Fang
Bin Guo
Jiangbo Lu
Qifeng Chen
Ying-Cong Chen
VLM
45
9
0
27 May 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
60
37
0
26 May 2024
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
Zixuan Wang
Qinkai Duan
Yu-Wing Tai
Chi-Keung Tang
38
3
0
25 May 2024
Streaming Long Video Understanding with Large Language Models
Rui Qian
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Shuangrui Ding
Dahua Lin
Jiaqi Wang
VLM
44
41
0
25 May 2024
A Misleading Gallery of Fluid Motion by Generative Artificial Intelligence
Ali Kashefi
VGen
58
5
0
24 May 2024
Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
Jianbiao Mei
Yukai Ma
Xuemeng Yang
Licheng Wen
Xinyu Cai
...
Min Dou
Botian Shi
Liang He
Yong-Jin Liu
Yu Qiao
55
10
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
82
45
0
23 May 2024
Dense Connector for MLLMs
Huanjin Yao
Wenhao Wu
Taojiannan Yang
Yuxin Song
Mengxi Zhang
Haocheng Feng
Yifan Sun
Zhiheng Li
Wanli Ouyang
Jingdong Wang
MLLM
VLM
42
18
0
22 May 2024
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models
Guangzhi Sun
Potsawee Manakul
Adian Liusie
Kunat Pipatanakul
Chao Zhang
P. Woodland
Mark Gales
HILM
MLLM
27
8
0
22 May 2024
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
Zhiyu Tan
Mengping Yang
Luozheng Qin
Hao Yang
Ye Qian
Qiang-feng Zhou
Cheng Zhang
Hao Li
69
3
0
21 May 2024
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
Zhiyuan Liu
An Zhang
Hao Fei
Enzhi Zhang
Xiang Wang
Kenji Kawaguchi
Tat-Seng Chua
60
18
0
21 May 2024
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Zhenwei Shao
Zhou Yu
Jun Yu
Xuecheng Ouyang
Lihao Zheng
Zhenbiao Gai
Mingyang Wang
Jiajun Ding
23
10
0
20 May 2024
SemEval-2024 Task 3: Multimodal Emotion Cause Analysis in Conversations
Fanfan Wang
Heqing Ma
Jianfei Yu
Rui Xia
Erik Cambria
45
22
0
19 May 2024
Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion
Zeyu Zhang
Yiran Wang
Biao Wu
Shuo Chen
Zhiyuan Zhang
Shiya Huang
Wenbo Zhang
Meng Fang
Ling-Hao Chen
Yang Zhao
VGen
53
6
0
18 May 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min-Ling Zhang
MoE
46
30
0
18 May 2024
Efficient Multimodal Large Language Models: A Survey
Yizhang Jin
Jian Li
Yexin Liu
Tianjun Gu
Kai Wu
...
Xin Tan
Zhenye Gan
Yabiao Wang
Chengjie Wang
Lizhuang Ma
LRM
52
47
0
17 May 2024
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
Yuchen Hu
Chen Chen
Chengwei Qin
Qiushi Zhu
Eng Siong Chng
Ruizhe Li
AuLLM
KELM
56
5
0
16 May 2024
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
Andong Wang
Bo Wu
Sunli Chen
Zhenfang Chen
Haotian Guan
Wei-Ning Lee
Li Erran Li
Chuang Gan
LRM
RALM
37
16
0
15 May 2024
FreeVA: Offline MLLM as Training-Free Video Assistant
Wenhao Wu
VLM
OffRL
40
20
0
13 May 2024
Sakuga-42M Dataset: Scaling Up Cartoon Research
Zhenglin Pan
Yu Zhu
Yuxuan Mu
43
6
0
13 May 2024
A Survey of Large Language Models for Graphs
Xubin Ren
Jiabin Tang
Dawei Yin
Nitesh Chawla
Chao Huang
30
34
0
10 May 2024
Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation
Ryan Wong
Necati Cihan Camgöz
Richard Bowden
SLR
59
22
0
07 May 2024
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Muhammad Uzair Khattak
Muhammad Ferjad Naeem
Jameel Hassan
Muzammal Naseer
Federico Tombari
Fahad Shahbaz Khan
Salman Khan
LRM
ELM
47
10
0
06 May 2024
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Yuanhan Zhang
Kaichen Zhang
Bo Li
Fanyi Pu
Christopher Arif Setiadharma
Jingkang Yang
Ziwei Liu
VGen
52
8
0
06 May 2024
Octopi: Object Property Reasoning with Large Tactile-Language Models
Samson Yu
Kelvin Lin
Anxing Xiao
Jiafei Duan
Harold Soh
LRM
41
27
0
05 May 2024
MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang
Xuan He
Huaye Zeng
Cong Wei
Max W.F. Ku
Qian Liu
Wenhu Chen
VLM
MLLM
33
104
0
02 May 2024
EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model
Deng Li
Xin Liu
Bohao Xing
Baiqiang Xia
Yuan Zong
Bihan Wen
Heikki Kälviäinen
44
3
0
01 May 2024
MileBench: Benchmarking MLLMs in Long Context
Dingjie Song
Shunian Chen
Guiming Hardy Chen
Fei Yu
Xiang Wan
Benyou Wang
VLM
82
35
0
29 Apr 2024
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Enxin Song
Wenhao Chai
Tianbo Ye
Lei Li
Xi Li
Gaoang Wang
VLM
MLLM
39
30
0
26 Apr 2024
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition
Zheng Lian
Haiyang Sun
Guoying Zhao
Zhuofan Wen
Siyuan Zhang
...
Bin Liu
Min Zhang
Guoying Zhao
Björn W. Schuller
Jianhua Tao
VLM
46
11
0
26 Apr 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLM
VLM
38
162
0
25 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
73
11
0
25 Apr 2024
Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations
Shen Zhang
Haojie Zhang
Jing Zhang
Xudong Zhang
Yimeng Zhuang
Jinting Wu
52
2
0
25 Apr 2024
Fake Artificial Intelligence Generated Contents (FAIGC): A Survey of Theories, Detection Methods, and Opportunities
Xiaomin Yu
Yezhaohui Wang
Yanfang Chen
Zhen Tao
Dinghao Xi
Shichao Song
Pengnian Qi
Zhiyu Li
74
8
0
25 Apr 2024
Step Differences in Instructional Video
Tushar Nagarajan
Lorenzo Torresani
VGen
37
5
0
24 Apr 2024
Pegasus-v1 Technical Report
Raehyuk Jung
Hyojun Go
Jaehyuk Yi
Jiho Jang
Daniel Kim
...
Maninder Saini
Meredith Sanders
Soyoung Lee
Sue Kim
Travis Couture
MLLM
VLM
29
5
0
23 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
54
20
0
22 Apr 2024
TAVGBench: Benchmarking Text to Audible-Video Generation
Yuxin Mao
Xuyang Shen
Jing Zhang
Zhen Qin
Jinxing Zhou
Mochu Xiang
Yiran Zhong
Yuchao Dai
48
11
0
22 Apr 2024
Graphic Design with Large Multimodal Model
Yutao Cheng
Zhao Zhang
Maoke Yang
Hui Nie
Chunyuan Li
Xinglong Wu
Jie Shao
59
10
0
22 Apr 2024
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
68
25
0
18 Apr 2024
AccidentBlip: Agent of Accident Warning based on MA-former
Yihua Shao
Hongyi Cai
Xinwei Long
Weiyi Lang
Ziyang Yan
Haoran Wu
Yan Wang
Jiayi Yin
Yang Yang
Yisheng Lv
45
2
0
18 Apr 2024
From Image to Video, what do we need in multimodal LLMs?
Suyuan Huang
Haoxin Zhang
Yan Gao
Honggu Chen
Yan Gao
Yao Hu
Zengchang Qin
VLM
47
8
0
18 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
50
4
0
18 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
54
26
0
11 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
45
24
0
10 Apr 2024
Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness
Xincan Feng
A. Yoshimoto
46
2
0
10 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
49
21
0
09 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
83
89
0
08 Apr 2024
Previous
1
2
3
...
9
10
11
...
13
14
15
Next