VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
v1v2 (latest)

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Papers citing "VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset"

29 / 79 papers shown
Title
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Yuxuan Wang
Chao Zhang
95
35
0
22 Jun 2024

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.