Mirasol3B: A Multimodal Autoregressive model for time-aligned and
contextual modalities

v1v2v3 (latest)

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

9 November 2023

A. Piergiovanni

Michael S. Ryoo

ArXiv (abs)PDF HTML

Papers citing "Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities"

15 / 15 papers shown

Title
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition Jongseo Lee Joohyun Chang Dongho Lee Jinwoo Choi 257 0 0 30 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding Shehreen Azad Vibhav Vineet Yogesh S Rawat VLM 500 3 0 11 Mar 2025
ReWind: Understanding Long Videos with Instructed Learnable Memory Anxhelo Diko Tinghuai Wang Wassim Swaileh Shiyan Sun Ioannis Patras KELM VLM 158 1 0 23 Nov 2024
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective Shenghao Xie Wenqiang Zu Mingyang Zhao Duo Su Shilong Liu Ruohua Shi Guoqi Li Shanghang Zhang Lei Ma LRM 162 3 0 29 Oct 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Michael S Ryoo Honglu Zhou Shrikant B. Kendre Can Qin Le Xue ... Kanchana Ranasinghe Caiming Xiong Ran Xu Caiming Xiong Juan Carlos Niebles VGen 104 15 0 21 Oct 2024
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Can Qin Congying Xia Krithika Ramakrishnan Michael S Ryoo Lifu Tu ... Silvio Savarese Juan Carlos Niebles Zeyuan Chen Ran Xu Caiming Xiong VGen DiffM 145 3 0 22 Aug 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models Jiawei Wang Liping Yuan Yuchen Zhang 115 67 0 30 Jun 2024
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models Guangzhi Sun Wenyi Yu Changli Tang Xianzhao Chen Tian Tan Wei Li Lu Lu Zejun Ma Yuxuan Wang Chao Zhang 97 35 0 22 Jun 2024
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Vicky Zayats Peter Chen Melissa Ferrari Dirk Padfield AI4CE 79 1 0 29 May 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions Akash Ghosh Arkadeep Acharya Sriparna Saha Vinija Jain Aman Chadha VLM 124 33 0 20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding Long Zhao N. B. Gundavarapu Liangzhe Yuan Hao Zhou Shen Yan ... Huisheng Wang Hartwig Adam Mikhail Sirotenko Ting Liu Boqing Gong VGen 131 36 0 20 Feb 2024
Memory Consolidation Enables Long-Context Video Understanding Ivana Balavzević Yuge Shi Pinelopi Papalampidi Rahma Chaabouni Skanda Koppula Olivier J. Hénaff 195 27 0 08 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion Shoubin Yu Jaehong Yoon Mohit Bansal 183 7 0 08 Feb 2024
Knowledge Translation: A New Pathway for Model Compression Wujie Sun Defang Chen Jiawei Chen Yan Feng Chun-Yen Chen Can Wang 67 0 0 11 Jan 2024
Epic-Sounds: A Large-scale Dataset of Actions That Sound Jaesung Huh Jacob Chalk Evangelos Kazakos Dima Damen Andrew Zisserman EgoV 100 43 0 01 Feb 2023

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.