Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

8 January 2024

Papers citing "Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification"

7 / 7 papers shown

Title
Omnivore: A Single Model for Many Visual Modalities Rohit Girdhar Mannat Singh Nikhil Ravi L. V. D. van der Maaten Armand Joulin Ishan Misra 226 226 0 20 Jan 2022
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Hu Xu Gargi Ghosh Po-Yao (Bernie) Huang Dmytro Okhonko Armen Aghajanyan Florian Metze Luke Zettlemoyer Florian Metze Luke Zettlemoyer Christoph Feichtenhofer CLIP VLM 259 560 0 28 Sep 2021
The Right to Talk: An Audio-Visual Transformer Approach Thanh-Dat Truong C. Duong T. D. Vu H. Pham Bhiksha Raj Ngan Le Khoa Luu 63 36 0 06 Aug 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Huayu Chen Boqing Gong ViT 251 577 0 22 Apr 2021
Is Space-Time Attention All You Need for Video Understanding? Gedas Bertasius Heng Wang Lorenzo Torresani ViT 283 1,984 0 09 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers Lisa Anne Hendricks John F. J. Mellor R. Schneider Jean-Baptiste Alayrac Aida Nematzadeh 79 110 0 31 Jan 2021
Audiovisual SlowFast Networks for Video Recognition Fanyi Xiao Yong Jae Lee Kristen Grauman Jitendra Malik Christoph Feichtenhofer 197 207 0 23 Jan 2020