Zorro: the masked multimodal transformer

23 January 2023

Jean-baptiste Alayrac

Papers citing "Zorro: the masked multimodal transformer"

23 / 23 papers shown

Title
Three things everyone should know about Vision Transformers Hugo Touvron Matthieu Cord Alaaeldin El-Nouby Jakob Verbeek Hervé Jégou ViT 65 121 0 18 Mar 2022
HiP: Hierarchical Perceiver João Carreira Skanda Koppula Daniel Zoran Adrià Recasens Catalin Ionescu ... M. Botvinick Oriol Vinyals Karen Simonyan Andrew Zisserman Andrew Jaegle VLM 76 14 0 22 Feb 2022
Co-training Transformer with Videos and Images Improves Action Recognition Bowen Zhang Jiahui Yu Christopher Fifty Wei Han Andrew M. Dai Ruoming Pang Fei Sha ViT 50 54 0 14 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval Nina Shvetsova Brian Chen Andrew Rouditchenko Samuel Thomas Brian Kingsbury Rogerio Feris David Harwath James R. Glass Hilde Kuehne ViT 60 130 0 08 Dec 2021
Masked-attention Mask Transformer for Universal Image Segmentation Bowen Cheng Ishan Misra Alex Schwing Alexander Kirillov Rohit Girdhar ISeg 188 2,315 0 02 Dec 2021
Swin Transformer V2: Scaling Up Capacity and Resolution Ze Liu Han Hu Yutong Lin Zhuliang Yao Zhenda Xie ... Yue Cao Zheng Zhang Li Dong Furu Wei B. Guo ViT 182 1,783 0 18 Nov 2021
Efficient Training of Audio Transformers with Patchout Khaled Koutini Jan Schluter Hamid Eghbalzadeh Gerhard Widmer ViT 102 256 0 11 Oct 2021
Perceiver IO: A General Architecture for Structured Inputs & Outputs Andrew Jaegle Sebastian Borgeaud Jean-Baptiste Alayrac Carl Doersch Catalin Ionescu ... Olivier J. Hénaff M. Botvinick Andrew Zisserman Oriol Vinyals João Carreira MLLM VLM GNN 48 572 0 30 Jul 2021
Attention Bottlenecks for Multimodal Fusion Arsha Nagrani Shan Yang Anurag Arnab A. Jansen Cordelia Schmid Chen Sun 81 553 0 30 Jun 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Huayu Chen Boqing Gong ViT 277 581 0 22 Apr 2021
ImageNet-21K Pretraining for the Masses T. Ridnik Emanuel Ben-Baruch Asaf Noy Lihi Zelnik-Manor SSeg VLM CLIP 275 692 0 22 Apr 2021
ViViT: A Video Vision Transformer Anurag Arnab Mostafa Dehghani G. Heigold Chen Sun Mario Lucic Cordelia Schmid ViT 149 2,119 0 29 Mar 2021
Slow-Fast Auditory Streams For Audio Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen 58 66 0 05 Mar 2021
Perceiver: General Perception with Iterative Attention Andrew Jaegle Felix Gimeno Andrew Brock Andrew Zisserman Oriol Vinyals João Carreira VLM ViT MDE 138 1,003 0 04 Mar 2021
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning Sangho Lee Jiwan Chung Youngjae Yu Gunhee Kim Thomas Breuel Gal Chechik Yale Song 81 46 0 26 Jan 2021
Audio-Visual Instance Discrimination with Cross-Modal Agreement Pedro Morgado Nuno Vasconcelos Ishan Misra SSL 59 271 0 27 Apr 2020
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Guohao Li Du Tran SSL 76 429 0 28 Nov 2019
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization Bruno Korbar Du Tran Lorenzo Torresani 78 473 0 30 Jun 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 74 529 0 18 Dec 2017
Multimodal Machine Learning: A Survey and Taxonomy T. Baltrušaitis Chaitanya Ahuja Louis-Philippe Morency 66 2,890 0 26 May 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset João Carreira Andrew Zisserman 206 7,961 0 22 May 2017
Convolutional Two-Stream Network Fusion for Video Action Recognition Christoph Feichtenhofer A. Pinz Andrew Zisserman 129 2,606 0 22 Apr 2016
Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman 225 7,518 0 09 Jun 2014