Self-Supervised MultiModal Versatile Networks

29 June 2020

Jean-Baptiste Alayrac

Papers citing "Self-Supervised MultiModal Versatile Networks"

32 / 32 papers shown

Title
Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining Weizhen He Yunfeng Yan Shixiang Tang Yiheng Deng Yangyang Zhong Pengxin Luo Donglian Qi VLM 125 1 0 29 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition Jongseo Lee Joohyun Chang Dongho Lee Jinwoo Choi 117 0 0 30 Mar 2025
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction Huang Huang Fangchen Liu Letian Fu Tingfan Wu Mustafa Mukadam Jitendra Malik Ken Goldberg Pieter Abbeel LM&Ro VLM 96 8 0 05 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention Joe Dhanith Shravan Venkatraman Modigari Narendra Vigya Sharma Santhosh Malarvannan 108 0 0 20 Feb 2025
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field Zijian Győző Yang Zhongwei Qiu Chang Xu Dongmei Fu 64 2 0 28 Jan 2025
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements M. Arda Aydın Efe Mert Çırpar Elvin Abdinli Gözde B. Ünal Y. Sahin VLM 141 1 0 18 Nov 2024
What to align in multimodal contrastive learning? Benoit Dufumier J. Castillo-Navarro D. Tuia Jean-Philippe Thiran 55 4 0 11 Sep 2024
Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification Weizhen He Yiheng Deng Yunfeng Yan Feng Zhu Yizhou Wang Lei Bai Qingsong Xie Donglian Qi Wanli Ouyang Shixiang Tang 118 2 0 28 May 2024
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures Kun Yuan V. Srivastav Tong Yu Joël L. Lavanchy Pietro Mascagni Pietro Mascagni N. Padoy Nicolas Padoy 57 22 0 27 Jul 2023
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions Weizhen He Yihe Deng Shixiang Tang Qihao Chen Qingsong Xie ... Feng Zhu Rui Zhao Wanli Ouyang Donglian Qi Yunfeng Yan 89 19 0 13 Jun 2023
A vector quantized masked autoencoder for audiovisual speech emotion recognition Samir Sadok Simon Leglaive Renaud Séguier SSL 94 6 0 05 May 2023
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos Andrew Rouditchenko Angie Boggust David Harwath Brian Chen D. Joshi ... Rogerio Feris Brian Kingsbury M. Picheny Antonio Torralba James R. Glass SSL 37 142 0 16 Jun 2020
What Makes for Good Views for Contrastive Learning? Yonglong Tian Chen Sun Ben Poole Dilip Krishnan Cordelia Schmid Phillip Isola SSL 56 1,313 0 20 May 2020
Audio-Visual Instance Discrimination with Cross-Modal Agreement Pedro Morgado Nuno Vasconcelos Ishan Misra SSL 42 271 0 27 Apr 2020
Evolving Losses for Unsupervised Video Representation Learning A. Piergiovanni A. Angelova Michael S. Ryoo SSL 36 139 0 26 Feb 2020
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition Qiuqiang Kong Yin Cao Turab Iqbal Yuxuan Wang Wenwu Wang Mark D. Plumbley VLM SSL 84 1,068 0 21 Dec 2019
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Guohao Li Du Tran SSL 51 429 0 28 Nov 2019
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision A. Jansen D. Ellis Shawn Hershey R. C. Moore Manoj Plakal Ashok Popat Rif A. Saurous SSL 18 26 0 14 Nov 2019
A Short Note on the Kinetics-700 Human Action Dataset João Carreira Eric Noland Chloe Hillier Andrew Zisserman 35 446 0 15 Jul 2019
Contrastive Multiview Coding Yonglong Tian Dilip Krishnan Phillip Isola SSL 118 2,385 0 13 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic VGen 83 1,186 0 07 Jun 2019
Learning Representations by Maximizing Mutual Information Across Views Philip Bachman R. Devon Hjelm William Buchwalter SSL 140 1,463 0 03 Jun 2019
Revisiting Self-Supervised Visual Representation Learning Alexander Kolesnikov Xiaohua Zhai Lucas Beyer SSL 110 716 0 25 Jan 2019
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization Bruno Korbar Du Tran Lorenzo Torresani 65 473 0 30 Jun 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 59 529 0 18 Dec 2017
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification Saining Xie Chen Sun Jonathan Huang Zhuowen Tu Kevin Patrick Murphy 3DH 121 1,317 0 13 Dec 2017
Sampling Matters in Deep Embedding Learning Chaoxia Wu R. Manmatha Alex Smola Philipp Krahenbuhl 77 921 0 23 Jun 2017
One Model To Learn Them All Lukasz Kaiser Aidan Gomez Noam M. Shazeer Ashish Vaswani Niki Parmar Llion Jones Jakob Uszkoreit VLM ViT 45 333 0 16 Jun 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset João Carreira Andrew Zisserman 178 7,961 0 22 May 2017
SGDR: Stochastic Gradient Descent with Warm Restarts I. Loshchilov Frank Hutter ODL 197 8,030 0 13 Aug 2016
Unsupervised Learning from Narrated Instruction Videos Jean-Baptiste Alayrac Piotr Bojanowski Nishant Agrawal Josef Sivic Ivan Laptev Simon Lacoste-Julien SSL 52 289 0 30 Jun 2015
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Bryan A. Plummer Liwei Wang Christopher M. Cervantes Juan C. Caicedo Julia Hockenmaier Svetlana Lazebnik 144 2,033 0 19 May 2015