Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

22 April 2021

Yanbei Chen

Yongqin Xian

A. Sophia Koepke

Ying Shan

Zeynep Akata

ArXiv (abs)PDF HTML Github (87★)

Papers citing "Distilling Audio-Visual Knowledge by Compositional Contrastive Learning"

50 / 52 papers shown

Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment Edson Araujo Andrew Rouditchenko Yuan Gong Saurabhchand Bhati Samuel Thomas Brian Kingsbury Leonid Karlinsky Rogerio Feris James Glass Hilde Kuehne 106 0 0 02 May 2025
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning Rajath Rao Adithya Ganesan Oscar Kjell Jonah Luby Akshay Raghavan ... B. Luft Camilo Ruggero Neville Ryant R. Kotov H. Andrew Schwartz 105 0 0 15 Jan 2025
Contrastive Learning for Unpaired Image-to-Image Translation Taesung Park Alexei A. Efros Richard Y. Zhang Jun-Yan Zhu SSL 89 1,235 0 30 Jul 2020
Multi-modal Transformer for Video Retrieval Valentin Gabeur Chen Sun Alahari Karteek Cordelia Schmid ViT 542 610 0 21 Jul 2020
Heterogeneous Knowledge Distillation using Information Flow Modeling Nikolaos Passalis Maria Tzelepi Anastasios Tefas 75 139 0 02 May 2020
VGGSound: A Large-scale Audio-Visual Dataset Honglie Chen Weidi Xie Andrea Vedaldi Andrew Zisserman 92 583 0 29 Apr 2020
Improved Baselines with Momentum Contrastive Learning Xinlei Chen Haoqi Fan Ross B. Girshick Kaiming He SSL 508 3,449 0 09 Mar 2020
Learning Robust Representations via Multi-View Information Bottleneck Marco Federici Anjan Dutta Patrick Forré Nate Kushman Zeynep Akata SLR 67 258 0 17 Feb 2020
A Simple Framework for Contrastive Learning of Visual Representations Ting-Li Chen Simon Kornblith Mohammad Norouzi Geoffrey E. Hinton SSL 393 18,897 0 13 Feb 2020
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition Qiuqiang Kong Yin Cao Turab Iqbal Yuxuan Wang Wenwu Wang Mark D. Plumbley VLM SSL 199 1,084 0 21 Dec 2019
ASR is all you need: cross-modal distillation for lip reading Triantafyllos Afouras Joon Son Chung Andrew Zisserman 58 135 0 28 Nov 2019
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Guohao Li Du Tran SSL 122 432 0 28 Nov 2019
Vision-Infused Deep Audio Inpainting Hang Zhou Ziwei Liu Lingfeng Guo Ping Luo Dahua Lin 142 88 0 24 Oct 2019
Contrastive Representation Distillation Yonglong Tian Dilip Krishnan Phillip Isola 176 1,054 0 23 Oct 2019
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Yang Liu Samuel Albanie Arsha Nagrani Andrew Zisserman 84 389 0 31 Jul 2019
Contrastive Multiview Coding Yonglong Tian Dilip Krishnan Phillip Isola SSL 182 2,412 0 13 Jun 2019
What Makes Training Multi-Modal Classification Networks Hard? Weiyao Wang Du Tran Matt Feiszli 154 453 0 29 May 2019
Temporal Cycle-Consistency Learning Debidatta Dwibedi Y. Aytar Jonathan Tompson P. Sermanet Andrew Zisserman SSL AI4TS 92 276 0 16 Apr 2019
Co-Separating Sounds of Visual Objects Ruohan Gao Kristen Grauman 131 210 0 16 Apr 2019
Relational Knowledge Distillation Wonpyo Park Dongju Kim Yan Lu Minsu Cho 89 1,428 0 10 Apr 2019
Correlation Congruence for Knowledge Distillation Baoyun Peng Xiao Jin Jiaheng Liu Shunfeng Zhou Yichao Wu Yu Liu Dongsheng Li Zhaoning Zhang 94 513 0 03 Apr 2019
Learning Correspondence from the Cycle-Consistency of Time Xinyu Wang Allan Jabri Alexei A. Efros SSL 91 491 0 18 Mar 2019
DistInit: Learning Video Representations Without a Single Labeled Video Rohit Girdhar Du Tran Lorenzo Torresani Deva Ramanan 48 54 0 26 Jan 2019
Composing Text and Image for Image Retrieval - An Empirical Odyssey Nam S. Vo Lu Jiang Chen Sun Kevin Patrick Murphy Li Li Li Fei-Fei James Hays CoGe 68 368 0 18 Dec 2018
SlowFast Networks for Video Recognition Christoph Feichtenhofer Haoqi Fan Jitendra Malik Kaiming He 169 3,286 0 10 Dec 2018
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles Dahun Kim Donghyeon Cho In So Kweon SSL 91 349 0 24 Nov 2018
Deep Audio-Visual Speech Recognition Triantafyllos Afouras Joon Son Chung A. Senior Oriol Vinyals Andrew Zisserman 98 710 0 06 Sep 2018
Learning deep representations by mutual information estimation and maximization R. Devon Hjelm A. Fedorov Samuel Lavoie-Marchildon Karan Grewal Phil Bachman Adam Trischler Yoshua Bengio SSL DRL 352 2,675 0 20 Aug 2018
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild Samuel Albanie Arsha Nagrani Andrea Vedaldi Andrew Zisserman CVBM 75 271 0 16 Aug 2018
Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning U. Büchler Biagio Brattoli Bjorn Ommer OOD SSL 83 114 0 30 Jul 2018
X2Face: A network for controlling face generation by using images, audio, and pose codes Olivia Wiles A. Sophia Koepke Andrew Zisserman CVBM 91 415 0 27 Jul 2018
Representation Learning with Contrastive Predictive Coding Aaron van den Oord Yazhe Li Oriol Vinyals DRL SSL 356 10,369 0 10 Jul 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Andrew Owens Alexei A. Efros SSL 100 754 0 10 Apr 2018
Learning Deep Representations with Probabilistic Knowledge Transfer Nikolaos Passalis Anastasios Tefas 66 413 0 28 Mar 2018
Audio-Visual Event Localization in Unconstrained Videos Yapeng Tian Jing Shi Bochen Li Zhiyao Duan Chenliang Xu 109 439 0 23 Mar 2018
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning Andrew Owens Jiajun Wu Josh H. McDermott William T. Freeman Antonio Torralba SSL 71 176 0 20 Dec 2017
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 116 530 0 18 Dec 2017
A Closer Look at Spatiotemporal Convolutions for Action Recognition Du Tran Heng Wang Lorenzo Torresani Jamie Ray Yann LeCun Manohar Paluri 240 3,033 0 30 Nov 2017
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks Zhaofan Qiu Ting Yao Tao Mei 102 1,663 0 28 Nov 2017
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara Hirokatsu Kataoka Y. Satoh 3DPC 133 1,935 0 27 Nov 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset João Carreira Andrew Zisserman 240 8,045 0 22 May 2017
You said that? Joon Son Chung A. Jamaludin Andrew Zisserman CVBM 74 260 0 08 May 2017
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer Sergey Zagoruyko N. Komodakis 147 2,590 0 12 Dec 2016
SoundNet: Learning Sound Representations from Unlabeled Video Y. Aytar Carl Vondrick Antonio Torralba SSL 135 1,044 0 27 Oct 2016
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao Dahua Lin Xiaoou Tang Luc Van Gool ViT 120 3,841 0 02 Aug 2016
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 2.3K 194,641 0 10 Dec 2015
Cross Modal Distillation for Supervision Transfer Saurabh Gupta Judy Hoffman Jitendra Malik 127 538 0 02 Jul 2015
Distilling the Knowledge in a Neural Network Geoffrey E. Hinton Oriol Vinyals J. Dean FedML 367 19,745 0 09 Mar 2015
FitNets: Hints for Thin Deep Nets Adriana Romero Nicolas Ballas Samira Ebrahimi Kahou Antoine Chassang C. Gatta Yoshua Bengio FedML 328 3,906 0 19 Dec 2014
Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman 261 7,545 0 09 Jun 2014