ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1807.00230
  4. Cited By
Cooperative Learning of Audio and Video Models from Self-Supervised
  Synchronization
v1v2 (latest)

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

30 June 2018
Bruno Korbar
Du Tran
Lorenzo Torresani
ArXiv (abs)PDFHTML

Papers citing "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization"

50 / 322 papers shown
Title
Learning from Silence and Noise for Visual Sound Source Localization
Learning from Silence and Noise for Visual Sound Source Localization
Xavier Juanola
G. Morais
Magdalena Fuentes
Gloria Haro
SSL
56
0
0
29 Aug 2025
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
Hugo Bohy
M. Tran
Kevin El Haddad
Thierry Dutoit
M. Soleymani
8
2
0
24 Aug 2025
VGGSounder: Audio-Visual Evaluations for Foundation Models
VGGSounder: Audio-Visual Evaluations for Foundation Models
Daniil Zverev
Thaddäus Wiedemer
Ameya Prabhu
Matthias Bethge
Wieland Brendel
A. Sophia Koepke
AuLLM
28
0
0
11 Aug 2025
Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm
Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm
Lin Zhang
Zefan Cai
Yufan Zhou
Shentong Mo
Jinhong Lin
...
Ruiyi Zhang
Wen Xiao
Tong Sun
Junjie Hu
Pedro Morgado
VGen
59
0
0
05 Aug 2025
From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
Jia Li
Yapeng Tian
VOS
58
0
0
29 Jul 2025
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
Thomas Kreutz
M. Mühlhäuser
Alejandro Sánchez Guinea
144
0
0
16 Jun 2025
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Theodore Barfoot
Luis C. Garcia-Peraza-Herrera
Samet Akcay
Ben Glocker
Tom Vercauteren
UQCV
232
0
0
04 Jun 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
212
0
0
02 May 2025
Evolutionary algorithms meet self-supervised learning: a comprehensive survey
Evolutionary algorithms meet self-supervised learning: a comprehensive survey
Adriano Vinhas
João Correia
Penousal Machado
SSLSyDa
158
0
0
09 Apr 2025
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Piyush Bagad
Hazel Doughty
Bernard Ghanem
Cees G. M. Snoek
ViTSSL
175
0
0
08 Apr 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
Akash Kumar
Ashlesha Kumar
Vibhav Vineet
Yogesh S Rawat
SSL
605
2
0
08 Apr 2025
UniSync: A Unified Framework for Audio-Visual Synchronization
UniSync: A Unified Framework for Audio-Visual Synchronization
Tao Feng
Yifan Xie
Xun Guan
Jiyuan Song
Z. Liu
Fei Ma
Fei Richard Yu
154
2
0
20 Mar 2025
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
Khang H. N. Vo
D. Q. Nguyen
T. Nguyen
Tho Quan
142
3
0
09 Mar 2025
Scaling 4D Representations
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
...
Viorica Patraucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
187
10
0
19 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
248
3
0
18 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo
Yibing Song
117
1
0
30 Oct 2024
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event
  Localization and Detection
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Yoto Fujita
Yoshiaki Bando
Keisuke Imoto
Masaki Onishi
Kazuyoshi Yoshii
76
3
0
30 Oct 2024
T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data
T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data
Hugo Thimonier
José Lucas De Melo Costa
Fabrice Popineau
Arpad Rimmel
Bich-Liên Doan
170
5
0
07 Oct 2024
Self-Supervised Audio-Visual Soundscape Stylization
Self-Supervised Audio-Visual Soundscape Stylization
Tingle Li
Renhao Wang
Po-Yao Huang
Andrew Owens
Gopala Anumanchipalli
DiffMSSL
151
7
0
22 Sep 2024
Interpretable Convolutional SyncNet
Interpretable Convolutional SyncNet
Sungjoon Park
Jaesub Yun
Donggeon Lee
Minsik Park
132
1
0
02 Sep 2024
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Shentong Mo
Haofan Wang
96
3
0
31 Aug 2024
Enhancing Sound Source Localization via False Negative Elimination
Enhancing Sound Source Localization via False Negative Elimination
Zengjie Song
Jiangshe Zhang
Yuxi Wang
Junsong Fan
Zhaoxiang Zhang
127
1
0
29 Aug 2024
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Mahrukh Awan
Asmar Nadeem
Muhammad Junaid Awan
Armin Mustafa
Syed Sameed Husain
89
2
0
26 Aug 2024
BrewCLIP: A Bifurcated Representation Learning Framework for
  Audio-Visual Retrieval
BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
Zhenyu Lu
Lakshay Sethi
116
0
0
19 Aug 2024
Aligning Sight and Sound: Advanced Sound Source Localization Through
  Audio-Visual Alignment
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
141
8
0
18 Jul 2024
Audio-visual Generalized Zero-shot Learning the Easy Way
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo
Pedro Morgado
102
6
0
18 Jul 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
146
4
0
08 Jul 2024
Semantic Grouping Network for Audio Source Separation
Semantic Grouping Network for Audio Source Separation
Shentong Mo
Yapeng Tian
98
4
0
04 Jul 2024
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual
  Transformers
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
Tanvir Mahmud
Shentong Mo
Yapeng Tian
Diana Marculescu
96
6
0
07 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
209
0
0
04 Jun 2024
SemiPL: A Semi-supervised Method for Event Sound Source Localization
SemiPL: A Semi-supervised Method for Event Sound Source Localization
Yue Li
Baiqiao Yin
Jinfu Liu
Jiajun Wen
Jiaying Lin
Mengyuan Liu
86
0
0
30 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large
  Multi-Modal Models
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLMCLIP
86
3
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric
  Videos
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoVSSL
122
8
0
08 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
142
24
0
05 Apr 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
Jongsuk Kim
Hyeongkeun Lee
Kyeongha Rho
Junmo Kim
Joon Son Chung
110
8
0
14 Mar 2024
Text-to-Audio Generation Synchronized with Videos
Text-to-Audio Generation Synchronized with Videos
Shentong Mo
Jing Shi
Yapeng Tian
DiffMVGen
123
22
0
08 Mar 2024
On the Efficacy of Text-Based Input Modalities for Action Anticipation
On the Efficacy of Text-Based Input Modalities for Action Anticipation
Apoorva Beedu
Karan Samel
Irfan Essa
162
2
0
23 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
224
1
0
15 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
127
6
0
08 Jan 2024
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
  Interactions through Masked Modeling
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo
Pedro Morgado
106
21
0
02 Dec 2023
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Hanyuan Wang
Majid Mirmehdi
Dima Damen
Toby Perrett
119
2
0
28 Nov 2023
Weakly-Supervised Audio-Visual Segmentation
Weakly-Supervised Audio-Visual Segmentation
Shentong Mo
Bhiksha Raj
VOS
127
16
0
25 Nov 2023
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
  Parsing
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Yating Xu
Conghui Hu
Gim Hee Lee
67
5
0
14 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
178
22
0
09 Nov 2023
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and
  Audio
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Xudong Xu
Dejan Marković
Jacob Sandakly
Todd Keebler
Steven Krenn
Alexander Richard
61
6
0
01 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
132
11
0
25 Oct 2023
Show from Tell: Audio-Visual Modelling in Clinical Settings
Show from Tell: Audio-Visual Modelling in Clinical Settings
Jianbo Jiao
M. Alsharid
L. Drukker
A. Papageorghiou
Andrew Zisserman
J. A. Noble
85
0
0
25 Oct 2023
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal
  Localized Alignment
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Jaewoo Lee
Jaehong Yoon
Wonjae Kim
Yunji Kim
Sung Ju Hwang
CLL
147
1
0
12 Oct 2023
Diffusion Models as Masked Audio-Video Learners
Diffusion Models as Masked Audio-Video Learners
Elvis Nunez
Yanzi Jin
Mohammad Rastegari
Sachin Mehta
Maxwell Horton
88
2
0
05 Oct 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Jiangliu Wang
Jianbo Jiao
Yibing Song
Stephen James
Zhan Tong
Chongjian Ge
Pieter Abbeel
Yunhui Liu
62
0
0
25 Sep 2023
1234567
Next