ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1807.00230
  4. Cited By
Cooperative Learning of Audio and Video Models from Self-Supervised
  Synchronization
v1v2 (latest)

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

30 June 2018
Bruno Korbar
Du Tran
Lorenzo Torresani
ArXiv (abs)PDFHTML

Papers citing "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization"

50 / 316 papers shown
Title
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
Thomas Kreutz
M. Mühlhäuser
Alejandro Sánchez Guinea
61
0
0
16 Jun 2025
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation
Theodore Barfoot
Luis C. Garcia-Peraza-Herrera
Samet Akcay
Ben Glocker
Tom Vercauteren
UQCV
156
0
0
04 Jun 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
134
0
0
02 May 2025
Evolutionary algorithms meet self-supervised learning: a comprehensive survey
Evolutionary algorithms meet self-supervised learning: a comprehensive survey
Adriano Vinhas
João Correia
Penousal Machado
SSLSyDa
129
0
0
09 Apr 2025
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Piyush Bagad
Hazel Doughty
Bernard Ghanem
Cees G. M. Snoek
ViTSSL
124
0
0
08 Apr 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
Akash Kumar
Ashlesha Kumar
Vibhav Vineet
Yogesh S Rawat
SSL
486
0
0
08 Apr 2025
UniSync: A Unified Framework for Audio-Visual Synchronization
UniSync: A Unified Framework for Audio-Visual Synchronization
Tao Feng
Yifan Xie
Xun Guan
Jiyuan Song
Z. Liu
Fei Ma
Fei Richard Yu
107
1
0
20 Mar 2025
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
Khang H. N. Vo
D. Q. Nguyen
T. Nguyen
Tho Quan
131
1
0
09 Mar 2025
Scaling 4D Representations
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
...
Viorica Patraucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
150
5
0
19 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
191
0
0
18 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo
Yibing Song
75
0
0
30 Oct 2024
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event
  Localization and Detection
DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Yoto Fujita
Yoshiaki Bando
Keisuke Imoto
Masaki Onishi
Kazuyoshi Yoshii
69
2
0
30 Oct 2024
T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data
T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data
Hugo Thimonier
José Lucas De Melo Costa
Fabrice Popineau
Arpad Rimmel
Bich-Liên Doan
131
2
0
07 Oct 2024
Self-Supervised Audio-Visual Soundscape Stylization
Self-Supervised Audio-Visual Soundscape Stylization
Tingle Li
Renhao Wang
Po-Yao Huang
Andrew Owens
Gopala Anumanchipalli
DiffMSSL
114
5
0
22 Sep 2024
Interpretable Convolutional SyncNet
Interpretable Convolutional SyncNet
Sungjoon Park
Jaesub Yun
Donggeon Lee
Minsik Park
118
0
0
02 Sep 2024
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Multi-scale Multi-instance Visual Sound Localization and Segmentation
Shentong Mo
Haofan Wang
82
2
0
31 Aug 2024
Enhancing Sound Source Localization via False Negative Elimination
Enhancing Sound Source Localization via False Negative Elimination
Zengjie Song
Jiangshe Zhang
Yuxi Wang
Junsong Fan
Zhaoxiang Zhang
97
0
0
29 Aug 2024
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Mahrukh Awan
Asmar Nadeem
Muhammad Junaid Awan
Armin Mustafa
Syed Sameed Husain
77
1
0
26 Aug 2024
BrewCLIP: A Bifurcated Representation Learning Framework for
  Audio-Visual Retrieval
BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval
Zhenyu Lu
Lakshay Sethi
84
0
0
19 Aug 2024
Aligning Sight and Sound: Advanced Sound Source Localization Through
  Audio-Visual Alignment
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
115
4
0
18 Jul 2024
Audio-visual Generalized Zero-shot Learning the Easy Way
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo
Pedro Morgado
79
5
0
18 Jul 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
102
3
0
08 Jul 2024
Semantic Grouping Network for Audio Source Separation
Semantic Grouping Network for Audio Source Separation
Shentong Mo
Yapeng Tian
80
4
0
04 Jul 2024
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual
  Transformers
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers
Tanvir Mahmud
Shentong Mo
Yapeng Tian
Diana Marculescu
70
4
0
07 Jun 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
165
0
0
04 Jun 2024
SemiPL: A Semi-supervised Method for Event Sound Source Localization
SemiPL: A Semi-supervised Method for Event Sound Source Localization
Yue Li
Baiqiao Yin
Jinfu Liu
Jiajun Wen
Jiaying Lin
Mengyuan Liu
78
0
0
30 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large
  Multi-Modal Models
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLMCLIP
78
3
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric
  Videos
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoVSSL
88
7
0
08 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
118
1
0
05 Apr 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
Jongsuk Kim
Hyeongkeun Lee
Kyeongha Rho
Junmo Kim
Joon Son Chung
73
6
0
14 Mar 2024
Text-to-Audio Generation Synchronized with Videos
Text-to-Audio Generation Synchronized with Videos
Shentong Mo
Jing Shi
Yapeng Tian
DiffMVGen
98
18
0
08 Mar 2024
On the Efficacy of Text-Based Input Modalities for Action Anticipation
On the Efficacy of Text-Based Input Modalities for Action Anticipation
Apoorva Beedu
Karan Samel
Irfan Essa
102
2
0
23 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
177
1
0
15 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
98
5
0
08 Jan 2024
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
  Interactions through Masked Modeling
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo
Pedro Morgado
82
14
0
02 Dec 2023
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Hanyuan Wang
Majid Mirmehdi
Dima Damen
Toby Perrett
105
2
0
28 Nov 2023
Weakly-Supervised Audio-Visual Segmentation
Weakly-Supervised Audio-Visual Segmentation
Shentong Mo
Bhiksha Raj
VOS
104
13
0
25 Nov 2023
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
  Parsing
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
Yating Xu
Conghui Hu
Gim Hee Lee
55
2
0
14 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
149
21
0
09 Nov 2023
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and
  Audio
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Xudong Xu
Dejan Marković
Jacob Sandakly
Todd Keebler
Steven Krenn
Alexander Richard
51
5
0
01 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
103
10
0
25 Oct 2023
Show from Tell: Audio-Visual Modelling in Clinical Settings
Show from Tell: Audio-Visual Modelling in Clinical Settings
Jianbo Jiao
M. Alsharid
L. Drukker
A. Papageorghiou
Andrew Zisserman
J. A. Noble
81
0
0
25 Oct 2023
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal
  Localized Alignment
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Jaewoo Lee
Jaehong Yoon
Wonjae Kim
Yunji Kim
Sung Ju Hwang
CLL
120
1
0
12 Oct 2023
Diffusion Models as Masked Audio-Video Learners
Diffusion Models as Masked Audio-Video Learners
Elvis Nunez
Yanzi Jin
Mohammad Rastegari
Sachin Mehta
Maxwell Horton
65
2
0
05 Oct 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Jiangliu Wang
Jianbo Jiao
Yibing Song
Stephen James
Zhan Tong
Chongjian Ge
Pieter Abbeel
Yunhui Liu
58
0
0
25 Sep 2023
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
  Classification
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
Meng Liu
K. Liang
Dayu Hu
Hao Yu
Yue Liu
Lingyuan Meng
Wenxuan Tu
Sihang Zhou
Xinwang Liu
89
26
0
21 Sep 2023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual
  Representation Models
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng
Layne Berry
Yi-Ting Chen
I-Hsiang Chiu
Hsuan-Hao Lin
...
Yu Tsao
Shinji Watanabe
Abdel-rahman Mohamed
Chi-Luen Feng
Hung-yi Lee
VLMSSL
131
15
0
19 Sep 2023
Sound Source Localization is All about Cross-Modal Alignment
Sound Source Localization is All about Cross-Modal Alignment
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
97
19
0
19 Sep 2023
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Class-Incremental Grouping Network for Continual Audio-Visual Learning
Shentong Mo
Weiguo Pian
Yapeng Tian
CLLVLM
94
26
0
11 Sep 2023
Text-to-feature diffusion for audio-visual few-shot learning
Text-to-feature diffusion for audio-visual few-shot learning
Otniel-Bogdan Mercea
Thomas Hummel
A. Sophia Koepke
Zeynep Akata
VLM
84
2
0
07 Sep 2023
1234567
Next