ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.07839
  4. Cited By
Contrastive Audio-Visual Masked Autoencoder

Contrastive Audio-Visual Masked Autoencoder

2 October 2022
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David F. Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
ArXivPDFHTML

Papers citing "Contrastive Audio-Visual Masked Autoencoder"

50 / 90 papers shown
Title
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection
Hao Cheng
Zhiwei Zhao
Yichao He
Zhenzhen Hu
Jia Li
M. Wang
Richang Hong
43
0
0
05 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
32
0
0
02 May 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
62
2
0
02 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
51
0
0
30 Mar 2025
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Lee Chae-Yeon
Oh Hyun-Bin
Han EunGi
Kim Sung-Bin
Suekyeong Nam
Tae-Hyun Oh
EGVM
3DH
85
0
1
26 Mar 2025
Structured-Noise Masked Modeling for Video, Audio and Beyond
Structured-Noise Masked Modeling for Video, Audio and Beyond
Aritra Bhowmik
Fida Mohammad Thoker
Carlos Hinojosa
Bernard Ghanem
Cees G. M. Snoek
VGen
59
0
0
20 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Wei Ping
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLM
AuLLM
LRM
57
8
0
06 Mar 2025
Smoothing the Shift: Towards Stable Test-Time Adaptation under Complex Multimodal Noises
Zirun Guo
Tao Jin
TTA
84
1
0
04 Mar 2025
When Vision Models Meet Parameter Efficient Look-Aside Adapters Without
  Large-Scale Audio Pretraining
When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining
Juan Yeo
Jinkwan Jang
Kyubyung Chae
Seongkyu Mun
Taesup Kim
VLM
57
0
0
08 Dec 2024
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA
  Benchmark
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark
Joseph Heyward
João Carreira
Dima Damen
Andrew Zisserman
Viorica Patraucean
80
2
0
29 Nov 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
75
0
0
24 Nov 2024
KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder
KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder
Maheswar Bora
Saurabh Atreya
Aritra Mukherjee
Abhijit Das
87
0
0
19 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
42
0
0
18 Nov 2024
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Aligning Audio-Visual Joint Representations with an Agentic Workflow
Shentong Mo
Yibing Song
23
0
0
30 Oct 2024
Analytic Continual Test-Time Adaptation for Multi-Modality Corruption
Analytic Continual Test-Time Adaptation for Multi-Modality Corruption
Yufei Zhang
Yicheng Xu
Hongxin Wei
Zhiping Lin
Huiping Zhuang
TTA
27
0
0
29 Oct 2024
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image
  Captioner using Audiovisual Distribution Alignment
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard
Michel Olvera
Stéphane Lathuilière
S. Essid
VLM
34
0
0
08 Oct 2024
The Solution for Temporal Action Localisation Task of Perception Test
  Challenge 2024
The Solution for Temporal Action Localisation Task of Perception Test Challenge 2024
Yinan Han
Qingyuan Jiang
Hongming Mei
Yang Yang
Jinhui Tang
22
0
0
08 Oct 2024
Solution for Temporal Sound Localisation Task of ECCV Second Perception
  Test Challenge 2024
Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024
Haowei Gu
Weihao Zhu
Yang Yang
32
0
0
29 Sep 2024
From Vision to Audio and Beyond: A Unified Model for Audio-Visual
  Representation and Generation
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Kun Su
Xiulong Liu
Eli Shlizerman
VGen
30
6
0
27 Sep 2024
Measuring Sound Symbolism in Audio-visual Models
Measuring Sound Symbolism in Audio-visual Models
Wei-Cheng Tseng
Yi-Jen Shih
David Harwath
Raymond Mooney
32
0
0
18 Sep 2024
DETECLAP: Enhancing Audio-Visual Representation Learning with Object
  Information
DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information
Shota Nakada
Taichi Nishimura
Hokuto Munakata
Masayoshi Kondo
Tatsuya Komatsu
CLIP
VLM
30
0
0
18 Sep 2024
Masked Image Modeling: A Survey
Masked Image Modeling: A Survey
Vlad Hondru
Florinel-Alin Croitoru
Shervin Minaee
Radu Tudor Ionescu
N. Sebe
64
6
0
13 Aug 2024
Missing Modality Prediction for Unpaired Multimodal Learning via Joint
  Embedding of Unimodal Models
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
Donggeun Kim
Taesup Kim
24
3
0
17 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
41
9
0
16 Jul 2024
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning
Jongsuk Kim
Jiwon Shin
Junmo Kim
39
1
0
10 Jul 2024
Masked Video and Body-worn IMU Autoencoder for Egocentric Action
  Recognition
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
Mingfang Zhang
Yifei Huang
Ruicong Liu
Yoichi Sato
39
4
0
09 Jul 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serra
35
2
0
08 Jul 2024
Revealing Vision-Language Integration in the Brain with Multimodal
  Networks
Revealing Vision-Language Integration in the Brain with Multimodal Networks
Vighnesh Subramaniam
C. Conwell
Christopher Wang
Gabriel Kreiman
Boris Katz
Ignacio Cases
Andrei Barbu
32
8
0
20 Jun 2024
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding
  of Sound and Language
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Mark Hamilton
Andrew Zisserman
John R. Hershey
William T. Freeman
VLM
37
5
0
09 Jun 2024
What do MLLMs hear? Examining reasoning with text and sound components
  in Multimodal Large Language Models
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk Çoban
Michael I. Mandel
Johanna Devaney
AuLLM
LRM
36
0
0
07 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Y. Guo
VGen
100
16
0
06 Jun 2024
Audio Mamba: Bidirectional State Space Model for Audio Representation
  Learning
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
67
19
0
05 Jun 2024
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection
Trevine Oorloff
Surya Koppisetti
Nicolò Bonettini
Divyaraj Solanki
Ben Colman
Yaser Yacoob
Ali Shahriyari
Gaurav Bharaj
32
20
0
05 Jun 2024
Images that Sound: Composing Images and Sounds on a Single Canvas
Images that Sound: Composing Images and Sounds on a Single Canvas
Ziyang Chen
Daniel Geng
Andrew Owens
DiffM
48
9
0
20 May 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiangkang Deng
Xiatian Zhu
VOS
37
5
0
21 Mar 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
Jongsuk Kim
Hyeongkeun Lee
Kyeongha Rho
Junmo Kim
Joon Son Chung
23
4
0
14 Mar 2024
Dyadic Interaction Modeling for Social Behavior Generation
Dyadic Interaction Modeling for Social Behavior Generation
Minh Tran
Di Chang
Maksim Siniukov
Mohammad Soleymani
VGen
34
6
0
14 Mar 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
32
4
0
14 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
77
4
0
08 Feb 2024
Tuning In: Analysis of Audio Classifier Performance in Clinical Settings
  with Limited Data
Tuning In: Analysis of Audio Classifier Performance in Clinical Settings with Limited Data
Hamza Mahdi
Eptehal Nashnoush
Rami Saab
Arjun Balachandar
Rishit Dagli
Lucas X. Perri
H. Khosravani
16
1
0
07 Feb 2024
A Consistent Lebesgue Measure for Multi-label Learning
A Consistent Lebesgue Measure for Multi-label Learning
Kaan Demir
B. Nguyen
Bing Xue
Mengjie Zhang
28
0
0
01 Feb 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
34
0
0
22 Jan 2024
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised
  Audio-Visual Emotion Recognition
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Licai Sun
Zheng Lian
Bin Liu
Jianhua Tao
51
29
0
11 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
38
6
0
08 Jan 2024
Learning to Embed Time Series Patches Independently
Learning to Embed Time Series Patches Independently
Seunghan Lee
Taeyoung Park
Kibok Lee
SSL
AI4TS
20
27
0
27 Dec 2023
SAIC: Integration of Speech Anonymization and Identity Classification
SAIC: Integration of Speech Anonymization and Identity Classification
Ming Cheng
Xingjian Diao
Shitong Cheng
Wenjun Liu
45
6
0
23 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
19
38
0
11 Dec 2023
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense
  Interactions through Masked Modeling
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo
Pedro Morgado
19
13
0
02 Dec 2023
Multimodal Representation Learning by Alternating Unimodal Adaptation
Multimodal Representation Learning by Alternating Unimodal Adaptation
Xiaohui Zhang
Jaehong Yoon
Mohit Bansal
Huaxiu Yao
26
21
0
17 Nov 2023
12
Next