ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.02184
  4. Cited By
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
  Prediction

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

5 January 2022
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdel-rahman Mohamed
    SSL
ArXivPDFHTML

Papers citing "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction"

50 / 207 papers shown
Title
LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models
LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models
Danilo de Oliveira
Julius Richter
Tal Peer
Timo Germann
DiffM
22
0
0
16 May 2025
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
Detao Bai
Zhiheng Ma
Xihan Wei
Liefeng Bo
141
0
0
06 May 2025
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
J. Choi
Ji-Hoon Kim
Kim Sung-Bin
Tae-Hyun Oh
Joon Son Chung
DiffM
49
0
0
29 Apr 2025
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
Jinghua Zhao
Yuhang Jia
Shiyao Wang
Jiaming Zhou
Hui Wang
Yong Qin
37
0
0
21 Apr 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li
Mining Tan
Feier Shen
Minyan Luo
Zijiao Yin
Fan Tang
W. Dong
Changsheng Xu
69
0
0
17 Apr 2025
Visual-Aware Speech Recognition for Noisy Scenarios
Visual-Aware Speech Recognition for Noisy Scenarios
Lakshmipathi Balaji
Karan Singla
31
0
0
09 Apr 2025
FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency
FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency
Shiyan Liu
Rui Qu
Yan Jin
31
0
0
06 Apr 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin
Jeongsoo Choi
Puyuan Peng
Joon Son Chung
Tae-Hyun Oh
David Harwath
VGen
47
1
0
03 Apr 2025
Understanding Co-speech Gestures in-the-wild
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
57
0
0
28 Mar 2025
VALLR: Visual ASR Language Model for Lip Reading
VALLR: Visual ASR Language Model for Lip Reading
Marshall Thomas
Edward Fish
Richard Bowden
41
0
0
27 Mar 2025
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Lee Chae-Yeon
Oh Hyun-Bin
Han EunGi
Kim Sung-Bin
Suekyeong Nam
Tae-Hyun Oh
EGVM
3DH
87
0
1
26 Mar 2025
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Ji-Hoon Kim
Jeongsoo Choi
Jaehun Kim
Chaeyoung Jung
Joon Son Chung
CVBM
53
1
0
21 Mar 2025
Multi-modal Time Series Analysis: A Tutorial and Survey
Multi-modal Time Series Analysis: A Tutorial and Survey
Yushan Jiang
Kanghui Ning
Zijie Pan
Xuyang Shen
Jingchao Ni
Wenchao Yu
Anderson Schneider
Haifeng Chen
Yuriy Nevmyvaka
Dongjin Song
AI4TS
170
0
0
17 Mar 2025
SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization
SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization
Xulin Fan
Heting Gao
Ziyi Chen
Peng Chang
Mei Han
Mark Hasegawa-Johnson
DiffM
62
0
0
17 Mar 2025
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo
Hyeongseop Rha
Se Jin Park
Y. Ro
56
0
0
14 Mar 2025
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Sungwoo Cho
J. Choi
Sungnyun Kim
Se-Young Yun
63
0
0
14 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
55
0
0
12 Mar 2025
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
Umberto Cappellazzo
Minsu Kim
Stavros Petridis
57
0
0
09 Mar 2025
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
Yifan Liu
Yu Fang
Zhouhan Lin
42
0
0
07 Mar 2025
Note-Level Singing Melody Transcription for Time-Aligned Musical Score Generation
Note-Level Singing Melody Transcription for Time-Aligned Musical Score Generation
Leekyung Kim
Sungwook Jeon
Wan Heo
Jonghun Park
87
0
0
18 Feb 2025
NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing
Yifan Liang
Fangkun Liu
Andong Li
Xiaodong Li
C. Zheng
49
1
0
17 Feb 2025
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
Jing-Xuan Zhang
Genshun Wan
Jianqing Gao
Zhen-Hua Ling
49
0
0
09 Feb 2025
Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models
Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models
Christopher Simic
Korbinian Riedhammer
Tobias Bocklet
93
0
0
03 Feb 2025
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Andrew Rouditchenko
Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
116
1
0
03 Feb 2025
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Sungnyun Kim
Sungwoo Cho
Sangmin Bae
Kangwook Jang
Se-Young Yun
SSL
68
1
0
23 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
43
0
0
03 Jan 2025
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
Jeong Hun Yeo
Chae Won Kim
Hyunjun Kim
Hyeongseop Rha
Seunghee Han
Wen-Huang Cheng
Y. Ro
59
3
0
03 Jan 2025
DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering
DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering
Ruohong Yang
Peng Hu
Xi Peng
Xiting Liu
Yunfan Li
39
0
0
25 Dec 2024
DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer
  Scaling Factor Search
DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
Lei Yang
Shaoyang Xu
Deyi Xiong
39
1
0
25 Dec 2024
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech
  Translation
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Lucas Goncalves
Prashant Mathur
Xing Niu
Brady Houston
Chandrashekhar Lavania
Srikanth Vishnubhotla
Lijia Sun
Anthony Ferritto
72
0
0
21 Dec 2024
Enhancing Modality Representation and Alignment for Multimodal
  Cold-start Active Learning
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning
Meng Shen
Yake Wei
Jianxiong Yin
D. Rajan
D. Hu
Simon See
81
0
0
12 Dec 2024
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
Dragos-Alexandru Boldisor
Stefan Smeu
Dan Oneaţă
Elisabeta Oneata
103
1
0
29 Nov 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
75
0
0
24 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
45
0
0
18 Nov 2024
DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization
DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization
C. Koutlis
Symeon Papadopoulos
58
2
0
15 Nov 2024
How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative
  Study of ChatGPT, AI Models and Human Perception
How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception
Sahibzada Adil Shahzad
Ammarah Hashmi
Yan-Tsung Peng
Yu Tsao
H. Wang
39
1
0
14 Nov 2024
Unified Speech Recognition: A Single Model for Auditory, Visual, and
  Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
M. Pantic
SSL
37
5
0
04 Nov 2024
Deep Insights into Cognitive Decline: A Survey of Leveraging
  Non-Intrusive Modalities with Deep Learning Techniques
Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques
David Ortiz-Perez
Manuel Benavent-Lledo
José García Rodríguez
David Tomás
M. Flores Vizcaya-Moreno
34
0
0
24 Oct 2024
Character-aware audio-visual subtitling in context
Character-aware audio-visual subtitling in context
Jaesung Huh
Andrew Zisserman
41
0
0
14 Oct 2024
Diffusion-based Unsupervised Audio-visual Speech Enhancement
Diffusion-based Unsupervised Audio-visual Speech Enhancement
Jean-Eudes Ayilo
Mostafa Sadeghi
Romain Serizel
Xavier Alameda-Pineda
DiffM
20
0
0
04 Oct 2024
Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic
  Perspective
Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective
Chen Chen
Xiaolou Li
Zehua Liu
Lantian Li
D. Wang
33
0
0
29 Sep 2024
Video-to-Audio Generation with Fine-grained Temporal Semantics
Video-to-Audio Generation with Fine-grained Temporal Semantics
Yuchen Hu
Yu Gu
Chenxing Li
Rilin Chen
Dong Yu
VGen
DiffM
29
1
0
23 Sep 2024
Measuring Sound Symbolism in Audio-visual Models
Measuring Sound Symbolism in Audio-visual Models
Wei-Cheng Tseng
Yi-Jen Shih
David Harwath
Raymond Mooney
37
0
0
18 Sep 2024
Towards Global Localization using Multi-Modal Object-Instance Re-Identification
Towards Global Localization using Multi-Modal Object-Instance Re-Identification
Aneesh Chavan
Vaibhav Agrawal
Vineeth Bhat
Sarthak Chittawar
Siddharth Srivastava
Chetan Arora
K. M. Krishna
95
0
0
18 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
36
9
0
18 Sep 2024
Self-supervised Multimodal Speech Representations for the Assessment of
  Schizophrenia Symptoms
Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms
Gowtham Premananth
Carol Y. Espy-Wilson
23
1
0
15 Sep 2024
Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
Yiwen Guan
V. Trinh
Vivek Voleti
Jacob Whitehill
42
1
0
13 Sep 2024
RAL:Redundancy-Aware Lipreading Model Based on Differential Learning
  with Symmetric Views
RAL:Redundancy-Aware Lipreading Model Based on Differential Learning with Symmetric Views
Zejun gu
Junxia jiang
33
0
0
09 Sep 2024
SegTalker: Segmentation-based Talking Face Generation with Mask-guided
  Local Editing
SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Lingyu Xiong
Xize Cheng
Jintao Tan
Wenxiong Kang
Xiandong Li
Lei Zhu
Fei Ma
Minglei Li
Huang Xu
Zhihu Hu
34
3
0
05 Sep 2024
Interpretable Convolutional SyncNet
Interpretable Convolutional SyncNet
Sungjoon Park
Jaesub Yun
Donggeon Lee
Minsik Park
57
0
0
02 Sep 2024
12345
Next