ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.02184
  4. Cited By
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
  Prediction

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

5 January 2022
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdel-rahman Mohamed
    SSL
ArXivPDFHTML

Papers citing "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction"

50 / 207 papers shown
Title
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech
  Recognition using Adversarial Data Augmentation
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Huimeng Wang
Zengrui Jin
Mengzhe Geng
Shujie Hu
Guinan Li
Tianzi Wang
Haoning Xu
Xunying Liu
21
10
0
01 Jan 2024
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
  Translation
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
Xize Cheng
Rongjie Huang
Linjun Li
Tao Jin
Zehan Wang
Aoxiong Yin
Minglei Li
Xinyu Duan
Changpeng Yang
Zhou Zhao
33
2
0
23 Dec 2023
Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models
Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models
Christopher Simic
Tobias Bocklet
34
5
0
21 Dec 2023
LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
  Representations of Unlabeled Data
LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data
Hendrik Laux
Emil Mededovic
Ahmed Hallawa
Lukas Martin
A. Peine
Anke Schmeink
VLM
26
4
0
15 Dec 2023
Audio-visual fine-tuning of audio-only ASR models
Audio-visual fine-tuning of audio-only ASR models
Avner May
Dmitriy Serdyuk
Ankit Parag Shah
Otavio Braga
Olivier Siohan
31
3
0
14 Dec 2023
On Robustness to Missing Video for Audiovisual Speech Recognition
On Robustness to Missing Video for Audiovisual Speech Recognition
Oscar Chang
Otavio Braga
H. Liao
Dmitriy Serdyuk
Olivier Siohan
45
11
0
13 Dec 2023
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
  Synthesis achieving both Auditory and Photo-realism
Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
Georgios Milis
P. Filntisis
A. Roussos
Petros Maragos
CVBM
36
2
0
11 Dec 2023
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation
  with Unified Audio-Visual Speech Representation
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
J. Choi
Se Jin Park
Minsu Kim
Y. Ro
37
12
0
05 Dec 2023
Stochastic Vision Transformers with Wasserstein Distance-Aware Attention
Stochastic Vision Transformers with Wasserstein Distance-Aware Attention
Franciskus Xaverius Erick
Mina Rezaei
Johanna P. Müller
Bernhard Kainz
23
0
0
30 Nov 2023
Do VSR Models Generalize Beyond LRS3?
Do VSR Models Generalize Beyond LRS3?
Y. A. D. Djilali
Sanath Narayan
Eustache Le Bihan
Haithem Boussaid
Ebtesam Almazrouei
Merouane Debbah
35
4
0
23 Nov 2023
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency
  for Video Deepfake Detection
AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection
Sahibzada Adil Shahzad
Ammarah Hashmi
Yan-Tsung Peng
Yu Tsao
Hsin-Min Wang
34
5
0
05 Nov 2023
Detecting Deepfakes Without Seeing Any
Detecting Deepfakes Without Seeing Any
Tal Reiss
Bar Cavia
Yedid Hoshen
AAML
31
17
0
02 Nov 2023
MOSEL: Inference Serving Using Dynamic Modality Selection
MOSEL: Inference Serving Using Dynamic Modality Selection
Bodun Hu
Le Xu
Jeongyoon Moon
N. Yadwadkar
Aditya Akella
13
4
0
27 Oct 2023
TorchAudio 2.1: Advancing speech recognition, self-supervised learning,
  and audio processing components for PyTorch
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Jeff Hwang
Moto Hira
Caroline Chen
Xiaohui Zhang
Zhaoheng Ni
...
Yumeng Tao
Robin Scheibler
Samuele Cornell
Sean Kim
Stavros Petridis
46
22
0
27 Oct 2023
Intuitive Multilingual Audio-Visual Speech Recognition with a
  Single-Trained Model
Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model
Joanna Hong
Se Jin Park
Y. Ro
VLM
16
6
0
23 Oct 2023
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting
  Multiple Experts for Video Deepfake Detection
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection
Ammarah Hashmi
Sahibzada Adil Shahzad
Chia-Wen Lin
Yu Tsao
Hsin-Min Wang
ViT
53
6
0
19 Oct 2023
Spatial HuBERT: Self-supervised Spatial Speech Representation Learning
  for a Single Talker from Multi-channel Audio
Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio
Antoni Dimitriadis
Siqi Pan
V. Sethu
Beena Ahmed
SSL
28
3
0
17 Oct 2023
Modality-aware Transformer for Financial Time series Forecasting
Modality-aware Transformer for Financial Time series Forecasting
Hajar Emami
Xuan-Hong Dang
Yousaf Shah
Petros Zerfos
AI4TS
40
0
0
02 Oct 2023
Emotional Listener Portrait: Neural Listener Head Generation with
  Emotion
Emotional Listener Portrait: Neural Listener Head Generation with Emotion
Luchuan Song
Guojun Yin
Zhenchao Jin
Xiaoyi Dong
Chenliang Xu
30
11
0
29 Sep 2023
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Andrew Rouditchenko
R. Collobert
Tatiana Likhomanenko
VLM
27
3
0
29 Sep 2023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual
  Representation Models
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng
Layne Berry
Yi-Ting Chen
I-Hsiang Chiu
Hsuan-Hao Lin
...
Yu Tsao
Shinji Watanabe
Abdel-rahman Mohamed
Chi-Luen Feng
Hung-yi Lee
VLM
SSL
61
14
0
19 Sep 2023
Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation
Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation
Danilo de Oliveira
Timo Gerkmann
VLM
28
3
0
18 Sep 2023
Enhancing GAN-Based Vocoders with Contrastive Learning Under
  Data-limited Condition
Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition
Haoming Guo
Seth Z. Zhao
Jiachen Lian
Gopala Anumanchipalli
Gerald Friedland
24
2
0
16 Sep 2023
Visual Speech Recognition for Languages with Limited Labeled Data using
  Automatic Labels from Whisper
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper
Jeong Hun Yeo
Minsu Kim
Shinji Watanabe
Y. Ro
VLM
34
12
0
15 Sep 2023
AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised
  Features for Audio-Visual Speech Enhancement
AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement
Ju-Chieh Chou
Chung-Ming Chien
Karen Livescu
DiffM
23
4
0
14 Sep 2023
Multimodal Fish Feeding Intensity Assessment in Aquaculture
Multimodal Fish Feeding Intensity Assessment in Aquaculture
Meng Cui
Xubo Liu
Haohe Liu
Zhuangzhuang Du
Tao Chen
Guoping Lian
Daoliang Li
Wenwu Wang
34
5
0
10 Sep 2023
Diversified Ensemble of Independent Sub-Networks for Robust
  Self-Supervised Representation Learning
Diversified Ensemble of Independent Sub-Networks for Robust Self-Supervised Representation Learning
Amirhossein Vahidi
Lisa Wimmer
H. Gündüz
Bernd Bischl
Eyke Hüllermeier
Mina Rezaei
OOD
UQCV
30
4
0
28 Aug 2023
Diffusion Models for Image Restoration and Enhancement -- A
  Comprehensive Survey
Diffusion Models for Image Restoration and Enhancement -- A Comprehensive Survey
Xin Li
Yulin Ren
Xin Jin
Cuiling Lan
X. Wang
Wenjun Zeng
Xinchao Wang
Zhibo Chen
43
86
0
18 Aug 2023
Lip Reading for Low-resource Languages by Learning and Combining General
  Speech Knowledge and Language-specific Knowledge
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
Minsu Kim
Jeong Hun Yeo
J. Choi
Y. Ro
34
16
0
18 Aug 2023
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
  Speaker Embedding
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding
J. Choi
Joanna Hong
Y. Ro
DiffM
29
19
0
15 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
  Compressing Audio Knowledge of a Pretrained Model
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
26
18
0
15 Aug 2023
Lip2Vec: Efficient and Robust Visual Speech Recognition via
  Latent-to-Latent Visual to Audio Representation Mapping
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping
Y. A. D. Djilali
Sanath Narayan
Haithem Boussaid
Ebtesam Almazrouei
Merouane Debbah
37
10
0
11 Aug 2023
Many-to-Many Spoken Language Translation via Unified Speech and Text
  Representation Learning with Unit-to-Unit Translation
Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Minsu Kim
J. Choi
Dahun Kim
Y. Ro
42
12
0
03 Aug 2023
Audio-visual video-to-speech synthesis with synthesized input audio
Audio-visual video-to-speech synthesis with synthesized input audio
Triantafyllos Kefalas
Yannis Panagakis
M. Pantic
VGen
DiffM
38
1
0
31 Jul 2023
A Unified Framework for Modality-Agnostic Deepfakes Detection
A Unified Framework for Modality-Agnostic Deepfakes Detection
Cai Yu
Peng-Wen Chen
Jiahe Tian
Jin Liu
Jiao Dai
Xi Wang
Yesheng Chai
Shan Jia
Siwei Lyu
Jizhong Han
32
0
0
26 Jul 2023
Leveraging Visemes for Better Visual Speech Representation and Lip
  Reading
Leveraging Visemes for Better Visual Speech Representation and Lip Reading
J. Peymanfard
Vahid Saeedi
Mohammad Reza Mohammadi
Hossein Zeinali
N. Mozayani
39
2
0
19 Jul 2023
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
Adriana Fernandez-Lopez
Honglie Chen
Pingchuan Ma
A. Haliassos
Stavros Petridis
M. Pantic
VLM
33
7
0
10 Jul 2023
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting
  Self-Supervised Representations
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Neha Sahipjohn
Neil Shah
Vishal Tambrahalli
Vineet Gandhi
24
2
0
03 Jul 2023
What Do Self-Supervised Speech Models Know About Words?
What Do Self-Supervised Speech Models Know About Words?
Ankita Pasad
C. Chien
Shane Settle
Karen Livescu
SSL
46
26
0
30 Jun 2023
QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me
  Challenge
QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me Challenge
Hsi-Che Lin
Chien-Yi Wang
Min-Hung Chen
Szu-Wei Fu
Y. Wang
23
2
0
30 Jun 2023
High-Quality Automatic Voice Over with Accurate Alignment: Supervision
  through Self-Supervised Discrete Speech Units
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
Junchen Lu
Berrak Sisman
Mingyang Zhang
Haizhou Li
32
4
0
29 Jun 2023
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM: A Large Language Model That Can Speak and Listen
Paul Kishan Rubenstein
Chulayuth Asawaroengchai
D. Nguyen
Ankur Bapna
Zalan Borsos
...
Neil Zeghidour
Yu Zhang
Zhishuai Zhang
Lukás Zilka
Christian Frank
LM&MA
AuLLM
VLM
48
264
0
22 Jun 2023
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
  Adversarial Network for Audio-Visual Speech Recognition
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Yuchen Hu
Chen Chen
Ruizhe Li
Heqing Zou
Chng Eng Siong
GAN
42
9
0
18 Jun 2023
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for
  Robust Audio-Visual Speech Recognition
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Yuchen Hu
Ruizhe Li
Cheng Chen
Chengwei Qin
Qiu-shi Zhu
Eng Siong Chng
39
5
0
18 Jun 2023
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
Yung-Lun Chien
Hsin-Hao Chen
Ming-Chi Yen
S. Tsai
Hsin-Min Wang
Yu Tsao
T. Chi
20
1
0
11 Jun 2023
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
  Alignment
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
Xize Cheng
Tao Jin
Lin Li
Wang Lin
Xinyu Duan
Zhou Zhao
VLM
24
15
0
10 Jun 2023
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading
Yochai Yemini
Aviv Shamsian
Lior Bracha
Sharon Gannot
Ethan Fetaya
DiffM
33
11
0
05 Jun 2023
Audio-Visual Speech Enhancement with Score-Based Generative Models
Audio-Visual Speech Enhancement with Score-Based Generative Models
Julius Richter
Simone Frintrop
Timo Gerkmann
DiffM
26
10
0
02 Jun 2023
Speech inpainting: Context-based speech synthesis guided by video
Speech inpainting: Context-based speech synthesis guided by video
Juan F. Montesinos
Daniel Michelsanti
G. Haro
Zheng-Hua Tan
Jesper Jensen
21
3
0
01 Jun 2023
Intelligible Lip-to-Speech Synthesis with Speech Units
Intelligible Lip-to-Speech Synthesis with Speech Units
J. Choi
Minsu Kim
Y. Ro
32
24
0
31 May 2023
Previous
12345
Next