ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.18539
  4. Cited By
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

23 January 2025
Sungnyun Kim
Sungwoo Cho
Sangmin Bae
Kangwook Jang
Se-Young Yun
    SSL
ArXivPDFHTML

Papers citing "Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation"

48 / 48 papers shown
Title
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust
  Audio-Visual Speech Recognition
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
Sungnyun Kim
Kangwook Jang
Sangmin Bae
Hoirin Kim
Se-Young Yun
80
3
0
04 Jul 2024
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory
  Speech Recognition
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition
A. Haliassos
Andreas Zinonos
Rodrigo Mira
Stavros Petridis
Maja Pantic
VLM
SSL
AI4TS
59
13
0
02 Apr 2024
Multichannel AV-wav2vec2: A Framework for Learning Multichannel
  Multi-Modal Speech Representation
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Qiu-shi Zhu
Jie Zhang
Yu Gu
Yuli Hu
Lirong Dai
SSL
63
11
0
07 Jan 2024
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with
  Adversarial Network for Audio-Visual Speech Recognition
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Yuchen Hu
Chen Chen
Ruizhe Li
Heqing Zou
Chng Eng Siong
GAN
65
9
0
18 Jun 2023
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for
  Robust Audio-Visual Speech Recognition
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition
Yuchen Hu
Ruizhe Li
Cheng Chen
Chengwei Qin
Qiu-shi Zhu
Eng Siong Chng
82
5
0
18 Jun 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
  Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie
James Lee-Thorp
Michiel de Jong
Yury Zemlyanskiy
Federico Lebrón
Sumit Sanghai
63
657
0
22 May 2023
DinoSR: Self-Distillation and Online Clustering for Self-supervised
  Speech Representation Learning
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Alexander H. Liu
Heng-Jui Chang
Michael Auli
Wei-Ning Hsu
James R. Glass
44
26
0
17 May 2023
Cross-Modal Global Interaction and Local Alignment for Audio-Visual
  Speech Recognition
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Yuchen Hu
Ruizhe Li
Chen Chen
Heqing Zou
Qiu-shi Zhu
Eng Siong Chng
53
8
0
16 May 2023
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
  AV-ASR
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
36
15
0
29 Mar 2023
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Pingchuan Ma
A. Haliassos
Adriana Fernandez-Lopez
Honglie Chen
Stavros Petridis
Maja Pantic
54
114
0
25 Mar 2023
AV-data2vec: Self-supervised Learning of Audio-Visual Speech
  Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Jiachen Lian
Alexei Baevski
Wei-Ning Hsu
Michael Auli
SSL
84
34
0
10 Feb 2023
Audio-Visual Efficient Conformer for Robust Speech Recognition
Audio-Visual Efficient Conformer for Robust Speech Recognition
Maxime Burchi
Radu Timofte
VLM
40
35
0
04 Jan 2023
Efficient Self-supervised Learning with Contextualized Target
  Representations for Vision, Speech and Language
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
Alexei Baevski
Arun Babu
Wei-Ning Hsu
Michael Auli
VLM
SSL
67
96
0
14 Dec 2022
Jointly Learning Visual and Auditory Speech Representations from Raw
  Data
Jointly Learning Visual and Auditory Speech Representations from Raw Data
A. Haliassos
Pingchuan Ma
Rodrigo Mira
Stavros Petridis
Maja Pantic
SSL
64
49
0
12 Dec 2022
Leveraging Modality-specific Representations for Audio-visual Speech
  Recognition via Reinforcement Learning
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Chen Chen
Yuchen Hu
Qiang Zhang
Heqing Zou
Beier Zhu
Eng Siong Chng
58
28
0
10 Dec 2022
Self-Supervised Audio-Visual Speech Representations Learning By
  Multimodal Self-Distillation
Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Jing-Xuan Zhang
Genshun Wan
Zhenhua Ling
Jia Pan
Jianqing Gao
Cong Liu
SSL
42
13
0
06 Dec 2022
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
  Speech Representation Learning
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiu-shi Zhu
Long Zhou
Zi-Hua Zhang
Shujie Liu
Binxing Jiao
Jie Zhang
Lirong Dai
Daxin Jiang
Jinyu Li
Furu Wei
57
38
0
21 Nov 2022
Weighted Ensemble Self-Supervised Learning
Weighted Ensemble Self-Supervised Learning
Yangjun Ruan
Saurabh Singh
Warren Morningstar
Alexander A. Alemi
Sergey Ioffe
Ian S. Fischer
Joshua V. Dillon
FedML
58
15
0
18 Nov 2022
Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal
  and Multimodal Representations
Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations
Sijie Mai
Ying Zeng
Haifeng Hu
57
70
0
31 Oct 2022
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer
  to Unlabeled Modality
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Wei-Ning Hsu
Bowen Shi
SSL
VLM
75
43
0
14 Jul 2022
Visual Context-driven Audio Feature Enhancement for Robust End-to-End
  Audio-Visual Speech Recognition
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition
Joanna Hong
Minsu Kim
Daehun Yoo
Y. Ro
39
21
0
13 Jul 2022
Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets
Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets
Kenny T. R. Voo
Liming Jiang
Chen Change Loy
CVBM
61
18
0
12 May 2022
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual
  Speech Recognition
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Xichen Pan
Peiyu Chen
Yichen Gong
Helong Zhou
Xinbing Wang
Zhouhan Lin
SSL
43
35
0
24 Feb 2022
data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSL
VLM
ViT
89
852
0
07 Feb 2022
Self-supervised Learning with Random-projection Quantizer for Speech
  Recognition
Self-supervised Learning with Random-projection Quantizer for Speech Recognition
Chung-Cheng Chiu
James Qin
Yu Zhang
Jiahui Yu
Yonghui Wu
SSL
71
169
0
03 Feb 2022
Robust Self-Supervised Audio-Visual Speech Recognition
Robust Self-Supervised Audio-Visual Speech Recognition
Bowen Shi
Wei-Ning Hsu
Abdel-rahman Mohamed
51
93
0
05 Jan 2022
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
  Prediction
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdel-rahman Mohamed
SSL
86
315
0
05 Jan 2022
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
  and Lip Reading
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading
Leyuan Qu
C. Weber
S. Wermter
47
23
0
09 Dec 2021
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan Chen
Chengyi Wang
Zhengyang Chen
Yu-Huan Wu
Shujie Liu
...
Yao Qian
Jian Wu
Micheal Zeng
Xiangzhan Yu
Furu Wei
SSL
206
1,846
0
26 Oct 2021
LiRA: Learning Visual Speech Representations from Audio through
  Self-supervision
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
Pingchuan Ma
Rodrigo Mira
Stavros Petridis
Björn W. Schuller
Maja Pantic
SSL
44
53
0
16 Jun 2021
HuBERT: Self-Supervised Speech Representation Learning by Masked
  Prediction of Hidden Units
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Wei-Ning Hsu
Benjamin Bolte
Yao-Hung Hubert Tsai
Kushal Lakhotia
Ruslan Salakhutdinov
Abdel-rahman Mohamed
SSL
147
2,939
0
14 Jun 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
611
6,029
0
29 Apr 2021
End-to-end Audio-visual Speech Recognition with Conformers
End-to-end Audio-visual Speech Recognition with Conformers
Pingchuan Ma
Stavros Petridis
Maja Pantic
110
231
0
12 Feb 2021
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
  Representations
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
SSL
234
5,774
0
20 Jun 2020
Conformer: Convolution-augmented Transformer for Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati
James Qin
Chung-Cheng Chiu
Niki Parmar
Yu Zhang
...
Wei Han
Shibo Wang
Zhengdong Zhang
Yonghui Wu
Ruoming Pang
210
3,119
0
16 May 2020
Discriminative Multi-modality Speech Recognition
Discriminative Multi-modality Speech Recognition
Bo Xu
Cheng Lu
Yandong Guo
Jacob Wang
42
99
0
12 May 2020
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
Takaki Makino
H. Liao
Yannis Assael
Brendan Shillingford
Basi García
Otavio Braga
Olivier Siohan
59
129
0
08 Nov 2019
Deep Audio-Visual Speech Recognition
Deep Audio-Visual Speech Recognition
Triantafyllos Afouras
Joon Son Chung
A. Senior
Oriol Vinyals
Andrew Zisserman
69
701
0
06 Sep 2018
LRS3-TED: a large-scale dataset for visual speech recognition
LRS3-TED: a large-scale dataset for visual speech recognition
Triantafyllos Afouras
Joon Son Chung
Andrew Zisserman
62
439
0
03 Sep 2018
VoxCeleb2: Deep Speaker Recognition
VoxCeleb2: Deep Speaker Recognition
Joon Son Chung
Arsha Nagrani
Andrew Zisserman
348
2,274
0
14 Jun 2018
Phoneme-to-viseme mappings: the good, the bad, and the ugly
Phoneme-to-viseme mappings: the good, the bad, and the ugly
Helen L. Bear
R. Harvey
56
62
0
08 May 2018
11K Hands: Gender recognition and biometric identification using a large
  dataset of hand images
11K Hands: Gender recognition and biometric identification using a large dataset of hand images
Mahmoud Afifi
CVBM
45
166
0
12 Nov 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
640
130,942
0
12 Jun 2017
Lip Reading Sentences in the Wild
Lip Reading Sentences in the Wild
Joon Son Chung
A. Senior
Oriol Vinyals
Andrew Zisserman
250
789
0
16 Nov 2016
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task
  Learning
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning
Suyoun Kim
Takaaki Hori
Shinji Watanabe
74
925
0
21 Sep 2016
MUSAN: A Music, Speech, and Noise Corpus
MUSAN: A Music, Speech, and Noise Corpus
David Snyder
Guoguo Chen
Daniel Povey
75
1,346
0
28 Oct 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.5K
149,842
0
22 Dec 2014
Microsoft COCO: Common Objects in Context
Microsoft COCO: Common Objects in Context
Nayeon Lee
Michael Maire
Serge J. Belongie
Lubomir Bourdev
Ross B. Girshick
James Hays
Pietro Perona
Deva Ramanan
C. L. Zitnick
Piotr Dollár
ObjD
377
43,524
0
01 May 2014
1