ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.02184
  4. Cited By
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster
  Prediction

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

5 January 2022
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdel-rahman Mohamed
    SSL
ArXivPDFHTML

Papers citing "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction"

50 / 207 papers shown
Title
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Rongjie Huang
Huadai Liu
Xize Cheng
Yi Ren
Lin Li
...
Jinzheng He
Lichao Zhang
Jinglin Liu
Xiaoyue Yin
Zhou Zhao
78
8
0
24 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
48
116
0
18 May 2023
Cross-Modal Global Interaction and Local Alignment for Audio-Visual
  Speech Recognition
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Yuchen Hu
Ruizhe Li
Chen Chen
Heqing Zou
Qiu-shi Zhu
Eng Siong Chng
34
7
0
16 May 2023
Deep Audio-Visual Singing Voice Transcription based on Self-Supervised
  Learning Models
Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models
Xiangming Gu
Weizhen Zeng
Jianan Zhang
Longshen Ou
Ye Wang
37
6
0
24 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
24
44
0
31 Mar 2023
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
  Supervision
SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision
Xubo Liu
Egor Lakomkin
Konstantinos Vougioukas
Pingchuan Ma
Honglie Chen
...
Niko Moritz
J. Kolár
Stavros Petridis
M. Pantic
Christian Fuegen
52
19
0
30 Mar 2023
Seeing What You Said: Talking Face Generation Guided by a Lip Reading
  Expert
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Jiadong Wang
Xinyuan Qian
Malu Zhang
R. Tan
Haizhou Li
EGVM
22
94
0
29 Mar 2023
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Pingchuan Ma
A. Haliassos
Adriana Fernandez-Lopez
Honglie Chen
Stavros Petridis
M. Pantic
27
107
0
25 Mar 2023
Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture
  and Single-Source Speech
Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech
Maryam Fazel-Zarandi
Wei-Ning Hsu
SSL
24
8
0
20 Mar 2023
Learning Cross-lingual Visual Speech Representations
Learning Cross-lingual Visual Speech Representations
Andreas Zinonos
A. Haliassos
Pingchuan Ma
Stavros Petridis
M. Pantic
SSL
22
8
0
14 Mar 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
  for Visual Speech Translation and Recognition
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
Xize Cheng
Lin Li
Tao Jin
Rongjie Huang
Wang Lin
Zehan Wang
Huangdai Liu
Yejin Wang
Aoxiong Yin
Zhou Zhao
26
24
0
09 Mar 2023
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition
  and Robust Speech-to-Text Translation
MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Mohamed Anwar
Bowen Shi
Vedanuj Goswami
Wei-Ning Hsu
J. Pino
Changhan Wang
47
37
0
01 Mar 2023
Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and
  English
Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English
Xiaoming Ren
Chao Li
Shenjian Wang
Biao Li
38
0
0
28 Feb 2023
Deep Visual Forced Alignment: Learning to Align Transcription with
  Talking Face Video
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video
Minsu Kim
Chae Won Kim
Y. Ro
CVBM
DiffM
38
3
0
27 Feb 2023
Transformadores: Fundamentos teoricos y Aplicaciones
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
78
0
0
18 Feb 2023
Conformers are All You Need for Visual Speech Recognition
Conformers are All You Need for Visual Speech Recognition
Oscar Chang
H. Liao
Dmitriy Serdyuk
Ankit Parag Shah
Olivier Siohan
VLM
50
14
0
17 Feb 2023
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech
  Recognition
Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Minsu Kim
Hyungil Kim
Y. Ro
VLM
18
18
0
16 Feb 2023
AV-data2vec: Self-supervised Learning of Audio-Visual Speech
  Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Jiachen Lian
Alexei Baevski
Wei-Ning Hsu
Michael Auli
SSL
40
34
0
10 Feb 2023
Multimodality Representation Learning: A Survey on Evolution,
  Pretraining and Its Applications
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
42
26
0
01 Feb 2023
A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech
  Recognition: the Arman-AV Dataset
A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset
J. Peymanfard
Samin Heydarian
Ali Lashini
Hossein Zeinali
Mohammad Reza Mohammadi
N. Mozayani
32
10
0
21 Jan 2023
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
  Universal and Generalized Speech Enhancement
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement
Wei-Ning Hsu
Tal Remez
Bowen Shi
Jacob Donley
Yossi Adi
DiffM
27
12
0
21 Dec 2022
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
26
51
0
15 Dec 2022
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
34
73
0
15 Dec 2022
Jointly Learning Visual and Auditory Speech Representations from Raw
  Data
Jointly Learning Visual and Auditory Speech Representations from Raw Data
A. Haliassos
Pingchuan Ma
Rodrigo Mira
Stavros Petridis
M. Pantic
SSL
45
49
0
12 Dec 2022
Leveraging Modality-specific Representations for Audio-visual Speech
  Recognition via Reinforcement Learning
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Chen Chen
Yuchen Hu
Qiang Zhang
Heqing Zou
Beier Zhu
Eng Siong Chng
33
26
0
10 Dec 2022
Learning to Dub Movies via Hierarchical Prosody Models
Learning to Dub Movies via Hierarchical Prosody Models
Gaoxiang Cong
Liang Li
Yuankai Qi
Zhengjun Zha
Qi Wu
Wen-yu Wang
Bin Jiang
Ming Yang
Qin Huang
75
25
0
08 Dec 2022
iQuery: Instruments as Queries for Audio-Visual Sound Separation
iQuery: Instruments as Queries for Audio-Visual Sound Separation
Jiaben Chen
Renrui Zhang
Dongze Lian
Jiaqi Yang
Ziyao Zeng
Jianbo Shi
34
27
0
07 Dec 2022
Self-Supervised Audio-Visual Speech Representations Learning By
  Multimodal Self-Distillation
Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
Jing-Xuan Zhang
Genshun Wan
Zhenhua Ling
Jia-Yu Pan
Jianqing Gao
Cong Liu
SSL
32
13
0
06 Dec 2022
FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video
  Deepfake Detection
FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection
Gil Knafo
Ohad Fried
31
5
0
01 Dec 2022
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
  Speech Representation Learning
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiu-shi Zhu
Long Zhou
Zi-Hua Zhang
Shujie Liu
Binxing Jiao
Jie Zhang
Lirong Dai
Daxin Jiang
Jinyu Li
Furu Wei
33
37
0
21 Nov 2022
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
Rodrigo Mira
Buye Xu
Jacob Donley
Anurag Kumar
Stavros Petridis
V. Ithapu
M. Pantic
28
13
0
20 Nov 2022
Comparative layer-wise analysis of self-supervised speech models
Comparative layer-wise analysis of self-supervised speech models
Ankita Pasad
Bowen Shi
Karen Livescu
SSL
35
109
0
08 Nov 2022
Streaming Audio-Visual Speech Recognition with Alignment Regularization
Streaming Audio-Visual Speech Recognition with Alignment Regularization
Pingchuan Ma
Niko Moritz
Stavros Petridis
Christian Fuegen
M. Pantic
37
2
0
03 Nov 2022
Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal
  Self-Supervised Embeddings
Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings
Ethan Chern
Kuo-Hsuan Hung
Yi-Ting Chen
Tassadaq Hussain
M. Gogate
Amir Hussain
Yu Tsao
Jen-Cheng Hou
SSL
18
15
0
31 Oct 2022
Clean Text and Full-Body Transformer: Microsoft's Submission to the
  WMT22 Shared Task on Sign Language Translation
Clean Text and Full-Body Transformer: Microsoft's Submission to the WMT22 Shared Task on Sign Language Translation
S. Dey
Abhilash Pal
Cyrine Chaabani
Oscar Koller
SLR
23
5
0
24 Oct 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
51
28
0
28 Sep 2022
Relaxed Attention for Transformer Models
Relaxed Attention for Transformer Models
Timo Lohrenz
Björn Möller
Zhengyang Li
Tim Fingscheidt
KELM
29
11
0
20 Sep 2022
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by
  Human Speech Perception
Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception
Jiadong Wang
Xinyuan Qian
Haizhou Li
41
14
0
05 Sep 2022
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from
  Videos
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
P. Filntisis
George Retsinas
Foivos Paraperas-Papantoniou
Athanasios Katsamanis
A. Roussos
Petros Maragos
3DH
26
29
0
22 Jul 2022
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer
  to Unlabeled Modality
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Wei-Ning Hsu
Bowen Shi
SSL
VLM
29
42
0
14 Jul 2022
MM-ALT: A Multimodal Automatic Lyric Transcription System
MM-ALT: A Multimodal Automatic Lyric Transcription System
Xiangming Gu
Longshen Ou
Danielle Ong
Ye Wang
19
13
0
13 Jul 2022
Dual-Path Cross-Modal Attention for better Audio-Visual Speech
  Extraction
Dual-Path Cross-Modal Attention for better Audio-Visual Speech Extraction
Zhongweiyang Xu
Xulin Fan
M. Hasegawa-Johnson
19
2
0
09 Jul 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
72
530
0
13 Jun 2022
Is Lip Region-of-Interest Sufficient for Lipreading?
Is Lip Region-of-Interest Sufficient for Lipreading?
Jing-Xuan Zhang
Genshun Wan
Jia-Yu Pan
24
6
0
28 May 2022
Self-Supervised Speech Representation Learning: A Review
Self-Supervised Speech Representation Learning: A Review
Abdel-rahman Mohamed
Hung-yi Lee
Lasse Borgholt
Jakob Drachmann Havtorn
Joakim Edin
...
Shang-Wen Li
Karen Livescu
Lars Maaløe
Tara N. Sainath
Shinji Watanabe
SSL
AI4TS
137
352
0
21 May 2022
Content-Context Factorized Representations for Automated Speech
  Recognition
Content-Context Factorized Representations for Automated Speech Recognition
David M. Chan
Shalini Ghosh
36
11
0
19 May 2022
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
Bowen Shi
Abdel-rahman Mohamed
Wei-Ning Hsu
SSL
28
17
0
15 May 2022
SVTS: Scalable Video-to-Speech Synthesis
SVTS: Scalable Video-to-Speech Synthesis
Rodrigo Mira
A. Haliassos
Stavros Petridis
Björn W. Schuller
M. Pantic
22
32
0
04 May 2022
More to Less (M2L): Enhanced Health Recognition in the Wild with Reduced
  Modality of Wearable Sensors
More to Less (M2L): Enhanced Health Recognition in the Wild with Reduced Modality of Wearable Sensors
Huiyuan Yang
Han Yu
K. Sridhar
T. Vaessen
I. Myin‐Germeys
Akane Sano
23
7
0
16 Feb 2022
Learning Contextually Fused Audio-visual Representations for
  Audio-visual Speech Recognition
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition
Zitian Zhang
Jie Zhang
Jian-Shu Zhang
Ming Wu
Xin Fang
Lirong Dai
SSL
41
10
0
15 Feb 2022
Previous
12345
Next