ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.13900
  4. Cited By
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

26 October 2021
Sanyuan Chen
Chengyi Wang
Zhengyang Chen
Yu-Huan Wu
Shujie Liu
Zhuo Chen
Jinyu Li
Naoyuki Kanda
Takuya Yoshioka
Xiong Xiao
Jian Wu
Long Zhou
Shuo Ren
Y. Qian
Yao Qian
Jian Wu
Micheal Zeng
Xiangzhan Yu
Furu Wei
    SSL
ArXivPDFHTML

Papers citing "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing"

50 / 1,036 papers shown
Title
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech
  Recognition Evaluation
SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation
Jiayu Du
Jinpeng Li
Guoguo Chen
Wei-Qiang Zhang
ELM
37
3
0
13 Mar 2024
SCORE: Self-supervised Correspondence Fine-tuning for Improved Content
  Representations
SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations
Amit Meghanani
Thomas Hain
41
3
0
10 Mar 2024
A robust audio deepfake detection system via multi-view feature
A robust audio deepfake detection system via multi-view feature
Yujie Yang
Haochen Qin
Hang Zhou
Chengcheng Wang
Tianyu Guo
Kai Han
Yunhe Wang
40
28
0
04 Mar 2024
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
  for Indian Languages
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
Tahir Javed
J. Nawale
E. George
Sakshi Joshi
Kaushal Bhogale
...
M. ManickamK
C. V. Vaijayanthi
Krishnan Srinivasa Raghavan Karunganni
Pratyush Kumar
Mitesh M Khapra
41
16
0
04 Mar 2024
A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
  Enhancement
A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement
Ravi Shankar
Ke Tan
Buye Xu
Anurag Kumar
36
0
0
03 Mar 2024
Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow
  Models
Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow Models
Neta Shaul
Uriel Singer
Ricky T. Q. Chen
Matt Le
Ali K. Thabet
Albert Pumarola
Y. Lipman
DiffM
44
4
0
02 Mar 2024
Efficient Adapter Tuning of Pre-trained Speech Models for Automatic
  Speaker Verification
Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification
Mufan Sang
John H. L. Hansen
49
6
0
01 Mar 2024
Compact Speech Translation Models via Discrete Speech Units Pretraining
Compact Speech Translation Models via Discrete Speech Units Pretraining
Tsz Kin Lam
Alexandra Birch
Barry Haddow
61
2
0
29 Feb 2024
Experimental Study: Enhancing Voice Spoofing Detection Models with
  wav2vec 2.0
Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0
Taein Kang
Soyul Han
Sunmook Choi
Jaejin Seo
Sanghyeok Chung
Seungeun Lee
Seungsang Oh
Il-Youp Kwak
41
8
0
27 Feb 2024
SKILL: Similarity-aware Knowledge distILLation for Speech
  Self-Supervised Learning
SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning
Luca Zampierin
G. B. Hacene
Bac Nguyen
Mirco Ravanelli
46
2
0
26 Feb 2024
The Effect of Batch Size on Contrastive Self-Supervised Speech
  Representation Learning
The Effect of Batch Size on Contrastive Self-Supervised Speech Representation Learning
Nik Vaessen
David A. van Leeuwen
35
3
0
21 Feb 2024
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Haibin Wu
Ho-Lam Chung
Yi-Cheng Lin
Yuan-Kuei Wu
Xuanjun Chen
Yu-Chi Pai
Hsiu-Hsuan Wang
Kai-Wei Chang
Alexander H. Liu
Hung-yi Lee
55
19
0
20 Feb 2024
EMO-SUPERB: An In-depth Look at Speech Emotion Recognition
EMO-SUPERB: An In-depth Look at Speech Emotion Recognition
Haibin Wu
Huang-Cheng Chou
Kai-Wei Chang
Lucas Goncalves
Jiawei Du
Jyh-Shing Roger Jang
Chi-Chun Lee
Hung-Yi Lee
36
11
0
20 Feb 2024
Handling Ambiguity in Emotion: From Out-of-Domain Detection to
  Distribution Estimation
Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation
Wen Wu
Bo-wen Li
C. Zhang
Chung-Cheng Chiu
Qiujia Li
Junwen Bai
Tara N. Sainath
P. Woodland
35
2
0
20 Feb 2024
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
Gaoxiang Cong
Yuankai Qi
Liang-Sheng Li
Amin Beheshti
Zhedong Zhang
Anton Van Den Hengel
Ming-Hsuan Yang
Chenggang Yan
Qingming Huang
46
12
0
20 Feb 2024
Language-Codec: Reducing the Gaps Between Discrete Codec Representation
  and Speech Language Models
Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models
Shengpeng Ji
Minghui Fang
Ziyue Jiang
Siqi Zheng
Qian Chen
Rongjie Huang
Jialung Zuo
Shulei Wang
Zhou Zhao
AuLLM
39
16
0
19 Feb 2024
Target Speech Extraction with Pre-trained Self-supervised Learning
  Models
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Junyi Peng
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Shoko Araki
J. Černocký
42
8
0
17 Feb 2024
Probing Self-supervised Learning Models with Target Speech Extraction
Probing Self-supervised Learning Models with Target Speech Extraction
Junyi Peng
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Takanori Ashihara
Shoko Araki
J. Černocký
40
2
0
17 Feb 2024
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate
  Speech into Large Language Models for Depression Detection
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection
Xiangyu Zhang
Hexin Liu
Kaishuai Xu
Qiquan Zhang
Daijiao Liu
Beena Ahmed
Julien Epps
28
8
0
17 Feb 2024
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot
  Text-to-Speech
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
Shengpeng Ji
Ziyue Jiang
Hanting Wang
Jia-li Zuo
Zhou Zhao
40
10
0
14 Feb 2024
UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL
  Models
UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models
Ruchao Fan
Natarajan Balaji Shankar
Abeer Alwan
41
0
0
14 Feb 2024
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Ziyang Ma
Guanrou Yang
Yifan Yang
Zhifu Gao
Jiaming Wang
...
Fan Yu
Qian Chen
Siqi Zheng
Shiliang Zhang
Xie Chen
AuLLM
49
41
0
13 Feb 2024
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
  on 100K hours of data
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Lajszczak
Guillermo Cámbara
Yang Li
Fatih Beyhan
Arent van Korlaar
...
Bartosz Putrycz
Soledad López Gambino
Kayeon Yoo
Elena Sokolova
Thomas Drugman
LM&MA
38
75
0
12 Feb 2024
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Naoyuki Kanda
Xiaofei Wang
Sefik Emre Eskimez
Manthan Thakker
Hemin Yang
...
Yufei Xia
Jinzhu Li
Yanqing Liu
Sheng Zhao
Michael Zeng
35
8
0
12 Feb 2024
SpeechCLIP+: Self-supervised multi-task representation learning for
  speech via CLIP and speech-image data
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Hsuan-Fu Wang
Yi-Jen Shih
Heng-Jui Chang
Layne Berry
Puyuan Peng
Hung-yi Lee
Hsin-Min Wang
David Harwath
VLM
51
2
0
10 Feb 2024
SpiRit-LM: Interleaved Spoken and Written Language Model
SpiRit-LM: Interleaved Spoken and Written Language Model
Tu Nguyen
Benjamin Muller
Bokai Yu
Marta R. Costa-jussá
Maha Elbayad
...
Itai Gat
Gabriel Synnaeve
Juan Pino
Benoît Sagot
Emmanuel Dupoux
AuLLM
VLM
56
34
0
08 Feb 2024
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative
  Training for Unsupervised ASR
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Liang-Hsuan Tseng
En-Pei Hu
Cheng-Han Chiang
Yuan Tseng
Hung-yi Lee
Lin-shan Lee
Shao-Hua Sun
61
1
0
06 Feb 2024
Enhancing the Stability of LLM-based Speech Generation Systems through
  Self-Supervised Representations
Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations
Álvaro Martín-Cortinas
Daniel Sáez-Trigueros
Iván Vallés-Pérez
Biel Tura Vecino
Piotr Bilinski
Mateusz Lajszczak
Grzegorz Beringer
Roberto Barra-Chicote
Jaime Lorenzo-Trueba
21
5
0
05 Feb 2024
Are Paralinguistic Representations all that is needed for Speech Emotion
  Recognition?
Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?
Orchid Chetia Phukan
Gautam Siddharth Kashyap
Arun Balaji Buduru
Rajesh Sharma
29
0
0
02 Feb 2024
Low-Resource Cross-Domain Singing Voice Synthesis via Reduced
  Self-Supervised Speech Representations
Low-Resource Cross-Domain Singing Voice Synthesis via Reduced Self-Supervised Speech Representations
Panos Kakoulidis
Nikolaos Ellinas
G. Vamvoukakis
Myrsini Christidou
Alexandra Vioni
...
Junkwang Oh
Gunu Jho
Inchul Hwang
Pirros Tsiakoulis
Aimilios Chalamandaris
28
1
0
02 Feb 2024
On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio
  Classification
On the Transferability of Large-Scale Self-Supervision to Few-Shot Audio Classification
Calum Heggan
S. Budgett
Timothy M. Hospedales
Mehrdad Yaghoobi
SSL
26
1
0
02 Feb 2024
STAA-Net: A Sparse and Transferable Adversarial Attack for Speech
  Emotion Recognition
STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition
Yi Chang
Zhao Ren
Zixing Zhang
Xin Jing
Kun Qian
Xi Shao
Bin Hu
Tanja Schultz
Björn W. Schuller
AAML
38
4
0
02 Feb 2024
Can you Remove the Downstream Model for Speaker Recognition with
  Self-Supervised Speech Features?
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Zakaria Aldeneh
Takuya Higuchi
Jee-weon Jung
Skyler Seto
Tatiana Likhomanenko
Stephen Shum
Ahmed Hussen Abdelaziz
Shinji Watanabe
B. Theobald
SSL
34
2
0
01 Feb 2024
What Do Self-Supervised Speech and Speaker Models Learn? New Findings
  From a Cross Model Layer-Wise Analysis
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
Takanori Ashihara
Marc Delcroix
Takafumi Moriya
Kohei Matsuura
Taichi Asami
Yusuke Ijima
SSL
24
7
0
31 Jan 2024
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible
  recipes, self-supervised front-ends, and off-the-shelf models
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Jee-weon Jung
Wangyou Zhang
Jiatong Shi
Zakaria Aldeneh
Takuya Higuchi
B. Theobald
Ahmed Hussen Abdelaziz
Shinji Watanabe
81
21
0
30 Jan 2024
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech
  Generation Leveraging NLP Evaluation Metrics
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
Takaaki Saeki
Soumi Maiti
Shinnosuke Takamichi
Shinji Watanabe
Hiroshi Saruwatari
30
14
0
30 Jan 2024
Speech foundation models on intelligibility prediction for
  hearing-impaired listeners
Speech foundation models on intelligibility prediction for hearing-impaired listeners
Santiago Cuervo
R. Marxer
38
6
0
24 Jan 2024
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion,
  Asr Error Detection, and Asr Error Correction
MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction
Jiajun He
Xiaohan Shi
Xingfeng Li
T. Toda
45
13
0
24 Jan 2024
Towards Hierarchical Spoken Language Dysfluency Modeling
Towards Hierarchical Spoken Language Dysfluency Modeling
Jiachen Lian
Gopala Anumanchipalli
32
9
0
18 Jan 2024
Efficient Training for Multilingual Visual Speech Recognition:
  Pre-training with Discretized Visual Speech Representation
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Minsu Kim
Jeong Hun Yeo
Se Jin Park
J. Choi
Y. Ro
27
5
0
18 Jan 2024
Revisiting Self-supervised Learning of Speech Representation from a
  Mutual Information Perspective
Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective
Alexander H. Liu
Sung-Lin Yeh
James R. Glass
SSL
27
3
0
16 Jan 2024
An Explainable Proxy Model for Multiabel Audio Segmentation
An Explainable Proxy Model for Multiabel Audio Segmentation
Théo Mariotte
Antonio Almudévar
Marie Tahon
Alfonso Ortega Giménez
34
1
0
16 Jan 2024
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion
  Diarization for Emotional Speech Synthesis
ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
Haobin Tang
Xulong Zhang
Ning Cheng
Jing Xiao
Jianzong Wang
28
12
0
16 Jan 2024
Learning Disentangled Speech Representations with Contrastive Learning
  and Time-Invariant Retrieval
Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
Yimin Deng
Huaizhen Tang
Xulong Zhang
Ning Cheng
Jing Xiao
Jianzong Wang
DRL
36
1
0
16 Jan 2024
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment
Hyoung-Seok Oh
Sang-Hoon Lee
Deok-Hyun Cho
Seong-Whan Lee
52
2
0
16 Jan 2024
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided
  Sequence Reordering
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Ya-Zhen Song
Zhuo Chen
Xiaofei Wang
Ziyang Ma
Xie Chen
AuLLM
21
37
0
14 Jan 2024
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised
  Audio-Visual Emotion Recognition
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Guoying Zhao
Zheng Lian
Bin Liu
Jianhua Tao
53
29
0
11 Jan 2024
Noise-robust zero-shot text-to-speech synthesis conditioned on
  self-supervised speech-representation model with adapters
Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
Kenichi Fujita
Hiroshi Sato
Takanori Ashihara
Hiroki Kanagawa
Marc Delcroix
Takafumi Moriya
Yusuke Ijima
41
8
0
10 Jan 2024
Singer Identity Representation Learning using Self-Supervised Techniques
Singer Identity Representation Learning using Self-Supervised Techniques
Bernardo Torres
Stefan Lattner
Gaël Richard
SSL
43
9
0
10 Jan 2024
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen
Yuzhe Liang
Ziyang Ma
Zhisheng Zheng
Xie Chen
ViT
54
18
0
07 Jan 2024
Previous
123...101112...192021
Next