ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.13900
  4. Cited By
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

26 October 2021
Sanyuan Chen
Chengyi Wang
Zhengyang Chen
Yu-Huan Wu
Shujie Liu
Zhuo Chen
Jinyu Li
Naoyuki Kanda
Takuya Yoshioka
Xiong Xiao
Jian Wu
Long Zhou
Shuo Ren
Y. Qian
Yao Qian
Jian Wu
Micheal Zeng
Xiangzhan Yu
Furu Wei
    SSL
ArXivPDFHTML

Papers citing "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing"

50 / 1,036 papers shown
Title
Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
  on Diffusion Models for Enhanced Speaker Naturalness
Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness
Sicheng Yang
Zunnan Xu
Haiwei Xue
Yongkang Cheng
Shaoli Huang
Biwei Huang
Zhiyong Wu
DiffM
VGen
39
11
0
07 Jan 2024
Multichannel AV-wav2vec2: A Framework for Learning Multichannel
  Multi-Modal Speech Representation
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation
Qiu-shi Zhu
Jie Zhang
Yu Gu
Yuli Hu
Lirong Dai
SSL
43
11
0
07 Jan 2024
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion
  Recognition
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition
Zheng Lian
Guoying Zhao
Yong Ren
Hao Gu
Haiyang Sun
Lan Chen
Bin Liu
Jianhua Tao
26
12
0
07 Jan 2024
StreamVC: Real-Time Low-Latency Voice Conversion
StreamVC: Real-Time Low-Latency Voice Conversion
Yang Yang
Y. Kartynnik
Yunpeng Li
Jiuqiang Tang
Xing Li
George Sung
Matthias Grundmann
30
12
0
05 Jan 2024
Pheme: Efficient and Conversational Speech Generation
Pheme: Efficient and Conversational Speech Generation
Paweł Budzianowski
Taras Sereda
Tomasz Cichy
Ivan Vulić
32
7
0
05 Jan 2024
Self-supervised Reflective Learning through Self-distillation and Online
  Clustering for Speaker Representation Learning
Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning
Danwei Cai
Zexin Cai
Ming Li
35
0
0
03 Jan 2024
Efficient Parallel Audio Generation using Group Masked Language Modeling
Efficient Parallel Audio Generation using Group Masked Language Modeling
Myeonghun Jeong
Minchan Kim
Joun Yeop Lee
Nam Soo Kim
30
5
0
02 Jan 2024
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech
  Recognition using Adversarial Data Augmentation
Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation
Huimeng Wang
Zengrui Jin
Mengzhe Geng
Shujie Hu
Guinan Li
Tianzi Wang
Haoning Xu
Xunying Liu
19
10
0
01 Jan 2024
Investigating Zero-Shot Generalizability on Mandarin-English
  Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models
  with Self-Supervision and Weak Supervision
Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
Chih-Kai Yang
Kuan-Po Huang
Ke-Han Lu
Chun-Yi Kuan
Chi-Yuan Hsiao
Hung-yi Lee
48
7
0
30 Dec 2023
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Hong-ping Hao
Long Zhou
Shujie Liu
Jinyu Li
Shujie Hu
Rui Wang
Furu Wei
34
18
0
30 Dec 2023
Self-supervised Pretraining for Decision Foundation Model: Formulation,
  Pipeline and Challenges
Self-supervised Pretraining for Decision Foundation Model: Formulation, Pipeline and Challenges
Xiaoqian Liu
Jianbin Jiao
Junge Zhang
OffRL
LRM
46
2
0
29 Dec 2023
Self-supervised Pretraining for Robust Personalized Voice Activity
  Detection in Adverse Conditions
Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions
H. S. Bovbjerg
Jesper Jensen
Jan Østergaard
Zheng-Hua Tan
VLM
27
3
0
27 Dec 2023
Frame-level emotional state alignment method for speech emotion
  recognition
Frame-level emotional state alignment method for speech emotion recognition
Qifei Li
Yingming Gao
Cong Wang
Yayue Deng
Jinlong Xue
Yichen Han
Ya Li
28
2
0
27 Dec 2023
Modality-Collaborative Transformer with Hybrid Feature Reconstruction
  for Robust Emotion Recognition
Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition
Chengxin Chen
Pengyuan Zhang
38
5
0
26 Dec 2023
Audiobox: Unified Audio Generation with Natural Language Prompts
Audiobox: Unified Audio Generation with Natural Language Prompts
Apoorv Vyas
Bowen Shi
Matt Le
Andros Tjandra
Yi-Chiao Wu
...
Chris Summers
Carleigh Wood
Joshua Lane
Mary Williamson
Wei-Ning Hsu
60
77
0
25 Dec 2023
emotion2vec: Self-Supervised Pre-Training for Speech Emotion
  Representation
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Ziyang Ma
Zhisheng Zheng
Jiaxin Ye
Jinchao Li
Zhifu Gao
Shiliang Zhang
Xie Chen
MDE
SLR
SSL
25
88
0
23 Dec 2023
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis
  Conditioned on Self-supervised Discrete Speech Representations
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
Cheng Gong
Xin Wang
Erica Cooper
Dan Wells
Longbiao Wang
Jianwu Dang
Korin Richmond
Junichi Yamagishi
31
21
0
22 Dec 2023
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and
  Detection
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Jiachen Lian
Carly Feng
Naasir Farooqi
Steve Li
Anshul Kashyap
Cheol Jun Cho
Peter Wu
Robin Netzorg
Tingle Li
Gopala Krishna Anumanchipalli
45
13
0
20 Dec 2023
Noise robust distillation of self-supervised speech models via
  correlation metrics
Noise robust distillation of self-supervised speech models via correlation metrics
Fabian Ritter-Gutierrez
Kuan-Po Huang
Dianwen Ng
Jeremy H.M Wong
Hung-yi Lee
Chng Eng Siong
Nancy F. Chen
24
2
0
19 Dec 2023
Efficiency-oriented approaches for self-supervised speech representation
  learning
Efficiency-oriented approaches for self-supervised speech representation learning
Luis Lugo
Valentin Vielzeuf
SSL
31
1
0
18 Dec 2023
A Survey of Reasoning with Foundation Models
A Survey of Reasoning with Foundation Models
Jiankai Sun
Chuanyang Zheng
E. Xie
Zhengying Liu
Ruihang Chu
...
Xipeng Qiu
Yi-Chen Guo
Hui Xiong
Qun Liu
Zhenguo Li
ReLM
LRM
AI4CE
30
76
0
17 Dec 2023
Seq2seq for Automatic Paraphasia Detection in Aphasic Speech
Seq2seq for Automatic Paraphasia Detection in Aphasic Speech
M. Perez
Duc Le
Amrit Romana
Elise Jones
Keli Licata
E. Provost
28
2
0
16 Dec 2023
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Xueyao Zhang
Liumeng Xue
Yicheng Gu
Yuancheng Wang
Haorui He
...
Mingxuan Wang
Jun Han
Kai Chen
Haizhou Li
Zhizheng Wu
31
28
0
15 Dec 2023
Automatic channel selection and spatial feature integration for
  multi-channel speech recognition across various array topologies
Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies
Bingshen Mu
Pengcheng Guo
Dake Guo
Pan Zhou
Wei Chen
Lei Xie
38
2
0
15 Dec 2023
Fine-Tuned Self-Supervised Speech Representations for Language
  Diarization in Multilingual Code-Switched Speech
Fine-Tuned Self-Supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech
Geoffrey T. Frost
Emily Morris
Joshua Jansen van Vüren
T. Niesler
30
2
0
15 Dec 2023
FastInject: Injecting Unpaired Text Data into CTC-based ASR training
FastInject: Injecting Unpaired Text Data into CTC-based ASR training
Keqi Deng
Phil Woodland
18
2
0
14 Dec 2023
STaR: Distilling Speech Temporal Relation for Lightweight Speech
  Self-Supervised Learning Models
STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models
Kangwook Jang
Sungnyun Kim
Hoi-Rim Kim
36
1
0
14 Dec 2023
Neural Concatenative Singing Voice Conversion: Rethinking
  Concatenation-Based Approach for One-Shot Singing Voice Conversion
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion
Binzhu Sha
Xu Li
Zhiyong Wu
Yin Shan
Helen M. Meng
23
7
0
08 Dec 2023
DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors
DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors
Federico Landini
Mireia Díez
Themos Stafylakis
Lukávs Burget
31
11
0
07 Dec 2023
Joint Training or Not: An Exploration of Pre-trained Speech Models in
  Audio-Visual Speaker Diarization
Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization
Huan Zhao
Li Zhang
Yuehong Li
Yannan Wang
Hongji Wang
Wei Rao
Qing Wang
Lei Xie
10
0
0
07 Dec 2023
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation
  with Unified Audio-Visual Speech Representation
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
J. Choi
Se Jin Park
Minsu Kim
Y. Ro
37
12
0
05 Dec 2023
Bigger is not Always Better: The Effect of Context Size on Speech
  Pre-Training
Bigger is not Always Better: The Effect of Context Size on Speech Pre-Training
Sean Robertson
Ewan Dunbar
SSL
30
1
0
03 Dec 2023
FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for
  Distortion-Invariant Robust Speech Recognition
FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition
Dongning Yang
Wei Wang
Yanmin Qian
13
3
0
29 Nov 2023
Vulnerability of Automatic Identity Recognition to Audio-Visual
  Deepfakes
Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes
Pavel Korshunov
Haolin Chen
Philip N. Garner
S´ebastien Marcel
CVBM
48
4
0
29 Nov 2023
SpeechAct: Towards Generating Whole-body Motion from Speech
Jinsong Zhang
Minjie Zhu
Yuxiang Zhang
Yebin Liu
Kun Li
36
0
0
29 Nov 2023
StyleCap: Automatic Speaking-Style Captioning from Speech Based on
  Speech and Language Self-supervised Learning Models
StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models
Kazuki Yamauchi
Yusuke Ijima
Yuki Saito
35
8
0
28 Nov 2023
A Quantitative Approach to Understand Self-Supervised Models as
  Cross-lingual Feature Extractors
A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors
Shuyue Stella Li
Beining Xu
Xiangyu Zhang
Hexin Liu
Wen-Han Chao
Leibny Paola García
SSL
37
4
0
27 Nov 2023
Lightly Weighted Automatic Audio Parameter Extraction for the Quality
  Assessment of Consensus Auditory-Perceptual Evaluation of Voice
Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice
Yixue Lin
Wen-Hsuan Tseng
Lichin Chen
Ching-Ting Tan
Yu Tsao
9
0
0
27 Nov 2023
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech
  Synthesis
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis
Jungil Kong
Junmo Lee
Jeongmin Kim
Beomjeong Kim
Jihoon Park
Dohee Kong
Changheon Lee
Sangjin Kim
25
1
0
20 Nov 2023
R-Spin: Efficient Speaker and Noise-invariant Representation Learning
  with Acoustic Pieces
R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces
Heng-Jui Chang
James R. Glass
38
3
0
15 Nov 2023
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
Hsin-Tien Chiang
Szu-Wei Fu
Hsin-Min Wang
Yu Tsao
John H. L. Hansen
38
2
0
15 Nov 2023
Multi-channel Conversational Speaker Separation via Neural Diarization
Multi-channel Conversational Speaker Separation via Neural Diarization
H. Taherian
DeLiang Wang
BDL
39
16
0
15 Nov 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified
  Large-Scale Audio-Language Models
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu
Jin Xu
Xiaohuan Zhou
Qian Yang
Shiliang Zhang
Zhijie Yan
Chang Zhou
Jingren Zhou
AuLLM
42
274
0
14 Nov 2023
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition
Xiaohan Shi
Jiajun He
Xingfeng Li
T. Toda
34
4
0
13 Nov 2023
Teach me with a Whisper: Enhancing Large Language Models for Analyzing
  Spoken Transcripts using Speech Embeddings
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Fatema Hasan
Yulong Li
James R. Foulds
Shimei Pan
Bishwaranjan Bhattacharjee
31
2
0
13 Nov 2023
Exploring Emotion Expression Recognition in Older Adults Interacting
  with a Virtual Coach
Exploring Emotion Expression Recognition in Older Adults Interacting with a Virtual Coach
Cristina Palmero
Mikel de Velasco
Mohamed Amine Hmani
Aymen Mtibaa
Leila Ben Letaifa
...
Anna Esposito
M. El-Yacoubi
Dijana Petrovska – Delacretaz
M. Inés Torres
Sergio Escalera
21
5
0
09 Nov 2023
Loss Masking Is Not Needed in Decoder-only Transformer for
  Discrete-token-based ASR
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Qian Chen
Wen Wang
Qinglin Zhang
Siqi Zheng
Shiliang Zhang
Chong Deng
Yukun Ma
Hai Yu
Jiaqing Liu
Chong Zhang
21
8
0
08 Nov 2023
Rethinking and Improving Multi-task Learning for End-to-end Speech
  Translation
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
Yuhao Zhang
Chen Xu
Bei Li
Hao Chen
Tong Xiao
Chunliang Zhang
Jingbo Zhu
26
5
0
07 Nov 2023
Attention or Convolution: Transformer Encoders in Audio Language Models
  for Inference Efficiency
Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency
Sungho Jeon
Ching-Feng Yeh
Hakan Inan
Wei-Ning Hsu
Rashi Rungta
Yashar Mehdad
Daniel M. Bikel
33
0
0
05 Nov 2023
Is one brick enough to break the wall of spoken dialogue state tracking?
Is one brick enough to break the wall of spoken dialogue state tracking?
Lucas Druart
Valentin Vielzeuf
Yannick Esteve
48
0
0
03 Nov 2023
Previous
123...111213...192021
Next