ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.11477
  4. Cited By
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
  Representations

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

20 June 2020
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
    SSL
ArXivPDFHTML

Papers citing "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations"

50 / 187 papers shown
Title
An End-to-End Approach for Child Reading Assessment in the Xhosa Language
An End-to-End Approach for Child Reading Assessment in the Xhosa Language
Sergio Chevtchenko
Nikhil Navas
Rafaella Vale
Franco Ubaudi
Sipumelele Lucwaba
Cally Ardington
Soheil Afshar
Mark Antoniou
Saeed Afshar
33
0
0
23 May 2025
Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer
Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer
Soumya Dutta
Avni Jain
Sriram Ganapathy
95
0
0
23 May 2025
X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance
X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance
Junbo Zhang
Heinrich Dinkel
Yadong Niu
Chenyu Liu
Si Cheng
Anbei Zhao
Jian Luan
136
0
0
22 May 2025
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
Junlong Tong
Jinlan Fu
Zixuan Lin
Yingqi Fan
Anhao Zhao
Hui Su
Xiaoyu Shen
72
0
0
22 May 2025
Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection
Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection
Chenxu Guo
Jiachen Lian
Xuanru Zhou
Jinming Zhang
Shuhe Li
...
Rian Bogley
Lisa Wauters
Zachary Miller
M. G. Tempini
Gopala Anumanchipalli
60
0
0
22 May 2025
"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding
"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding
Alkis Koudounas
Claudio Savelli
Flavio Giobergia
Elena Baralis
MU
85
0
0
21 May 2025
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Yuxin Lin
Yinglin Zheng
Ming Zeng
Wangzheng Shi
66
0
0
19 May 2025
Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning
Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning
Kristin Qi
Jiali Cheng
Youxiang Zhu
Hadi Amiri
Xiaohui Liang
95
0
0
19 May 2025
Exploring the Potential of SSL Models for Sound Event Detection
Exploring the Potential of SSL Models for Sound Event Detection
Hanfang Cui
Longfei Song
Li Li
Dongxing Xu
Yanhua Long
59
0
0
17 May 2025
Multi-Stage Speaker Diarization for Noisy Classrooms
Multi-Stage Speaker Diarization for Noisy Classrooms
Ali Sartaz Khan
Tolulope Ogunremi
Ahmed Adel Attia
Dorottya Demszky
65
0
0
16 May 2025
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
Biel Tura Vecino
Adam Gabry's
Daniel Mątwicki
Andrzej Pomirski
Tom Iddon
Marius Cotescu
Jaime Lorenzo-Trueba
175
3
0
12 May 2025
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
Junyi Peng
Takanori Ashihara
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Shoko Araki
J. Černocký
ELM
87
0
0
10 May 2025
Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Shigeki Karita
Yuma Koizumi
Heiga Zen
Haruko Ishikawa
Robin Scheibler
M. Bacchiani
VLM
341
1
0
07 May 2025
OT-Talk: Animating 3D Talking Head with Optimal Transportation
OT-Talk: Animating 3D Talking Head with Optimal Transportation
Xinmu Wang
Xiang Gao
Xiyun Song
Heather Yu
Zongfang Lin
Liang Peng
Xianfeng Gu
70
0
0
03 May 2025
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
Dongxing Yu
89
0
0
03 May 2025
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
Yeona Hong
Hyewon Han
Woo-Jin Chung
Hong-Goo Kang
MQ
85
0
0
21 Apr 2025
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis
Radek Daněček
Carolin Schmitt
Senya Polikovsky
Michael J. Black
83
0
0
18 Apr 2025
Can Masked Autoencoders Also Listen to Birds?
Can Masked Autoencoders Also Listen to Birds?
Lukas Rauch
Ilyass Moummad
René Heinrich
Alexis Joly
Bernhard Sick
Christoph Scholz
102
0
0
17 Apr 2025
Dysarthria Normalization via Local Lie Group Transformations for Robust ASR
Dysarthria Normalization via Local Lie Group Transformations for Robust ASR
Mikhail Osipov
104
1
0
16 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
115
0
0
09 Apr 2025
Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift
Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift
Maja J. Hjuler
Line H. Clemmensen
Sneha Das
FAtt
100
1
0
07 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Kai Zhang
MGen
VGen
233
1
0
01 Apr 2025
Exploring In-Context Learning Capabilities of ChatGPT for Pathological Speech Detection
Exploring In-Context Learning Capabilities of ChatGPT for Pathological Speech Detection
Mahdi Amiri
Hatef Otroshi Shahreza
Ina Kodrasi
87
0
0
31 Mar 2025
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Hyunjong Ok
Suho Yoo
Jaeho Lee
143
0
0
30 Mar 2025
STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
Zijun Ding
Mingdie Xiong
Congcong Zhu
Jingrun Chen
DiffM
105
0
0
29 Mar 2025
Dual Audio-Centric Modality Coupling for Talking Head Generation
Dual Audio-Centric Modality Coupling for Talking Head Generation
Ao Fu
Ziqi Ni
Yi Zhou
109
1
0
26 Mar 2025
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
Vrushank Ahire
Kunal Shah
Mudasir Nazir Khan
Nikhil Pakhale
L. Sookha
M. A. Ganaie
Abhinav Dhall
106
0
0
16 Mar 2025
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Xue Jiang
Xiulian Peng
Yuan Zhang
Yan Lu
SSL
116
1
0
15 Mar 2025
Lightweight Models for Emotional Analysis in Video
Lightweight Models for Emotional Analysis in Video
Quoc-Tien Nguyen
H. Nguyen
V. Huynh
83
0
0
13 Mar 2025
Heterogeneous bimodal attention fusion for speech emotion recognition
Heterogeneous bimodal attention fusion for speech emotion recognition
Jiachen Luo
Huy Phan
Lin Wang
Joshua Reiss
87
0
0
09 Mar 2025
Bimodal Connection Attention Fusion for Speech Emotion Recognition
Bimodal Connection Attention Fusion for Speech Emotion Recognition
Jiachen Luo
Huy Phan
Lin Wang
Joshua D. Reiss
89
0
0
08 Mar 2025
DGFM: Full Body Dance Generation Driven by Music Foundation Models
DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu
Zhenhua Feng
Diptesh Kanojia
Wenwu Wang
DiffM
124
1
0
27 Feb 2025
TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data
TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data
Ege Onur Taga
M. E. Ildiz
Samet Oymak
AI4TS
111
2
0
22 Feb 2025
NEAR: A Training-Free Pre-Estimator of Machine Learning Model Performance
NEAR: A Training-Free Pre-Estimator of Machine Learning Model Performance
Raphael T. Husistein
Markus Reiher
Marco Eckhoff
204
1
0
20 Feb 2025
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
Borui Liao
Yulong Xu
Jiao Ou
Kaiyuan Yang
Weihua Jian
Pengfei Wan
Di Zhang
AuLLM
115
0
0
19 Feb 2025
Learn2Mix: Training Neural Networks Using Adaptive Data Integration
Learn2Mix: Training Neural Networks Using Adaptive Data Integration
Shyam Venkatasubramanian
Vahid Tarokh
140
0
0
17 Feb 2025
CR-CTC: Consistency regularization on CTC for improved speech recognition
CR-CTC: Consistency regularization on CTC for improved speech recognition
Zengwei Yao
Wei Kang
Xiaoyu Yang
Fangjun Kuang
Liyong Guo
Han Zhu
Zengrui Jin
Zhaoqing Li
Long Lin
Daniel Povey
107
4
0
17 Feb 2025
Conformal Prediction Sets Can Cause Disparate Impact
Conformal Prediction Sets Can Cause Disparate Impact
Jesse C. Cresswell
Bhargava Kumar
Yi Sui
Mouloud Belbahri
FaML
494
1
0
17 Feb 2025
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
Xiangyu Lu
Wang Xu
Haoyu Wang
Hongyun Zhou
Haiyan Zhao
Conghui Zhu
Tiejun Zhao
M. Yang
Mamba
AuLLM
82
0
0
16 Feb 2025
On the Promise for Assurance of Differentiable Neurosymbolic Reasoning Paradigms
On the Promise for Assurance of Differentiable Neurosymbolic Reasoning Paradigms
Luke E. Richards
Jessie Yaros
Jasen Babcock
Coung Ly
Robin Cosbey
Timothy Doster
Cynthia Matuszek
NAI
100
0
0
13 Feb 2025
Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content
Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content
Girish A. Koushik
Diptesh Kanojia
Helen Treharne
100
2
0
11 Feb 2025
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Yassine El Kheir
Youness Samih
Suraj Maharjan
Tim Polzehl
Sebastian Möller
123
1
0
05 Feb 2025
Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models
Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models
Christopher Simic
Korbinian Riedhammer
Tobias Bocklet
137
0
0
03 Feb 2025
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
Gaojie Lin
Jianwen Jiang
Jiaqi Yang
Zerong Zheng
Chao Liang
DiffM
VGen
271
22
0
03 Feb 2025
Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings
Summary of the NOTSOFAR-1 Challenge: Highlights and Learnings
Igor Abramovski
Alon Vinnikov
Shalev Shaer
Naoyuki Kanda
Xiaofei Wang
Amir Ivry
Eyal Krupka
92
0
0
28 Jan 2025
Boli: A dataset for understanding stuttering experience and analyzing stuttered speech
Boli: A dataset for understanding stuttering experience and analyzing stuttered speech
Ashita Batra
Mannas narang
Neeraj Kumar Sharma
Pradip K Das
76
0
0
27 Jan 2025
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
Shuqi Dai
Yunyun Wang
Roger B. Dannenberg
Zeyu Jin
DiffM
95
0
0
23 Jan 2025
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Sungnyun Kim
Sungwoo Cho
Sangmin Bae
Kangwook Jang
Se-Young Yun
SSL
113
1
0
23 Jan 2025
LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
Soumya Dutta
Sriram Ganapathy
86
5
0
20 Jan 2025
How Redundant Is the Transformer Stack in Speech Representation Models?
How Redundant Is the Transformer Stack in Speech Representation Models?
Teresa Dorszewski
Albert Kjøller Jacobsen
Lenka Tětková
Lars Kai Hansen
145
0
0
20 Jan 2025
1234
Next