Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.07447
Cited By
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
14 June 2021
Wei-Ning Hsu
Benjamin Bolte
Yao-Hung Hubert Tsai
Kushal Lakhotia
Ruslan Salakhutdinov
Abdel-rahman Mohamed
SSL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units"
50 / 149 papers shown
Title
VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion
Joon-Seung Choi
Dong-Min Byun
Hyung-Seok Oh
Seong-Whan Lee
58
0
0
27 May 2025
SACM: SEEG-Audio Contrastive Matching for Chinese Speech Decoding
Hongbin Wang
Zhihong Jia
Yuanzhong Shen
Ziwei Wang
Siyang Li
Kai Shu
Feng Hu
Dongrui Wu
69
0
0
26 May 2025
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du
Changfeng Gao
Yuxuan Wang
Fan Yu
Tianyu Zhao
...
Mengzhe Chen
Yafeng Chen
Shiliang Zhang
Wen Wang
Jieping Ye
AuLLM
105
0
0
23 May 2025
An End-to-End Approach for Child Reading Assessment in the Xhosa Language
Sergio Chevtchenko
Nikhil Navas
Rafaella Vale
Franco Ubaudi
Sipumelele Lucwaba
Cally Ardington
Soheil Afshar
Mark Antoniou
Saeed Afshar
45
0
0
23 May 2025
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
Zifan Peng
Yule Liu
Zhen Sun
Mingchen Li
Zeren Luo
...
Xinlei He
Xuechao Wang
Yingjie Xue
Shengmin Xu
Xinyi Huang
AuLLM
AAML
86
1
0
23 May 2025
Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer
Soumya Dutta
Avni Jain
Sriram Ganapathy
113
0
0
23 May 2025
EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion
Advait Joglekar
Divyanshu Singh
Rooshil Rohit Bhatia
S. Umesh
61
0
0
22 May 2025
X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance
Junbo Zhang
Heinrich Dinkel
Yadong Niu
Chenyu Liu
Si Cheng
Anbei Zhao
Jian Luan
147
0
0
22 May 2025
"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding
Alkis Koudounas
Claudio Savelli
Flavio Giobergia
Elena Baralis
MU
98
0
0
21 May 2025
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Yuxin Lin
Yinglin Zheng
Ming Zeng
Wangzheng Shi
76
0
0
19 May 2025
Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning
Kristin Qi
Jiali Cheng
Youxiang Zhu
Hadi Amiri
Xiaohui Liang
103
0
0
19 May 2025
Exploring the Potential of SSL Models for Sound Event Detection
Hanfang Cui
Longfei Song
Li Li
Dongxing Xu
Yanhua Long
72
0
0
17 May 2025
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
Junyi Peng
Takanori Ashihara
Marc Delcroix
Tsubasa Ochiai
Oldrich Plchot
Shoko Araki
J. Černocký
ELM
94
0
0
10 May 2025
Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Shigeki Karita
Yuma Koizumi
Heiga Zen
Haruko Ishikawa
Robin Scheibler
M. Bacchiani
VLM
372
1
0
07 May 2025
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
Hilde Kuehne
84
0
0
02 May 2025
Quantifying Source Speaker Leakage in One-to-One Voice Conversion
Scott Wellington
Xuechen Liu
Junichi Yamagishi
80
0
0
22 Apr 2025
StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models
Yeona Hong
Hyewon Han
Woo-Jin Chung
Hong-Goo Kang
MQ
95
0
0
21 Apr 2025
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis
Radek Daněček
Carolin Schmitt
Senya Polikovsky
Michael J. Black
92
0
0
18 Apr 2025
Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Na Li
Chuke Wang
Yu Gu
Zhifeng Li
114
0
0
11 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
123
0
0
09 Apr 2025
Exploring Local Interpretable Model-Agnostic Explanations for Speech Emotion Recognition with Distribution-Shift
Maja J. Hjuler
Line H. Clemmensen
Sneha Das
FAtt
110
1
0
07 Apr 2025
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System
Hyeongju Kim
Jinhyeok Yang
Yechan Yu
Seunghun Ji
Jacob Morton
Frederik Bous
Joon Byun
Juheon Lee
128
0
0
29 Mar 2025
Dual Audio-Centric Modality Coupling for Talking Head Generation
Ao Fu
Ziqi Ni
Yi Zhou
116
1
0
26 Mar 2025
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network
Vrushank Ahire
Kunal Shah
Mudasir Nazir Khan
Nikhil Pakhale
L. Sookha
M. A. Ganaie
Abhinav Dhall
114
0
0
16 Mar 2025
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
Xue Jiang
Xiulian Peng
Yuan Zhang
Yan Lu
SSL
126
1
0
15 Mar 2025
FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis
Ziqi Ni
Ao Fu
Yi Zhou
165
0
0
06 Mar 2025
TimePFN: Effective Multivariate Time Series Forecasting with Synthetic Data
Ege Onur Taga
M. E. Ildiz
Samet Oymak
AI4TS
127
3
0
22 Feb 2025
A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond
Shreya Shukla
Jose Torres
Abhijit Mishra
Jacek Gwizdka
Shounak Roychowdhury
94
0
0
20 Feb 2025
Slamming: Training a Speech Language Model on One GPU in a Day
Gallil Maimon
Avishai Elmakies
Yossi Adi
69
3
0
19 Feb 2025
CR-CTC: Consistency regularization on CTC for improved speech recognition
Zengwei Yao
Wei Kang
Xiaoyu Yang
Fangjun Kuang
Liyong Guo
Han Zhu
Zengrui Jin
Zhaoqing Li
Long Lin
Daniel Povey
117
4
0
17 Feb 2025
AudioMiXR: Spatial Audio Object Manipulation with 6DoF for Sound Design in Augmented Reality
Brandon Woodard
Margarita Geleta
Joseph J. LaViola Jr.
Andrea Fanelli
Rhonda Wilson
117
4
0
05 Feb 2025
Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Yassine El Kheir
Youness Samih
Suraj Maharjan
Tim Polzehl
Sebastian Möller
138
1
0
05 Feb 2025
Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models
Christopher Simic
Korbinian Riedhammer
Tobias Bocklet
140
0
0
03 Feb 2025
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
Sungnyun Kim
Sungwoo Cho
Sangmin Bae
Kangwook Jang
Se-Young Yun
SSL
124
1
0
23 Jan 2025
LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations
Soumya Dutta
Sriram Ganapathy
99
5
0
20 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
269
26
0
17 Jan 2025
USED: Universal Speaker Extraction and Diarization
Junyi Ao
Mehmet Sinan Yildirim
Ruijie Tao
Mengyao Ge
Shuai Wang
Yan-min Qian
Haizhou Li
73
6
0
17 Jan 2025
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning
Rajath Rao
Adithya Ganesan
Oscar Kjell
Jonah Luby
Akshay Raghavan
...
B. Luft
Camilo Ruggero
Neville Ryant
R. Kotov
H. Andrew Schwartz
98
0
0
15 Jan 2025
HAAQI-Net: A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids
Dyah A. M. G. Wisnu
Stefano Rini
Ryandhimas E. Zezario
Hsin-Min Wang
Yu Tsao
104
0
0
10 Jan 2025
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
Samir Sadok
Simon Leglaive
Laurent Girin
Gaël Richard
Xavier Alameda-Pineda
97
3
0
10 Jan 2025
LUPET: Incorporating Hierarchical Information Path into Multilingual ASR
Wei Liu
Jingyong Hou
Dong Yang
Muyong Cao
Tan Lee
122
1
0
10 Jan 2025
Spectral-Aware Low-Rank Adaptation for Speaker Verification
Zhe Li
Man-Wai Mak
Mert Pilanci
Hung-yi Lee
Helen Meng
95
0
0
07 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
86
0
0
03 Jan 2025
DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering
Ruohong Yang
Peng Hu
Xi Peng
Xiting Liu
Yunfan Li
102
0
0
25 Dec 2024
DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
Lei Yang
Shaoyang Xu
Deyi Xiong
78
0
0
25 Dec 2024
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
Chenyu Yang
Shuai Wang
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Yongjun Xu
Yizhi Zhou
Haina Zhu
Haoyang Li
KELM
373
2
0
18 Dec 2024
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks
Simon Rampp
Andreas Triantafyllopoulos
M. Milling
Björn Schuller
256
0
0
16 Dec 2024
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
Prashanth Gurunath Shivakumar
Aditya Gourav
Yile Gu
Ankur Gandhe
Hung-yi Lee
I. Bulyko
97
9
0
04 Nov 2024
EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
Deok-Hyeon Cho
Hyung-Seok Oh
Seung-Bin Kim
Seong-Whan Lee
111
8
0
04 Nov 2024
Music Foundation Model as Generic Booster for Music Downstream Tasks
Weihsiang Liao
Yuhta Takida
Yukara Ikemiya
Zhi-Wei Zhong
Chieh-Hsin Lai
...
Stefan Uhlich
Taketo Akama
Woosung Choi
Yuichiro Koyama
Yuki Mitsufuji
196
1
0
02 Nov 2024
1
2
3
Next