Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2005.08100
Cited By
Conformer: Convolution-augmented Transformer for Speech Recognition
16 May 2020
Anmol Gulati
James Qin
Chung-Cheng Chiu
Niki Parmar
Yu Zhang
Jiahui Yu
Wei Han
Shibo Wang
Zhengdong Zhang
Yonghui Wu
Ruoming Pang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Conformer: Convolution-augmented Transformer for Speech Recognition"
50 / 1,758 papers shown
Title
AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion
Wen-Chin Huang
Kazuhiro Kobayashi
Tomoki Toda
24
2
0
14 Sep 2023
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Yongqiang Wang
Jionghao Bai
Rongjie Huang
Ruiqi Li
Zhiqing Hong
Zhou Zhao
24
3
0
14 Sep 2023
Outlier-aware Inlier Modeling and Multi-scale Scoring for Anomalous Sound Detection via Multitask Learning
Yucong Zhang
Hongbin Suo
Yulong Wan
Ming Li
32
4
0
14 Sep 2023
CPPF: A contextual and post-processing-free model for automatic speech recognition
Lei Zhang
Zhengkun Tian
Xiang Chen
Jiaming Sun
Hongyu Xiang
Ke Ding
Guanglu Wan
39
0
0
14 Sep 2023
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Yifan Yang
Feiyu Shen
Chenpeng Du
Ziyang Ma
K. Yu
Daniel Povey
Xie Chen
43
26
0
14 Sep 2023
Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Zhengyang Chen
Bing Han
Shuai Wang
Yan-min Qian
33
18
0
13 Sep 2023
Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Anna Deichler
Shivam Mehta
Simon Alexanderson
Jonas Beskow
DiffM
25
24
0
11 Sep 2023
Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Jinzuomu Zhong
Yang Li
Hui Huang
Korin Richmond
Jie Liu
Zhiba Su
Jing Guo
Benlai Tang
Fengjie Zhu
23
1
0
11 Sep 2023
SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus
Haoxu Wang
Fan Yu
Xian Shi
Yuezhang Wang
Shiliang Zhang
Ming Li
37
11
0
11 Sep 2023
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
T. Park
Kunal Dhawan
Nithin Rao Koluguri
Jagadeesh Balam
44
15
0
11 Sep 2023
Leveraging Large Language Models for Exploiting ASR Uncertainty
Pranay Dighe
Yi Su
Shangshang Zheng
Yunshu Liu
Vineet Garg
Xiaochuan Niu
Ahmed H. Tewfik
13
12
0
09 Sep 2023
Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Huaibo Zhao
Yosuke Higuchi
Yusuke Kida
Tetsuji Ogawa
Tetsunori Kobayashi
28
1
0
09 Sep 2023
End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining
Saksham Bassi
Giulio Duregon
Siddhartha Jalagam
David Roth
41
2
0
08 Sep 2023
Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems
Takuma Udagawa
Masayuki Suzuki
Gakuto Kurata
Masayasu Muraoka
G. Saon
46
2
0
07 Sep 2023
MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023
Zhihang Xu
Shaofei Zhang
Xi Wang
Jiajun Zhang
Wenning Wei
Lei He
Sheng Zhao
23
2
0
06 Sep 2023
Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition
Patrick Eickhoff
M. Möller
Theresa Pekarek-Rosin
Johannes Twiefel
Stefan Wermter
28
2
0
05 Sep 2023
Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation
Jiaxu Zhu
Weinan Tong
Yaoxun Xu
Chang Song
Zhiyong Wu
Zhao You
Dan Su
Dong Yu
Helen M. Meng
32
0
0
04 Sep 2023
SememeASR: Boosting Performance of End-to-End Speech Recognition against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge
Jiaxu Zhu
Chang Song
Zhiyong Wu
Helen Meng
VLM
34
0
0
04 Sep 2023
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
Zhichao Wang
Xinsheng Wang
Qicong Xie
Tao Li
Linfu Xie
Qiao Tian
Yuping Wang
34
4
0
03 Sep 2023
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin
Tao Li
Chenxu Hu
Jian Cong
Xinfa Zhu
Jingbei Li
Qiao Tian
Yuping Wang
Linfu Xie
DiffM
57
8
0
02 Sep 2023
CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding
Etienne Labbé
Thomas Pellegrini
J. Pinquier
30
12
0
01 Sep 2023
Remixing-based Unsupervised Source Separation from Scratch
Kohei Saijo
Tetsuji Ogawa
18
3
0
01 Sep 2023
RepCodec: A Speech Representation Codec for Speech Tokenization
Zhichao Huang
Chutong Meng
Tom Ko
22
25
0
31 Aug 2023
Improving vision-inspired keyword spotting using dynamic module skipping in streaming conformer encoder
Alexandre Bittar
Paul Dixon
Mohammad Samragh
K. Nishu
Devang Naik
33
3
0
31 Aug 2023
Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer
Kyuhong Shim
Jinkyu Lee
Simyoung Chang
Kyuwoong Hwang
47
2
0
31 Aug 2023
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
Ji-Hoon Kim
Jaehun Kim
Joon Son Chung
42
5
0
29 Aug 2023
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
S. Essid
Mirco Ravanelli
SSL
29
11
0
28 Aug 2023
Decoupled Structure for Improved Adaptability of End-to-End Models
Keqi Deng
P. Woodland
AuLLM
32
2
0
25 Aug 2023
TC-LIF: A Two-Compartment Spiking Neuron Model for Long-Term Sequential Modelling
Shimin Zhang
Qu Yang
Chenxiang Ma
Jibin Wu
Haizhou Li
Kay Chen Tan
35
16
0
25 Aug 2023
Exploiting Time-Frequency Conformers for Music Audio Enhancement
Yunkee Chae
Junghyun Koo
Sungho Lee
Kyogu Lee
40
3
0
24 Aug 2023
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury
Sreyan Ghosh
Subhrajyoti Dasgupta
Anton Ratnarajah
Utkarsh Tyagi
Tianyi Zhou
34
11
0
23 Aug 2023
KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods
Antoine Nzeyimana
SSL
30
0
0
23 Aug 2023
Convoifilter: A case study of doing cocktail party speech recognition
Thai-Binh Nguyen
A. Waibel
25
2
0
22 Aug 2023
How Much Temporal Long-Term Context is Needed for Action Segmentation?
Emad Bahrami Rad
Gianpiero Francesca
Juergen Gall
ViT
32
27
0
22 Aug 2023
An Effective Transformer-based Contextual Model and Temporal Gate Pooling for Speaker Identification
Harunori Kawano
Sota Shimizu
30
1
0
22 Aug 2023
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
Jinchuan Tian
Jianwei Yu
Hangting Chen
Brian Yan
Chao Weng
Dong Yu
Shinji Watanabe
42
1
0
19 Aug 2023
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Ye-Xin Lu
Yang Ai
Zhenhua Ling
30
9
0
17 Aug 2023
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
Running Zhao
Jiang-Tao Luca Yu
Haiying Zhao
Edith C.H. Ngai
37
4
0
16 Aug 2023
Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability
Seokhyeon Ha
S. Jung
Jungwook Lee
27
3
0
15 Aug 2023
Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
Daobin Zhu
Xiangdong Su
Hongbin Zhang
21
1
0
15 Aug 2023
O-1: Self-training with Oracle and 1-best Hypothesis
M. Baskar
Andrew Rosenberg
Bhuvana Ramabhadran
Kartik Audhkhasi
VLM
27
0
0
14 Aug 2023
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
Shaan Bijwadia
Shuo-yiin Chang
Weiran Wang
Zhong Meng
Hao Zhang
Tara N. Sainath
24
1
0
14 Aug 2023
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Xiaofei Wang
Manthan Thakker
Zhuo Chen
Naoyuki Kanda
Sefik Emre Eskimez
Sanyuan Chen
M. Tang
Shujie Liu
Jinyu Li
Takuya Yoshioka
28
80
0
14 Aug 2023
Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
Hanjing Zhu
Dongji Gao
Gaofeng Cheng
Daniel Povey
Pengyuan Zhang
Yonghong Yan
NoLa
40
4
0
12 Aug 2023
Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
K. Nishu
Minsik Cho
Paul Dixon
Devang Naik
37
13
0
12 Aug 2023
Improving Joint Speech-Text Representations Without Alignment
Cal Peyser
Zhong Meng
Ke Hu
Rohit Prabhavalkar
Andrew Rosenberg
Tara N. Sainath
M. Picheny
Kyunghyun Cho
VLM
33
4
0
11 Aug 2023
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping
Y. A. D. Djilali
Sanath Narayan
Haithem Boussaid
Ebtesam Almazrouei
Merouane Debbah
42
10
0
11 Aug 2023
Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model
Fan Zhang
Naye Ji
Fuxing Gao
Siyuan Zhao
Zhaohan Wang
Shunman Li
32
0
0
11 Aug 2023
Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio
Yang Zhang
Krishna C. Puvvada
Vitaly Lavrukhin
Boris Ginsburg
40
14
0
09 Aug 2023
Cross-view Semantic Alignment for Livestreaming Product Recognition
Wenjie Yang
Yiyi Chen
Yan Li
Yanhua Cheng
Xudong Liu
Quanming Chen
Han Li
34
2
0
09 Aug 2023
Previous
1
2
3
...
14
15
16
...
34
35
36
Next