ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.11607
  4. Cited By
Transformers in Speech Processing: A Survey

Transformers in Speech Processing: A Survey

21 March 2023
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
ArXivPDFHTML

Papers citing "Transformers in Speech Processing: A Survey"

50 / 235 papers shown
Title
ViLT: Vision-and-Language Transformer Without Convolution or Region
  Supervision
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
112
1,739
0
05 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
96
114
0
31 Jan 2021
UniSpeech: Unified Speech Representation Learning with Labeled and
  Unlabeled Data
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
Chengyi Wang
Yu-Huan Wu
Yao Qian
K. Kumatani
Shujie Liu
Furu Wei
Michael Zeng
Xuedong Huang
OT
SSL
70
114
0
19 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
270
2,492
0
04 Jan 2021
A Survey on Deep Reinforcement Learning for Audio-Based Applications
A Survey on Deep Reinforcement Learning for Audio-Based Applications
S. Latif
Heriberto Cuayáhuitl
Farrukh Pervez
Fahad Shamshad
Hafiz Shehbaz Ali
Min Zhang
OffRL
94
74
0
01 Jan 2021
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
345
6,731
0
23 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained
  Models
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
63
43
0
15 Dec 2020
DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector
  Quantization
DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization
Shaoshi Ling
Yuzong Liu
58
107
0
11 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation
  Learning
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
82
77
0
08 Dec 2020
s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis
s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis
Xi Wang
Huaiping Ming
Lei He
Frank Soong
31
5
0
17 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
115
420
0
14 Nov 2020
Learning Representations from Audio-Visual Spatial Alignment
Learning Representations from Audio-Visual Spatial Alignment
Pedro Morgado
Yi Li
Nuno Vasconcelos
SSL
64
122
0
03 Nov 2020
Transformer in action: a comparative study of transformer-based acoustic
  models for large scale speech recognition applications
Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications
Yongqiang Wang
Yangyang Shi
Frank Zhang
Chunyang Wu
Julian Chan
Ching-Feng Yeh
Alex Xiao
38
21
0
27 Oct 2020
Attention is All You Need in Speech Separation
Attention is All You Need in Speech Separation
Cem Subakan
Mirco Ravanelli
Samuele Cornell
Mirko Bronzi
Jianyuan Zhong
88
557
0
25 Oct 2020
Two-stage Textual Knowledge Distillation for End-to-End Spoken Language
  Understanding
Two-stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding
Seongbin Kim
Gyuwan Kim
Seongjin Shin
Sangmin Lee
VLM
28
20
0
25 Oct 2020
GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
  Synthesis
GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis
Rui Liu
Berrak Sisman
Haizhou Li
72
25
0
23 Oct 2020
Transformer-based End-to-End Speech Recognition with Local Dense
  Synthesizer Attention
Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention
Menglong Xu
Shengqiang Li
Xiao-Lei Zhang
43
32
0
23 Oct 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at
  Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
530
40,739
0
22 Oct 2020
Developing Real-time Streaming Transformer Transducer for Speech
  Recognition on Large-scale Dataset
Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset
Xie Chen
Yu-Huan Wu
Zhenghao Wang
Shujie Liu
Jinyu Li
104
174
0
22 Oct 2020
TurnGPT: a Transformer-based Language Model for Predicting Turn-taking
  in Spoken Dialog
TurnGPT: a Transformer-based Language Model for Predicting Turn-taking in Spoken Dialog
Erik Ekstedt
Gabriel Skantze
56
58
0
21 Oct 2020
Emformer: Efficient Memory Transformer Based Acoustic Model For Low
  Latency Streaming Speech Recognition
Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition
Yangyang Shi
Yongqiang Wang
Chunyang Wu
Ching-Feng Yeh
Julian Chan
Frank Zhang
Duc Le
M. Seltzer
94
171
0
21 Oct 2020
Pushing the Limits of Semi-Supervised Learning for Automatic Speech
  Recognition
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Yu Zhang
James Qin
Daniel S. Park
Wei Han
Chung-Cheng Chiu
Ruoming Pang
Quoc V. Le
Yonghui Wu
VLM
SSL
176
309
0
20 Oct 2020
fairseq S2T: Fast Speech-to-Text Modeling with fairseq
fairseq S2T: Fast Speech-to-Text Modeling with fairseq
Changhan Wang
Yun Tang
Xutai Ma
Anne Wu
Sravya Popuri
Dmytro Okhonko
J. Pino
VLM
LRM
63
271
0
11 Oct 2020
Rethinking Attention with Performers
Rethinking Attention with Performers
K. Choromanski
Valerii Likhosherstov
David Dohan
Xingyou Song
Andreea Gane
...
Afroz Mohiuddin
Lukasz Kaiser
David Belanger
Lucy J. Colwell
Adrian Weller
167
1,570
0
30 Sep 2020
Learning Knowledge Bases with Parameters for Task-Oriented Dialogue
  Systems
Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems
Andrea Madotto
Samuel Cahyawijaya
Genta Indra Winata
Yan Xu
Zihan Liu
Zhaojiang Lin
Pascale Fung
94
63
0
28 Sep 2020
Hierarchical Pre-training for Sequence Labelling in Spoken Dialog
Hierarchical Pre-training for Sequence Labelling in Spoken Dialog
E. Chapuis
Pierre Colombo
Matteo Manica
Matthieu Labeau
Chloé Clavel
120
59
0
23 Sep 2020
Efficient Transformers: A Survey
Efficient Transformers: A Survey
Yi Tay
Mostafa Dehghani
Dara Bahri
Donald Metzler
VLM
146
1,115
0
14 Sep 2020
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
  Speech Recognition
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
Quan Wang
Ignacio López Moreno
Mert Saglam
K. Wilson
Alan Chiao
...
Yanzhang He
Wei Li
Jason W. Pelecanos
M. Nika
A. Gruenstein
VLM
51
85
0
09 Sep 2020
Are Neural Open-Domain Dialog Systems Robust to Speech Recognition
  Errors in the Dialog History? An Empirical Study
Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study
Karthik Gopalakrishnan
Behnam Hedayatnia
Longshaokan Wang
Yang Liu
Dilek Z. Hakkani-Tür
28
24
0
18 Aug 2020
Multi-modal Transformer for Video Retrieval
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
518
609
0
21 Jul 2020
Low Rank Fusion based Transformers for Multimodal Sequences
Low Rank Fusion based Transformers for Multimodal Sequences
Saurav Sahay
Eda Okur
Shachi H. Kumar
L. Nachman
ViT
41
66
0
04 Jul 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
  Representations
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
SSL
228
5,774
0
20 Jun 2020
FastPitch: Parallel Text-to-speech with Pitch Prediction
FastPitch: Parallel Text-to-speech with Pitch Prediction
Adrian Lañcucki
66
340
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation
  Learning
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
56
494
0
11 Jun 2020
MultiSpeech: Multi-Speaker Text to Speech with Transformer
MultiSpeech: Multi-Speaker Text to Speech with Transformer
Mingjian Chen
Xu Tan
Yi Ren
Jin Xu
Hao Sun
Sheng Zhao
Tao Qin
Tie-Yan Liu
60
110
0
08 Jun 2020
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Yi Ren
Chenxu Hu
Xu Tan
Tao Qin
Sheng Zhao
Zhou Zhao
Tie-Yan Liu
105
1,393
0
08 Jun 2020
On the Comparison of Popular End-to-End Models for Large Scale Speech
  Recognition
On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition
Jinyu Li
Yu-Huan Wu
Yashesh Gaur
Chengyi Wang
Rui Zhao
Shujie Liu
39
137
0
28 May 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language
  Processing
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing
Hanrui Wang
Zhanghao Wu
Zhijian Liu
Han Cai
Ligeng Zhu
Chuang Gan
Song Han
82
262
0
28 May 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
624
41,736
0
28 May 2020
Exploring Transformers for Large-Scale Speech Recognition
Exploring Transformers for Large-Scale Speech Recognition
Liang Lu
Changliang Liu
Jinyu Li
Jiawei Liu
43
40
0
19 May 2020
Conformer: Convolution-augmented Transformer for Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati
James Qin
Chung-Cheng Chiu
Niki Parmar
Yu Zhang
...
Wei Han
Shibo Wang
Zhengdong Zhang
Yonghui Wu
Ruoming Pang
210
3,119
0
16 May 2020
Streaming Transformer-based Acoustic Models Using Self-attention with
  Augmented Memory
Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory
Chunyang Wu
Yongqiang Wang
Yangyang Shi
Ching-Feng Yeh
Frank Zhang
RALM
58
63
0
16 May 2020
JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech
  without Explicit Alignment
JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment
D. Lim
Won Jang
Gyeonghwan O
Heayoung Park
Bongwan Kim
Jaesam Yoon
44
37
0
15 May 2020
Recipes for building an open-domain chatbot
Recipes for building an open-domain chatbot
Stephen Roller
Emily Dinan
Naman Goyal
Da Ju
Mary Williamson
...
Myle Ott
Kurt Shuster
Eric Michael Smith
Y-Lan Boureau
Jason Weston
ALM
115
1,006
0
28 Apr 2020
DIET: Lightweight Language Understanding for Dialogue Systems
DIET: Lightweight Language Understanding for Dialogue Systems
Tanja Bunk
Daksh Varshneya
Vladimir Vlasov
Alan Nichol
50
162
0
21 Apr 2020
Understanding the Difficulty of Training Transformers
Understanding the Difficulty of Training Transformers
Liyuan Liu
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
Jiawei Han
AI4CE
55
254
0
17 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
90
1,934
0
13 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
124
438
0
02 Apr 2020
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
294
596
0
12 Mar 2020
Unsupervised Style and Content Separation by Minimizing Mutual
  Information for Speech Synthesis
Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis
Ting-Yao Hu
A. Shrivastava
Oncel Tuzel
C. Dhir
31
31
0
09 Mar 2020
Previous
12345
Next