Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.11607
Cited By
Transformers in Speech Processing: A Survey
21 March 2023
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Transformers in Speech Processing: A Survey"
50 / 235 papers shown
Title
Fine-grained style control in Transformer-based Text-to-speech Synthesis
Li-Wei Chen
Alexander I. Rudnicky
111
31
0
12 Oct 2021
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training
Sanyuan Chen
Yu Wu
Chengyi Wang
Zhengyang Chen
Zhuo Chen
...
Jian Wu
Yao Qian
Furu Wei
Jinyu Li
Xiangzhan Yu
SSL
53
92
0
12 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLM
CLIP
141
452
0
11 Oct 2021
Universal Paralinguistic Speech Representations Using Self-Supervised Conformers
Joel Shor
A. Jansen
Wei Han
Daniel S. Park
Yu Zhang
SSL
AI4TS
81
59
0
09 Oct 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
309
576
0
28 Sep 2021
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Yu Zhang
Daniel S. Park
Wei Han
James Qin
Anmol Gulati
...
Zhifeng Chen
Quoc V. Le
Chung-Cheng Chiu
Ruoming Pang
Yonghui Wu
SSL
60
175
0
27 Sep 2021
Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning
Songjun Cao
Yueteng Kang
Yanzhe Fu
Xiaoshuo Xu
Sining Sun
Yike Zhang
Long Ma
35
16
0
15 Sep 2021
TEASEL: A Transformer-Based Speech-Prefixed Language Model
Mehdi Arjmand
M. Dousti
H. Moradi
55
18
0
12 Sep 2021
Speechformer: Reducing Information Loss in Direct Speech Translation
Sara Papi
Marco Gaido
Matteo Negri
Marco Turchi
101
24
0
09 Sep 2021
A Unified Transformer-based Framework for Duplex Text Normalization
T. Lai
Yang Zhang
Evelina Bakhturina
Boris Ginsburg
Heng Ji
13
12
0
23 Aug 2021
Differentiable Subset Pruning of Transformer Heads
Jiaoda Li
Ryan Cotterell
Mrinmaya Sachan
84
56
0
10 Aug 2021
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining
Xunlin Zhan
Yangxin Wu
Xiao Dong
Yunchao Wei
Minlong Lu
Yichi Zhang
Hang Xu
Xiaodan Liang
ViT
57
66
0
30 Jul 2021
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Ye Jia
Michelle Tadmor Ramanovich
Tal Remez
Roi Pomerantz
44
70
0
19 Jul 2021
Long-Short Transformer: Efficient Transformers for Language and Vision
Chen Zhu
Ming-Yu Liu
Chaowei Xiao
Mohammad Shoeybi
Tom Goldstein
Anima Anandkumar
Bryan Catanzaro
ViT
VLM
69
133
0
05 Jul 2021
CLIP-It! Language-Guided Video Summarization
Medhini Narasimhan
Anna Rohrbach
Trevor Darrell
CLIP
61
117
0
01 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
92
560
0
30 Jun 2021
A Survey on Neural Speech Synthesis
Xu Tan
Tao Qin
Frank Soong
Tie-Yan Liu
AI4TS
73
358
0
29 Jun 2021
DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Alexandre Muzio
Saksham Singhal
Hany Awadalla
Xia Song
Furu Wei
SLR
AI4CE
45
81
0
25 Jun 2021
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
68
89
0
25 Jun 2021
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
Han Fang
Pengfei Xiong
Luhui Xu
Yu Chen
CLIP
VLM
90
297
0
21 Jun 2021
Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems
Hsien-Chin Lin
Nurul Lubis
Songbo Hu
Carel van Niekerk
Christian Geishauser
Michael Heck
Shutong Feng
Milica Gavsić
118
30
0
16 Jun 2021
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Wei-Ning Hsu
Benjamin Bolte
Yao-Hung Hubert Tsai
Kushal Lakhotia
Ruslan Salakhutdinov
Abdel-rahman Mohamed
SSL
147
2,939
0
14 Jun 2021
Supervising the Transfer of Reasoning Patterns in VQA
Corentin Kervadec
Christian Wolf
G. Antipov
M. Baccouche
Madiha Nadri Wolf
53
11
0
10 Jun 2021
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
Xingshan Zeng
Liangyou Li
Qun Liu
55
48
0
09 Jun 2021
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
114
1,115
0
08 Jun 2021
On the Connection between Local Attention and Dynamic Depth-wise Convolution
Qi Han
Zejia Fan
Qi Dai
Lei-huan Sun
Ming-Ming Cheng
Jiaying Liu
Jingdong Wang
ViT
56
108
0
08 Jun 2021
CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings
Tatiana Likhomanenko
Qiantong Xu
Gabriel Synnaeve
R. Collobert
A. Rogozhnikov
OOD
ViT
52
58
0
06 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLM
LRM
95
380
0
04 Jun 2021
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning
Haiyang Xu
Ming Yan
Chenliang Li
Bin Bi
Songfang Huang
Wenming Xiao
Fei Huang
VLM
55
119
0
03 Jun 2021
Segmenter: Transformer for Semantic Segmentation
Robin Strudel
Ricardo Garcia Pinel
Ivan Laptev
Cordelia Schmid
ViT
174
1,460
0
12 May 2021
SUPERB: Speech processing Universal PERformance Benchmark
Shu-Wen Yang
Po-Han Chi
Yung-Sung Chuang
Cheng-I Jeff Lai
Kushal Lakhotia
...
Shuyan Dong
Shang-Wen Li
Shinji Watanabe
Abdel-rahman Mohamed
Hung-yi Lee
SSL
90
929
0
03 May 2021
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Xiuye Gu
Nayeon Lee
Weicheng Kuo
Huayu Chen
VLM
ObjD
263
915
0
28 Apr 2021
DPT-FSNet: Dual-path Transformer Based Full-band and Sub-band Fusion Network for Speech Enhancement
Feng Dang
Hangting Chen
Pengyuan Zhang
91
102
0
27 Apr 2021
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
133
55
0
23 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
297
587
0
22 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
394
801
0
18 Apr 2021
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction
Stanislav Beliaev
Boris Ginsburg
43
9
0
16 Apr 2021
Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings
L. Pepino
Pablo Riera
Luciana Ferrer
46
360
0
08 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLM
VLM
56
91
0
01 Apr 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
47
139
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
56
121
0
30 Mar 2021
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Chun-Fu Chen
Quanfu Fan
Yikang Shen
ViT
68
1,469
0
27 Mar 2021
TSTNN: Two-stage Transformer based Neural Network for Speech Enhancement in the Time Domain
Kai Wang
Bengbeng He
Weiping Zhu
55
169
0
18 Mar 2021
Causal Attention for Vision-Language Tasks
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
72
154
0
05 Mar 2021
Natural Language Understanding for Argumentative Dialogue Systems in the Opinion Building Domain
W. A. Abro
Annalena Aicher
Niklas Rach
Stefan Ultes
Wolfgang Minker
Guilin Qi
50
33
0
03 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
810
29,167
0
26 Feb 2021
TransMask: A Compact and Fast Speech Separation Model Based on Transformer
Zining Zhang
Bingsheng He
Zhenjie Zhang
48
22
0
19 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
114
661
0
11 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
419
3,826
0
11 Feb 2021
Bayesian Transformer Language Models for Speech Recognition
Boyang Xue
Jianwei Yu
Junhao Xu
Shansong Liu
Shoukang Hu
Zi Ye
Mengzhe Geng
Xunying Liu
Helen Meng
BDL
88
28
0
09 Feb 2021
Previous
1
2
3
4
5
Next