ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer
v1v2v3 (latest)

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXiv (abs)PDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 486 papers shown
Title
Audio Retrieval with WavText5K and CLAP Training
Audio Retrieval with WavText5K and CLAP Training
Soham Deshmukh
Benjamin Elizalde
Huaming Wang
3DVCLIP
181
53
0
28 Sep 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
137
31
0
28 Sep 2022
InFi: End-to-End Learning to Filter Input for Resource-Efficiency in
  Mobile-Centric Inference
InFi: End-to-End Learning to Filter Input for Resource-Efficiency in Mobile-Centric Inference
Mu Yuan
Lan Zhang
Fengxiang He
Xueting Tong
Miao-Hui Song
Zhengyuan Xu
Xiang-Yang Li
60
2
0
28 Sep 2022
UniKW-AT: Unified Keyword Spotting and Audio Tagging
UniKW-AT: Unified Keyword Spotting and Audio Tagging
Heinrich Dinkel
Yongqing Wang
Zhiyong Yan
Junbo Zhang
Yujun Wang
62
3
0
23 Sep 2022
An Efficient End-to-End Transformer with Progressive Tri-modal Attention
  for Multi-modal Emotion Recognition
An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition
Yang Wu
Pai Peng
Zhenyu Zhang
Yanyan Zhao
Bing Qin
45
1
0
20 Sep 2022
I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra
  Contrastive Regularization
I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization
Dianwen Ng
J. Yip
Tanmay Surana
Zhao Yang
Chong Zhang
Yukun Ma
Chongjia Ni
Chng Eng Siong
B. Ma
91
6
0
14 Sep 2022
Classify Respiratory Abnormality in Lung Sounds Using STFT and a
  Fine-Tuned ResNet18 Network
Classify Respiratory Abnormality in Lung Sounds Using STFT and a Fine-Tuned ResNet18 Network
Zizhao Chen
Hongliang Wang
Chia-Hui Yeh
Xilin Liu
34
16
0
30 Aug 2022
MuLan: A Joint Embedding of Music Audio and Natural Language
MuLan: A Joint Embedding of Music Audio and Natural Language
Qingqing Huang
A. Jansen
Joonseok Lee
Ravi Ganti
Judith Yue Li
D. Ellis
143
139
0
26 Aug 2022
Improved Zero-Shot Audio Tagging & Classification with Patchout
  Spectrogram Transformers
Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers
Paul Primus
Gerhard Widmer
VLM
112
5
0
24 Aug 2022
A differentiable short-time Fourier transform with respect to the window
  length
A differentiable short-time Fourier transform with respect to the window length
Maxime Leiber
Axel Barrau
Y. Marnissi
D. Abboud
54
9
0
23 Aug 2022
Self-Supervised Multimodal Fusion Transformer for Passive Activity
  Recognition
Self-Supervised Multimodal Fusion Transformer for Passive Activity Recognition
Armand K. Koupai
M. J. Bocus
Raúl Santos-Rodríguez
Robert Piechocki
Ryan McConville
ViT
62
9
0
15 Aug 2022
Audio-visual scene classification via contrastive event-object alignment
  and semantic-based fusion
Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion
Yuanbo Hou
Bo Kang
Dick Botteldooren
66
3
0
03 Aug 2022
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
  using Adversarial Disentanglement of Multimodal Style Encoding
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding
Mireille Fares
Michele Grimaldi
Catherine Pelachaud
Nicolas Obin
63
18
0
03 Aug 2022
UAVM: Towards Unifying Audio and Visual Models
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
75
23
0
29 Jul 2022
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn
Rui Qian
Kimberly Wilber
Hartwig Adam
Oisin Mac Aodha
Serge Belongie
103
10
0
21 Jul 2022
Introducing Auxiliary Text Query-modifier to Content-based Audio
  Retrieval
Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval
Daiki Takeuchi
Yasunori Ohishi
Daisuke Niizumi
Noboru Harada
K. Kashino
122
2
0
20 Jul 2022
COVID-19 Detection from Respiratory Sounds with Hierarchical Spectrogram
  Transformers
COVID-19 Detection from Respiratory Sounds with Hierarchical Spectrogram Transformers
Idil Aytekin
Onat Dalmaz
Kaan Gonc
H. Ankishan
E. Saritas
Ulas Bagci
H. Celik
Tolga Çukur
60
12
0
19 Jul 2022
GAFX: A General Audio Feature eXtractor
GAFX: A General Audio Feature eXtractor
Zhaoyang Bu
Han Zhang
Xiaohu Zhu
60
0
0
19 Jul 2022
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
A. Shirian
Krishna Somandepalli
Victor Sanchez
T. Guha
61
3
0
16 Jul 2022
Segment-level Metric Learning for Few-shot Bioacoustic Event Detection
Segment-level Metric Learning for Few-shot Bioacoustic Event Detection
Haohe Liu
Xubo Liu
Xinhao Mei
Qiuqiang Kong
Wenwu Wang
Mark D. Plumbley
74
8
0
15 Jul 2022
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
  and Language Models
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
Rui Qian
Yeqing Li
Zheng Xu
Ming-Hsuan Yang
Serge Belongie
Huayu Chen
VLM
74
22
0
15 Jul 2022
Masked Autoencoders that Listen
Masked Autoencoders that Listen
Po-Yao (Bernie) Huang
Hu Xu
Juncheng Billy Li
Alexei Baevski
Michael Auli
Wojciech Galuba
Florian Metze
Christoph Feichtenhofer
165
290
0
13 Jul 2022
Wayformer: Motion Forecasting via Simple & Efficient Attention Networks
Wayformer: Motion Forecasting via Simple & Efficient Attention Networks
Nigamaa Nayakanti
Rami Al-Rfou
Aurick Zhou
Kratarth Goel
Khaled S. Refaat
Benjamin Sapp
AI4TS
140
259
0
12 Jul 2022
EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use
EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use
Jan Schluter
Gerald Gutenbrunner
VLM
60
13
0
12 Jul 2022
A Multi-tasking Model of Speaker-Keyword Classification for Keeping
  Human in the Loop of Drone-assisted Inspection
A Multi-tasking Model of Speaker-Keyword Classification for Keeping Human in the Loop of Drone-assisted Inspection
Yu Li
Anisha Parsan
Bill Wang
Penghao Dong
Shanshan Yao
Ruwen Qin
77
7
0
08 Jul 2022
BAST: Binaural Audio Spectrogram Transformer for Binaural Sound
  Localization
BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization
Sheng Kuang
Kiki van der Heijden
S. Mehrkanoon
33
3
0
08 Jul 2022
Data Augmentation for Dementia Detection in Spoken Language
Data Augmentation for Dementia Detection in Spoken Language
Anna Hlédiková
Dominika Woszczyk
Alican Acman
Soteris Demetriou
Björn Schuller
70
13
0
26 Jun 2022
Avoid Overfitting User Specific Information in Federated Keyword
  Spotting
Avoid Overfitting User Specific Information in Federated Keyword Spotting
Xin-Chun Li
Jin-Lin Tang
Shaoming Song
Bingshuai Li
Yinchuan Li
Yunfeng Shao
Le Gan
De-Chuan Zhan
FedMLAAML
64
9
0
17 Jun 2022
Event-related data conditioning for acoustic event classification
Event-related data conditioning for acoustic event classification
Yuanbo Hou
Dick Botteldooren
59
3
0
16 Jun 2022
It's Time for Artistic Correspondence in Music and Video
It's Time for Artistic Correspondence in Music and Video
Dídac Surís
Carl Vondrick
Bryan C. Russell
Justin Salamon
64
37
0
14 Jun 2022
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit
Hui Zhang
Tian Yuan
Junkun Chen
Xintong Li
Renjie Zheng
...
Zeyu Chen
Xiaoguang Hu
Dianhai Yu
Yanjun Ma
Liang Huang
AuLLM
74
28
0
20 May 2022
The AI Mechanic: Acoustic Vehicle Characterization Neural Networks
The AI Mechanic: Acoustic Vehicle Characterization Neural Networks
Adam M. Terwilliger
J. Siegel
67
2
0
19 May 2022
Composing General Audio Representation by Fusing Multilayer Features of
  a Pre-trained Model
Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
69
6
0
17 May 2022
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges
  in Audio Captioning
Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
Xuenan Xu
Zeyu Xie
Mengyue Wu
K. Yu
84
16
0
11 May 2022
Robustness of Neural Architectures for Audio Event Detection
Robustness of Neural Architectures for Audio Event Detection
Juncheng Billy Li
Zheng Wang
Shuhui Qu
Florian Metze
40
1
0
06 May 2022
Pseudo strong labels for large scale weakly supervised audio tagging
Pseudo strong labels for large scale weakly supervised audio tagging
Heinrich Dinkel
Zhiyong Yan
Yongqing Wang
Junbo Zhang
Yujun Wang
63
6
0
28 Apr 2022
Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training
Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training
Dading Chong
Helin Wang
Peilin Zhou
Qingcheng Zeng
82
68
0
27 Apr 2022
ATST: Audio Representation Learning with Teacher-Student Transformer
ATST: Audio Representation Learning with Teacher-Student Transformer
Xian Li
Xiaofei Li
ViT
58
22
0
26 Apr 2022
BYOL for Audio: Exploring Pre-trained General-purpose Audio
  Representations
BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
SSL
100
59
0
15 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
146
43
0
06 Apr 2022
MetaAudio: A Few-Shot Audio Classification Benchmark
MetaAudio: A Few-Shot Audio Classification Benchmark
Calum Heggan
S. Budgett
Timothy M. Hospedales
Mehrdad Yaghoobi
VLM
86
33
0
05 Apr 2022
Learning Audio-Video Modalities from Image Captions
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
93
86
0
01 Apr 2022
A Temporal-oriented Broadcast ResNet for COVID-19 Detection
A Temporal-oriented Broadcast ResNet for COVID-19 Detection
Xin Jing
Shuo Liu
Emilia Parada-Cabaleiro
Andreas Triantafyllopoulos
Meishu Song
Zijiang Yang
Björn W. Schuller
84
2
0
31 Mar 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
Alan Baade
Puyuan Peng
David Harwath
84
102
0
30 Mar 2022
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio
  Representation Learning
DeLoRes: Decorrelating Latent Spaces for Low-Resource Audio Representation Learning
Sreyan Ghosh
Ashish Seth
and Deepak Mittal
Maneesh Singh
S. Umesh
SSL
64
6
0
25 Mar 2022
AudioTagging Done Right: 2nd comparison of deep learning methods for
  environmental sound classification
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
Juncheng Billy Li
Shuhui Qu
Po-Yao (Bernie) Huang
Florian Metze
VLM
102
9
0
25 Mar 2022
CT-SAT: Contextual Transformer for Sequential Audio Tagging
CT-SAT: Contextual Transformer for Sequential Audio Tagging
Yuanbo Hou
Zhaoyi Liu
Bo Kang
Yun Wang
Dick Botteldooren
ViT
64
5
0
22 Mar 2022
PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
Samuel Yu
Peter Wu
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
LRM
120
16
0
21 Mar 2022
SepTr: Separable Transformer for Audio Spectrogram Processing
SepTr: Separable Transformer for Audio Spectrogram Processing
Nicolae-Cătălin Ristea
Radu Tudor Ionescu
Fahad Shahbaz Khan
ViT
96
32
0
17 Mar 2022
Learning Audio Representations with MLPs
Learning Audio Representations with MLPs
Mashrur M. Morshed
Ahmad Omar Ahsan
H. Mahmud
Md. Kamrul Hasan
80
4
0
16 Mar 2022
Previous
123...1089
Next