ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer
v1v2v3 (latest)

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXiv (abs)PDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 486 papers shown
Title
WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech
  Health Diagnostic Model
WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model
Yi Zhu
Tiago H. Falk
MedIm
84
1
0
26 Jun 2024
A Study on Synthesizing Expressive Violin Performances: Approaches and
  Comparisons
A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
Tzu-Yun Hung
Jui-Te Wu
Yu-Chia Kuo
Yo-Wei Hsiao
Ting-Wei Lin
Li Su
68
0
0
26 Jun 2024
Decoding with Limited Teacher Supervision Requires Understanding When to
  Trust the Teacher
Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
Hyunjong Ok
Jegwang Ryu
Jaeho Lee
47
0
0
26 Jun 2024
This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an
  Audio-Textual Transformer-Based Approach
This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach
Lukas Christ
Shahin Amiriparian
Friederike Hawighorst
Ann-Kathrin Schill
Angelo Boutalikakis
Lorenz Graf-Vlachy
Andreas Konig
Björn W. Schuller
71
1
0
25 Jun 2024
Sound Tagging in Infant-centric Home Soundscapes
Sound Tagging in Infant-centric Home Soundscapes
Mohammad Nur Hossain Khan
Jialu Li
Nancy L. McElwain
M. Hasegawa-Johnson
Bashima Islam
62
0
0
25 Jun 2024
Towards Open Respiratory Acoustic Foundation Models: Pretraining and
  Benchmarking
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking
Yuwei Zhang
Tong Xia
Jing Han
Yu Wu
Georgios Rizos
Yang Liu
Mohammed Mosuily
Jagmohan Chauhan
Cecilia Mascolo
AI4CE
76
12
0
23 Jun 2024
Predefined Prototypes for Intra-Class Separation and Disentanglement
Predefined Prototypes for Intra-Class Separation and Disentanglement
Antonio Almudévar
Théo Mariotte
Alfonso Ortega
Marie Tahon
Luis Vicente
A. Miguel
Eduardo Lleida
74
0
0
23 Jun 2024
LARP: Language Audio Relational Pre-training for Cold-Start Playlist
  Continuation
LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation
Rebecca Salganik
Xiaohao Liu
Yunshan Ma
Jian Kang
Tat-Seng Chua
CLL
100
2
0
20 Jun 2024
Enhancing Automated Audio Captioning via Large Language Models with
  Optimized Audio Encoding
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Jizhong Liu
Gang Li
Junbo Zhang
Heinrich Dinkel
Yongqing Wang
Zhiyong Yan
Yujun Wang
Bin Wang
AuLLM
135
5
0
19 Jun 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
  Complex Reasoning Abilities
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh
Sonal Kumar
Ashish Seth
Chandra Kiran Reddy Evuru
Utkarsh Tyagi
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLMLRM
110
61
0
17 Jun 2024
AnoPatch: Towards Better Consistency in Machine Anomalous Sound
  Detection
AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection
Anbai Jiang
Bing Han
Zhiqiang Lv
Yufeng Deng
Wei-Qiang Zhang
Xie Chen
Yanmin Qian
Jia Liu
Pingyi Fan
66
3
0
17 Jun 2024
AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
AVR: Synergizing Foundation Models for Audio-Visual Humor Detection
Sarthak Sharma
Orchid Chetia Phukan
Drishti Singh
Arun Balaji Buduru
Rajesh Sharma
70
0
0
15 Jun 2024
Understanding Pedestrian Movement Using Urban Sensing Technologies: The
  Promise of Audio-based Sensors
Understanding Pedestrian Movement Using Urban Sensing Technologies: The Promise of Audio-based Sensors
Chaeyeon Han
Pavan Seshadri
Yiwei Ding
Noah Posner
B. Koo
Animesh Agrawal
Alexander Lerch
S. Guhathakurta
57
2
0
14 Jun 2024
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric
  Videos
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen
Puyuan Peng
Ami Baid
Zihui Xue
Wei-Ning Hsu
David Harwath
Kristen Grauman
VGen
98
8
0
13 Jun 2024
Vision Transformer Segmentation for Visual Bird Sound Denoising
Vision Transformer Segmentation for Visual Bird Sound Denoising
Sahil Kumar
Jialu Li
Youshan Zhang
73
1
0
13 Jun 2024
Towards Multilingual Audio-Visual Question Answering
Towards Multilingual Audio-Visual Question Answering
Orchid Chetia Phukan
Priyabrata Mallick
Swarup Ranjan Behera
Aalekhya Satya Narayani
Arun Balaji Buduru
Rajesh Sharma
103
0
0
13 Jun 2024
3M: Multi-modal Multi-task Multi-teacher Learning for Game Event
  Detection
3M: Multi-modal Multi-task Multi-teacher Learning for Game Event Detection
Thye Shan Ng
Feiqi Cao
S. Han
48
0
0
13 Jun 2024
Fully Few-shot Class-incremental Audio Classification Using Expandable
  Dual-embedding Extractor
Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor
Yongjie Si
Yanxiong Li
Jialong Li
Jiaxin Tan
Qianhua He
159
2
0
12 Jun 2024
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and
  Video Generation
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
Kai Wang
Shijian Deng
Jing Shi
Dimitrios Hatzinakos
Yapeng Tian
VGen
124
11
0
11 Jun 2024
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging
  and Cross-Model Knowledge Distillation
FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation
Swarup Ranjan Behera
Abhishek Dhiman
Karthik Gowda
Aalekhya Satya Narayani
85
1
0
11 Jun 2024
MambaLRP: Explaining Selective State Space Sequence Models
MambaLRP: Explaining Selective State Space Sequence Models
F. Jafari
G. Montavon
Klaus-Robert Müller
Oliver Eberle
Mamba
271
11
0
11 Jun 2024
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory
  Sound Classification
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification
June-Woo Kim
Miika Toikkanen
Yera Choi
Seoung-Eun Moon
Ho-Young Jung
99
9
0
10 Jun 2024
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of
  Progress in Speech Emotion Recognition
INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition
Andreas Triantafyllopoulos
A. Batliner
Simon Rampp
M. Milling
Björn Schuller
VLM
67
1
0
10 Jun 2024
Audio-based Step-count Estimation for Running -- Windowing and Neural
  Network Baselines
Audio-based Step-count Estimation for Running -- Windowing and Neural Network Baselines
Philipp Wagner
Andreas Triantafyllopoulos
Alexander Gebhard
Björn Schuller
71
0
0
10 Jun 2024
Contrastive Learning from Synthetic Audio Doppelgängers
Contrastive Learning from Synthetic Audio Doppelgängers
Manuel Cherep
Nikhil Singh
116
1
0
09 Jun 2024
What do MLLMs hear? Examining reasoning with text and sound components
  in Multimodal Large Language Models
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk Çoban
Michael I. Mandel
Johanna Devaney
AuLLMLRM
86
0
0
07 Jun 2024
Audio Mamba: Bidirectional State Space Model for Audio Representation
  Learning
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
146
26
0
05 Jun 2024
Multi-Microphone Speech Emotion Recognition using the Hierarchical
  Token-semantic Audio Transformer Architecture
Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
Ohad Cohen
G. Hazan
Sharon Gannot
51
1
0
05 Jun 2024
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding
  using Contrastive Learning with Application to Room Shape Classification
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification
Jacob Bitterman
Daniel Levi
H. H. Diamandi
Sharon Gannot
Tal Rosenwein
137
0
0
05 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose
  Audio-Language Representation
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
102
7
0
04 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise
  Pseudo Labeling
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
99
19
0
03 Jun 2024
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
Hao Dong
Yue Zhao
Eleni Chatzi
Olga Fink
OODD
85
18
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
143
5
0
26 May 2024
Planted: a dataset for planted forest identification from
  multi-satellite time series
Planted: a dataset for planted forest identification from multi-satellite time series
L. M. Pazos-Outón
Cristina Nader Vasconcelos
Anton Raichuk
Anurag Arnab
Dan Morris
Maxim Neumann
85
5
0
24 May 2024
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks
Michelle Halbheer
Dominik J. Mühlematter
Alexander Becker
Dominik Narnhofer
Helge Aasen
Konrad Schindler
Mehmet Özgür Türkoglu
UQCV
127
3
0
23 May 2024
On a time-frequency blurring operator with applications in data
  augmentation
On a time-frequency blurring operator with applications in data augmentation
Simon Halvdansson
ViT
51
0
0
21 May 2024
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Siavash Shams
Sukru Samet Dindar
Xilin Jiang
N. Mesgarani
Mamba
124
23
0
20 May 2024
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding
  Space Binding
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding
Jiajie Teng
Huiyu Duan
Yucheng Zhu
Sijing Wu
Guangtao Zhai
72
2
0
15 May 2024
RepAugment: Input-Agnostic Representation-Level Augmentation for
  Respiratory Sound Classification
RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification
June-Woo Kim
Miika Toikkanen
Sangmin Bae
Minseok Kim
Ho-Young Jung
86
7
0
05 May 2024
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation
Weiquan Huang
Yifei Shen
Yifan Yang
Mamba
72
4
0
30 Apr 2024
Comparison of self-supervised in-domain and supervised out-domain
  transfer learning for bird species recognition
Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition
H. Ghaffari
Paul Devos
80
0
0
26 Apr 2024
Exploring Pre-trained General-purpose Audio Representations for Heart
  Murmur Detection
Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
MedIm
61
2
0
26 Apr 2024
AudioRepInceptionNeXt: A lightweight single-stream architecture for
  efficient audio recognition
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
Kin Wai Lau
Yasar Abbas Ur Rehman
L. Po
85
1
0
21 Apr 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals
  and Accompaniment
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
Zhiqing Hong
Rongjie Huang
Xize Cheng
Yongqi Wang
Ruiqi Li
Fuming You
Zhou Zhao
Zhimeng Zhang
70
10
0
14 Apr 2024
Any2Point: Empowering Any-modality Large Models for Efficient 3D
  Understanding
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Yiwen Tang
Ray Zhang
Jiaming Liu
Zoey Guo
Dong Wang
...
Bin Zhao
Shanghang Zhang
Peng Gao
Hongsheng Li
Xuelong Li
99
13
0
11 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large
  Multi-Modal Models
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLMCLIP
73
3
0
09 Apr 2024
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
105
15
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric
  Videos
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoVSSL
88
7
0
08 Apr 2024
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification
  in Motivational Interviews
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews
Sayed Muddashir Hossain
Jan Alexandersson
Philipp Müller
85
1
0
04 Apr 2024
Audio Simulation for Sound Source Localization in Virtual Evironment
Audio Simulation for Sound Source Localization in Virtual Evironment
Yidi Yuan
Swee Liang Wong
Jonathan Pan
61
0
0
02 Apr 2024
Previous
12345...8910
Next