ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXivPDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 463 papers shown
Title
Audio-based Step-count Estimation for Running -- Windowing and Neural
  Network Baselines
Audio-based Step-count Estimation for Running -- Windowing and Neural Network Baselines
Philipp Wagner
Andreas Triantafyllopoulos
Alexander Gebhard
Björn Schuller
37
0
0
10 Jun 2024
Contrastive Learning from Synthetic Audio Doppelgängers
Contrastive Learning from Synthetic Audio Doppelgängers
Manuel Cherep
Nikhil Singh
40
1
0
09 Jun 2024
What do MLLMs hear? Examining reasoning with text and sound components
  in Multimodal Large Language Models
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk Çoban
Michael I. Mandel
Johanna Devaney
AuLLM
LRM
38
0
0
07 Jun 2024
Audio Mamba: Bidirectional State Space Model for Audio Representation
  Learning
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
73
19
0
05 Jun 2024
Multi-Microphone Speech Emotion Recognition using the Hierarchical
  Token-semantic Audio Transformer Architecture
Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
Ohad Cohen
G. Hazan
Sharon Gannot
42
1
0
05 Jun 2024
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding
  using Contrastive Learning with Application to Room Shape Classification
RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification
Jacob Bitterman
Daniel Levi
H. H. Diamandi
Sharon Gannot
Tal Rosenwein
44
0
0
05 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose
  Audio-Language Representation
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
38
5
0
04 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise
  Pseudo Labeling
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
61
18
0
03 Jun 2024
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
Hao Dong
Yue Zhao
Eleni Chatzi
Olga Fink
OODD
43
11
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
71
5
0
26 May 2024
Planted: a dataset for planted forest identification from
  multi-satellite time series
Planted: a dataset for planted forest identification from multi-satellite time series
L. M. Pazos-Outón
Cristina Nader Vasconcelos
Anton Raichuk
Anurag Arnab
Dan Morris
Maxim Neumann
47
4
0
24 May 2024
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention
  Networks
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks
Michelle Halbheer
Dominik J. Mühlematter
Alexander Becker
Dominik Narnhofer
Helge Aasen
Konrad Schindler
Mehmet Özgür Türkoglu
UQCV
38
1
0
23 May 2024
On a time-frequency blurring operator with applications in data
  augmentation
On a time-frequency blurring operator with applications in data augmentation
Simon Halvdansson
ViT
27
0
0
21 May 2024
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model
Siavash Shams
Sukru Samet Dindar
Xilin Jiang
N. Mesgarani
Mamba
69
18
0
20 May 2024
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding
  Space Binding
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding
Jiajie Teng
Huiyu Duan
Yucheng Zhu
Sijing Wu
Guangtao Zhai
36
2
0
15 May 2024
RepAugment: Input-Agnostic Representation-Level Augmentation for
  Respiratory Sound Classification
RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification
June-Woo Kim
Miika Toikkanen
Sangmin Bae
Minseok Kim
Ho-Young Jung
38
5
0
05 May 2024
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation
Weiquan Huang
Yifei Shen
Yifan Yang
Mamba
41
4
0
30 Apr 2024
Comparison of self-supervised in-domain and supervised out-domain
  transfer learning for bird species recognition
Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition
H. Ghaffari
Paul Devos
45
0
0
26 Apr 2024
Exploring Pre-trained General-purpose Audio Representations for Heart
  Murmur Detection
Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
MedIm
21
2
0
26 Apr 2024
AudioRepInceptionNeXt: A lightweight single-stream architecture for
  efficient audio recognition
AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
Kin Wai Lau
Yasar Abbas Ur Rehman
L. Po
44
1
0
21 Apr 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals
  and Accompaniment
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
Zhiqing Hong
Rongjie Huang
Xize Cheng
Yongqi Wang
Ruiqi Li
Fuming You
Zhou Zhao
Zhimeng Zhang
34
7
0
14 Apr 2024
Any2Point: Empowering Any-modality Large Models for Efficient 3D
  Understanding
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
Yiwen Tang
Ray Zhang
Jiaming Liu
Zoey Guo
Dong Wang
...
Bin Zhao
Shanghang Zhang
Peng Gao
Hongsheng Li
Xuelong Li
40
12
0
11 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large
  Multi-Modal Models
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
33
2
0
09 Apr 2024
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
K. Kashino
42
10
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric
  Videos
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoV
SSL
28
6
0
08 Apr 2024
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification
  in Motivational Interviews
M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews
Sayed Muddashir Hossain
Jan Alexandersson
Philipp Müller
31
1
0
04 Apr 2024
Audio Simulation for Sound Source Localization in Virtual Evironment
Audio Simulation for Sound Source Localization in Virtual Evironment
Yidi Yuan
Swee Liang Wong
Jonathan Pan
29
0
0
02 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
Accuracy enhancement method for speech emotion recognition from
  spectrogram using temporal frequency correlation and positional information
  learning through knowledge transfer
Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer
Jeongho Kim
Seung-Ho Lee
35
1
0
26 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
42
47
0
22 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiangkang Deng
Xiatian Zhu
VOS
43
5
0
21 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
56
12
0
20 Mar 2024
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with
  Unsupervised Audio Mixtures
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
Afrina Tabassum
Dung N. Tran
Trung D. Q. Dang
Ismini Lourentzou
K. Koishida
50
0
0
14 Mar 2024
Speaker-Independent Dysarthria Severity Classification using
  Self-Supervised Transformers and Multi-Task Learning
Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning
Lauren Stumpf
B. Kadirvelu
Sigourney Waibel
A. A. Faisal
29
2
0
29 Feb 2024
Beyond Language Models: Byte Models are Digital World Simulators
Beyond Language Models: Byte Models are Digital World Simulators
Shangda Wu
Xu Tan
Zili Wang
Rui Wang
Xiaobing Li
Maosong Sun
35
12
0
29 Feb 2024
Mixer is more than just a model
Mixer is more than just a model
Qingfeng Ji
Yuxin Wang
Letong Sun
40
0
0
28 Feb 2024
What Do Language Models Hear? Probing for Auditory Representations in
  Language Models
What Do Language Models Hear? Probing for Auditory Representations in Language Models
Jerry Ngo
Yoon Kim
AuLLM
MILM
26
8
0
26 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
52
1
0
23 Feb 2024
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Haibin Wu
Ho-Lam Chung
Yi-Cheng Lin
Yuan-Kuei Wu
Xuanjun Chen
Yu-Chi Pai
Hsiu-Hsuan Wang
Kai-Wei Chang
Alexander H. Liu
Hung-yi Lee
55
19
0
20 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
40
5
0
14 Feb 2024
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
Masato Hagiwara
Marius Miron
Jen-Yu Liu
26
1
0
05 Feb 2024
Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic
  Scene Classification under Domain Shift
Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift
Jisheng Bai
Mou Wang
Haohe Liu
Han Yin
Yafei Jia
...
Woon-Seng Gan
Mark D. Plumbley
S. Rahardja
Bin Xiang
Jianfeng Chen
24
7
0
05 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLM
LM&MA
MLLM
78
74
0
02 Feb 2024
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture
  of Adapters
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters
Umberto Cappellazzo
Daniele Falavigna
Alessio Brutti
MoE
48
2
0
01 Feb 2024
Multimodal Action Quality Assessment
Multimodal Action Quality Assessment
Ling-an Zeng
Wei-Shi Zheng
43
13
0
31 Jan 2024
Masked Audio Modeling with CLAP and Multi-Objective Learning
Masked Audio Modeling with CLAP and Multi-Objective Learning
Yifei Xin
Xiulian Peng
Yan Lu
52
8
0
29 Jan 2024
A Survey on Data Augmentation in Large Model Era
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MA
VLM
49
23
0
27 Jan 2024
Expressivity-aware Music Performance Retrieval using Mid-level
  Perceptual Features and Emotion Word Embeddings
Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings
Shreyan Chowdhury
Gerhard Widmer
24
0
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
22
7
0
25 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
39
0
0
22 Jan 2024
Previous
12345...8910
Next