ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer
v1v2v3 (latest)

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXiv (abs)PDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 486 papers shown
Title
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
88
7
0
28 Mar 2024
Accuracy enhancement method for speech emotion recognition from
  spectrogram using temporal frequency correlation and positional information
  learning through knowledge transfer
Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer
Jeongho Kim
Seung-Ho Lee
70
1
0
26 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
91
79
0
22 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiangkang Deng
Xiatian Zhu
VOS
82
6
0
21 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
105
13
0
20 Mar 2024
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with
  Unsupervised Audio Mixtures
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
Afrina Tabassum
Dung N. Tran
Trung D. Q. Dang
Ismini Lourentzou
K. Koishida
84
0
0
14 Mar 2024
Speaker-Independent Dysarthria Severity Classification using
  Self-Supervised Transformers and Multi-Task Learning
Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning
Lauren Stumpf
B. Kadirvelu
Sigourney Waibel
A. A. Faisal
46
3
0
29 Feb 2024
Beyond Language Models: Byte Models are Digital World Simulators
Beyond Language Models: Byte Models are Digital World Simulators
Shangda Wu
Xu Tan
Zili Wang
Rui Wang
Xiaobing Li
Maosong Sun
67
13
0
29 Feb 2024
Mixer is more than just a model
Mixer is more than just a model
Qingfeng Ji
Yuxin Wang
Letong Sun
70
0
0
28 Feb 2024
What Do Language Models Hear? Probing for Auditory Representations in
  Language Models
What Do Language Models Hear? Probing for Auditory Representations in Language Models
Jerry Ngo
Yoon Kim
AuLLMMILM
66
8
0
26 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
61
2
0
23 Feb 2024
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Codec-SUPERB: An In-Depth Analysis of Sound Codec Models
Haibin Wu
Ho-Lam Chung
Yi-Cheng Lin
Yuan-Kuei Wu
Xuanjun Chen
Yu-Chi Pai
Hsiu-Hsuan Wang
Kai-Wei Chang
Alexander H. Liu
Hung-yi Lee
123
29
0
20 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
105
6
0
14 Feb 2024
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
Masato Hagiwara
Marius Miron
Jen-Yu Liu
55
2
0
05 Feb 2024
Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic
  Scene Classification under Domain Shift
Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift
Jisheng Bai
Mou Wang
Haohe Liu
Han Yin
Yafei Jia
...
Woon-Seng Gan
Mark D. Plumbley
S. Rahardja
Bin Xiang
Jianfeng Chen
53
7
0
05 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLMLM&MAMLLM
172
94
0
02 Feb 2024
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture
  of Adapters
Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters
Umberto Cappellazzo
Daniele Falavigna
Alessio Brutti
MoE
59
3
0
01 Feb 2024
Multimodal Action Quality Assessment
Multimodal Action Quality Assessment
Ling-an Zeng
Wei-Shi Zheng
119
16
0
31 Jan 2024
Masked Audio Modeling with CLAP and Multi-Objective Learning
Masked Audio Modeling with CLAP and Multi-Objective Learning
Yifei Xin
Xiulian Peng
Yan Lu
112
8
0
29 Jan 2024
A Survey on Data Augmentation in Large Model Era
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MAVLM
132
27
0
27 Jan 2024
Expressivity-aware Music Performance Retrieval using Mid-level
  Perceptual Features and Emotion Word Embeddings
Expressivity-aware Music Performance Retrieval using Mid-level Perceptual Features and Emotion Word Embeddings
Shreyan Chowdhury
Gerhard Widmer
51
0
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
139
7
0
25 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
82
0
0
22 Jan 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
85
4
0
21 Jan 2024
ASM: Audio Spectrogram Mixer
ASM: Audio Spectrogram Mixer
Qingfeng Ji
Jicun Zhang
Yuxin Wang
48
1
0
20 Jan 2024
LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre
  Memory Units
LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units
Zeyu Liu
Gourav Datta
Anni Li
Peter A. Beerel
70
10
0
20 Jan 2024
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
Yun Liang
Hai Lin
Shaojian Qiu
Yihang Zhang
35
1
0
19 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLMLLMAG
155
20
0
19 Jan 2024
From Coarse to Fine: Efficient Training for Audio Spectrogram
  Transformers
From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers
Jiu Feng
Mehmet Hamza Erol
Joon Son Chung
Arda Senocak
60
2
0
16 Jan 2024
Cascaded Cross-Modal Transformer for Audio-Textual Classification
Cascaded Cross-Modal Transformer for Audio-Textual Classification
Nicolae-Cătălin Ristea
Andrei Anghel
Radu Tudor Ionescu
96
2
0
15 Jan 2024
Full-frequency dynamic convolution: a physical frequency-dependent
  convolution for sound event detection
Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection
Haobo Yue
Zhicheng Zhang
Da Mu
Yonghao Dang
Jianqin Yin
Jin Tang
75
0
0
10 Jan 2024
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen
Yuzhe Liang
Ziyang Ma
Zhisheng Zheng
Xie Chen
ViT
107
22
0
07 Jan 2024
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via
  Expressive Masked Audio Gesture Modeling
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Haiyang Liu
Zihao Zhu
Giorgio Becherini
Yichen Peng
Mingyang Su
You Zhou
Xuefei Zhe
Naoya Iwamoto
Bo Zheng
Michael J. Black
SLR
192
36
0
31 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLMMLLM
102
175
0
28 Dec 2023
EnchantDance: Unveiling the Potential of Music-Driven Dance Movement
EnchantDance: Unveiling the Potential of Music-Driven Dance Movement
Bo Han
Yi Ren
Hao Peng
Teng Zhang
Zeyu Ling
Xiang Yin
Feilin Han
49
4
0
26 Dec 2023
Deformable Audio Transformer for Audio Event Detection
Deformable Audio Transformer for Audio Event Detection
Wentao Zhu
78
0
0
24 Dec 2023
Consistent and Relevant: Rethink the Query Embedding in General Sound
  Separation
Consistent and Relevant: Rethink the Query Embedding in General Sound Separation
Yuanyuan Wang
Hangting Chen
Dongchao Yang
Jianwei Yu
Chao Weng
Zhiyong Wu
Helen M. Meng
51
6
0
24 Dec 2023
On the choice of the optimal temporal support for audio classification
  with Pre-trained embeddings
On the choice of the optimal temporal support for audio classification with Pre-trained embeddings
Aurian Quélennec
Michel Olvera
Geoffroy Peeters
S. Essid
69
2
0
21 Dec 2023
Stethoscope-guided Supervised Contrastive Learning for Cross-domain
  Adaptation on Respiratory Sound Classification
Stethoscope-guided Supervised Contrastive Learning for Cross-domain Adaptation on Respiratory Sound Classification
June-Woo Kim
Sangmin Bae
Won-Yang Cho
Byungjo Lee
Ho-Young Jung
99
17
0
15 Dec 2023
Efficient speech detection in environmental audio using acoustic
  recognition and knowledge distillation
Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation
Drew Priebe
Burooj Ghani
Dan Stowell
47
5
0
14 Dec 2023
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs
  for Embodied AI
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI
Kai Huang
Boyuan Yang
Wei Gao
68
1
0
13 Dec 2023
Emotional Speech-driven 3D Body Animation via Disentangled Latent
  Diffusion
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
Kiran Chhatre
Radek Danvevcek
Nikos Athanasiou
Giorgio Becherini
Christopher Peters
Michael J. Black
Timo Bolkart
DiffM
132
22
0
07 Dec 2023
ViT-Lens: Towards Omni-modal Representations
ViT-Lens: Towards Omni-modal Representations
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
99
20
0
27 Nov 2023
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Zhengcong Fei
Mingyuan Fan
Junshi Huang
153
20
0
27 Nov 2023
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio,
  Video, Point Cloud, Time-Series and Image Recognition
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
Xiaohan Ding
Yiyuan Zhang
Yixiao Ge
Sijie Zhao
Lin Song
Xiangyu Yue
Ying Shan
VLMAI4TSSSL
104
131
0
27 Nov 2023
Spectro-ViT: A Vision Transformer Model for GABA-edited MRS
  Reconstruction Using Spectrograms
Spectro-ViT: A Vision Transformer Model for GABA-edited MRS Reconstruction Using Spectrograms
G. Dias
R. Berto
Mateus Oliveira
Lucas Ueda
S. Dertkigil
Paula D. P. Costa
Amirmohammad Shamaei
Roberto Souza
Ashley D. Harris
Letícia Rittner
62
0
0
26 Nov 2023
Input Compression with Positional Consistency for Efficient Training and
  Inference of Transformer Neural Networks
Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks
Amrit Nagarajan
Anand Raghunathan
VLMViT
42
0
0
22 Nov 2023
Unveiling the Power of Self-Attention for Shipping Cost Prediction: The
  Rate Card Transformer
Unveiling the Power of Self-Attention for Shipping Cost Prediction: The Rate Card Transformer
Aditya Sreekar
Berrin Yanıko˘glu
Varun Madhavan
Abhishek Persad
26
0
0
20 Nov 2023
Multi-View Spectrogram Transformer for Respiratory Sound Classification
Multi-View Spectrogram Transformer for Respiratory Sound Classification
Wentao He
Yuchen Yan
Jianfeng Ren
Ruibin Bai
Xudong Jiang
MedImViT
48
10
0
16 Nov 2023
AI-based soundscape analysis: Jointly identifying sound sources and
  predicting annoyance
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance
Yuanbo Hou
Qiaoqiao Ren
Huizhong Zhang
A. Mitchell
F. Aletta
Jian Kang
Dick Botteldooren
65
17
0
15 Nov 2023
Previous
123456...8910
Next