ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.01778
  4. Cited By
AST: Audio Spectrogram Transformer
v1v2v3 (latest)

AST: Audio Spectrogram Transformer

5 April 2021
Yuan Gong
Yu-An Chung
James R. Glass
    ViT
ArXiv (abs)PDFHTML

Papers citing "AST: Audio Spectrogram Transformer"

50 / 486 papers shown
Title
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Boseung Jeong
Jicheol Park
Sungyeon Kim
Suha Kwak
84
0
0
03 Apr 2025
Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
Token Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance
Taehan Lee
Hyukjun Lee
ViTVLM
88
0
0
02 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
257
0
0
30 Mar 2025
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Comparative Analysis of Image, Video, and Audio Classifiers for Automated News Video Segmentation
Jonathan Attard
Dylan Seychell
108
0
0
27 Mar 2025
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Suho Yoo
Hyunjong Ok
Jaeho Lee
AuLLMRALM
105
0
0
21 Mar 2025
Structured-Noise Masked Modeling for Video, Audio and Beyond
Structured-Noise Masked Modeling for Video, Audio and Beyond
Aritra Bhowmik
Fida Mohammad Thoker
Carlos Hinojosa
Bernard Ghanem
Cees G. M. Snoek
VGen
108
0
0
20 Mar 2025
A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana
A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana
Alba Márquez-Rodríguez
Miguel Ángel Mohedano-Munoz
Manuel J. Marín-Jiménez
Eduardo Santamaría-García
Giulia Bastianelli
Pedro Jordano
Irene Mendoza
83
0
0
19 Mar 2025
Neural Edge Histogram Descriptors for Underwater Acoustic Target Recognition
Neural Edge Histogram Descriptors for Underwater Acoustic Target Recognition
Atharva Agashe
Davelle Carreiro
A. V. Dine
Joshua Peeples
69
0
0
17 Mar 2025
R^RRFLAV: Rolling Flow matching for infinite Audio Video generation
Alex Ergasti
Giuseppe Tarollo
Filippo Botti
Tomaso Fontanini
Claudio Ferrari
Massimo Bertozzi
Andrea Prati
VGen
84
0
0
13 Mar 2025
Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification
Wassim Bouaziz
El-Mahdi El-Mhamdi
Nicolas Usunier
86
0
0
13 Mar 2025
Learning Gentle Grasping Using Vision, Sound, and Touch
Ken Nakahara
Roberto Calandra
106
0
0
11 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Ming-Yu Liu
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLMAuLLMLRM
136
21
0
06 Mar 2025
Question-Aware Gaussian Experts for Audio-Visual Question Answering
Hongyeob Kim
Inyoung Jung
Dayoon Suh
Youjia Zhang
Sangmin Lee
Sungeun Hong
132
0
0
06 Mar 2025
JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection
JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection
Hyeonuk Nam
Yong-Hwa Park
82
1
0
28 Feb 2025
Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Tianyun Liu
CLIPVLM
105
0
0
26 Feb 2025
Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer
Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer
Siqiao Zhao
Zhikang Dong
Zeyu Cao
Raphael Douady
137
6
0
17 Feb 2025
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
David Sasu
Zehui Wu
Ziwei Gong
Run Chen
Pengyuan Shi
Lin Ai
Julia Hirschberg
Natalie Schluter
173
3
0
16 Feb 2025
Harnessing Vision Models for Time Series Analysis: A Survey
Harnessing Vision Models for Time Series Analysis: A Survey
Jingchao Ni
Ziming Zhao
ChengAo Shen
Hanghang Tong
Dongjin Song
Wei Cheng
Dongsheng Luo
Haifeng Chen
AI4TS
185
6
0
13 Feb 2025
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala
Vasileios Tsouvalas
N. Meratnia
MoE
118
0
0
10 Feb 2025
Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling
Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling
Jakob Poncelet
Hugo Van hamme
151
0
0
05 Feb 2025
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. P. Muñoz
Jinjie Yuan
Nilesh Jain
Mamba
146
2
0
28 Jan 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
172
4
0
28 Jan 2025
Safe Gradient Flow for Bilevel Optimization
Safe Gradient Flow for Bilevel Optimization
Sina Sharifi
Nazanin Abolfazli
Erfan Yazdandoost Hamedani
Mahyar Fazlyab
93
3
0
27 Jan 2025
Hybrid Losses for Hierarchical Embedding Learning
Hybrid Losses for Hierarchical Embedding Learning
Haokun Tian
Stefan Lattner
Brian McFee
Charalampos Saitis
89
0
0
22 Jan 2025
Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection
Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection
Myeonghoon Ryu
June-Woo Kim
Minseok Oh
Suji Lee
Han Park
125
0
0
20 Jan 2025
AudioBERT: Audio Knowledge Augmented Language Model
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLMRALMVLM
91
0
0
17 Jan 2025
Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm
Preconditioned Sharpness-Aware Minimization: Unifying Analysis and a Novel Learning Algorithm
Yilang Zhang
Bingcong Li
G. Giannakis
AAML
67
0
0
11 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
193
3
0
10 Jan 2025
Contrastive Learning from Exploratory Actions: Leveraging Natural Interactions for Preference Elicitation
N. Dennler
Stefanos Nikolaidis
Maja J. Matarić
466
0
0
03 Jan 2025
FAST: Fast Audio Spectrogram Transformer
Anugunj Naman
Gaibo Zhang
65
0
0
03 Jan 2025
Trainingless Adaptation of Pretrained Models for Environmental Sound
  Classification
Trainingless Adaptation of Pretrained Models for Environmental Sound Classification
Noriyuki Tonami
Wataru Kohno
Keisuke Imoto
Yoshiyuki Yajima
Sakiko Mishima
Reishi Kondo
Tomoyuki Hino
VLM
164
0
0
23 Dec 2024
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
VGen
138
0
0
18 Dec 2024
When Vision Models Meet Parameter Efficient Look-Aside Adapters Without
  Large-Scale Audio Pretraining
When Vision Models Meet Parameter Efficient Look-Aside Adapters Without Large-Scale Audio Pretraining
Juan Yeo
Jinkwan Jang
Kyubyung Chae
Seongkyu Mun
Taesup Kim
VLM
142
0
0
08 Dec 2024
STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied
  Agents in Minecraft
STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft
Nicholas Lenzen
Amogh Raut
Andrew Melnik
VGen
118
0
0
01 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
200
0
0
24 Nov 2024
State-Space Large Audio Language Models
State-Space Large Audio Language Models
Saurabhchand Bhati
Yuan Gong
Leonid Karlinsky
Hilde Kuehne
Rogerio Feris
James Glass
153
1
0
24 Nov 2024
How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative
  Study of ChatGPT, AI Models and Human Perception
How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception
Sahibzada Adil Shahzad
Ammarah Hashmi
Yan-Tsung Peng
Yu Tsao
H. Wang
71
1
0
14 Nov 2024
PSELDNets: Pre-trained Neural Networks on a Large-scale Synthetic Dataset for Sound Event Localization and Detection
PSELDNets: Pre-trained Neural Networks on a Large-scale Synthetic Dataset for Sound Event Localization and Detection
Jinbo Hu
Yin Cao
Ming Wu
Fang Kang
Feiran Yang
Wenwu Wang
Mark D. Plumbley
J. Yang
72
1
0
10 Nov 2024
Model and Deep learning based Dynamic Range Compression Inversion
Model and Deep learning based Dynamic Range Compression Inversion
Haoran Sun
Dominique Fourer
Hichem Maaref
21
0
0
07 Nov 2024
Stepping Forward on the Last Mile
Stepping Forward on the Last Mile
Chen Feng
Shaojie Zhuo
Xiaopeng Zhang
R. Ramakrishnan
Zhaocong Yuan
Andrew Zou Li
139
0
0
06 Nov 2024
Angular Distance Distribution Loss for Audio Classification
Angular Distance Distribution Loss for Audio Classification
Antonio Almudévar
Romain Serizel
Alfonso Ortega
58
0
0
31 Oct 2024
EEG-based Multimodal Representation Learning for Emotion Recognition
EEG-based Multimodal Representation Learning for Emotion Recognition
Kang Yin
Hye-Bin Shin
Dan Li
Seong-Whan Lee
39
4
0
29 Oct 2024
Deep Insights into Cognitive Decline: A Survey of Leveraging
  Non-Intrusive Modalities with Deep Learning Techniques
Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques
David Ortiz-Perez
Manuel Benavent-Lledo
José García Rodríguez
David Tomás
M. Flores Vizcaya-Moreno
69
3
0
24 Oct 2024
Learning to rumble: Automated elephant call classification, detection
  and endpointing using deep architectures
Learning to rumble: Automated elephant call classification, detection and endpointing using deep architectures
Christiaan M. Geldenhuys
Thomas R. Niesler
34
0
0
15 Oct 2024
GraFPrint: A GNN-Based Approach for Audio Identification
GraFPrint: A GNN-Based Approach for Audio Identification
Aditya Bhattacharjee
Shubhr Singh
Emmanouil Benetos
95
1
0
14 Oct 2024
Skipping Computations in Multimodal LLMs
Skipping Computations in Multimodal LLMs
Mustafa Shukor
Matthieu Cord
68
3
0
12 Oct 2024
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video
  Paragraph Captioning
GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning
Eileen Wang
Caren Han
Josiah Poon
74
0
0
12 Oct 2024
Movie Trailer Genre Classification Using Multimodal Pretrained Features
Movie Trailer Genre Classification Using Multimodal Pretrained Features
Serkan Sulun
Paula Viana
M. Davies
CLIP
74
3
0
11 Oct 2024
Music Genre Classification using Large Language Models
Music Genre Classification using Large Language Models
Mohamed El Amine Meguenani
Alceu de Souza Britto Jr.
A. L. Koerich
75
0
0
10 Oct 2024
Self-Attention Mechanism in Multimodal Context for Banking Transaction
  Flow
Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow
Cyrile Delestre
Yoann Sola
34
0
0
10 Oct 2024
Previous
12345...8910
Next