ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.03183
  4. Cited By
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong
  General Audio Event Taggers

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

6 July 2023
Yuan Gong
Sameer Khurana
Leonid Karlinsky
James R. Glass
ArXivPDFHTML

Papers citing "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers"

50 / 50 papers shown
Title
Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Haoshuai Zhou
Boxuan Cao
Changgeng Mo
Linkai Li
Shan Xiang Wang
AI4CE
31
0
0
13 May 2025
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey
R. Swaminathan
K V Vijay Girish
Arunasish Sen
Jian Xie
Grant P. Strimel
Andreas Schwarz
140
0
0
12 Apr 2025
Spatial Audio Processing with Large Language Model on Wearable Devices
Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra
Yang Bai
Priyadarshan Narayanasamy
Nakul Garg
Nirupam Roy
30
0
0
11 Apr 2025
On The Landscape of Spoken Language Models: A Comprehensive Survey
On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora
Kai-Wei Chang
Chung-Ming Chien
Yifan Peng
Haibin Wu
Yossi Adi
Emmanuel Dupoux
Hung-yi Lee
Karen Livescu
Shinji Watanabe
52
2
0
11 Apr 2025
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
Junyi Ao
Dekun Chen
Xiaohai Tian
Wenjie Feng
Jingyang Zhang
Lu Lu
Yixuan Wang
Haizhou Li
Zhizheng Wu
AuLLM
66
0
0
19 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
53
0
0
12 Mar 2025
Demographic Attributes Prediction from Speech Using WavLM Embeddings
Demographic Attributes Prediction from Speech Using WavLM Embeddings
Yuchen Yang
Thomas Thebaud
Najim Dehak
49
0
0
17 Feb 2025
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Andrew Rouditchenko
Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
113
1
0
03 Feb 2025
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
Jingyang Zhang
Lu Lu
Yixuan Wang
Haizhou Li
Zhikai Wu
AuLLM
87
17
0
17 Jan 2025
SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR
SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR
Pengcheng Guo
Xuankai Chang
Hang Lv
Shinji Watanabe
Lei Xie
66
0
0
07 Dec 2024
Generative Emotion Cause Explanation in Multimodal Conversations
Generative Emotion Cause Explanation in Multimodal Conversations
Lin Wang
Xiaocui Yang
Shi Feng
Daling Wang
Yifei Zhang
39
0
0
01 Nov 2024
Frozen Large Language Models Can Perceive Paralinguistic Aspects of
  Speech
Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
Wonjune Kang
J. Jia
Chunyang Wu
Wei Zhou
Egor Lakomkin
...
Leda Sari
Suyoun Kim
Ke Li
Jay Mahadeokar
Ozlem Kalinli
AuLLM
31
2
0
02 Oct 2024
Probing mental health information in speech foundation models
Probing mental health information in speech foundation models
Marc de Gennes
Adrien Lesage
Martin Denais
Xuan-Nga Cao
Simon Chang
Pierre Van Remoortere
Cyrille Dakhlia
Rachid Riad
26
0
0
27 Sep 2024
MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events
Xiaoyu Yang
Qiujia Li
Chao Zhang
P. Woodland
24
0
0
25 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
36
9
0
18 Sep 2024
Adaptive Large Language Models By Layerwise Attention Shortcuts
Adaptive Large Language Models By Layerwise Attention Shortcuts
Prateek Verma
Mert Pilanci
KELM
OffRL
52
0
0
17 Sep 2024
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions
Lingwei Meng
Shujie Hu
Jiawen Kang
Zhaoqing Li
Yuejiao Wang
Wenxuan Wu
Xixin Wu
Xunying Liu
Helen Meng
AuLLM
70
2
0
13 Sep 2024
Computer Audition: From Task-Specific Machine Learning to Foundation
  Models
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Tuomas Virtanen
Björn Schuller
45
4
0
22 Jul 2024
Seal: Advancing Speech Language Models to be Few-Shot Learners
Seal: Advancing Speech Language Models to be Few-Shot Learners
Shuyu Lei
Lingen Liu
Jiaolong Yang
Yasen Jiao
Yuxiang Yang
Yushu Yang
Xiang Guo
VLM
32
0
0
20 Jul 2024
Factor-Conditioned Speaking-Style Captioning
Factor-Conditioned Speaking-Style Captioning
Atsushi Ando
Takafumi Moriya
Shota Horiguchi
Ryo Masumura
35
0
0
27 Jun 2024
Enhancing Automated Audio Captioning via Large Language Models with
  Optimized Audio Encoding
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Jizhong Liu
Gang Li
Junbo Zhang
Heinrich Dinkel
Yongqing Wang
Zhiyong Yan
Yujun Wang
Bin Wang
AuLLM
57
2
0
19 Jun 2024
On the Evaluation of Speech Foundation Models for Spoken Language
  Understanding
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
Siddhant Arora
Ankita Pasad
Chung-Ming Chien
Jionghao Han
Roshan S. Sharma
...
William Chen
Suwon Shon
Hung-yi Lee
Karen Livescu
Shinji Watanabe
ELM
48
4
0
14 Jun 2024
Simul-Whisper: Attention-Guided Streaming Whisper with Truncation
  Detection
Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
Haoyu Wang
Guoqiang Hu
Guodong Lin
Wei-Qiang Zhang
Jian Li
22
1
0
14 Jun 2024
Can Large Language Models Understand Spatial Audio?
Can Large Language Models Understand Spatial Audio?
Changli Tang
Wenyi Yu
Guangzhi Sun
Xianzhao Chen
Tian Tan
...
Jun Zhang
Lu Lu
Zejun Ma
Yuxuan Wang
Chao Zhang
49
4
0
12 Jun 2024
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability
  of Whisper
Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper
Chih-Kai Yang
Kuan Po Huang
Hung-yi Lee
40
3
0
09 Jun 2024
Keyword-Guided Adaptation of Automatic Speech Recognition
Keyword-Guided Adaptation of Automatic Speech Recognition
Aviv Shamsian
Aviv Navon
Neta Glazer
Gill Hetz
Joseph Keshet
33
1
0
04 Jun 2024
Crossmodal ASR Error Correction with Discrete Speech Units
Crossmodal ASR Error Correction with Discrete Speech Units
Yuanchao Li
Pinzhen Chen
Peter Bell
Catherine Lai
36
6
0
26 May 2024
Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with
  LLMs for Multi-modal Text Recognition
Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition
Chan-Jan Hsu
Yi-Chang Chen
Feng-Ting Liao
Pei-Chen Ho
Yu-Hsiang Wang
Po-Chun Hsu
Da-shan Shiu
31
2
0
23 May 2024
Audio Dialogues: Dialogues dataset for audio and music understanding
Audio Dialogues: Dialogues dataset for audio and music understanding
Arushi Goel
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
AuLLM
31
4
0
11 Apr 2024
Transfer Learning from Whisper for Microscopic Intelligibility
  Prediction
Transfer Learning from Whisper for Microscopic Intelligibility Prediction
Paul Best
Santiago Cuervo
R. Marxer
33
2
0
02 Apr 2024
Project MOSLA: Recording Every Moment of Second Language Acquisition
Project MOSLA: Recording Every Moment of Second Language Acquisition
Masato Hagiwara
Joshua Tanner
35
0
0
26 Mar 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLM
LM&MA
MLLM
74
73
0
02 Feb 2024
Speech foundation models on intelligibility prediction for
  hearing-impaired listeners
Speech foundation models on intelligibility prediction for hearing-impaired listeners
Santiago Cuervo
R. Marxer
30
6
0
24 Jan 2024
Large Language Models are Efficient Learners of Noise-Robust Speech
  Recognition
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Yuchen Hu
Chen Chen
Chao-Han Huck Yang
Ruizhe Li
Chao Zhang
Pin-Yu Chen
Ensiong Chng
27
20
0
19 Jan 2024
Who Said What? An Automated Approach to Analyzing Speech in Preschool
  Classrooms
Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms
Anchen Sun
Juan J Londono
Batya Elbaum
Luis Estrada
Roberto Jose Lazo
Laura Vitale
Hugo Gonzalez Villasanti
Riccardo Fusaroli
L. K. Perry
D. Messinger
25
4
0
14 Jan 2024
Investigating the Emergent Audio Classification Ability of ASR
  Foundation Models
Investigating the Emergent Audio Classification Ability of ASR Foundation Models
Rao Ma
Adian Liusie
Mark J. F. Gales
Kate Knill
34
7
0
15 Nov 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified
  Large-Scale Audio-Language Models
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu
Jin Xu
Xiaohuan Zhou
Qian Yang
Shiliang Zhang
Zhijie Yan
Chang Zhou
Jingren Zhou
AuLLM
42
268
0
14 Nov 2023
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition
On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition
Xiaohan Shi
Jiajun He
Xingfeng Li
T. Toda
34
3
0
13 Nov 2023
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang
Wenyi Yu
Guangzhi Sun
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
LM&MA
AuLLM
39
200
0
20 Oct 2023
LLark: A Multimodal Instruction-Following Language Model for Music
LLark: A Multimodal Instruction-Following Language Model for Music
Josh Gardner
Simon Durand
Daniel Stoller
Rachel M. Bittner
AuLLM
31
14
0
11 Oct 2023
Joint Audio and Speech Understanding
Joint Audio and Speech Understanding
Yuan Gong
Alexander H. Liu
Hongyin Luo
Leonid Karlinsky
James R. Glass
AuLLM
28
66
0
25 Sep 2023
A Study on Incorporating Whisper for Robust Speech Assessment
A Study on Incorporating Whisper for Robust Speech Assessment
Ryandhimas E. Zezario
Yu-Wen Chen
Szu-Wei Fu
Yu Tsao
H. Wang
C. Fuh
27
10
0
22 Sep 2023
Big model only for hard audios: Sample dependent Whisper model selection
  for efficient inferences
Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences
Hugo Malard
Salah Zaiem
Robin Algayres
29
2
0
22 Sep 2023
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive
  Instruction-Tuning Benchmark for Speech
Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
Chien-yu Huang
Ke-Han Lu
Shi Wang
Chi-Yuan Hsiao
Chun-Yi Kuan
...
Roshan S. Sharma
Shinji Watanabe
Bhiksha Ramakrishnan
Shady Shehata
Hung-yi Lee
AuLLM
34
50
0
18 Sep 2023
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Dianwen Ng
Chong Zhang
Ruixi Zhang
Yukun Ma
Fabian Ritter Gutierrez
Trung Hieu Nguyen
Chongjia Ni
Shengkui Zhao
E. Chng
B. Ma
VLM
32
1
0
18 Sep 2023
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech
  Recognition for Children VS. Adults
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults
Ahmed Attia
Jing Liu
Wei Ai
Dorottya Demszky
Carol Y. Espy-Wilson
20
13
0
12 Sep 2023
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion
  and Infinite Data Generation
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
Dongyang Yu
Shihao Wang
Yuan Fang
Wangpeng An
VGen
33
0
0
08 Aug 2023
UniKW-AT: Unified Keyword Spotting and Audio Tagging
UniKW-AT: Unified Keyword Spotting and Audio Tagging
Heinrich Dinkel
Yongqing Wang
Zhiyong Yan
Junbo Zhang
Yujun Wang
42
3
0
23 Sep 2022
A Noise-Robust Self-supervised Pre-training Model Based Speech
  Representation Learning for Automatic Speech Recognition
A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition
Qiu-shi Zhu
Jie Zhang
Zi-qiang Zhang
Ming Wu
Xin Fang
Lirong Dai
120
39
0
22 Jan 2022
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and
  Aggregation
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation
Yuan Gong
Yu-An Chung
James R. Glass
VLM
104
144
0
02 Feb 2021
1