ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.00747
  4. Cited By
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

1 March 2023
Max Bain
Jaesung Huh
Tengda Han
Andrew Zisserman
ArXivPDFHTML

Papers citing "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio"

50 / 118 papers shown
Title
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Yuxin Lin
Yinglin Zheng
Ming Zeng
Wangzheng Shi
12
0
0
19 May 2025
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down
Yingzhi Wang
Anas Alhmoud
Saad Alsahly
Muhammad Alqurishi
Mirco Ravanelli
14
0
0
19 May 2025
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025
Sai Koneru
Maike Züfle
Thai-Binh Nguyen
Seymanur Akti
Jan Niehues
Alexander Waibel
12
0
0
19 May 2025
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models
Keunwoo Peter Yu
Joyce Chai
MLLM
VLM
12
0
0
16 May 2025
BLAB: Brutally Long Audio Bench
BLAB: Brutally Long Audio Bench
Orevaoghene Ahia
Martijn Bartelds
Kabir Ahuja
Hila Gonen
Valentin Hofmann
...
Noah Bennett
Shinji Watanabe
Noah A. Smith
Yulia Tsvetkov
Sachin Kumar
AuLLM
LM&MA
VLM
63
0
0
05 May 2025
Generating Narrated Lecture Videos from Slides with Synchronized Highlights
Generating Narrated Lecture Videos from Slides with Synchronized Highlights
Alexander Holmberg
27
0
0
05 May 2025
Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Co3^{3}3Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Xingqun Qi
Yatian Wang
Hengyuan Zhang
J. Pan
Wei Xue
Shanghang Zhang
Wenhan Luo
Qifeng Liu
Yike Guo
SLR
66
0
0
03 May 2025
Versatile Framework for Song Generation with Prompt-based Control
Versatile Framework for Song Generation with Prompt-based Control
Wenjie Qu
Wenxiang Guo
Changhao Pan
Zehan Zhu
Ruiqi Li
...
Rongjie Huang
Ruiyuan Zhang
Zhiqing Hong
Ziyue Jiang
Zhou Zhao
77
1
0
27 Apr 2025
Acquisition of high-quality images for camera calibration in robotics applications via speech prompts
Acquisition of high-quality images for camera calibration in robotics applications via speech prompts
Timm Linder
Kadir Yilmaz
David B. Adrian
Bastian Leibe
31
0
0
15 Apr 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin
Jeongsoo Choi
Puyuan Peng
Joon Son Chung
Tae-Hyun Oh
David Harwath
VGen
47
1
0
03 Apr 2025
Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
Soumyya Kanti Datta
Shan Jia
Siwei Lyu
44
0
0
02 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
39
0
0
31 Mar 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
Yubo Zhang
Pedro Botelho
Trevor Gordon
Gil Zussman
I. Kadota
55
0
0
31 Mar 2025
Understanding Co-speech Gestures in-the-wild
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
57
0
0
28 Mar 2025
VideoMix: Aggregating How-To Videos for Task-Oriented Learning
VideoMix: Aggregating How-To Videos for Task-Oriented Learning
Saelyne Yang
Anh Truong
Juho Kim
Dingzeyu Li
45
1
0
27 Mar 2025
Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication
Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication
Yiwen Xu
Monideep Chakraborti
Tianyi Zhang
Katelyn Eng
Aanchan Mohan
Mirjana Prpa
AuLLM
42
0
0
21 Mar 2025
Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning
Yi He
Lei Yang
Shilin Wang
58
0
0
05 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding
Louis Mahon
Mirella Lapata
VLM
41
2
0
03 Mar 2025
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Keisuke Kamahori
Jungo Kasai
Noriyuki Kojima
Baris Kasikci
37
0
0
27 Feb 2025
Educator Attention: How computational tools can systematically identify the distribution of a key resource for students
Educator Attention: How computational tools can systematically identify the distribution of a key resource for students
Qingyang Zhang
Rose E. Wang
Ana T. Ribeiro
Dora Demszky
Susanna Loeb
46
0
0
27 Feb 2025
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
E. Ghaleb
Bulat Khaertdinov
Aslı Özyürek
Raquel Fernández
41
0
0
27 Feb 2025
PlantPal: Leveraging Precision Agriculture Robots to Facilitate Remote Engagement in Urban Gardening
PlantPal: Leveraging Precision Agriculture Robots to Facilitate Remote Engagement in Urban Gardening
Albin Zeqiri
Julian Britten
Clara Schramm
Pascal Jansen
Michael Rietzler
Enrico Rukzio
79
0
0
26 Feb 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li
Rithesh Kumar
Zeyu Jin
DiffM
98
0
0
21 Feb 2025
Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
A. Benazir
Felix Xiaozhu Lin
47
0
0
29 Jan 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Haorui He
Zengqiang Shang
Chaoren Wang
Xuyuan Li
Yicheng Gu
...
Peiyang Shi
Yansen Wang
Kai Chen
Pengyuan Zhang
Zhikai Wu
AuLLM
64
4
0
28 Jan 2025
Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech
Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech
Eunjung Yeo
J. Liss
Visar Berisha
David Mortensen
42
0
0
27 Jan 2025
Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data
Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data
Jiajie Li
Brian R Quaranto
Chenhui Xu
Ishan Mishra
Ruiyang Qin
Dancheng Liu
Peter C W Kim
Jinjun Xiong
94
0
0
25 Jan 2025
Integrating Pause Information with Word Embeddings in Language Models for Alzheimer's Disease Detection from Spontaneous Speech
Integrating Pause Information with Word Embeddings in Language Models for Alzheimer's Disease Detection from Spontaneous Speech
Yu Pu
Wei-Qiang Zhang
43
0
0
12 Jan 2025
Towards an optimised evaluation of teachers' discourse: The case of
  engaging messages
Towards an optimised evaluation of teachers' discourse: The case of engaging messages
Samuel Falcon
Jaime Leon
75
1
0
18 Dec 2024
Generative Emotion Cause Explanation in Multimodal Conversations
Generative Emotion Cause Explanation in Multimodal Conversations
Lin Wang
Xiaocui Yang
Shi Feng
Daling Wang
Yifei Zhang
39
0
0
01 Nov 2024
Moonshine: Speech Recognition for Live Transcription and Voice Commands
Moonshine: Speech Recognition for Live Transcription and Voice Commands
Nat Jeffries
Evan King
M. Kudlur
Guy Nicholson
James Wang
Pete Warden
39
5
0
21 Oct 2024
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
Juntao Li
Zhenxi Song
Jiaqi Wang
Min Zhang
Honghai Liu
Min Zhang
Zhiguo Zhang
31
1
0
19 Oct 2024
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation
Louis Mahon
Mirella Lapata
28
2
0
17 Oct 2024
Characterizing the MrDeepFakes Sexual Deepfake Marketplace
Characterizing the MrDeepFakes Sexual Deepfake Marketplace
Catherine Han
Anne Li
Deepak Kumar
Zakir Durumeric
32
1
0
14 Oct 2024
Character-aware audio-visual subtitling in context
Character-aware audio-visual subtitling in context
Jaesung Huh
Andrew Zisserman
41
0
0
14 Oct 2024
Audio Description Generation in the Era of LLMs and VLMs: A Review of
  Transferable Generative AI Technologies
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Yingqiang Gao
Lukas Fischer
Alexa Lintner
Sarah Ebling
36
0
0
11 Oct 2024
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
Anna Deichler
Jim O'Regan
Jonas Beskow
29
0
0
30 Sep 2024
Word-wise intonation model for cross-language TTS systems
Word-wise intonation model for cross-language TTS systems
Tomilov A. A.
Gromova A. Y.
Svischev A. N
34
0
0
30 Sep 2024
MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
Gia-Bao Dinh Ho
Chang Wei Tan
Zahra Zamanzadeh Darban
Mahsa Salehi
Gholamreza Haffari
Wray L. Buntine
18
0
0
23 Sep 2024
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation
  with Whisper
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper
Iuliia Thorbecke
Juan Zuluaga-Gomez
Esaú Villatoro-Tello
Shashi Kumar
Pradeep Rangappa
Sergio Burdisso
P. Motlícek
Karthik Pandia
A. Ganapathiraju
36
0
0
20 Sep 2024
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Jee-weon Jung
Yihan Wu
Xin Wang
Ji-Hoon Kim
Soumi Maiti
...
Joon Son Chung
Wangyou Zhang
Seyun Um
Shinnosuke Takamichi
Shinji Watanabe
65
1
0
18 Sep 2024
Increasing faithfulness in human-human dialog summarization with Spoken
  Language Understanding tasks
Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks
Eunice Akani
Benoît Favre
Frederic Bechet
Romain Gemignani
26
0
0
16 Sep 2024
Text-To-Speech Synthesis In The Wild
Text-To-Speech Synthesis In The Wild
Jee-weon Jung
Wangyou Zhang
Soumi Maiti
Yihan Wu
Xin Wang
...
Hye-jin Shim
Nicholas W. D. Evans
Joon Son Chung
Shinnosuke Takamichi
Shinji Watanabe
41
1
0
13 Sep 2024
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic
  Framework and its Applicability in Automatic Pronunciation Assessment
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment
Tien-Hong Lo
Meng-Ting Tsai
Berlin Chen
32
0
0
11 Sep 2024
A Large Dataset of Spontaneous Speech with the Accent Spoken in São
  Paulo for Automatic Speech Recognition Evaluation
A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation
Rodrigo Lima
S. Leal
Arnaldo Candido Junior
S. Aluísio
16
0
0
10 Sep 2024
Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding
  Asian-European Non-verbal Cultural Characteristics and their Influences on
  Engagement
Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement
Marius Funk
Shogo Okada
Elisabeth André
34
0
0
09 Sep 2024
Focus Agent: LLM-Powered Virtual Focus Group
Focus Agent: LLM-Powered Virtual Focus Group
Taiyu Zhang
Xuesong Zhang
Robbe Cools
Adalberto L. Simeone
LLMAG
29
1
0
03 Sep 2024
Measuring the Accuracy of Automatic Speech Recognition Solutions
Measuring the Accuracy of Automatic Speech Recognition Solutions
Korbinian Kuhn
Verena Kersken
Benedikt Reuter
Niklas Egger
Gottfried Zimmermann
27
19
0
29 Aug 2024
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Xuanru Zhou
Anshul Kashyap
Steve Li
Ayati Sharma
Brittany Morin
...
Z. Ezzes
Zachary Miller
M. G. Tempini
Jiachen Lian
Gopala Krishna Anumanchipalli
29
6
0
27 Aug 2024
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured
  Surgical Video Learning
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
Jiajie Li
Garrett C Skinner
Gene Yang
Brian R Quaranto
Steven D. Schwaitzberg
Peter C W Kim
Jinjun Xiong
38
10
0
15 Aug 2024
123
Next