WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

1 March 2023

Papers citing "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio"

50 / 118 papers shown

Title
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals Yuxin Lin Yinglin Zheng Ming Zeng Wangzheng Shi 12 0 0 19 May 2025
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down Yingzhi Wang Anas Alhmoud Saad Alsahly Muhammad Alqurishi Mirco Ravanelli 14 0 0 19 May 2025
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 Sai Koneru Maike Züfle Thai-Binh Nguyen Seymanur Akti Jan Niehues Alexander Waibel 12 0 0 19 May 2025
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models Keunwoo Peter Yu Joyce Chai MLLM VLM 12 0 0 16 May 2025
BLAB: Brutally Long Audio Bench Orevaoghene Ahia Martijn Bartelds Kabir Ahuja Hila Gonen Valentin Hofmann ... Noah Bennett Shinji Watanabe Noah A. Smith Yulia Tsvetkov Sachin Kumar AuLLM LM&MA VLM 63 0 0 05 May 2025
Generating Narrated Lecture Videos from Slides with Synchronized Highlights Alexander Holmberg 27 0 0 05 May 2025
$Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion$ Co $^{3}$ Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion Xingqun Qi Yatian Wang Hengyuan Zhang J. Pan Wei Xue Shanghang Zhang Wenhan Luo Qifeng Liu Yike Guo SLR 66 0 0 03 May 2025
Versatile Framework for Song Generation with Prompt-based Control Wenjie Qu Wenxiang Guo Changhao Pan Zehan Zhu Ruiqi Li ... Rongjie Huang Ruiyuan Zhang Zhiqing Hong Ziyue Jiang Zhou Zhao 77 1 0 27 Apr 2025
Acquisition of high-quality images for camera calibration in robotics applications via speech prompts Timm Linder Kadir Yilmaz David B. Adrian Bastian Leibe 31 0 0 15 Apr 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Kim Sung-Bin Jeongsoo Choi Puyuan Peng Joon Son Chung Tae-Hyun Oh David Harwath VGen 47 1 0 03 Apr 2025
Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies Soumyya Kanti Datta Shan Jia Siwei Lyu 44 0 0 02 Apr 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs Lucas Ventura Antoine Yang Cordelia Schmid Gül Varol 39 0 0 31 Mar 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning Yubo Zhang Pedro Botelho Trevor Gordon Gil Zussman I. Kadota 55 0 0 31 Mar 2025
Understanding Co-speech Gestures in-the-wild Sindhu B. Hegde KR Prajwal Taein Kwon Andrew Zisserman SLR 57 0 0 28 Mar 2025
VideoMix: Aggregating How-To Videos for Task-Oriented Learning Saelyne Yang Anh Truong Juho Kim Dingzeyu Li 45 1 0 27 Mar 2025
Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication Yiwen Xu Monideep Chakraborti Tianyi Zhang Katelyn Eng Aanchan Mohan Mirjana Prpa AuLLM 42 0 0 21 Mar 2025
Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning Yi He Lei Yang Shilin Wang 58 0 0 05 Mar 2025
Parameter-free Video Segmentation for Vision and Language Understanding Louis Mahon Mirella Lapata VLM 41 2 0 03 Mar 2025
LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Keisuke Kamahori Jungo Kasai Noriyuki Kojima Baris Kasikci 37 0 0 27 Feb 2025
Educator Attention: How computational tools can systematically identify the distribution of a key resource for students Qingyang Zhang Rose E. Wang Ana T. Ribeiro Dora Demszky Susanna Loeb 46 0 0 27 Feb 2025
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue E. Ghaleb Bulat Khaertdinov Aslı Özyürek Raquel Fernández 41 0 0 27 Feb 2025
PlantPal: Leveraging Precision Agriculture Robots to Facilitate Remote Engagement in Urban Gardening Albin Zeqiri Julian Britten Clara Schramm Pascal Jansen Michael Rietzler Enrico Rukzio 79 0 0 26 Feb 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis Yingahao Aaron Li Rithesh Kumar Zeyu Jin DiffM 98 0 0 21 Feb 2025
Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models A. Benazir Felix Xiaozhu Lin 47 0 0 29 Jan 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Haorui He Zengqiang Shang Chaoren Wang Xuyuan Li Yicheng Gu ... Peiyang Shi Yansen Wang Kai Chen Pengyuan Zhang Zhikai Wu AuLLM 64 4 0 28 Jan 2025
Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech Eunjung Yeo J. Liss Visar Berisha David Mortensen 42 0 0 27 Jan 2025
Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data Jiajie Li Brian R Quaranto Chenhui Xu Ishan Mishra Ruiyang Qin Dancheng Liu Peter C W Kim Jinjun Xiong 94 0 0 25 Jan 2025
Integrating Pause Information with Word Embeddings in Language Models for Alzheimer's Disease Detection from Spontaneous Speech Yu Pu Wei-Qiang Zhang 43 0 0 12 Jan 2025
Towards an optimised evaluation of teachers' discourse: The case of engaging messages Samuel Falcon Jaime Leon 75 1 0 18 Dec 2024
Generative Emotion Cause Explanation in Multimodal Conversations Lin Wang Xiaocui Yang Shi Feng Daling Wang Yifei Zhang 39 0 0 01 Nov 2024
Moonshine: Speech Recognition for Live Transcription and Voice Commands Nat Jeffries Evan King M. Kudlur Guy Nicholson James Wang Pete Warden 39 5 0 21 Oct 2024
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation Juntao Li Zhenxi Song Jiaqi Wang Min Zhang Honghai Liu Min Zhang Zhiguo Zhang 31 1 0 19 Oct 2024
ScreenWriter: Automatic Screenplay Generation and Movie Summarisation Louis Mahon Mirella Lapata 28 2 0 17 Oct 2024
Characterizing the MrDeepFakes Sexual Deepfake Marketplace Catherine Han Anne Li Deepak Kumar Zakir Durumeric 32 1 0 14 Oct 2024
Character-aware audio-visual subtitling in context Jaesung Huh Andrew Zisserman 41 0 0 14 Oct 2024
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies Yingqiang Gao Lukas Fischer Alexa Lintner Sarah Ebling 36 0 0 11 Oct 2024
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans Anna Deichler Jim O'Regan Jonas Beskow 29 0 0 30 Sep 2024
Word-wise intonation model for cross-language TTS systems Tomilov A. A. Gromova A. Y. Svischev A. N 34 0 0 30 Sep 2024
MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations Gia-Bao Dinh Ho Chang Wei Tan Zahra Zamanzadeh Darban Mahsa Salehi Gholamreza Haffari Wray L. Buntine 18 0 0 23 Sep 2024
Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper Iuliia Thorbecke Juan Zuluaga-Gomez Esaú Villatoro-Tello Shashi Kumar Pradeep Rangappa Sergio Burdisso P. Motlícek Karthik Pandia A. Ganapathiraju 36 0 0 20 Sep 2024
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild Jee-weon Jung Yihan Wu Xin Wang Ji-Hoon Kim Soumi Maiti ... Joon Son Chung Wangyou Zhang Seyun Um Shinnosuke Takamichi Shinji Watanabe 65 1 0 18 Sep 2024
Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks Eunice Akani Benoît Favre Frederic Bechet Romain Gemignani 26 0 0 16 Sep 2024
Text-To-Speech Synthesis In The Wild Jee-weon Jung Wangyou Zhang Soumi Maiti Yihan Wu Xin Wang ... Hye-jin Shim Nicholas W. D. Evans Joon Son Chung Shinnosuke Takamichi Shinji Watanabe 41 1 0 13 Sep 2024
Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment Tien-Hong Lo Meng-Ting Tsai Berlin Chen 32 0 0 11 Sep 2024
A Large Dataset of Spontaneous Speech with the Accent Spoken in São Paulo for Automatic Speech Recognition Evaluation Rodrigo Lima S. Leal Arnaldo Candido Junior S. Aluísio 16 0 0 10 Sep 2024
Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement Marius Funk Shogo Okada Elisabeth André 34 0 0 09 Sep 2024
Focus Agent: LLM-Powered Virtual Focus Group Taiyu Zhang Xuesong Zhang Robbe Cools Adalberto L. Simeone LLMAG 29 1 0 03 Sep 2024
Measuring the Accuracy of Automatic Speech Recognition Solutions Korbinian Kuhn Verena Kersken Benedikt Reuter Niklas Egger Gottfried Zimmermann 27 19 0 29 Aug 2024
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection Xuanru Zhou Anshul Kashyap Steve Li Ayati Sharma Brittany Morin ... Z. Ezzes Zachary Miller M. G. Tempini Jiachen Lian Gopala Krishna Anumanchipalli 29 6 0 27 Aug 2024
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning Jiajie Li Garrett C Skinner Gene Yang Brian R Quaranto Steven D. Schwaitzberg Peter C W Kim Jinjun Xiong 38 10 0 15 Aug 2024