ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.00747
  4. Cited By
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

1 March 2023
Max Bain
Jaesung Huh
Tengda Han
Andrew Zisserman
ArXivPDFHTML

Papers citing "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio"

50 / 120 papers shown
Title
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection
Xuanru Zhou
Anshul Kashyap
Steve Li
Ayati Sharma
Brittany Morin
...
Z. Ezzes
Zachary Miller
M. G. Tempini
Jiachen Lian
Gopala Krishna Anumanchipalli
32
6
0
27 Aug 2024
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured
  Surgical Video Learning
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning
Jiajie Li
Garrett C Skinner
Gene Yang
Brian R Quaranto
Steven D. Schwaitzberg
Peter C W Kim
Jinjun Xiong
38
10
0
15 Aug 2024
An Investigation Into Explainable Audio Hate Speech Detection
An Investigation Into Explainable Audio Hate Speech Detection
Jinmyeong An
Wonjun Lee
Yejin Jeon
Jungseul Ok
Yunsu Kim
Gary Geunbae Lee
33
2
0
12 Aug 2024
MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video
MMSummary: Multimodal Summary Generation for Fetal Ultrasound Video
Xiaoqing Guo
Qianhui Men
J. A. Noble
48
0
0
07 Aug 2024
Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
49
5
0
31 Jul 2024
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
41
8
0
22 Jul 2024
dMel: Speech Tokenization made Simple
dMel: Speech Tokenization made Simple
Richard He Bai
Tatiana Likhomanenko
Ruixiang Zhang
Zijin Gu
Zakaria Aldeneh
Navdeep Jaitly
43
4
0
22 Jul 2024
DISCOVER: A Data-driven Interactive System for Comprehensive
  Observation, Visualization, and ExploRation of Human Behaviour
DISCOVER: A Data-driven Interactive System for Comprehensive Observation, Visualization, and ExploRation of Human Behaviour
Dominik Schiller
Tobias Hallmen
Daksitha Senel Withanage Don
Elisabeth André
Tobias Baur
26
3
0
18 Jul 2024
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for
  Large-Scale Speech Generation
Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
Haorui He
Zengqiang Shang
Chaoren Wang
Xuyuan Li
Yicheng Gu
...
Peiyang Shi
Yuancheng Wang
Kai Chen
Pengyuan Zhang
Zhizheng Wu
38
37
0
07 Jul 2024
VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic
  Features
VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features
Tomoki Koriyama
41
0
0
03 Jul 2024
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
Ruiqi Li
Zhiqing Hong
Yongqi Wang
Lichao Zhang
Rongjie Huang
Siqi Zheng
Zhou Zhao
39
6
0
02 Jul 2024
MatchTime: Towards Automatic Soccer Game Commentary Generation
MatchTime: Towards Automatic Soccer Game Commentary Generation
Jiayuan Rao
Haoning Wu
Chang-rui Liu
Yanfeng Wang
Weidi Xie
43
7
0
26 Jun 2024
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk
  Assessment
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment
Maurice Gerczuk
Shahin Amiriparian
Justina Lutz
W. Strube
I. Papazova
Alkomiet Hasan
Björn W. Schuller
9
1
0
26 Jun 2024
FASA: a Flexible and Automatic Speech Aligner for Extracting
  High-quality Aligned Children Speech Data
FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
Dancheng Liu
Jinjun Xiong
33
0
0
25 Jun 2024
Zero-Shot Long-Form Video Understanding through Screenplay
Zero-Shot Long-Form Video Understanding through Screenplay
Yongliang Wu
Bozheng Li
Jiawang Cao
Wenbo Zhu
Yi Lu
...
Chuyun Xie
Haolin Zheng
Ziyue Su
Jay Wu
Xu Yang
48
4
0
25 Jun 2024
PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and
  Evolving Speaker Characteristics
PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics
Amir Nassereldine
Dancheng Liu
Chenhui Xu
Jinjun Xiong
44
0
0
21 Jun 2024
The Greek podcast corpus: Competitive speech models for low-resourced
  languages with weakly supervised data
The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
Georgios Paraskevopoulos
Chara Tsoukala
Athanasios Katsamanis
V. Katsouros
OffRL
31
0
0
21 Jun 2024
Gender Representation in TV and Radio: Automatic Information Extraction
  methods versus Manual Analyses
Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses
David Doukhan
Lena Dodson
Manon Conan
Valentin Pelloin
Aurélien Clamouse
Mélina Lepape
Géraldine Van Hille
Cécile Méadel
Marlene Coulomb-Gully
36
0
0
14 Jun 2024
Reading Miscue Detection in Primary School through Automatic Speech
  Recognition
Reading Miscue Detection in Primary School through Automatic Speech Recognition
Lingyun Gao
Cristian Tejedor-García
H. Strik
C. Cucchiarini
32
0
0
11 Jun 2024
LLM-based speaker diarization correction: A generalizable approach
LLM-based speaker diarization correction: A generalizable approach
Georgios Efstathiadis
Vijay Yadav
Anzar Abbas
45
3
0
07 Jun 2024
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
Xingqun Qi
Hengyuan Zhang
Yatian Wang
J. Pan
Chen Liu
...
Qixun Zhang
Shanghang Zhang
Wenhan Luo
Qifeng Liu
Qi-fei Liu
DiffM
SLR
115
5
0
27 May 2024
Modeling Real-Time Interactive Conversations as Timed Diarized
  Transcripts
Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts
Garrett Tanzer
Gustaf Ahdritz
Luke Melas-Kyriazi
18
1
0
21 May 2024
CinePile: A Long Video Question Answering Dataset and Benchmark
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
43
40
0
14 May 2024
Alignment Helps Make the Most of Multimodal Data
Alignment Helps Make the Most of Multimodal Data
Christian Arnold
Andreas Küpfer
46
2
0
14 May 2024
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's
  Disease Detection From Spontaneous Speech
HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech
Zhongren Dong
Zixing Zhang
Weixiang Xu
Jing Han
Jianjun Ou
Björn W. Schuller
40
1
0
07 May 2024
Deep Learning Models in Speech Recognition: Measuring GPU Energy
  Consumption, Impact of Noise and Model Quantization for Edge Deployment
Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment
Aditya Chakravarty
25
0
0
02 May 2024
From Keyboard to Chatbot: An AI-powered Integration Platform with
  Large-Language Models for Teaching Computational Thinking for Young Children
From Keyboard to Chatbot: An AI-powered Integration Platform with Large-Language Models for Teaching Computational Thinking for Young Children
Changjae Lee
Jinjun Xiong
LM&Ro
LRM
16
0
0
01 May 2024
Speech Technology Services for Oral History Research
Speech Technology Services for Oral History Research
Christoph Draxler
H. V. D. Heuvel
A. V. Hessen
P. Ircing
Jan Lehecka
38
0
0
26 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
47
20
0
22 Apr 2024
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Ruizhe Huang
Xiaohui Zhang
Zhaoheng Ni
Li Sun
Moto Hira
...
Vineel Pratap
Matthew Wiesner
Shinji Watanabe
Daniel Povey
Sanjeev Khudanpur
27
4
0
22 Apr 2024
Language Proficiency and F0 Entrainment: A Study of L2 English Imitation
  in Italian, French, and Slovak Speakers
Language Proficiency and F0 Entrainment: A Study of L2 English Imitation in Italian, French, and Slovak Speakers
Zheng Yuan
Štefan Beňuš
Alessandro DÁusilio
23
0
0
16 Apr 2024
Anatomy of Industrial Scale Multilingual ASR
Anatomy of Industrial Scale Multilingual ASR
Francis McCann Ramirez
Luka Chkhetiani
Andrew Ehrenberg
R. McHardy
Rami Botros
...
Ahmed Efty
Daniel McCrystal
Sam Flamini
Domenic Donato
Takuya Yoshioka
42
7
0
15 Apr 2024
Scaling Up Video Summarization Pretraining with Large Language Models
Scaling Up Video Summarization Pretraining with Large Language Models
Dawit Mureja Argaw
Seunghyun Yoon
Fabian Caba Heilbron
Hanieh Deilamsalehy
Trung Bui
Zhaowen Wang
Franck Dernoncourt
Joon Son Chung
43
9
0
04 Apr 2024
ART: The Alternating Reading Task Corpus for Speech Entrainment and
  Imitation
ART: The Alternating Reading Task Corpus for Speech Entrainment and Imitation
Zheng Yuan
D. D. Jong
Štefan Beňuš
Noël Nguyen
Ruitao Feng
Róbert Sabo
Luciano Fadiga
Alessandro DÁusilio
15
1
0
03 Apr 2024
A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or
  Somewhere In Between?
A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?
Kahyun Choi
Minje Kim
23
0
0
31 Mar 2024
Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation
Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation
Sai Akarsh
Vamshi Raghusimha
Anindita Mondal
Anil Vuppala
44
1
0
07 Mar 2024
AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response
AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response
Govind Mittal
Arthur Jakobsson
Kelly O. Marshall
Chinmay Hegde
Nasir D. Memon
40
0
0
28 Feb 2024
Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting
  the Variation in Human Signals during Visuo-Linguistic Processes
Describing Images Fast and Slow\textit{Fast and Slow}Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
Ece Takmaz
Sandro Pezzelle
Raquel Fernández
24
1
0
02 Feb 2024
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
Bruno Korbar
Jaesung Huh
Andrew Zisserman
13
3
0
22 Jan 2024
Towards Hierarchical Spoken Language Dysfluency Modeling
Towards Hierarchical Spoken Language Dysfluency Modeling
Jiachen Lian
Gopala Anumanchipalli
32
9
0
18 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
42
6
0
08 Jan 2024
Retrieval-Augmented Egocentric Video Captioning
Retrieval-Augmented Egocentric Video Captioning
Jilan Xu
Yifei Huang
Junlin Hou
Guo Chen
Yue Zhang
Rui Feng
Weidi Xie
EgoV
54
29
0
01 Jan 2024
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in
  Speech-to-Speech Models
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
Maureen de Seyssel
Antony DÁvirro
Adina Williams
Emmanuel Dupoux
32
3
0
21 Dec 2023
A Strong Baseline for Temporal Video-Text Alignment
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya Zhang
Yanfeng Wang
Weidi Xie
AI4TS
VGen
43
5
0
21 Dec 2023
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and
  Detection
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection
Jiachen Lian
Carly Feng
Naasir Farooqi
Steve Li
Anshul Kashyap
Cheol Jun Cho
Peter Wu
Robin Netzorg
Tingle Li
Gopala Krishna Anumanchipalli
40
13
0
20 Dec 2023
Seq2seq for Automatic Paraphasia Detection in Aphasic Speech
Seq2seq for Automatic Paraphasia Detection in Aphasic Speech
M. Perez
Duc Le
Amrit Romana
Elise Jones
Keli Licata
E. Provost
28
2
0
16 Dec 2023
WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words
WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words
Lukas Wolf
Greta Tuckute
Klemen Kotar
Eghbal Hosseini
Tamar I. Regev
Ethan Gotlieb Wilcox
Alex Warstadt
45
3
0
05 Dec 2023
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
  Learning
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang
K. Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
26
28
0
29 Nov 2023
Average Token Delay: A Duration-aware Latency Metric for Simultaneous
  Translation
Average Token Delay: A Duration-aware Latency Metric for Simultaneous Translation
Yasumasa Kano
Katsuhito Sudoh
Satoshi Nakamura
25
1
0
24 Nov 2023
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Shehan Munasinghe
Rusiru Thushara
Muhammad Maaz
H. Rasheed
Salman Khan
Mubarak Shah
Fahad Khan
VLM
MLLM
32
34
0
22 Nov 2023
Previous
123
Next