Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens

26 October 2019

Papers citing "Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens"

50 / 78 papers shown

Title
Text-Driven Voice Conversion via Latent State-Space Modeling Wen Li Sofia Martinez Priyanka Shah 53 0 0 26 Mar 2025
EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector Deok-Hyeon Cho Hyung-Seok Oh Seung-Bin Kim Seong-Whan Lee 46 4 0 04 Nov 2024
Disentangling segmental and prosodic factors to non-native speech comprehensibility Waris Quamer Ricardo Gutierrez-Osuna 45 1 0 20 Aug 2024
Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy Ravi Shankar Archana Venkataraman 44 0 0 04 Aug 2024
DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability Hyun Joon Park Jin Sob Kim Wooseok Shin Sung Won Han DiffM 41 2 0 27 Jun 2024
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment Paarth Neekhara Shehzeen Samarah Hussain Subhankar Ghosh Jason Chun Lok Li Rafael Valle Rohan Badlani Boris Ginsburg 58 11 0 25 Jun 2024
Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations Yejin Jeon Yunsu Kim Gary Geunbae Lee 40 2 0 04 Jan 2024
Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning Rishabh Jain Peter Corcoran 28 0 0 07 Nov 2023
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning Tao Li Zhichao Wang Xinfa Zhu Jian Cong Qiao Tian Yuping Wang Lei Xie DiffM 35 3 0 06 Oct 2023
Improving severity preservation of healthy-to-pathological voice conversion with global style tokens B. Halpern Wen-Chin Huang Lester Phillip Violeta R.J.J.H. van Son T. Toda 35 2 0 04 Oct 2023
A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis X. Wei Jia Jia Xiang Li Zhiyong Wu Ziyi Wang 23 1 0 21 Sep 2023
PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion Yimin Deng Huaizhen Tang Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao 29 7 0 21 Aug 2023
Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis Erik Ekstedt Siyang Wang Éva Székely Joakim Gustafson Gabriel Skantze 28 6 0 29 May 2023
Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS Sewade Ogun Vincent Colotte Emmanuel Vincent DiffM 35 4 0 28 May 2023
Controllable Speaking Styles Using a Large Language Model A. Sigurgeirsson Simon King 25 2 0 17 May 2023
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model Kenichi Fujita Takanori Ashihara Hiroki Kanagawa Takafumi Moriya Yusuke Ijima 46 10 0 24 Apr 2023
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models Yinghao Aaron Li Cong Han N. Mesgarani 24 18 0 29 Dec 2022
UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis Yinjiao Lei Shan Yang Xinsheng Wang Qicong Xie Jixun Yao Linfu Xie Dan Su DiffM 21 8 0 03 Dec 2022
Controllable speech synthesis by learning discrete phoneme-level prosodic representations Nikolaos Ellinas Myrsini Christidou Alexandra Vioni June Sig Sung Aimilios Chalamandaris Pirros Tsiakoulis P. Mastorocostas 25 7 0 29 Nov 2022
Can we use Common Voice to train a Multi-Speaker TTS system? Sewade Ogun Vincent Colotte Emmanuel Vincent 27 10 0 12 Oct 2022
SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata P. Syga 37 14 0 12 Oct 2022
The Role of Vocal Persona in Natural and Synthesized Speech Camille Noufi Lloyd May J. Berger 27 2 0 06 Sep 2022
Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation Giulia Comini Goeric Huybrechts M. Ribeiro Adam Gabry's Jaime Lorenzo-Trueba 35 5 0 29 Jul 2022
Controllable Data Generation by Deep Learning: A Review Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao 33 28 0 19 Jul 2022
NatiQ: An End-to-end Text-to-Speech System for Arabic Ahmed Abdelali Nadir Durrani C. Demiroğlu Fahim Dalvi Hamdy Mubarak Kareem Darwish 23 14 0 15 Jun 2022
StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis Yinghao Aaron Li Cong Han N. Mesgarani 42 38 0 30 May 2022
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Rongjie Huang Yi Ren Jinglin Liu Chenye Cui Zhou Zhao OODD VLM 117 34 0 15 May 2022
Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation Ryo Terashima Ryuichi Yamamoto Eunwoo Song Yuma Shirahata Hyun-Wook Yoon Jae-Min Kim Kentaro Tachibana 11 15 0 21 Apr 2022
Karaoker: Alignment-free singing voice synthesis with speech training data Panos Kakoulidis Nikolaos Ellinas G. Vamvoukakis K. Markopoulos June Sig Sung Gunu Jho Pirros Tsiakoulis Aimilios Chalamandaris 12 3 0 08 Apr 2022
Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis Fan Wang Po-Chun Hsu Da-Rong Liu Hung-yi Lee 18 0 0 01 Apr 2022
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy Shuai Guo Jiatong Shi Tao Qian Shinji Watanabe Qin Jin 33 13 0 31 Mar 2022
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher Heyang Xue Xinsheng Wang Yongmao Zhang Lei Xie Pengcheng Zhu Mengxiao Bi DiffM 33 11 0 30 Mar 2022
A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis Rishabh Jain Mariam Yiwere Dan Bigioi Peter Corcoran H. Cucu 27 14 0 22 Mar 2022
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows Kevin J. Shih Rafael Valle Rohan Badlani J. F. Santos Bryan Catanzaro 36 4 0 03 Mar 2022
MuSE-SVS: Multi-Singer Emotional Singing Voice Synthesizer that Controls Emotional Intensity Sungjae Kim Y.E. Kim Jewoo Jun Injung Kim 31 14 0 02 Mar 2022
Cross-speaker style transfer for text-to-speech using data augmentation M. Ribeiro Julian Roth Giulia Comini Goeric Huybrechts Adam Gabry's Jaime Lorenzo-Trueba 19 21 0 10 Feb 2022
V2C: Visual Voice Cloning Qi Chen Yuanqing Li Yuankai Qi Jiaqiu Zhou Mingkui Tan Qi Wu VGen 33 23 0 25 Nov 2021
Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control Myrsini Christidou Alexandra Vioni Nikolaos Ellinas G. Vamvoukakis K. Markopoulos Panos Kakoulidis June Sig Sung Hyoungmin Park Aimilios Chalamandaris Pirros Tsiakoulis 21 4 0 19 Nov 2021
Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control K. Markopoulos Nikolaos Ellinas Alexandra Vioni Myrsini Christidou Panos Kakoulidis ... Georgia Maniati June Sig Sung Hyoungmin Park Pirros Tsiakoulis Aimilios Chalamandaris 11 2 0 17 Nov 2021
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses Shengyuan Xu Wenxiao Zhao Jing Guo 24 12 0 01 Nov 2021
Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning Shijun Wang Dimche Kostadinov Damian Borth 29 11 0 27 Oct 2021
Adapting TTS models For New Speakers using Transfer Learning Paarth Neekhara Jason Chun Lok Li Boris Ginsburg 38 15 0 12 Oct 2021
Pitch Preservation In Singing Voice Synthesis Shujun Liu Hai Zhu Kun Wang Huajun Wang 28 0 0 11 Oct 2021
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models Jen-Hao Rick Chang A. Shrivastava H. Koppula Xiaoshuai Zhang Oncel Tuzel DiffM 51 16 0 06 Oct 2021
GANtron: Emotional Speech Synthesis with Generative Adversarial Networks E. Hortal Rodrigo Brechard Alarcia GAN 26 2 0 06 Oct 2021
Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis Tao Li Xinsheng Wang Qicong Xie Zhichao Wang Linfu Xie 26 42 0 14 Sep 2021
Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis Julian Zaïdi Hugo Seuté Benjamin van Niekerk M. Carbonneau 34 20 0 04 Aug 2021
SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours Yi-Wei Chen Hung-Shin Lee Yen-Hsing Chen Hsin-Min Wang 24 17 0 01 Aug 2021
Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations Seyun Um Jihyun Kim Jihyun Lee Hong-Goo Kang CVBM 13 4 0 26 Jul 2021
A Deep-Bayesian Framework for Adaptive Speech Duration Modification Ravi Shankar A. Venkataraman 26 0 0 11 Jul 2021