ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.18042
67
21

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

26 September 2024
Kai Chen
Yunhao Gou
Runhui Huang
Zhili Liu
Daxin Tan
Jing Xu
Chunwei Wang
Yi Zhu
Yihan Zeng
Kuo Yang
Dingdong Wang
Kun Xiang
Haoyuan Li
Haoli Bai
Jianhua Han
Xiaohui Li
Weike Jin
Nian Xie
Yu Zhang
James T. Kwok
Hengshuang Zhao
Xiaodan Liang
Dit-Yan Yeung
Xiao Chen
Zhenguo Li
Wei Zhang
Qun Liu
Jun Yao
Lu Hou
Hang Xu
Hang Xu
    AuLLM
    MLLM
    VLM
ArXivPDFHTML
Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

View on arXiv
@article{chen2025_2409.18042,
  title={ EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions },
  author={ Kai Chen and Yunhao Gou and Runhui Huang and Zhili Liu and Daxin Tan and Jing Xu and Chunwei Wang and Yi Zhu and Yihan Zeng and Kuo Yang and Dingdong Wang and Kun Xiang and Haoyuan Li and Haoli Bai and Jianhua Han and Xiaohui Li and Weike Jin and Nian Xie and Yu Zhang and James T. Kwok and Hengshuang Zhao and Xiaodan Liang and Dit-Yan Yeung and Xiao Chen and Zhenguo Li and Wei Zhang and Qun Liu and Jun Yao and Lanqing Hong and Lu Hou and Hang Xu },
  journal={arXiv preprint arXiv:2409.18042},
  year={ 2025 }
}
Comments on this paper