ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.04561
83
0

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

8 January 2025
Run Luo
Ting-En Lin
Jun Wang
Yuchuan Wu
Xiong Liu
Min Yang
Heng Chang
Longze Chen
Jiaming Li
Lei Zhang
Yushen Chen
Hamid Alinejad-Rokny
Fei Huang
    AuLLM
    VLM
ArXivPDFHTML
Abstract

Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%

View on arXiv
@article{luo2025_2501.04561,
  title={ OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis },
  author={ Run Luo and Ting-En Lin and Haonan Zhang and Yuchuan Wu and Xiong Liu and Min Yang and Yongbin Li and Longze Chen and Jiaming Li and Lei Zhang and Yangyi Chen and Hamid Alinejad-Rokny and Fei Huang },
  journal={arXiv preprint arXiv:2501.04561},
  year={ 2025 }
}
Comments on this paper