ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.14445
5
0

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval

17 June 2025
Ruofan Hu
Yan Xia
Minjie Hong
Jieming Zhu
Bo Chen
Xiaoda Yang
Minghui Fang
Tao Jin
    VLM
ArXiv (abs)PDFHTML
Main:4 Pages
2 Figures
Bibliography:1 Pages
5 Tables
Abstract

Multimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then propose a single-modality training approach, where the model is trained exclusively on text pairs. Our experiments show that Vela outperforms traditional CLAP models in standard text-audio retrieval tasks. Furthermore, we introduce new benchmarks that expose CLAP models' limitations in handling long texts and complex retrieval tasks. In contrast, Vela, by harnessing the capabilities of MLLMs, demonstrates robust performance in these scenarios. Our code will soon be available.

View on arXiv
@article{hu2025_2506.14445,
  title={ Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval },
  author={ Ruofan Hu and Yan Xia and Minjie Hong and Jieming Zhu and Bo Chen and Xiaoda Yang and Minghui Fang and Tao Jin },
  journal={arXiv preprint arXiv:2506.14445},
  year={ 2025 }
}
Comments on this paper