Mozualization: Crafting Music and Visual Representation with Multimodal AI

5 April 2025

Abstract

In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music. Building on this concept, we developed a tool that transforms these emotional expressions into a cohesive and expressive song, allowing users to seamlessly incorporate their unique preferences and inspirations. To evaluate the tool and, more importantly, gather insights for its improvement, we conducted a user study involving nine music enthusiasts. The study assessed user experience, engagement, and the impact of interacting with and listening to the generated music.

View on arXiv

@article{xu2025_2504.13891,
  title={ Mozualization: Crafting Music and Visual Representation with Multimodal AI },
  author={ Wanfang Xu and Lixiang Zhao and Haiwen Song and Xinheng Song and Zhaolin Lu and Yu Liu and Min Chen and Eng Gee Lim and Lingyun Yu },
  journal={arXiv preprint arXiv:2504.13891},
  year={ 2025 }
}

Comments on this paper