Mozualization: Crafting Music and Visual Representation with Multimodal AI

In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music. Building on this concept, we developed a tool that transforms these emotional expressions into a cohesive and expressive song, allowing users to seamlessly incorporate their unique preferences and inspirations. To evaluate the tool and, more importantly, gather insights for its improvement, we conducted a user study involving nine music enthusiasts. The study assessed user experience, engagement, and the impact of interacting with and listening to the generated music.
View on arXiv@article{xu2025_2504.13891, title={ Mozualization: Crafting Music and Visual Representation with Multimodal AI }, author={ Wanfang Xu and Lixiang Zhao and Haiwen Song and Xinheng Song and Zhaolin Lu and Yu Liu and Min Chen and Eng Gee Lim and Lingyun Yu }, journal={arXiv preprint arXiv:2504.13891}, year={ 2025 } }