170

MiMo-Audio: Audio Language Models are Few-Shot Learners

Xiaomi LLM-Core Team
Dong Zhang
Gang Wang
Jinlong Xue
Kai Fang
Liang Zhao
Rui Ma
Shuhuai Ren
Shuo Liu
Tao Guo
Weiji Zhuang
Xin Zhang
Xingchen Song
Yihan Yan
Yongzhe He
Cici
Bowen Shen
Chengxuan Zhu
Chong Ma
Chun Chen
Heyu Chen
Jiawei Li
Lei Li
Menghang Zhu
Peidian Li
Qiying Wang
Sirui Deng
Weimin Xiong
Wenshan Huang
Wenyu Yang
Yilin Jiang
Yixin Yang
Yuanyuan Tian
Yue Ma
Yue Yu
Zihan Zhang
Zihao Yue
Bangjun Xiao
Bingquan Xia
Bofei Gao
Bowen Ye
Can Cai
Chang Liu
Chenhong He
Chunan Li
Dawei Zhu
Duo Zhang
Fengyuan Shi
Guoan Wang
Hailin Zhang
Hanglong Lv
Hanyu Li
Hao Tian
Heng Qu
Hongshen Xu
Houbin Zhang
Huaqiu Liu
Jiangshan Duo
Jianguang Zuo
Jianyu Wei
Jiebao Xiao
Jinhao Dong
Jun Shi
Junhao Hu
Kainan Bao
Kang Zhou
Linghao Zhang
Meng Chen
Nuo Chen
Peng Zhang
Qianli Chen
Qiantong Wang
Rang Li
Shaohui Liu
Shengfan Wang
Shicheng Li
Shihua Yu
Shijie Cao
Shimao Chen
Shuhao Gu
Weikun Wang
Wenhan Ma
Xiangwei Deng
Xing Yong
Xing Zhang
Xu Wang
Yifan Song
Yihao Zhao
Yingbo Zhao
Yizhao Gao
Yu Cheng
Yu Tu
Yudong Wang
Zhaojun Huang
Zhengju Tang
Zhenru Lin
Zhichao Song
Zhipeng Xu
Zhixian Zheng
Zihan Jiang
Main:24 Pages
3 Figures
Bibliography:5 Pages
10 Tables
Appendix:2 Pages
Abstract

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available atthis https URL.

View on arXiv
Comments on this paper