120

Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Haixing Zhao
Huaque Cairang
Suonan Cairang
Rou Te
Lengben Zhaxi
Gazang Zhaxi
Zhonglin Ye
Yuhui Zheng
Chunyan Peng
Secha Jia
Pema Tashi
Cizhen Jiacuo
Pema Dorjee
Hongkai Liu
Pema Yanggon
Tsehang Dorjee
Jiaxin Han
Qiongying Hu
Jilin Man
Huanke You
Yuqi Ren
Duo La
Deyi Xiong
Main:6 Pages
1 Figures
Bibliography:3 Pages
4 Tables
Appendix:4 Pages
Abstract

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

View on arXiv
Comments on this paper