ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.12369
27
212

PanGu-ααα: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

26 April 2021
Wei Zeng
Xiaozhe Ren
Teng Su
Hui Wang
Yi-Lun Liao
Zhiwei Wang
Xin Jiang
ZhenZhang Yang
Kaisheng Wang
Xiaoda Zhang
Chen Li
Ziyan Gong
Yifan Yao
Xinjing Huang
Jun Wang
Jianfeng Yu
Qiwei Guo
Yue Yu
Yan Zhang
Jin Wang
Heng Tao
Dasen Yan
Z. Yi
Fang Peng
Fan Jiang
Han Zhang
Lingfeng Deng
Yehong Zhang
Zhengping Lin
Chao Zhang
Shaojie Zhang
Mingyue Guo
Shanzhi Gu
Gaojun Fan
Yaowei Wang
Xuefeng Jin
Qun Liu
Yonghong Tian
    ALM
    MoE
    AI4CE
ArXivPDFHTML
Abstract

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-α\alphaα, with up to 200 billion parameters. PanGu-α\alphaα is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-α\alphaα, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-α\alphaα in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-α\alphaα in performing various tasks under few-shot or zero-shot settings.

View on arXiv
Comments on this paper