PAMP: A unified framework boosting low resource automatic speech recognition

5 February 2023

Zeping Min

Qian Ge

Zhong Li

E. Weinan

ArXiv (abs)PDF HTML Github (5080★)

Main:11 Pages

6 Figures

Bibliography:3 Pages

3 Tables

Abstract

We propose a novel text-to-speech (TTS) data augmentation framework for low resource automatic speech recognition (ASR) tasks, named phoneme audio mix up (PAMP). The PAMP method is highly interpretable and can incorporate prior knowledge of pronunciation rules. Furthermore, PAMP can be easily deployed in almost any language, extremely for low resource ASR tasks. Extensive experiments have demonstrated the great effectiveness of PAMP on low resource ASR tasks: we achieve a \textbf{10.84\%} character error rate (CER) on the common voice Cantonese ASR task, bringing a great relative improvement of about \textbf{30\%} compared to the previous state-of-the-art which was achieved by fine-tuning the wav2vec2 pretrained model.

View on arXiv

Comments on this paper