In high-noise environments such as factories, subways, and busy streets, capturing clear speech is challenging. Throat microphones can offer a solution because of their inherent noise-suppression capabilities; however, the passage of sound waves through skin and tissue attenuates high-frequency information, reducing speech clarity. Recent deep learning approaches have shown promise in enhancing throat microphone recordings, but further progress is constrained by the lack of a standard dataset. Here, we introduce the Throat and Acoustic Paired Speech (TAPS) dataset, a collection of paired utterances recorded from 60 native Korean speakers using throat and acoustic microphones. Furthermore, an optimal alignment approach was developed and applied to address the inherent signal mismatch between the two microphones. We tested three baseline deep learning models on the TAPS dataset and found mapping-based approaches to be superior for improving speech quality and restoring content. These findings demonstrate the TAPS dataset's utility for speech enhancement tasks and support its potential as a standard resource for advancing research in throat microphone-based applications.
View on arXiv@article{kim2025_2502.11478, title={ TAPS: Throat and Acoustic Paired Speech Dataset for Deep Learning-Based Speech Enhancement }, author={ Yunsik Kim and Yonghun Song and Yoonyoung Chung }, journal={arXiv preprint arXiv:2502.11478}, year={ 2025 } }