Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
Abstract
One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.(https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR)
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2023
- DOI:
- 10.48550/arXiv.2311.03196
- arXiv:
- arXiv:2311.03196
- Bibcode:
- 2023arXiv231103196N
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence;
- 68T50;
- F.2.2;
- I.2.7
- E-Print:
- Accepted at BLP-2023 (at EMNLP 2023), ASR, low-resource, out-of-distribution, domain-agnostic