Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa
Abstract
In this paper we address the problem of code-mixing in resource-poor language settings. We examine data consisting of 182k unique questions generated by users of the MomConnect helpdesk, part of a national scale public health platform in South Africa. We show evidence of code-switching at the level of approximately 10% within this dataset -- a level that is likely to pose challenges for future services. We use a natural language processing library (Polyglot) that supports detection of 196 languages and attempt to evaluate its performance at identifying English, isiZulu and code-mixed questions.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2019
- DOI:
- 10.48550/arXiv.1911.05636
- arXiv:
- arXiv:1911.05636
- Bibcode:
- 2019arXiv191105636O
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- 3 pages, Presented at NeurIPS 2019 Workshop on Machine Learning for the Developing World