A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

doi:10.48550/arXiv.2501.00029

A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

Gupta, Pranav

We review the recent literature (January 2022- October 2024) in South Asian languages on text-based language processing, multimodal models, and speech processing, and provide a spotlight analysis focused on 21 low-resource South Asian languages, namely Saraiki, Assamese, Balochi, Bhojpuri, Bodo, Burmese, Chhattisgarhi, Dhivehi, Gujarati, Kannada, Kashmiri, Konkani, Khasi, Malayalam, Meitei, Nepali, Odia, Pashto, Rajasthani, Sindhi, and Telugu. We identify trends, challenges, and future research directions, using a step-wise approach that incorporates relevance classification and clustering based on large language models (LLMs). Our goal is to provide a breadth-first overview of the recent developments in South Asian language technologies to NLP researchers interested in working with South Asian languages.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2501.00029

arXiv:

arXiv:2501.00029

Bibcode:

2025arXiv250100029G

Keywords:

Computer Science - Computation and Language;
Computer Science - Information Retrieval;
Computer Science - Machine Learning

ADS

A Breadth-First Catalog of Text Processing, Speech Processing and Multimodal Research in South Asian Languages

Abstract