Indexing Portuguese NLP Resources with PT-Pump-Up
Abstract
The recent advances in natural language processing (NLP) are linked to training processes that require vast amounts of corpora. Access to this data is commonly not a trivial process due to resource dispersion and the need to maintain these infrastructures online and up-to-date. New developments in NLP are often compromised due to the scarcity of data or lack of a shared repository that works as an entry point to the community. This is especially true in low and mid-resource languages, such as Portuguese, which lack data and proper resource management infrastructures. In this work, we propose PT-Pump-Up, a set of tools that aim to reduce resource dispersion and improve the accessibility to Portuguese NLP resources. Our proposal is divided into four software components: a) a web platform to list the available resources; b) a client-side Python package to simplify the loading of Portuguese NLP resources; c) an administrative Python package to manage the platform and d) a public GitHub repository to foster future collaboration and contributions. All four components are accessible using: https://linktr.ee/pt_pump_up
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2024
- DOI:
- 10.48550/arXiv.2401.15400
- arXiv:
- arXiv:2401.15400
- Bibcode:
- 2024arXiv240115400A
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval;
- 68P20;
- I.7.1
- E-Print:
- Demo Track, 3 pages