HuSpaCy: an industrial-strength Hungarian natural language processing toolkit
Abstract
Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industry-ready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has high accuracy while maintaining resource-efficient prediction capabilities.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2022
- DOI:
- 10.48550/arXiv.2201.01956
- arXiv:
- arXiv:2201.01956
- Bibcode:
- 2022arXiv220101956O
- Keywords:
-
- Computer Science - Computation and Language;
- Statistics - Machine Learning;
- 68T50;
- I.2.7
- E-Print:
- Camera-ready manuscript: - Fixed various grammatical error. - Restructured the evaluation section. - Updated scores in accordance with the v0.4.2 release