HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

doi:10.48550/arXiv.2201.01956

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industry-ready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has high accuracy while maintaining resource-efficient prediction capabilities.

Publication:

arXiv e-prints

Pub Date:

January 2022

DOI:

10.48550/arXiv.2201.01956

arXiv:

arXiv:2201.01956

Bibcode:

2022arXiv220101956O

Keywords:

Computer Science - Computation and Language;
Statistics - Machine Learning;
68T50;
I.2.7

E-Print:

Camera-ready manuscript: - Fixed various grammatical error. - Restructured the evaluation section. - Updated scores in accordance with the v0.4.2 release

NASA/ADS

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

Abstract