MULTEXT-East

doi:10.48550/arXiv.2003.14026

MULTEXT-East

Erjavec, Tomaž

MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel "1984" by George Orwell, is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset is extensively documented, and freely available for research purposes. This case study gives a history of the development of the MULTEXT-East resources, presents their encoding and components, discusses related work and gives some conclusions.

Publication:

arXiv e-prints

Pub Date:

March 2020

DOI:

10.48550/arXiv.2003.14026

arXiv:

arXiv:2003.14026

Bibcode:

2020arXiv200314026E

Keywords:

Computer Science - Computation and Language;
I.2.7

E-Print:

Published in: Nancy Ide, James Pustejovsky, eds. 2007. Handbook of linguistic annotation. pp. 441-462. Springer

NASA/ADS

MULTEXT-East

Abstract