Multilingual Culture-Independent Word Analogy Datasets

doi:10.48550/arXiv.1911.10038

Multilingual Culture-Independent Word Analogy Datasets

In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.

Publication:

arXiv e-prints

Pub Date:

November 2019

DOI:

10.48550/arXiv.1911.10038

arXiv:

arXiv:1911.10038

Bibcode:

2019arXiv191110038U

Keywords:

Computer Science - Computation and Language;
J.5

E-Print:

7 pages, LREC2020 conference

NASA/ADS

Multilingual Culture-Independent Word Analogy Datasets

Abstract