DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings
Abstract
Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed. Our benchmark, evaluation code, and new word embedding models will be publicly available.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2020
- DOI:
- arXiv:
- arXiv:2011.10970
- Bibcode:
- 2020arXiv201110970A
- Keywords:
-
- Computer Science - Artificial Intelligence
- E-Print:
- WANLP2021