Misspelling Oblivious Word Embeddings
Abstract
In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2019
- DOI:
- 10.48550/arXiv.1905.09755
- arXiv:
- arXiv:1905.09755
- Bibcode:
- 2019arXiv190509755E
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Machine Learning
- E-Print:
- 9 Pages