Misspelling Oblivious Word Embeddings

doi:10.48550/arXiv.1905.09755

Misspelling Oblivious Word Embeddings

In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.

Publication:

arXiv e-prints

Pub Date:

May 2019

DOI:

10.48550/arXiv.1905.09755

arXiv:

arXiv:1905.09755

Bibcode:

2019arXiv190509755E

Keywords:

Computer Science - Computation and Language;
Computer Science - Machine Learning

E-Print:

9 Pages

NASA/ADS

Misspelling Oblivious Word Embeddings

Abstract