A unified multilingual handwriting recognition system using multigrams sub-lexical units
Abstract
We address the design of a unified multilingual system for handwriting recognition. Most of multilingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by considering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes possible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods performance with a strong reduction of the complexity.
- Publication:
-
Pattern Recognition Letters
- Pub Date:
- April 2019
- DOI:
- 10.1016/j.patrec.2018.07.027
- arXiv:
- arXiv:1808.09183
- Bibcode:
- 2019PaReL.121...68S
- Keywords:
-
- Sub-lexical units;
- Multilingual;
- Language model;
- Handwriting recognition;
- Multigrams;
- Computer Science - Computer Vision and Pattern Recognition
- E-Print:
- preprint