On the Use of Machine Translation-Based Approaches for Vietnamese Diacritic Restoration
Abstract
This paper presents an empirical study of two machine translation-based approaches for Vietnamese diacritic restoration problem, including phrase-based and neural-based machine translation models. This is the first work that applies neural-based machine translation method to this problem and gives a thorough comparison to the phrase-based machine translation method which is the current state-of-the-art method for this problem. On a large dataset, the phrase-based approach has an accuracy of 97.32% while that of the neural-based approach is 96.15%. While the neural-based method has a slightly lower accuracy, it is about twice faster than the phrase-based method in terms of inference speed. Moreover, neural-based machine translation method has much room for future improvement such as incorporating pre-trained word embeddings and collecting more training data.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2017
- DOI:
- 10.48550/arXiv.1709.07104
- arXiv:
- arXiv:1709.07104
- Bibcode:
- 2017arXiv170907104P
- Keywords:
-
- Computer Science - Computation and Language
- E-Print:
- 4 pages, 2 figures, 4 tables, accepted to IALP 2017