Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

doi:10.48550/arXiv.2007.14351

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

Publication:

arXiv e-prints

Pub Date:

July 2020

DOI:

10.48550/arXiv.2007.14351

arXiv:

arXiv:2007.14351

Bibcode:

2020arXiv200714351L

Keywords:

Electrical Engineering and Systems Science - Audio and Speech Processing;
Computer Science - Sound

E-Print:

Accepted to Interspeech 2020

NASA/ADS

Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?

Abstract