Automatic register identification for the open web using multilingual deep learning
Abstract
This article investigates how well deep learning models can identify web registers -- text varieties such as news reports and discussion forums -- across 16 languages. We introduce the Multilingual CORE corpora, which contain 72,504 documents annotated with a hierarchical taxonomy of 25 registers designed to cover the entire open web. Our multilingual models achieve state-of-the-art results (79% F1 score) using multi-label classification. This performance matches or exceeds previous studies that used simpler classification schemes, showing that models can perform well even with a complex register scheme at a massively multilingual scale. However, we observe a consistent performance ceiling around 77-80% F1 score across all models and configurations. When we remove documents with uncertain labels through data pruning, performance increases to over 90% F1, suggesting that this ceiling stems from inherent ambiguity in web registers rather than model limitations. Analysis of hybrid documents -- texts combining multiple registers -- reveals that the main challenge is not in classifying hybrids themselves, but in distinguishing between hybrid and non-hybrid documents. Multilingual models consistently outperform monolingual ones, particularly helping languages with limited training data. While zero-shot performance drops by an average of 7% on unseen languages, this decrease varies substantially between languages (from 3% to 20%), indicating that while registers share many features across languages, they also maintain language-specific characteristics.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2024
- DOI:
- arXiv:
- arXiv:2406.19892
- Bibcode:
- 2024arXiv240619892H
- Keywords:
-
- Computer Science - Computation and Language