Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases
Abstract
Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks. We evaluate our method on eight linguistic tasks in Polish, comparing it with the best available multilingual sentence encoders.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2022
- DOI:
- 10.48550/arXiv.2207.12759
- arXiv:
- arXiv:2207.12759
- Bibcode:
- 2022arXiv220712759D
- Keywords:
-
- Computer Science - Computation and Language