Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

doi:10.48550/arXiv.2207.12759

Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

Dadas, Sławomir

Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks. We evaluate our method on eight linguistic tasks in Polish, comparing it with the best available multilingual sentence encoders.

Publication:

arXiv e-prints

Pub Date:

July 2022

DOI:

10.48550/arXiv.2207.12759

arXiv:

arXiv:2207.12759

Bibcode:

2022arXiv220712759D

Keywords:

Computer Science - Computation and Language

NASA/ADS

Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

Abstract