The Effects of In-domain Corpus Size on pre-training BERT

doi:10.48550/arXiv.2212.07914

The Effects of In-domain Corpus Size on pre-training BERT

Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.

Publication:

arXiv e-prints

Pub Date:

December 2022

DOI:

10.48550/arXiv.2212.07914

arXiv:

arXiv:2212.07914

Bibcode:

2022arXiv221207914S

Keywords:

Computer Science - Computation and Language

E-Print:

6 pages

NASA/ADS

The Effects of In-domain Corpus Size on pre-training BERT

Abstract