Datasheet for the Pile

doi:10.48550/arXiv.2201.07311

Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

Publication:

arXiv e-prints

Pub Date:

January 2022

DOI:

10.48550/arXiv.2201.07311

arXiv:

arXiv:2201.07311

Bibcode:

2022arXiv220107311B

Keywords:

Computer Science - Computation and Language

E-Print:

Accompanies "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" arXiv:2101.00027

NASA/ADS

Datasheet for the Pile

Abstract