Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution
Abstract
Convolutional neural networks (CNNs) have demonstrated superior capability for extracting information from raw signals in computer vision. Recently, character-level and multi-channel CNNs have exhibited excellent performance for sentence classification tasks. We apply CNNs to large-scale authorship attribution, which aims to determine an unknown text's author among many candidate authors, motivated by their ability to process character-level signals and to differentiate between a large number of classes, while making fast predictions in comparison to state-of-the-art approaches. We extensively evaluate CNN-based approaches that leverage word and character channels and compare them against state-of-the-art methods for a large range of author numbers, shedding new light on traditional approaches. We show that character-level CNNs outperform the state-of-the-art on four out of five datasets in different domains. Additionally, we present the first application of authorship attribution to reddit.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2016
- DOI:
- 10.48550/arXiv.1609.06686
- arXiv:
- arXiv:1609.06686
- Bibcode:
- 2016arXiv160906686R
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Machine Learning
- E-Print:
- 9 pages, 5 figures, 3 tables