A Large Self-Annotated Corpus for Sarcasm
Abstract
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2017
- DOI:
- 10.48550/arXiv.1704.05579
- arXiv:
- arXiv:1704.05579
- Bibcode:
- 2017arXiv170405579K
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning
- E-Print:
- 6 pages, 4 Figures. To Appear in LREC 2018