Plagiarism Detection in arXiv
Abstract
We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2007
- DOI:
- arXiv:
- arXiv:cs/0702012
- Bibcode:
- 2007cs........2012S
- Keywords:
-
- Computer Science - Databases;
- Computer Science - Digital Libraries;
- Computer Science - Information Retrieval
- E-Print:
- Sixth International Conference on Data Mining (ICDM'06), Dec 2006