Plagiarism Detection in arXiv

doi:10.48550/arXiv.cs/0702012

Plagiarism Detection in arXiv

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.

Publication:

arXiv e-prints

Pub Date:

February 2007

DOI:

10.48550/arXiv.cs/0702012

arXiv:

arXiv:cs/0702012

Bibcode:

2007cs........2012S

Keywords:

Computer Science - Databases;
Computer Science - Digital Libraries;
Computer Science - Information Retrieval

E-Print:

Sixth International Conference on Data Mining (ICDM'06), Dec 2006

ADS

Plagiarism Detection in arXiv

Abstract