Automatic plagiarism detection system for specialized corpora

Authors
University
Politehnica
of Bucharest
Automatic Plagiarism Detection
System for Specialized Corpora
Filip Cristian Buruiană
Adrian Scoică
Traian Rebedea – traian.rebedea@cs.pub.ro
Razvan Rughiniș

Overview
• Introduction
• System architecture
• Detection of plagiarism
• Algorithms for candidate selection
• Algorithms for detailed analysis
• Algorithms for post-procesing
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2

Introduction
• Plagiarism: unauthorized appropriation of the language or
thoughts of another author and the representation of that
author's work as pertaining to one's own without according
proper credit to the original author
• Lots of documents => automatic detection
needed
• Information Retrieval
– Stemming (ex. beauty, beautiful, beautifulness => beauti)
– Vector Space Model
– tf-idf weighting, cosine similarity
• Measuring results
– precision, recall, granularity => F-measure
22.09.13 CSCS 2013 – Bucharest, Romania 3

Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)
• They are general solutions, topic independent
• No open-source solutions that offer good results
• No solutions specialized for Computer Science
• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on
general texts
• Used corpora:
– PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)
– Bachelor thesis @ A&C

System Architecture
• Web interface for accessing AuthentiCop
– Simple to add documents (text, pdf) and to highlight suspicios
elements
22.09.13 5CSCS 2013 – Bucharest, Romania

System architecture
22.09.13 6
• Logical separation
– Front-end (PHP, JavaScript + AJAX, jquery)
– Back-end (C++)
– Cross-Language Communication
• Scalable solution, easy to update
– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines
– Plagiarism detection can be distributed on different
machines (distributed workers)
• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania

System architecture

System architecture
22.09.13 8
•Example: sequence of steps for processing PDF files:
•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components
•Automatic deployment system for the solution

Detection of plagiarism
• Different problems
– Intrinsic plagiarism (analyze only the suspicious
document)
– External plagiarism (also has a reference collection
to check against)
• How large is the collection? Online sources?
• Source identification
• Text allignment

Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection
– Find pairs of suspicious texts
– Combines source identification with text
allignment
1.Detailed analysis
2.Post-processing

Algorithms for candidate selection
22.09.13 11
•Selection of the plausible pairs of
plagiarism
•Using stop-words elimination, tf-idf & cosine
•Initial hypothesis
•“Similarity Search Problem”: All-Pairs,
ppjoin (Prefix Filtering with Positional
Information Join) CSCS 2013 – Bucharest, Romania

Algorithms for candidate selection
22.09.13 12
•FastDocode (presented at PAN 2010)
+ caching + sub-linear merging
•New approach
- Text segments => fingerprints & indexing with Apache
CLucene
- Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on
1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316

Algorithms for detailed analysis
22.09.13 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode
• Extending the analysis to the right and to the left,
starting from common words/passages
• Using passages instead of words as seeds for the
comparison
• tf-idf weighting & cosine similarity
Image source: Wikipedia

Algorithms for post-processing
• Semantic analysis using LSA
– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)
– Gensim framework in Pyhton
• Smith-Waterman Algorithm
– Dynamic programming
– Similar to the longest common subsequence
– Insert and delete operations may have any cost
(they may be greater than 1)

Results
22.09.13 15
• Corpus: PAN 2011 (~ 22k documents)
• Run time on laptop: ~ 20 hours
• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611

Results
22.09.13 16
• Specific corpus for CS:
– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat
– 307 BSc thesis in English
Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based
on the first derivative of a Gaussian, because
it is susceptible to noise present on raw
unprocessed image data, so to begin with,
the raw image is convolved with a Gaussian
filter. The result is a slightly blurred version of
the original which is not affected by a single
noisy pixel to any significant degree.
Because the Canny edge detector is
susceptible to noise present in raw
unprocessed image data, it uses a filter based
on a Gaussian (bell curve), where the raw
image is convolved with a Gaussian filter. The
result is a slightly blurred version of the
original which is not affected by a single noisy
pixel to any significant degree.
• Some elements are incorrectly identified as
plagiarism: quotes, bibliographic references

Conclusions
• Improving the corpus
• The system uses several parameters that were
determined empirically => use machine
learning for finding the best values
• Increase the speed of the processing
• Improve the method: “bag of words” +
information about the position of the words
• Need a better post-processing for real
documents (like scientific papers or thesis)

Thank you!
• Questions?
• Discussion

Automatic plagiarism detection system for specialized corpora

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (14)

Similaire à Automatic plagiarism detection system for specialized corpora

Similaire à Automatic plagiarism detection system for specialized corpora (20)

Plus de Traian Rebedea

Plus de Traian Rebedea (20)

Dernier

Dernier (20)

Automatic plagiarism detection system for specialized corpora