SlideShare une entreprise Scribd logo
1  sur  18
Authors
University
Politehnica
of Bucharest
Automatic Plagiarism Detection
System for Specialized Corpora
Filip Cristian Buruiană
Adrian Scoică
Traian Rebedea – traian.rebedea@cs.pub.ro
Razvan Rughiniș
Overview
• Introduction
• System architecture
• Detection of plagiarism
• Algorithms for candidate selection
• Algorithms for detailed analysis
• Algorithms for post-procesing
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2
Introduction
• Plagiarism: unauthorized appropriation of the language or
thoughts of another author and the representation of that
author's work as pertaining to one's own without according
proper credit to the original author
• Lots of documents => automatic detection
needed
• Information Retrieval
– Stemming (ex. beauty, beautiful, beautifulness => beauti)
– Vector Space Model
– tf-idf weighting, cosine similarity
• Measuring results
– precision, recall, granularity => F-measure
22.09.13 CSCS 2013 – Bucharest, Romania 3
Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)
• They are general solutions, topic independent
• No open-source solutions that offer good results
• No solutions specialized for Computer Science
• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on
general texts
• Used corpora:
– PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)
– Bachelor thesis @ A&C
22.09.13 CSCS 2013 – Bucharest, Romania 4
System Architecture
• Web interface for accessing AuthentiCop
– Simple to add documents (text, pdf) and to highlight suspicios
elements
22.09.13 5CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 6
• Logical separation
– Front-end (PHP, JavaScript + AJAX, jquery)
– Back-end (C++)
– Cross-Language Communication
• Scalable solution, easy to update
– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines
– Plagiarism detection can be distributed on different
machines (distributed workers)
• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 7CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 8
•Example: sequence of steps for processing PDF files:
•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components
•Automatic deployment system for the solution
CSCS 2013 – Bucharest, Romania
Detection of plagiarism
• Different problems
– Intrinsic plagiarism (analyze only the suspicious
document)
– External plagiarism (also has a reference collection
to check against)
• How large is the collection? Online sources?
• Source identification
• Text allignment
22.09.13 CSCS 2013 – Bucharest, Romania 9
Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection
– Find pairs of suspicious texts
– Combines source identification with text
allignment
1.Detailed analysis
2.Post-processing
22.09.13 CSCS 2013 – Bucharest, Romania 10
Algorithms for candidate selection
22.09.13 11
•Selection of the plausible pairs of
plagiarism
•Using stop-words elimination, tf-idf & cosine
•Initial hypothesis
•“Similarity Search Problem”: All-Pairs,
ppjoin (Prefix Filtering with Positional
Information Join) CSCS 2013 – Bucharest, Romania
Algorithms for candidate selection
22.09.13 12
•FastDocode (presented at PAN 2010)
+ caching + sub-linear merging
•New approach
- Text segments => fingerprints & indexing with Apache
CLucene
- Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on
1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316
CSCS 2013 – Bucharest, Romania
Algorithms for detailed analysis
22.09.13 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode
• Extending the analysis to the right and to the left,
starting from common words/passages
• Using passages instead of words as seeds for the
comparison
• tf-idf weighting & cosine similarity
Image source: Wikipedia
CSCS 2013 – Bucharest, Romania
Algorithms for post-processing
• Semantic analysis using LSA
– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)
– Gensim framework in Pyhton
• Smith-Waterman Algorithm
– Dynamic programming
– Similar to the longest common subsequence
– Insert and delete operations may have any cost
(they may be greater than 1)
22.09.13 14CSCS 2013 – Bucharest, Romania
Results
22.09.13 15
• Corpus: PAN 2011 (~ 22k documents)
• Run time on laptop: ~ 20 hours
• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611
CSCS 2013 – Bucharest, Romania
Results
22.09.13 16
• Specific corpus for CS:
– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat
– 307 BSc thesis in English
Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based
on the first derivative of a Gaussian, because
it is susceptible to noise present on raw
unprocessed image data, so to begin with,
the raw image is convolved with a Gaussian
filter. The result is a slightly blurred version of
the original which is not affected by a single
noisy pixel to any significant degree.
Because the Canny edge detector is
susceptible to noise present in raw
unprocessed image data, it uses a filter based
on a Gaussian (bell curve), where the raw
image is convolved with a Gaussian filter. The
result is a slightly blurred version of the
original which is not affected by a single noisy
pixel to any significant degree.
• Some elements are incorrectly identified as
plagiarism: quotes, bibliographic references
CSCS 2013 – Bucharest, Romania
Conclusions
• Improving the corpus
• The system uses several parameters that were
determined empirically => use machine
learning for finding the best values
• Increase the speed of the processing
• Improve the method: “bag of words” +
information about the position of the words
• Need a better post-processing for real
documents (like scientific papers or thesis)
22.09.13 17CSCS 2013 – Bucharest, Romania
Thank you!
• Questions?
• Discussion
22.09.13 CSCS 2013 – Bucharest, Romania 18

Contenu connexe

En vedette

Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...yosra Yassora
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsVlad Mackevic
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detectionankit_saluja
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...osify
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Traian Rebedea
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLPbutest
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniquesNimisha T
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguisticsAbbou Zohra
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applicationsdahveed123
 

En vedette (14)

Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
 

Similaire à Automatic plagiarism detection system for specialized corpora

Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...Kausal Malladi
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Vienna Data Science Group
 
Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...CUBCCE Conference
 
SERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical SystemsSERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical SystemsSERENEWorkshop
 
SERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolSERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolHenry Muccini
 
SERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_schoolSERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_schoolHenry Muccini
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
A Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSNA Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSNIJARIIT
 
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and ProcessingA Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and ProcessingPayamBarnaghi
 
Performance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlaysPerformance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlaysKnut-Helge Vik
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersBaden Hughes
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzujerdeb
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...Mumbai Academisc
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurementDr.M.Prasad Naidu
 

Similaire à Automatic plagiarism detection system for specialized corpora (20)

Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
 
Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...
 
SERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical SystemsSERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical Systems
 
SERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolSERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_school
 
SERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_schoolSERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_school
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
A Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSNA Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSN
 
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and ProcessingA Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
 
Performance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlaysPerformance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlays
 
bonino
boninobonino
bonino
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 

Plus de Traian Rebedea

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5Traian Rebedea
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...Traian Rebedea
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discoveryTraian Rebedea
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianTraian Rebedea
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...Traian Rebedea
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Traian Rebedea
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 

Plus de Traian Rebedea (20)

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in Romanian
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 

Dernier

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Automatic plagiarism detection system for specialized corpora

  • 1. Authors University Politehnica of Bucharest Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiană Adrian Scoică Traian Rebedea – traian.rebedea@cs.pub.ro Razvan Rughiniș
  • 2. Overview • Introduction • System architecture • Detection of plagiarism • Algorithms for candidate selection • Algorithms for detailed analysis • Algorithms for post-procesing • Results • Conclusions 22.09.13 Sesiunea de Licenţe - Iulie 2012 2
  • 3. Introduction • Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author • Lots of documents => automatic detection needed • Information Retrieval – Stemming (ex. beauty, beautiful, beautifulness => beauti) – Vector Space Model – tf-idf weighting, cosine similarity • Measuring results – precision, recall, granularity => F-measure 22.09.13 CSCS 2013 – Bucharest, Romania 3
  • 4. Existing solutions • Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.) • They are general solutions, topic independent • No open-source solutions that offer good results • No solutions specialized for Computer Science • Difficult to evaluate: need a good corpus (annotated by persons, how to find plagiarized documents, etc.) • AuthentiCop – developed for specialized corpora, also evaluated on general texts • Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and social software misuse” at CLEF) – Bachelor thesis @ A&C 22.09.13 CSCS 2013 – Bucharest, Romania 4
  • 5. System Architecture • Web interface for accessing AuthentiCop – Simple to add documents (text, pdf) and to highlight suspicios elements 22.09.13 5CSCS 2013 – Bucharest, Romania
  • 6. System architecture 22.09.13 6 • Logical separation – Front-end (PHP, JavaScript + AJAX, jquery) – Back-end (C++) – Cross-Language Communication • Scalable solution, easy to update – Web server (front-end) and the plagiarism detection modules (back-end) may run on different machines – Plagiarism detection can be distributed on different machines (distributed workers) • Several external open-source libraries are used (e.g. Apache Tika, Clucene, etc.) CSCS 2013 – Bucharest, Romania
  • 7. System architecture 22.09.13 7CSCS 2013 – Bucharest, Romania
  • 8. System architecture 22.09.13 8 •Example: sequence of steps for processing PDF files: •Apache Tika is used for transforming PDFs into text •Automatic build module for the back-end components •Automatic deployment system for the solution CSCS 2013 – Bucharest, Romania
  • 9. Detection of plagiarism • Different problems – Intrinsic plagiarism (analyze only the suspicious document) – External plagiarism (also has a reference collection to check against) • How large is the collection? Online sources? • Source identification • Text allignment 22.09.13 CSCS 2013 – Bucharest, Romania 9
  • 10. Detection of plagiarism Steps for external plagiarism detection 1.Candidate selection – Find pairs of suspicious texts – Combines source identification with text allignment 1.Detailed analysis 2.Post-processing 22.09.13 CSCS 2013 – Bucharest, Romania 10
  • 11. Algorithms for candidate selection 22.09.13 11 •Selection of the plausible pairs of plagiarism •Using stop-words elimination, tf-idf & cosine •Initial hypothesis •“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join) CSCS 2013 – Bucharest, Romania
  • 12. Algorithms for candidate selection 22.09.13 12 •FastDocode (presented at PAN 2010) + caching + sub-linear merging •New approach - Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet 3 150 10% 5413 44522 11469 ~ 1 0.162 4 150 10% 4913 10297 11969 ~ 2 0.306 4 150 30% 7633 35169 9249 ~ 4.5 0.256 5 150 20% 5194 6256 11688 ~ 3 0.367 Used method (used on 1000 documents) TP FP FN Prec. Recall Plagdet Fingerprinting & indexing 685 494 761 0.581 0.474 0.522 FastDocode#3 634 4097 812 0.134 0.438 0.205 FastDocode#4 424 815 1022 0.342 0.293 0.316 CSCS 2013 – Bucharest, Romania
  • 13. Algorithms for detailed analysis 22.09.13 13 •DotPlot: “Sequence Alignment Problem”. •Modified FastDocode • Extending the analysis to the right and to the left, starting from common words/passages • Using passages instead of words as seeds for the comparison • tf-idf weighting & cosine similarity Image source: Wikipedia CSCS 2013 – Bucharest, Romania
  • 14. Algorithms for post-processing • Semantic analysis using LSA – Built a semantic space with papers from Computer Science (and pages from Wikipedia) – Gensim framework in Pyhton • Smith-Waterman Algorithm – Dynamic programming – Similar to the longest common subsequence – Insert and delete operations may have any cost (they may be greater than 1) 22.09.13 14CSCS 2013 – Bucharest, Romania
  • 15. Results 22.09.13 15 • Corpus: PAN 2011 (~ 22k documents) • Run time on laptop: ~ 20 hours • Results: • Official results from PAN 2011: Plagdet Recall Precision Granularity 0.221929185084 0.202996955425 0.366482242839 1.26150173611 CSCS 2013 – Bucharest, Romania
  • 16. Results 22.09.13 16 • Specific corpus for CS: – 940 BSc thesis + 8700 article on CS from Wikipedia • Detecting thesis written in English: TextCat – 307 BSc thesis in English Plagiarized text Original text from Wikipedia The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. • Some elements are incorrectly identified as plagiarism: quotes, bibliographic references CSCS 2013 – Bucharest, Romania
  • 17. Conclusions • Improving the corpus • The system uses several parameters that were determined empirically => use machine learning for finding the best values • Increase the speed of the processing • Improve the method: “bag of words” + information about the position of the words • Need a better post-processing for real documents (like scientific papers or thesis) 22.09.13 17CSCS 2013 – Bucharest, Romania
  • 18. Thank you! • Questions? • Discussion 22.09.13 CSCS 2013 – Bucharest, Romania 18