SlideShare a Scribd company logo
1 of 30
AN APPROACH TO SOURCE CODE
     PLAGIARISM DETECTION AND
INVESTIGATION USING LATENT SEMANTIC
              ANALYSIS




Authors: Georgina Cosma and Mike Joy

Presented by
Varsha Bhat K(1DS09CS105)
INTRODUCTION

   Source code plagiarism: reuse of source code
    authored by someone else & fail to adequately
    acknowledge the fact

   It may occur intentionally or unintentionally


   Used by higher education academics
CHALLENGES INVOLVED
   Detect similar file pairs
   Investigating the similar source code fragments
    within the detected files
   Determine if the similarity is suspicious or innocent
   Burden of proof:-
        “Not only do we need to detect instances of
     plagiarism, we must also be able to demonstrate
    beyond reasonable doubt that those instances are
                  not chance similarities.”
EXISTING TOOLS
Category of the tools:

   Fingerprint based systems


   String matching systems


   Parameterized matching systems


                   These were identified by Mozgovoy
FINGER PRINT BASED SYSTEM
   Create finger print for each of the files
   Finger print contains statistical information
   There are various metrics used for detecting
    plagiarism. Ex: Halstead‟s metrics
   Example of such a system is the ITPAD
STRING MATCHING APPROACH
The various steps involved here are :--
 Stage I is the process of tokenization

 Then the source code is written as a series of token
  strings
 Tokens are compared to check for similarity

 Example tools are:

      MOSS

      YAP3

      JPLAG
PARAMETERIZED MATCHING SYSTEMS
 Detects identical and near duplicate sections of
  source code
 Achieved by matching source code sections whose
  identifiers have been substituted for systematically
 Ex: DUP tool


    INFORMATION RETRIEVAL METHODS
 Represents program as indexed set of keywords
 Computed the frequency of these keywords

 Then computed the pair wise similarity

 Ex: PDetect
PLAGATE
   Detect similar source code files

   Investigate the similar code fragments within them

   The view of investigation is to gather evidence for
    proving plagiarism by indicating contribution levels
    of fragments

   This enhances detection performance of existing
    algorithms

   Uses the technique of Latent Semantic Analysis to
    achieve this
LATENT SEMANTIC ANALYSIS
   It is an information retrieval technique
   Text collection is preprocessed            a11   a12

   Represented as a term-by-file matrix       a21   a22

   Matrix transformation is applied
   Singular value decomposition performed
   Thus uncovers latent relationships
   Derives meaning of terms by approximating the
    structure of term usage among document using
    SVD
ADVANTAGES
 Can detect transitive relationships unlike the
  traditional text retrieval systems
 Helps reduce noise in the data

 Overcomes problems of synonymy and polysemy

 Changes to document structure will not affect the
  detection
 Language independent


 DISADVANTAGES
  Gives relatively high similarity values for non copied
   programs also
SIMILARITY IN SOURCE CODE FILES
   Key factors for judging similarity in files are
      Nature of programming language and the problem
      Variance in solution
      Supporting source code already given
      Assignment requirements


   Fragments under investigation must not be
         Short
         Simple
         Standard
         Trivial
         Limited functionality
         Frequently published
SIMILARITY CATEGORIES
   Source code fragments have varying contribution to
    evidence for plagiarism
   Thus arises the need for a criterion for identifying
    the contributions
   Contribution levels
     1.   Contribution level 0- no contribution
     2.   Contribution level 1- low contribution
     3.   Contribution level 2- high contribution

   Similarity levels
     1.   Level 0- innocent
     2.   Level 1- suspicious
PLAGATE SYSTEM
   Aim: enhance the process of plagiarism detection
    and investigation
   It is integrated with external detection tools as an
    enhancer
   Components
    1.   PlaGate Detection tool (PGDT)
    2.   PlaGate Query tool (PGQT)
FUNCTIONALITY
SYSTEM REPRESENTATION
   File copus C
             C={ F1, F2, …….. Fn }
   Source code fragment „s‟ from source code file F
            FɛC
            F= { s1, s2, ………sp }
   Set of source code fragments S


   File length ‘lf’ where


   Source code fragment length ‘ls’
LSA PROCESS IN PLAGATE
   Preprocess the files
   Transform the corpus of files into an m x n matrix
                    A=[        ]
   Term weighting algorithm are applied to them
        value of term in file:




   SVD is performed on the weighted matrix A

   Reduction of dimention
DETECTION AND CLASSIFICATION PROCESS IN
PLAGATE
 PGQT component transforms the input file or
  fragment into a query vector „q‟
 Then q is projected onto the k-dimensional space

 Thus we get:




 We now measure similarity between Q and all the
  source code files in the corpus by using similarity
  measure
 Cosine similarity measure is the most popular
EXPERIMENTATION

 Four corpora consisting of java source code files
 Corpora is produced by undergraduate students at
  University of Warwick
 Students were given simple skeleton code to start
  with
                      The Data Sets
PERFORMANCE EVALUATION MEASURES
 sim(Fa,Fb) gives the similarity of two files and is
  computed using similarity measure
 Recall and Precision are two most commonly used
  measures for information retrieval systems
 A threshold is selected Ø

 Files that have sim(Fa,Fb) ≥ Ø are detected
   Overall performance will be evaluated by combining
    both the measures




   Closer the value of F to 1.00 the better is the
    detection performance
PLAGATE VS JPLAG AND SHERLOCK
 Performance when tools function alone and when
  integrated with PlaGate is evaluated
 List of suspicious file are created
   Results


        Recall increases after integration with PGDT
        This constant increase indicates PGDT and
         external tools compliment each other
        Further increase seen when both PGDT and
         PGQT are integrated but at the cost of Precision
   JPlag alone had high Precision and low
    Recall in all data sets
   Sherlock and JPlag, both string matching
    algorithms vary significantly in detection
    performance
   Similarity often occurs in groups containing more
    than 2 files
   JPlag and Sherlock fail to parse some suspicious
    files due to Local Confusion
   Local Confusion occurs when some code segments
    shorter than the minimum- match length have been
    shuffled in files as they are string matched
    algorithms
   PlaGate does not suffer from this sort of local
    confusion as it does not depend on the structure of
    the code
Example
CONCLUSION
 LSA based technique for plagiarism detection and
  investigation as enhancers
 Detection of missed source code files by current
  plagiarism detection tools
 Integration with PlaGate increases Recall at the
  cost of Precision
 Classification of similarity by PlaGate into
  contribution levels
 PlaGate is language independent

 Unlike other tools that find the similarity of two
  files, PlaGate finds the relative similarity
FUTURE WORK
   Automating dimensionality reduction is still a
    problem
   Miss classification of source code fragment
   PlaGate behavior is not as stable as the string
    matching algorithms
An approach to source code plagiarism
An approach to source code plagiarism

More Related Content

What's hot

AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova
 
Behavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code ClonesBehavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code ClonesTELKOMNIKA JOURNAL
 
A Novel Approach for Code Clone Detection Using Hybrid Technique
A Novel Approach for Code Clone Detection Using Hybrid TechniqueA Novel Approach for Code Clone Detection Using Hybrid Technique
A Novel Approach for Code Clone Detection Using Hybrid TechniqueINFOGAIN PUBLICATION
 
Survey of universal authentication protocol for mobile communication
Survey of universal authentication protocol for mobile communicationSurvey of universal authentication protocol for mobile communication
Survey of universal authentication protocol for mobile communicationAhmad Sharifi
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executablesUltraUploader
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsAdrian Paschke
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareICSM 2010
 
A novel approach for clone group mapping
A novel approach for clone group mappingA novel approach for clone group mapping
A novel approach for clone group mappingijseajournal
 
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT IAEME Publication
 
Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Amogh Kawle
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET Journal
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006pogil
 
o-checker : Malicious document file detection tool - Malicious feature can be...
o-checker : Malicious document file detection tool - Malicious feature can be...o-checker : Malicious document file detection tool - Malicious feature can be...
o-checker : Malicious document file detection tool - Malicious feature can be...CODE BLUE
 
Proposed Arabic Text Steganography Method Based on New Coding Technique
Proposed Arabic Text Steganography Method Based on New Coding TechniqueProposed Arabic Text Steganography Method Based on New Coding Technique
Proposed Arabic Text Steganography Method Based on New Coding TechniqueIJERA Editor
 
Butler
ButlerButler
Butleranesah
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus typejins0618
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineeringNakul Sharma
 

What's hot (20)

AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
Behavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code ClonesBehavioral Analysis for Detecting Code Clones
Behavioral Analysis for Detecting Code Clones
 
A Novel Approach for Code Clone Detection Using Hybrid Technique
A Novel Approach for Code Clone Detection Using Hybrid TechniqueA Novel Approach for Code Clone Detection Using Hybrid Technique
A Novel Approach for Code Clone Detection Using Hybrid Technique
 
Survey of universal authentication protocol for mobile communication
Survey of universal authentication protocol for mobile communicationSurvey of universal authentication protocol for mobile communication
Survey of universal authentication protocol for mobile communication
 
A hybrid model to detect malicious executables
A hybrid model to detect malicious executablesA hybrid model to detect malicious executables
A hybrid model to detect malicious executables
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent Software
 
A novel approach for clone group mapping
A novel approach for clone group mappingA novel approach for clone group mapping
A novel approach for clone group mapping
 
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
GENERIC CODE CLONING METHOD FOR DETECTION OF CLONE CODE IN SOFTWARE DEVELOPMENT
 
Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text Named Entity Recognition For Hindi-English code-mixed Twitter Text
Named Entity Recognition For Hindi-English code-mixed Twitter Text
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine Learning
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006
 
o-checker : Malicious document file detection tool - Malicious feature can be...
o-checker : Malicious document file detection tool - Malicious feature can be...o-checker : Malicious document file detection tool - Malicious feature can be...
o-checker : Malicious document file detection tool - Malicious feature can be...
 
Proposed Arabic Text Steganography Method Based on New Coding Technique
Proposed Arabic Text Steganography Method Based on New Coding TechniqueProposed Arabic Text Steganography Method Based on New Coding Technique
Proposed Arabic Text Steganography Method Based on New Coding Technique
 
Malware analysis
Malware analysisMalware analysis
Malware analysis
 
Butler
ButlerButler
Butler
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
 

Viewers also liked

AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...Damiano Spina
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Kyunghoon Kim
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschachtguest1add48f
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
Brain controlled-car-for-disabled
Brain controlled-car-for-disabledBrain controlled-car-for-disabled
Brain controlled-car-for-disabledshahnaazmd
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp
 

Viewers also liked (20)

AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschacht
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Brain controlled-car-for-disabled
Brain controlled-car-for-disabledBrain controlled-car-for-disabled
Brain controlled-car-for-disabled
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
 

Similar to An approach to source code plagiarism

A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism DetectionKarla Adamson
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISijseajournal
 
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...Ra'Fat Al-Msie'deen
 
36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructure36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructureWashington Garcia
 
2 column paper
2 column paper2 column paper
2 column paperAksh Gupta
 
2 column paper
2 column paper2 column paper
2 column paperAksh Gupta
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniquesNimisha T
 
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...IJCSIS Research Publications
 
A Tool to Detect Plagiarism in Java Source Code.pdf
A Tool to Detect Plagiarism in Java Source Code.pdfA Tool to Detect Plagiarism in Java Source Code.pdf
A Tool to Detect Plagiarism in Java Source Code.pdfKayla Smith
 
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...Parvez Mahbub
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleCSCJournals
 
detection and classification of malware.pptx
detection and classification of malware.pptxdetection and classification of malware.pptx
detection and classification of malware.pptxJamesFranklen
 
Security Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data MiningSecurity Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data MiningPravinYalameli
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
csmalware_malware
csmalware_malwarecsmalware_malware
csmalware_malwareJoshua Saxe
 

Similar to An approach to source code plagiarism (20)

A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism Detection
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
 
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...
YamenTrace: Requirements Traceability - Recovering and Visualizing Traceabili...
 
36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructure36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructure
 
2 column paper
2 column paper2 column paper
2 column paper
 
2 column paper
2 column paper2 column paper
2 column paper
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
 
P33077080
P33077080P33077080
P33077080
 
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
 
A Tool to Detect Plagiarism in Java Source Code.pdf
A Tool to Detect Plagiarism in Java Source Code.pdfA Tool to Detect Plagiarism in Java Source Code.pdf
A Tool to Detect Plagiarism in Java Source Code.pdf
 
Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...Author Identification of Source Code Segments Written by Multiple Authors Usi...
Author Identification of Source Code Segments Written by Multiple Authors Usi...
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known Sample
 
L1803058388
L1803058388L1803058388
L1803058388
 
20170412 om patri pres 153pdf
20170412 om patri pres 153pdf20170412 om patri pres 153pdf
20170412 om patri pres 153pdf
 
detection and classification of malware.pptx
detection and classification of malware.pptxdetection and classification of malware.pptx
detection and classification of malware.pptx
 
Security Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data MiningSecurity Application for Malicious Code Detection using Data Mining
Security Application for Malicious Code Detection using Data Mining
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
csmalware_malware
csmalware_malwarecsmalware_malware
csmalware_malware
 

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

An approach to source code plagiarism

  • 1. AN APPROACH TO SOURCE CODE PLAGIARISM DETECTION AND INVESTIGATION USING LATENT SEMANTIC ANALYSIS Authors: Georgina Cosma and Mike Joy Presented by Varsha Bhat K(1DS09CS105)
  • 2. INTRODUCTION  Source code plagiarism: reuse of source code authored by someone else & fail to adequately acknowledge the fact  It may occur intentionally or unintentionally  Used by higher education academics
  • 3. CHALLENGES INVOLVED  Detect similar file pairs  Investigating the similar source code fragments within the detected files  Determine if the similarity is suspicious or innocent  Burden of proof:- “Not only do we need to detect instances of plagiarism, we must also be able to demonstrate beyond reasonable doubt that those instances are not chance similarities.”
  • 4. EXISTING TOOLS Category of the tools:  Fingerprint based systems  String matching systems  Parameterized matching systems These were identified by Mozgovoy
  • 5. FINGER PRINT BASED SYSTEM  Create finger print for each of the files  Finger print contains statistical information  There are various metrics used for detecting plagiarism. Ex: Halstead‟s metrics  Example of such a system is the ITPAD
  • 6. STRING MATCHING APPROACH The various steps involved here are :--  Stage I is the process of tokenization  Then the source code is written as a series of token strings  Tokens are compared to check for similarity  Example tools are:  MOSS  YAP3  JPLAG
  • 7. PARAMETERIZED MATCHING SYSTEMS  Detects identical and near duplicate sections of source code  Achieved by matching source code sections whose identifiers have been substituted for systematically  Ex: DUP tool INFORMATION RETRIEVAL METHODS  Represents program as indexed set of keywords  Computed the frequency of these keywords  Then computed the pair wise similarity  Ex: PDetect
  • 8. PLAGATE  Detect similar source code files  Investigate the similar code fragments within them  The view of investigation is to gather evidence for proving plagiarism by indicating contribution levels of fragments  This enhances detection performance of existing algorithms  Uses the technique of Latent Semantic Analysis to achieve this
  • 9. LATENT SEMANTIC ANALYSIS  It is an information retrieval technique  Text collection is preprocessed a11 a12  Represented as a term-by-file matrix a21 a22  Matrix transformation is applied  Singular value decomposition performed  Thus uncovers latent relationships  Derives meaning of terms by approximating the structure of term usage among document using SVD
  • 10. ADVANTAGES  Can detect transitive relationships unlike the traditional text retrieval systems  Helps reduce noise in the data  Overcomes problems of synonymy and polysemy  Changes to document structure will not affect the detection  Language independent DISADVANTAGES  Gives relatively high similarity values for non copied programs also
  • 11. SIMILARITY IN SOURCE CODE FILES  Key factors for judging similarity in files are  Nature of programming language and the problem  Variance in solution  Supporting source code already given  Assignment requirements  Fragments under investigation must not be  Short  Simple  Standard  Trivial  Limited functionality  Frequently published
  • 12. SIMILARITY CATEGORIES  Source code fragments have varying contribution to evidence for plagiarism  Thus arises the need for a criterion for identifying the contributions  Contribution levels 1. Contribution level 0- no contribution 2. Contribution level 1- low contribution 3. Contribution level 2- high contribution  Similarity levels 1. Level 0- innocent 2. Level 1- suspicious
  • 13. PLAGATE SYSTEM  Aim: enhance the process of plagiarism detection and investigation  It is integrated with external detection tools as an enhancer  Components 1. PlaGate Detection tool (PGDT) 2. PlaGate Query tool (PGQT)
  • 15. SYSTEM REPRESENTATION  File copus C C={ F1, F2, …….. Fn }  Source code fragment „s‟ from source code file F FɛC F= { s1, s2, ………sp }  Set of source code fragments S  File length ‘lf’ where  Source code fragment length ‘ls’
  • 16. LSA PROCESS IN PLAGATE  Preprocess the files  Transform the corpus of files into an m x n matrix A=[ ]  Term weighting algorithm are applied to them value of term in file:  SVD is performed on the weighted matrix A  Reduction of dimention
  • 17. DETECTION AND CLASSIFICATION PROCESS IN PLAGATE  PGQT component transforms the input file or fragment into a query vector „q‟  Then q is projected onto the k-dimensional space  Thus we get:  We now measure similarity between Q and all the source code files in the corpus by using similarity measure  Cosine similarity measure is the most popular
  • 18. EXPERIMENTATION  Four corpora consisting of java source code files  Corpora is produced by undergraduate students at University of Warwick  Students were given simple skeleton code to start with The Data Sets
  • 19. PERFORMANCE EVALUATION MEASURES  sim(Fa,Fb) gives the similarity of two files and is computed using similarity measure  Recall and Precision are two most commonly used measures for information retrieval systems  A threshold is selected Ø  Files that have sim(Fa,Fb) ≥ Ø are detected
  • 20. Overall performance will be evaluated by combining both the measures  Closer the value of F to 1.00 the better is the detection performance
  • 21. PLAGATE VS JPLAG AND SHERLOCK  Performance when tools function alone and when integrated with PlaGate is evaluated  List of suspicious file are created
  • 22. Results  Recall increases after integration with PGDT  This constant increase indicates PGDT and external tools compliment each other  Further increase seen when both PGDT and PGQT are integrated but at the cost of Precision
  • 23. JPlag alone had high Precision and low Recall in all data sets  Sherlock and JPlag, both string matching algorithms vary significantly in detection performance
  • 24.
  • 25. Similarity often occurs in groups containing more than 2 files  JPlag and Sherlock fail to parse some suspicious files due to Local Confusion  Local Confusion occurs when some code segments shorter than the minimum- match length have been shuffled in files as they are string matched algorithms  PlaGate does not suffer from this sort of local confusion as it does not depend on the structure of the code
  • 27. CONCLUSION  LSA based technique for plagiarism detection and investigation as enhancers  Detection of missed source code files by current plagiarism detection tools  Integration with PlaGate increases Recall at the cost of Precision  Classification of similarity by PlaGate into contribution levels  PlaGate is language independent  Unlike other tools that find the similarity of two files, PlaGate finds the relative similarity
  • 28. FUTURE WORK  Automating dimensionality reduction is still a problem  Miss classification of source code fragment  PlaGate behavior is not as stable as the string matching algorithms