SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp
An Adaptive Image-based
Plagiarism Detection Approach
Norman Meuschke
Information ScienceGroup
University of Konstanz
www.isg.uni.kn
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
University of Konstanz
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
Map data ©2018 GeoBasis-DE/BKG(©2009), Google
Map data ©2018 GeoBasis-DE/BKG(©2009), Google
Outline
• Overview of Research on Academic PlagiarismDetection
• Image-based PlagiarismDetection Approach
• Evaluation Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
Academic Plagiarism
“The use of ideas, concepts, words, or structures
without appropriately acknowledging the source to
benefit in a setting where originalityis expected.”
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that
transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity.
3
Plagiarism Forms
Note: plagiarismformsare not mutually exclusive
Paraphrasing
▪ intentional rewriting
▪ no / insufficient reference the source
Structural and idea plagiarism
▪ little or no verbatim text overlap
Cross-language plagiarism
▪ manual/automated conversion of text into
other language to hide its origin
Copy & paste
▪ taking content verbatim from other source
Shake & paste
▪ copy & paste of text segments with slight
adjustments, e.g., synonym substitutions
Technical disguise
▪ techniques that exploit weaknesses of
current detection methods
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4
Weak Strong
level of obfuscation
- intense research
- methods limited by text-based
candidate retrieval (R approx. 0.8 for
moderate disguise)
-
solvable,
no research
needed
solvedCopy & paste
Shake & paste
▪ n-gramfingerprinting
▪ vector space models
▪ text alignment
▪ exhaustive string matching
Technical disguise
▪ encoding checks
▪ checks for textual content
▪ checks for large images
Detection Capabilities
Paraphrasing
Structural and idea plagiarism
▪ synonym expansion(WordNet)
▪ Semantic Role Labeling
▪ Latent Semantic Analysis
▪ POS-aware text matching
Cross-language plagiarism
▪ CL Character N-Gram Comp.
▪ CL Explicit Semantic Analysis
▪ CL Alignment-based Similarity Analysis
Weak Strong
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
level of obfuscation
5
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Our Research
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 6
• Combine analysis of textual and non-textual content features
Idea of Image-based Plagiarism Detection
• Images in academic documents convey much semantic
information in compressed format independent of the text
• Much research on Content-based Image Retrieval (CBIR)
• Little adaption of CBIR methods to plagiarismdetection(PD)
• exact and cropped images copies
• affinely transformed images (scaling, rotation, projection)
• slight alterations of appearance (blurring, lower resolution,noise)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
Research Gap
• Currentimage-based PD approaches problematic for:
• compound images
• rearranged images
• images mostly containing text (typically tables inserted as figures)
• visually differing, semantically equivalent data visualizations
• Goal: image-based PD process that:
• combines established and new analysis methods to cover
heterogenousimages in academic documents
• adaptively applies suitable analysis steps
• flexibly quantifies suspiciousness
• is extensible in the future
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
Process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9
decompose
image
classify
image
extract
image
perceptual
hashing
OCR
ratio
hashing
positional
text matching
k-gram
text matching
reference
DB
distance calculation
DpHash, DrHash, DkTM, DposTM
outlier detection: s(Dm)>r
potential
source
images
input
doc.
Perceptual Hashing
• Efficient CBIR method to reliably find near imagecopies
• Uses most apparent visual features in images
• Creates non-uniquefingerprintsthat can be compared
• Fingerprints are invariant to:
• scaling
• aspect ratio changes
• changes to brightness, contrast and colors
• We use DiscreteCosine Transformation and Hamming Distance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10
Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
k-gram Text Matching
• To identify tables inserted as figures
and images with little visual similarity
• Text extracted using open source
OCR engine Tesseract
• Granularity:
• character 3-grams
• no chunk selection
• Similarity measure 𝑑 =
𝐾1⊖𝐾2
𝐾1∩𝐾2
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
Position-aware Text Matching
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12
• To account for typically small amount of text in images aggravated
by OCR errors
• Process:
• Scale images to same height (here: 800px)
• Define proximity region around identified text (here 50px circle)
• Project proximity regions of input image to potential source
• Only consider matching characters in projected proximity regions
𝑠 =
𝐾1 ∩ 𝐾2
max( 𝐾1 , |𝐾2|)
A
C
B
B
positional character match
input image
D
A
X
reference image
A positional character mismatchB
Legend:
D
1w 2w
2800pxh=
1800pxh=
25pxr =
Ratio Hashing
• First approach to targetreuse of data (and its visualization)
• identifies equivalent, yet visually differing bar charts
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
0
100
200
300
400
500
600
700
800
900
1.00
0.80
0.61
0.44
0.30
0.07
𝑑 =
1.00-1.00+
0.80-0.80+
0.61-0.61+
0.44-0.44+
0.30-0.30+
0.07-0.07
= 0.00
Outlier Detection
• To quantify suspiciousness of method-specific distance scores
• Two assumptions:
• image only suspicious if comparably high similarity (small distance)
to small set (c=9) of other images
• clear separation of distance scoresof highly similar set of images
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Outlier Detection Continued
• Find outlier group:
• split list of relative distance deltas if a distance is at least twice as
large as its predecessor(3x as large for k-gram matching)
• Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance
margin to collection that is twice as large as outlier’s distance to the
input image
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15
0 25 33 40 80 90
outliers considered
as potential sources
images considered
as unrelated
mD
'
mD 0.3 0.2 0.1
'
(80 40) / 40id = − = 1
'
,1mD '
,2mD
' 1kd 
absolute distance scores
relative deltas of
distance scores
condition for list split: k c
Evaluation
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
• Source for test images: VroniPlag collection
• crowd-sourced effortinvestigating plagiarism allegations
• 196 manually examined academic works(mostly PhD theses)
• most allegations confirmed by responsible universities
• Targeted crawl for all annotated ‘fragments’ containing images
• confirmed by at least two examiners
• Selection of 15 representativecases (mostly from life sciences)
• Cases imbedded in 4,500images obtained from PubMed Central
16
Example: Near Copies
An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al.
17
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
Example: Weak Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ry/073
Example: Moderate Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ab/017
Example: Strong Alteration
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Ad/068
Results
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21
• Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at
least one of the methods (𝑅 = 0.73)
• Outlier detection effective (𝑃 = 1):
• For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank
• For all input images with 𝑠 < 0.5, no source image retrieved among
the top-ten most similar images, i.e. no false positives
• Perceptual hashing with sub-image extraction worked best for
near copies and weakly altered images (found 6 of 9 cases)
• Text analysis performed better than perceptual hashing for
moderately and strongly altered images
• if quality of the image was high enough to perform OCR reliably and
sufficient text contentis present.
Results Continued
• Text analysis approaches identified 3 of 4 cases involving tables
• position-aware text matchingmore robust to low OCR quality
• k-gram matchingidentified more cases
• combination of approaches allows processing more images
• Dataset contained only one bar chart, for which ratio hashing
yielded extremely suspicious score (𝑠 = 0.92)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22
Source Image Reused Image
Source: http://de.vroniplag.wikia.com/wiki/Cz/047
Discussion & Conclusion
• Image-based PD promising complement to other methods
• Small test collection, but restrictiveoutlier detection procedure
will prevent false positives also in larger collections
• if reduced precision is acceptable, threshold can be changed
interactively by user
• Approach well suited for scaling
• Preprocessingin parallel
• Options described to scale analysis methods
• Approach easily extensible with new methods
• New input scores to outlier detection
• Code: www.purl.org/imagepd
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
Future Work
• More detection methods tailored to specific data visualizations
• Scale the process
• parallelization of preprocessing
• candidate selection for feature descriptors
• Realize hybrid process
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24
heuristics
detailed
comparison
full text
similarity
mathematical
fomulae
text
snippets
citation
patterns
visuali-
zation
end
semantic
similarity
image
similarity
start
fuzzy citation
patterns
cross-lingual.
similarity
candidate
retrieval
human
inspection
post
processing
mathematical
fomulae
Legend:
future
research
current
research
completed
research
image
similarity
Questions?
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al.
Norman Meuschke
n@meuschke.org
• Code:
www.purl.org/imagepd
• Contact, publications, other projects:
www.isg.uni.kn
25
Image Extraction & Decomposition
• Extraction:
• poppler framework
• convertto JPEG
• discard images smaller than 7.5 KB (typically logos)
• Decomposition:
• assume white pixels separate sub-images
• assume rectangular sub-images aligned horizontally or vertically
• tradeoff (images remain analyzable if decomposition fails)
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
Decomposition
• Process:
• conversionto grayscale to reduceruntime
• padding with white pixels to removea potential border
• binarization using adaptive thresholdingto obtain a b/w image
• dilation to ensureblack pixels are connected
• floodfill of white areas with black pixels
• subtract original image
• invert image
• blob detection using the algorithm of Suzuki and Abe [1]
• estimate boundingbox by looking for large contoursaligned along
the image axes
• crop and store the identified sub-images
[1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by
Border Following. CVGIP 30, 1 (1985).
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
Image Classification
• Depp CNN realized using Caffe and AlexNet architecture [2]
• CNN classifies images into:
• photographs(pHash only)
• bar charts (ratio hashing only)
• other image types (pHash and OCR text matching)
• Manual checks of 100 classified images
• Accuracy 0.92 for photographs and 1.00 for bar charts
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
Perceptual Hashing
• Process:
• Reduce size to 32x32 pixels
• Convertto grayscale
• Compute 32x32 DiscreteCosine Transform(DCT)
• Reduce DCT to 8x8 for lowest frequencies
• Compute average DCT value
• Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT
value (1 - above mean, 0 – below mean)
• Similarity measure: Hammingdistance
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
Extraction of Bar Heights
• Process:
• convertto grayscale
• binarize using global threshold to obtain b/w image (sharp contours)
• pad image with white pixels to ensurebars can be filled
• clean artifacts of black pixels using a threshold on the relative area
covered by the pixels
• removeimage border
• floodfill with black pixels and invert
• find candidates for bars by determining the lengths of all vertical lines
of black pixels
• determine bars by clustering vertical lines
• removenoise from whiskers, labels, and legend entries
• assume the average height of the lines in a cluster as the bar height
An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30

Contenu connexe

Similaire à An Adaptive Image-based Plagiarism Detection Approach

Similaire à An Adaptive Image-based Plagiarism Detection Approach (20)

Multilabel Image Retreval Using Hashing
Multilabel Image Retreval Using HashingMultilabel Image Retreval Using Hashing
Multilabel Image Retreval Using Hashing
 
OBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEYOBJECT DETECTION AND RECOGNITION: A SURVEY
OBJECT DETECTION AND RECOGNITION: A SURVEY
 
B0310408
B0310408B0310408
B0310408
 
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEYAPPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
APPLICATIONS OF SPATIAL FEATURES IN CBIR : A SURVEY
 
Applications of spatial features in cbir a survey
Applications of spatial features in cbir  a surveyApplications of spatial features in cbir  a survey
Applications of spatial features in cbir a survey
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVALA COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
A COMPARATIVE ANALYSIS OF RETRIEVAL TECHNIQUES IN CONTENT BASED IMAGE RETRIEVAL
 
A comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrievalA comparative analysis of retrieval techniques in content based image retrieval
A comparative analysis of retrieval techniques in content based image retrieval
 
project final ppt.pptx
project final ppt.pptxproject final ppt.pptx
project final ppt.pptx
 
Lecture1
Lecture1Lecture1
Lecture1
 
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
Enhanced Hashing Approach For Image Forgery Detection With Feature Level FusionEnhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
Enhanced Hashing Approach For Image Forgery Detection With Feature Level Fusion
 
JPM1407 Exposing Digital Image Forgeries by Illumination Color Classification
JPM1407   Exposing Digital Image Forgeries by Illumination Color ClassificationJPM1407   Exposing Digital Image Forgeries by Illumination Color Classification
JPM1407 Exposing Digital Image Forgeries by Illumination Color Classification
 
IEEE MultiMedia 2016 Title and Abstract
IEEE MultiMedia 2016 Title and AbstractIEEE MultiMedia 2016 Title and Abstract
IEEE MultiMedia 2016 Title and Abstract
 
Lec10 alignment
Lec10 alignmentLec10 alignment
Lec10 alignment
 
[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp[DL輪読会]ClearGrasp
[DL輪読会]ClearGrasp
 
Brain Maps like Mine
Brain Maps like MineBrain Maps like Mine
Brain Maps like Mine
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
 
2019 cvpr paper overview by Ho Seong Lee
2019 cvpr paper overview by Ho Seong Lee2019 cvpr paper overview by Ho Seong Lee
2019 cvpr paper overview by Ho Seong Lee
 
AUTOMATED IMAGE MOSAICING SYSTEM WITH ANALYSIS OVER VARIOUS IMAGE NOISE
AUTOMATED IMAGE MOSAICING SYSTEM WITH ANALYSIS OVER VARIOUS IMAGE NOISEAUTOMATED IMAGE MOSAICING SYSTEM WITH ANALYSIS OVER VARIOUS IMAGE NOISE
AUTOMATED IMAGE MOSAICING SYSTEM WITH ANALYSIS OVER VARIOUS IMAGE NOISE
 
Template Matching
Template MatchingTemplate Matching
Template Matching
 

Plus de Scientific Information Analytics Group, Prof. Gipp

Plus de Scientific Information Analytics Group, Prof. Gipp (11)

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-...
 
A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection  A First Step Towards Content Protecting Plagiarism Detection
A First Step Towards Content Protecting Plagiarism Detection
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
Towards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and RecognitionTowards Formula Concept Discovery and Recognition
Towards Formula Concept Discovery and Recognition
 
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
Too Late to Collaborate:Challenges tothe Discovery ofin-progress ResearchToo Late to Collaborate:Challenges tothe Discovery ofin-progress Research
Too Late to Collaborate: Challenges to the Discovery of in-progress Research
 
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathe...
 
Repurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical GuideRepurposing Open Source Tools for Open Science: a Practical Guide
Repurposing Open Source Tools for Open Science: a Practical Guide
 
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...Blockchain based Trusted Timestamping for Research Data and Preprints using O...
Blockchain based Trusted Timestamping for Research Data and Preprints using O...
 
Analyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic PlagiarismAnalyzing Nontextual Content Features to Detect Academic Plagiarism
Analyzing Nontextual Content Features to Detect Academic Plagiarism
 
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...A Semantically Enriched Recommendation & Visualization Approach for Academic ...
A Semantically Enriched Recommendation & Visualization Approach for Academic ...
 
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...Automatic Mathematical Information Retrieval to Perform Translations up to Co...
Automatic Mathematical Information Retrieval to Perform Translations up to Co...
 

Dernier

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 

Dernier (20)

7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 

An Adaptive Image-based Plagiarism Detection Approach

  • 1. N. Meuschke, C. Gondek, D. Seebacher, C. Breitinger, D. Keim, B. Gipp An Adaptive Image-based Plagiarism Detection Approach Norman Meuschke Information ScienceGroup University of Konstanz www.isg.uni.kn An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 1
  • 2. University of Konstanz An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2 Map data ©2018 GeoBasis-DE/BKG(©2009), Google Map data ©2018 GeoBasis-DE/BKG(©2009), Google
  • 3. Outline • Overview of Research on Academic PlagiarismDetection • Image-based PlagiarismDetection Approach • Evaluation Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 2
  • 4. Academic Plagiarism “The use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originalityis expected.” An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Source: Teddi Fishman. 2009. ”Weknow it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity. 3
  • 5. Plagiarism Forms Note: plagiarismformsare not mutually exclusive Paraphrasing ▪ intentional rewriting ▪ no / insufficient reference the source Structural and idea plagiarism ▪ little or no verbatim text overlap Cross-language plagiarism ▪ manual/automated conversion of text into other language to hide its origin Copy & paste ▪ taking content verbatim from other source Shake & paste ▪ copy & paste of text segments with slight adjustments, e.g., synonym substitutions Technical disguise ▪ techniques that exploit weaknesses of current detection methods An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 4 Weak Strong level of obfuscation
  • 6. - intense research - methods limited by text-based candidate retrieval (R approx. 0.8 for moderate disguise) - solvable, no research needed solvedCopy & paste Shake & paste ▪ n-gramfingerprinting ▪ vector space models ▪ text alignment ▪ exhaustive string matching Technical disguise ▪ encoding checks ▪ checks for textual content ▪ checks for large images Detection Capabilities Paraphrasing Structural and idea plagiarism ▪ synonym expansion(WordNet) ▪ Semantic Role Labeling ▪ Latent Semantic Analysis ▪ POS-aware text matching Cross-language plagiarism ▪ CL Character N-Gram Comp. ▪ CL Explicit Semantic Analysis ▪ CL Alignment-based Similarity Analysis Weak Strong An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. level of obfuscation 5
  • 8. Idea of Image-based Plagiarism Detection • Images in academic documents convey much semantic information in compressed format independent of the text • Much research on Content-based Image Retrieval (CBIR) • Little adaption of CBIR methods to plagiarismdetection(PD) • exact and cropped images copies • affinely transformed images (scaling, rotation, projection) • slight alterations of appearance (blurring, lower resolution,noise) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 7
  • 9. Research Gap • Currentimage-based PD approaches problematic for: • compound images • rearranged images • images mostly containing text (typically tables inserted as figures) • visually differing, semantically equivalent data visualizations • Goal: image-based PD process that: • combines established and new analysis methods to cover heterogenousimages in academic documents • adaptively applies suitable analysis steps • flexibly quantifies suspiciousness • is extensible in the future An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 8
  • 10. Process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 9 decompose image classify image extract image perceptual hashing OCR ratio hashing positional text matching k-gram text matching reference DB distance calculation DpHash, DrHash, DkTM, DposTM outlier detection: s(Dm)>r potential source images input doc.
  • 11. Perceptual Hashing • Efficient CBIR method to reliably find near imagecopies • Uses most apparent visual features in images • Creates non-uniquefingerprintsthat can be compared • Fingerprints are invariant to: • scaling • aspect ratio changes • changes to brightness, contrast and colors • We use DiscreteCosine Transformation and Hamming Distance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 10 Image Source: https://medium.com/taringa-on-publishing/why-we-built-imageid-and-saved-47-of-the-moderation-effort-b7afb69d068e
  • 12. k-gram Text Matching • To identify tables inserted as figures and images with little visual similarity • Text extracted using open source OCR engine Tesseract • Granularity: • character 3-grams • no chunk selection • Similarity measure 𝑑 = 𝐾1⊖𝐾2 𝐾1∩𝐾2 An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 11
  • 13. Position-aware Text Matching An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 12 • To account for typically small amount of text in images aggravated by OCR errors • Process: • Scale images to same height (here: 800px) • Define proximity region around identified text (here 50px circle) • Project proximity regions of input image to potential source • Only consider matching characters in projected proximity regions 𝑠 = 𝐾1 ∩ 𝐾2 max( 𝐾1 , |𝐾2|) A C B B positional character match input image D A X reference image A positional character mismatchB Legend: D 1w 2w 2800pxh= 1800pxh= 25pxr =
  • 14. Ratio Hashing • First approach to targetreuse of data (and its visualization) • identifies equivalent, yet visually differing bar charts An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 13 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 0 100 200 300 400 500 600 700 800 900 1.00 0.80 0.61 0.44 0.30 0.07 𝑑 = 1.00-1.00+ 0.80-0.80+ 0.61-0.61+ 0.44-0.44+ 0.30-0.30+ 0.07-0.07 = 0.00
  • 15. Outlier Detection • To quantify suspiciousness of method-specific distance scores • Two assumptions: • image only suspicious if comparably high similarity (small distance) to small set (c=9) of other images • clear separation of distance scoresof highly similar set of images An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 14 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 16. Outlier Detection Continued • Find outlier group: • split list of relative distance deltas if a distance is at least twice as large as its predecessor(3x as large for k-gram matching) • Score suspicious (𝑠 ≥ 0.5 ) if least similar outlier has distance margin to collection that is twice as large as outlier’s distance to the input image An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 15 0 25 33 40 80 90 outliers considered as potential sources images considered as unrelated mD ' mD 0.3 0.2 0.1 ' (80 40) / 40id = − = 1 ' ,1mD ' ,2mD ' 1kd  absolute distance scores relative deltas of distance scores condition for list split: k c
  • 17. Evaluation An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. • Source for test images: VroniPlag collection • crowd-sourced effortinvestigating plagiarism allegations • 196 manually examined academic works(mostly PhD theses) • most allegations confirmed by responsible universities • Targeted crawl for all annotated ‘fragments’ containing images • confirmed by at least two examiners • Selection of 15 representativecases (mostly from life sciences) • Cases imbedded in 4,500images obtained from PubMed Central 16
  • 18. Example: Near Copies An AdaptiveImage-based PlagiarismDetection Approach - Meuschkeet al. 17 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Dsa/014
  • 19. Example: Weak Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 18 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ry/073
  • 20. Example: Moderate Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 19 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ab/017
  • 21. Example: Strong Alteration An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 20 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Ad/068
  • 22. Results An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 21 • Suspicious scores (𝑠 ≥ 0.5) for 11 of 15 cases computed by at least one of the methods (𝑅 = 0.73) • Outlier detection effective (𝑃 = 1): • For all input images with 𝑠 ≥ 0.5, true sourceimage at the top rank • For all input images with 𝑠 < 0.5, no source image retrieved among the top-ten most similar images, i.e. no false positives • Perceptual hashing with sub-image extraction worked best for near copies and weakly altered images (found 6 of 9 cases) • Text analysis performed better than perceptual hashing for moderately and strongly altered images • if quality of the image was high enough to perform OCR reliably and sufficient text contentis present.
  • 23. Results Continued • Text analysis approaches identified 3 of 4 cases involving tables • position-aware text matchingmore robust to low OCR quality • k-gram matchingidentified more cases • combination of approaches allows processing more images • Dataset contained only one bar chart, for which ratio hashing yielded extremely suspicious score (𝑠 = 0.92) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 22 Source Image Reused Image Source: http://de.vroniplag.wikia.com/wiki/Cz/047
  • 24. Discussion & Conclusion • Image-based PD promising complement to other methods • Small test collection, but restrictiveoutlier detection procedure will prevent false positives also in larger collections • if reduced precision is acceptable, threshold can be changed interactively by user • Approach well suited for scaling • Preprocessingin parallel • Options described to scale analysis methods • Approach easily extensible with new methods • New input scores to outlier detection • Code: www.purl.org/imagepd An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 23
  • 25. Future Work • More detection methods tailored to specific data visualizations • Scale the process • parallelization of preprocessing • candidate selection for feature descriptors • Realize hybrid process An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 24 heuristics detailed comparison full text similarity mathematical fomulae text snippets citation patterns visuali- zation end semantic similarity image similarity start fuzzy citation patterns cross-lingual. similarity candidate retrieval human inspection post processing mathematical fomulae Legend: future research current research completed research image similarity
  • 26. Questions? An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. Norman Meuschke n@meuschke.org • Code: www.purl.org/imagepd • Contact, publications, other projects: www.isg.uni.kn 25
  • 27. Image Extraction & Decomposition • Extraction: • poppler framework • convertto JPEG • discard images smaller than 7.5 KB (typically logos) • Decomposition: • assume white pixels separate sub-images • assume rectangular sub-images aligned horizontally or vertically • tradeoff (images remain analyzable if decomposition fails) An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 26
  • 28. Decomposition • Process: • conversionto grayscale to reduceruntime • padding with white pixels to removea potential border • binarization using adaptive thresholdingto obtain a b/w image • dilation to ensureblack pixels are connected • floodfill of white areas with black pixels • subtract original image • invert image • blob detection using the algorithm of Suzuki and Abe [1] • estimate boundingbox by looking for large contoursaligned along the image axes • crop and store the identified sub-images [1] Satoshi Suzuki and Keiichi Abe. 1985. Topological Structural Analysis of Digitized Binary Images by Border Following. CVGIP 30, 1 (1985). An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 27
  • 29. Image Classification • Depp CNN realized using Caffe and AlexNet architecture [2] • CNN classifies images into: • photographs(pHash only) • bar charts (ratio hashing only) • other image types (pHash and OCR text matching) • Manual checks of 100 classified images • Accuracy 0.92 for photographs and 1.00 for bar charts An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 28
  • 30. Perceptual Hashing • Process: • Reduce size to 32x32 pixels • Convertto grayscale • Compute 32x32 DiscreteCosine Transform(DCT) • Reduce DCT to 8x8 for lowest frequencies • Compute average DCT value • Binarize 64 pixels (8x8) to 64 bit Integer depending on mean DCT value (1 - above mean, 0 – below mean) • Similarity measure: Hammingdistance An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 29
  • 31. Extraction of Bar Heights • Process: • convertto grayscale • binarize using global threshold to obtain b/w image (sharp contours) • pad image with white pixels to ensurebars can be filled • clean artifacts of black pixels using a threshold on the relative area covered by the pixels • removeimage border • floodfill with black pixels and invert • find candidates for bars by determining the lengths of all vertical lines of black pixels • determine bars by clustering vertical lines • removenoise from whiskers, labels, and legend entries • assume the average height of the lines in a cluster as the bar height An Adaptive Image-based PlagiarismDetection Approach- Meuschkeet al. 30