SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
ICSME 2020
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
Detecting Java software similarities by using different clustering techniques 2ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Gather “as much data as possible”
- to prevent bias in the representation of a small sample
- work with a sample as close as the population itself
- showcase the performance of existing or new tools in treating vast amount of
data
Detecting Java software similarities by using different clustering techniques 3ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Cumulative number of FOSS projects per year Average number of FOSS projects per year
Detecting Java software similarities by using different clustering techniques 4ICSME2020
Similarity of Systems and Empirical Research
insensitive to that
Very few works have clearly stated the
similarity (or differences) between
systems in the interpretation of the
results
- by explicitly proposing explanations based
on application domains
- by sampling the projects to be analysed
from a specific, restricted topic
Detecting Java software similarities by using different clustering techniques 5ICSME2020
Assumptions of this paper
A specific software system might be similar to others to some degree,
and that there are different approaches to defining their similarity
A sample of software systems might get divided into subsets (or
clusters), each containing similar systems, and showing differences with
other clusters
Detecting Java software similarities by using different clustering techniques 6ICSME2020
Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
Detecting Java software similarities by using different clustering techniques 7ICSME2020
Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
s1
s2
s6
s4
s5
s3
s7
s8
Log management
JSON Parsing
DB Management
Detecting Java software similarities by using different clustering techniques 8ICSME2020
Research question
Are OO metrics sensitive to the context of their clusters?
The main goal is to investigate whether experiments in software engineering can
generalize results based on populations under different contexts and how
sensitive are cluster techniques to provide such classification
Detecting Java software similarities by using different clustering techniques 9ICSME2020
Types of clustering techniques used in the paper
CrossSim (Graph-based similarity)
Clustering based on projects descriptions (manually classified)
LDA-informed Clustering
1. We group software systems based on the three different clustering techniques
2. We collect the values of the OO metrics suite in each cluster
3. We then test whether clusters are statistically different between each other,
using the Kolgomorov-Smirnov (KS) hypothesis testing
The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population
Detecting Java software similarities by using different clustering techniques 10ICSME2020
CrossSim
Based on the graph
structure, one can exploit
nodes, links, and the
mutual relationships to
compute similarity using
existing graph similarity
algorithms
Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020
Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)
Detecting Java software similarities by using different clustering techniques 11ICSME2020
Results for CrossSim Clustering
12 projects (6 pairs), from a larger population of 5,000 projects extracted
as part of the CROSSMINER project
The similarity by CrossSim is computed according to libraries, stargazers,
and committers
Result - We cannot conclude that CrossSim clusters are structurally
different from each others
https://www.crossminer.org
Detecting Java software similarities by using different clustering techniques 12ICSME2020
Results from manual classification
Java subset of 520 projects collected out of the 5,000 projects [1,2]
Manually assigned to 12 categories
– e.g, Communications, Database, Software Development, Text Editors, …
Result - The obtained clusters result in pools of attributes that are
structurally different from each other
– Each cluster is a standalone category, with specific (and unique) characteristics
[1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and
Evolution (ICSME), IEEE, 2016, pp. 334–344.
[2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.
Detecting Java software similarities by using different clustering techniques 13ICSME2020
Results from LDA-informed Clustering
Latent Dirichlet Allocation (LDA) information retrieval method
Detecting Java software similarities by using different clustering techniques 14ICSME2020
Results from LDA-informed Clustering
20 Categories/Domains from SourceForge
100 most starred Java projects from GitHub
Result - Strong evidence to reject null hypothesis based on KS test
• OO attributes are showing differences among the different clusters
Detecting Java software similarities by using different clustering techniques 15ICSME2020
Take-away messages
1. When you cluster software systems in categories you can create
strongly different results
2. The interpretation of software metrics might be more sensitive to
context than reported so far in the literature
– The correlation among OO metrics can be extremely sensitive to application
domains
Detecting Java software similarities by using different clustering techniques 16ICSME2020
Take-away messages
3. We should pay more attention to the application domain of the
studied systems
• e.g. the metrics one should consider to analyse gaming software should be
different from those used to assess the quality of security software
• LOCs are less appropriate for assessing the quality of security software or in general of
mission critical software systems
The empirical findings might need readjustment depending on the
cluster of projects they evaluate
Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279

Contenu connexe

Tendances

Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
Craig Cannon
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
IJECEIAES
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
aphex34
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
IJECEIAES
 

Tendances (13)

Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
20170412 om patri pres 153pdf
20170412 om patri pres 153pdf20170412 om patri pres 153pdf
20170412 om patri pres 153pdf
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
 
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
 
Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biology
 
‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud
 
NeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIGNeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIG
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known Sample
 

Similaire à Detecting java software similarities by using different clustering

Developing Projects & Research
Developing Projects & ResearchDeveloping Projects & Research
Developing Projects & Research
Thomas Mylonas
 

Similaire à Detecting java software similarities by using different clustering (20)

AI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism DetectorAI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism Detector
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarity
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Presentation at MTSR 2012
Presentation at MTSR 2012Presentation at MTSR 2012
Presentation at MTSR 2012
 
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
 
Developing Projects & Research
Developing Projects & ResearchDeveloping Projects & Research
Developing Projects & Research
 
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
 
principle of oop’s in cpp
principle of oop’s in cppprinciple of oop’s in cpp
principle of oop’s in cpp
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
 
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the Future
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 

Plus de Davide Ruscio

Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...
Davide Ruscio
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
Davide Ruscio
 

Plus de Davide Ruscio (12)

Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...
 
On the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activitiesOn the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activities
 
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
FOCUS:  A Recommender System for Mining API Function Calls and  Usage PatternsFOCUS:  A Recommender System for Mining API Function Calls and  Usage Patterns
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
 
CrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projectsCrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projects
 
Use of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source SoftwareUse of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source Software
 
Consistency Recovery in Interactive Modeling
Consistency Recovery in Interactive ModelingConsistency Recovery in Interactive Modeling
Consistency Recovery in Interactive Modeling
 
Edelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactoringsEdelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactorings
 
Semantic based model matching with emf compare
Semantic based model matching with emf compareSemantic based model matching with emf compare
Semantic based model matching with emf compare
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
 
Model repositories: will they become reality?
Model repositories: will they become reality?Model repositories: will they become reality?
Model repositories: will they become reality?
 
Mining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel MetricsMining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel Metrics
 
MDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platformMDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platform
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Dernier (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 

Detecting java software similarities by using different clustering

  • 1. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** ICSME 2020 * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
  • 2. Detecting Java software similarities by using different clustering techniques 2ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Gather “as much data as possible” - to prevent bias in the representation of a small sample - work with a sample as close as the population itself - showcase the performance of existing or new tools in treating vast amount of data
  • 3. Detecting Java software similarities by using different clustering techniques 3ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Cumulative number of FOSS projects per year Average number of FOSS projects per year
  • 4. Detecting Java software similarities by using different clustering techniques 4ICSME2020 Similarity of Systems and Empirical Research insensitive to that Very few works have clearly stated the similarity (or differences) between systems in the interpretation of the results - by explicitly proposing explanations based on application domains - by sampling the projects to be analysed from a specific, restricted topic
  • 5. Detecting Java software similarities by using different clustering techniques 5ICSME2020 Assumptions of this paper A specific software system might be similar to others to some degree, and that there are different approaches to defining their similarity A sample of software systems might get divided into subsets (or clusters), each containing similar systems, and showing differences with other clusters
  • 6. Detecting Java software similarities by using different clustering techniques 6ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike”
  • 7. Detecting Java software similarities by using different clustering techniques 7ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike” s1 s2 s6 s4 s5 s3 s7 s8 Log management JSON Parsing DB Management
  • 8. Detecting Java software similarities by using different clustering techniques 8ICSME2020 Research question Are OO metrics sensitive to the context of their clusters? The main goal is to investigate whether experiments in software engineering can generalize results based on populations under different contexts and how sensitive are cluster techniques to provide such classification
  • 9. Detecting Java software similarities by using different clustering techniques 9ICSME2020 Types of clustering techniques used in the paper CrossSim (Graph-based similarity) Clustering based on projects descriptions (manually classified) LDA-informed Clustering 1. We group software systems based on the three different clustering techniques 2. We collect the values of the OO metrics suite in each cluster 3. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population
  • 10. Detecting Java software similarities by using different clustering techniques 10ICSME2020 CrossSim Based on the graph structure, one can exploit nodes, links, and the mutual relationships to compute similarity using existing graph similarity algorithms Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020 Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)
  • 11. Detecting Java software similarities by using different clustering techniques 11ICSME2020 Results for CrossSim Clustering 12 projects (6 pairs), from a larger population of 5,000 projects extracted as part of the CROSSMINER project The similarity by CrossSim is computed according to libraries, stargazers, and committers Result - We cannot conclude that CrossSim clusters are structurally different from each others https://www.crossminer.org
  • 12. Detecting Java software similarities by using different clustering techniques 12ICSME2020 Results from manual classification Java subset of 520 projects collected out of the 5,000 projects [1,2] Manually assigned to 12 categories – e.g, Communications, Database, Software Development, Text Editors, … Result - The obtained clusters result in pools of attributes that are structurally different from each other – Each cluster is a standalone category, with specific (and unique) characteristics [1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2016, pp. 334–344. [2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.
  • 13. Detecting Java software similarities by using different clustering techniques 13ICSME2020 Results from LDA-informed Clustering Latent Dirichlet Allocation (LDA) information retrieval method
  • 14. Detecting Java software similarities by using different clustering techniques 14ICSME2020 Results from LDA-informed Clustering 20 Categories/Domains from SourceForge 100 most starred Java projects from GitHub Result - Strong evidence to reject null hypothesis based on KS test • OO attributes are showing differences among the different clusters
  • 15. Detecting Java software similarities by using different clustering techniques 15ICSME2020 Take-away messages 1. When you cluster software systems in categories you can create strongly different results 2. The interpretation of software metrics might be more sensitive to context than reported so far in the literature – The correlation among OO metrics can be extremely sensitive to application domains
  • 16. Detecting Java software similarities by using different clustering techniques 16ICSME2020 Take-away messages 3. We should pay more attention to the application domain of the studied systems • e.g. the metrics one should consider to analyse gaming software should be different from those used to assess the quality of security software • LOCs are less appropriate for assessing the quality of security software or in general of mission critical software systems The empirical findings might need readjustment depending on the cluster of projects they evaluate
  • 17. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279