Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Detecting java software similarities by using different clustering
1. Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
ICSME 2020
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
2. Detecting Java software similarities by using different clustering techniques 2ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Gather “as much data as possible”
- to prevent bias in the representation of a small sample
- work with a sample as close as the population itself
- showcase the performance of existing or new tools in treating vast amount of
data
3. Detecting Java software similarities by using different clustering techniques 3ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Cumulative number of FOSS projects per year Average number of FOSS projects per year
4. Detecting Java software similarities by using different clustering techniques 4ICSME2020
Similarity of Systems and Empirical Research
insensitive to that
Very few works have clearly stated the
similarity (or differences) between
systems in the interpretation of the
results
- by explicitly proposing explanations based
on application domains
- by sampling the projects to be analysed
from a specific, restricted topic
5. Detecting Java software similarities by using different clustering techniques 5ICSME2020
Assumptions of this paper
A specific software system might be similar to others to some degree,
and that there are different approaches to defining their similarity
A sample of software systems might get divided into subsets (or
clusters), each containing similar systems, and showing differences with
other clusters
6. Detecting Java software similarities by using different clustering techniques 6ICSME2020
Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
7. Detecting Java software similarities by using different clustering techniques 7ICSME2020
Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
s1
s2
s6
s4
s5
s3
s7
s8
Log management
JSON Parsing
DB Management
8. Detecting Java software similarities by using different clustering techniques 8ICSME2020
Research question
Are OO metrics sensitive to the context of their clusters?
The main goal is to investigate whether experiments in software engineering can
generalize results based on populations under different contexts and how
sensitive are cluster techniques to provide such classification
9. Detecting Java software similarities by using different clustering techniques 9ICSME2020
Types of clustering techniques used in the paper
CrossSim (Graph-based similarity)
Clustering based on projects descriptions (manually classified)
LDA-informed Clustering
1. We group software systems based on the three different clustering techniques
2. We collect the values of the OO metrics suite in each cluster
3. We then test whether clusters are statistically different between each other,
using the Kolgomorov-Smirnov (KS) hypothesis testing
The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population
10. Detecting Java software similarities by using different clustering techniques 10ICSME2020
CrossSim
Based on the graph
structure, one can exploit
nodes, links, and the
mutual relationships to
compute similarity using
existing graph similarity
algorithms
Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020
Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)
11. Detecting Java software similarities by using different clustering techniques 11ICSME2020
Results for CrossSim Clustering
12 projects (6 pairs), from a larger population of 5,000 projects extracted
as part of the CROSSMINER project
The similarity by CrossSim is computed according to libraries, stargazers,
and committers
Result - We cannot conclude that CrossSim clusters are structurally
different from each others
https://www.crossminer.org
12. Detecting Java software similarities by using different clustering techniques 12ICSME2020
Results from manual classification
Java subset of 520 projects collected out of the 5,000 projects [1,2]
Manually assigned to 12 categories
– e.g, Communications, Database, Software Development, Text Editors, …
Result - The obtained clusters result in pools of attributes that are
structurally different from each other
– Each cluster is a standalone category, with specific (and unique) characteristics
[1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and
Evolution (ICSME), IEEE, 2016, pp. 334–344.
[2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.
13. Detecting Java software similarities by using different clustering techniques 13ICSME2020
Results from LDA-informed Clustering
Latent Dirichlet Allocation (LDA) information retrieval method
14. Detecting Java software similarities by using different clustering techniques 14ICSME2020
Results from LDA-informed Clustering
20 Categories/Domains from SourceForge
100 most starred Java projects from GitHub
Result - Strong evidence to reject null hypothesis based on KS test
• OO attributes are showing differences among the different clusters
15. Detecting Java software similarities by using different clustering techniques 15ICSME2020
Take-away messages
1. When you cluster software systems in categories you can create
strongly different results
2. The interpretation of software metrics might be more sensitive to
context than reported so far in the literature
– The correlation among OO metrics can be extremely sensitive to application
domains
16. Detecting Java software similarities by using different clustering techniques 16ICSME2020
Take-away messages
3. We should pay more attention to the application domain of the
studied systems
• e.g. the metrics one should consider to analyse gaming software should be
different from those used to assess the quality of security software
• LOCs are less appropriate for assessing the quality of security software or in general of
mission critical software systems
The empirical findings might need readjustment depending on the
cluster of projects they evaluate
17. Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279