Detecting java software similarities by using different clustering

Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
ICSME 2020
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279

Detecting Java software similarities by using different clustering techniques 2ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Gather “as much data as possible”
- to prevent bias in the representation of a small sample
- work with a sample as close as the population itself
- showcase the performance of existing or new tools in treating vast amount of
data

On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Cumulative number of FOSS projects per year Average number of FOSS projects per year

Similarity of Systems and Empirical Research
insensitive to that
Very few works have clearly stated the
similarity (or differences) between
systems in the interpretation of the
results
- by explicitly proposing explanations based
on application domains
- by sampling the projects to be analysed
from a specific, restricted topic

Assumptions of this paper
A specific software system might be similar to others to some degree,
and that there are different approaches to defining their similarity
A sample of software systems might get divided into subsets (or
clusters), each containing similar systems, and showing differences with
other clusters

Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”

Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
s1
s2
s6
s4
s5
s3
s7
s8
Log management
JSON Parsing
DB Management

Research question
Are OO metrics sensitive to the context of their clusters?
The main goal is to investigate whether experiments in software engineering can
generalize results based on populations under different contexts and how
sensitive are cluster techniques to provide such classification

Types of clustering techniques used in the paper
CrossSim (Graph-based similarity)
Clustering based on projects descriptions (manually classified)
LDA-informed Clustering
1. We group software systems based on the three different clustering techniques
2. We collect the values of the OO metrics suite in each cluster
3. We then test whether clusters are statistically different between each other,
using the Kolgomorov-Smirnov (KS) hypothesis testing
The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population

CrossSim
Based on the graph
structure, one can exploit
nodes, links, and the
mutual relationships to
compute similarity using
existing graph similarity
algorithms
Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020
Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)

Results for CrossSim Clustering
12 projects (6 pairs), from a larger population of 5,000 projects extracted
as part of the CROSSMINER project
The similarity by CrossSim is computed according to libraries, stargazers,
and committers
Result - We cannot conclude that CrossSim clusters are structurally
different from each others
https://www.crossminer.org

Results from manual classification
Java subset of 520 projects collected out of the 5,000 projects [1,2]
Manually assigned to 12 categories
– e.g, Communications, Database, Software Development, Text Editors, …
Result - The obtained clusters result in pools of attributes that are
structurally different from each other
– Each cluster is a standalone category, with specific (and unique) characteristics
[1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and
Evolution (ICSME), IEEE, 2016, pp. 334–344.
[2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.

Results from LDA-informed Clustering
Latent Dirichlet Allocation (LDA) information retrieval method

Results from LDA-informed Clustering
20 Categories/Domains from SourceForge
100 most starred Java projects from GitHub
Result - Strong evidence to reject null hypothesis based on KS test
• OO attributes are showing differences among the different clusters

Take-away messages
1. When you cluster software systems in categories you can create
strongly different results
2. The interpretation of software metrics might be more sensitive to
context than reported so far in the literature
– The correlation among OO metrics can be extremely sensitive to application
domains

Take-away messages
3. We should pay more attention to the application domain of the
studied systems
• e.g. the metrics one should consider to analyse gaming software should be
diﬀerent from those used to assess the quality of security software
• LOCs are less appropriate for assessing the quality of security software or in general of
mission critical software systems
The empirical findings might need readjustment depending on the
cluster of projects they evaluate

Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279

Detecting java software similarities by using different clustering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (13)

Similaire à Detecting java software similarities by using different clustering

Similaire à Detecting java software similarities by using different clustering (20)

Plus de Davide Ruscio

Plus de Davide Ruscio (12)

Dernier

Dernier (20)

Detecting java software similarities by using different clustering