This document discusses aggregation techniques for software metrics. It describes traditional aggregation methods like mean, median, and standard deviation. It also discusses inequality indices like Gini, Theil, and Atkinson. The document outlines available datasets, tools used for analysis including R and Python, and sample results showing correlations between aggregation techniques for different software projects over multiple versions.
1. Aggregation
of software metrics
Bogdan Vasilescu
b.n.vasilescu@student.tue.nl
Alexander Serebrenik
a.serebrenik@tue.nl
April 7, 2011
2. Aggregation techniques for software metrics 2/8
Better understand aggregation techniques for software metrics.
Source lines of code − freecol−0.9.4
0.004
0.003
Density
0.002
0.001
0.000
0 500 1000 1500 2000 2500 3000
SLOC per class
Traditional: mean, sum, median, standard deviation, variance,
skewness, kurtosis.
/ department of mathematics and computer science
3. Aggregation techniques for software metrics 2/8
Better understand aggregation techniques for software metrics.
Household income in Ilocos, the Philippines (1998) Source lines of code − freecol−0.9.4
5e−06
0.004
4e−06
0.003
3e−06
Density
Density
0.002
2e−06
0.001
1e−06
0e+00
0.000
0 500000 1000000 1500000 2000000 2500000 0 500 1000 1500 2000 2500 3000
Income SLOC per class
Traditional: mean, sum, median, standard deviation, variance,
skewness, kurtosis.
Inequality indices: Gini, Theil, Atkinson, Hoover, Kolm.
/ department of mathematics and computer science
4. Correlation study 3/8
Aggregate SLOC from class to package level.
Study statistical correlation between pairs of aggregation techniques.
Not enough to measure.
/ department of mathematics and computer science
5. Available datasets 4/8
Qualitas Corpus 20101126 r+e.
r (recent): the most recent versions from 106 systems.
e (evolution): all available versions from 13 systems (≥ 10 versions
available), 414 versions in total.
/ department of mathematics and computer science
6. Tooling 5/8
Developed and available tooling to analyze the corpus:
Extract metrics: SLOCCount, Understand (still not generic enough)
Compute inequality indices, perform statistical analyses: R (highly
scriptable)
Put everything together: Python toolchain (easily extendable)
Kendall correlation: Atkinson − skewness (SLOC) Kendall correlation: Gini − Theil (SLOC) Kendall correlation: mean − kurtosis (SLOC)
1.0
1.0
1.0
q q
q
q
q q
q
0.5
0.5
0.5
q
Kendall correlation coefficient
Kendall correlation coefficient
Kendall correlation coefficient
q
0.0
0.0
0.0
q
q
−0.5
−0.5
−0.5
q
q
−1.0
−1.0
−1.0
/ department of mathematics and computer science