This presentation is joint work by Alexandre Decan, Tom Mens and Maelick Claes (Software Engineering Lab, COMPLEXYS research institute, University of Mons). It was presented by Maelick during the International Workshop on Software Ecosystem Architectures (WEA 2016) in Copenhagen, on 29 November 2016.
Abstract of the accompanying paper (DOI 10.1145/1235):
Package-based software ecosystems are composed of thousands of interdependent software packages. Many empirical studies have focused on software packages belonging to a single software ecosystem, and suggest to generalise the results to more ecosystems. We claim that such a generalisation is not always possible, because the technical structure of software ecosystems can be very different, even if these ecosystems belong to the same domain. We confirm this claim through a study of three big and popular package-based programming language ecosystems: R’s CRAN archive network, Python’s PyPI distribution, and JavaScript’s NPM package manager. We study and compare the structure of their package dependency graphs and reveal some important differences that may make it difficult to generalise the findings of one ecosystem to another one.
A follow-up on this work can be found in the SANER 2017 paper by the same authors, entitled "An Empirical Comparison of Dependency Issues in OSS Packaging Ecosystems”
A healthy diet for your Java application Devoxx France.pdf
On the topology of package dependency networks: A comparison of programming language ecosystems
1. On the Topology of Package Dependency Networks
A Comparison of Programming Language Ecosystems
Alexandre Decan, Tom Mens, Maëlick Claes
Software Engineering Lab
1
29 November 2016 – Int’l Workshop Software Ecosystem Architectures (WEA)
3. Previous Work
• A. Decan, T. Mens, M. Claes, P. Grosjean
– IWSECO-WEA 2015: "On the Development and Distribution of R
Packages: An Empirical Analysis of the R Ecosystem"
– SANER 2016:"When GitHub Meets CRAN: An Analysis of Inter-
Repository Package Dependency Problems”
• A. Serebrenik, T. Mens
– WEA 2015: "Challenges in Software Ecosystems Research"
• Generalizability
• Comparing different ecosystems
3
4. Software Packaging Ecosystems
• Ecosystem: ”a collection of software projects
which are developed and evolve together in
the same environment” [Lungu]
• Software distributed as packages
– Dependency relationships between
packages
– Package versioning
4
5. Software Packaging Ecosystems
for programming languages
• Many programming-language specific
package managers
5
npm
JavaScript
PyPI
Python
RubyGems
Ruby
CRAN
R
6. Software Packaging Ecosystems
for programming languages
IEEE Spectrum ranking of most popular programming languages
6
(http://spectrum.ieee.org/image/Mjc5MjI0Ng.png)
“The real standard library people want is more like what you find in Python
or Ruby, and it’s more batteries included, feature complete, and that is not
in JavaScript. That’s in the NPM world or the larger world.”
7. Ecosystem comparison
7
CRAN PyPI NPM
Snapshot date 2016-04-26 2016-02-17 2016-06-28
Packages 9k 56k 317k
Dependencies 21k 53k 728k
New packages in 2015 1.6k 17k 113k
Updates in 2015 8k 131k 711k
8. Data extraction
• CRAN: https://github.com/ecos-umons/extractoR
• npm: https://registry.npmjs.org
• PyPI: Missing dependencies information
=> https://kgullikson88.github.io/blog/pypi-analysis.html
8
9. Terminology
• b is a dependency of a
• a is a reverse dependency of b
• c is a transitive dependency of a
• a is a transitive reverse dependency of c
• {a, b, c, d, e, f} is a (weakly connected) component
• g is an isolated package 9
10. Dependency usage
in programming language ecosystems
PyPI has proportionally more isolated Python packages
(due to its extensive standard library?)
10
“The real standard library people want is more like what you find in Python or Ruby, and
it’s more batteries included, feature complete, and that is not in JavaScript. That’s in the
NPM world or the larger world.”
11. Topology
of programming language ecosystems
The majority of packages are part of a single huge component
11
Largest component:
• 76.5% (CRAN), 35.6% (PyPI), 63.8% (npm) of all packages
• 91% (CRAN), 88% (PyPI), 92% (npm) of all non-isolated packages
13. Differences in reverse dependencies
between programming language ecosystems
13
There are proportionally more very popular npm packages
(i.e. higher number of transitive reverse dependencies)
14. Differences in reverse dependencies
between programming language ecosystems
14
Number of packages required by more than 2% of the ecosystem
15. Possible explanation
micro-packages in npm
“In a lot of JavaScript environments, space is at a premium. [...]
Several larger libraries […] have actually intentionally split
themselves into sub-modules because people usually only ever
load them to use a single merge function.”
Example: isarray
150 direct, 77K inverse transitive deps in August 2016
var toString = {}.toString;
module.exports = Array.isArray || function (arr) {
return toString.call(arr) == '[object Array]’;
};
15
16. function leftpad (str, len, ch) {
str = String(str);
var i = -1;
if (!ch && ch !== 0) ch = ' ';
len = len - str.length;
while (++i < len) { str = ch + str; }
return str;
}
Known problems: leftpad
16
Its developer removed all his packages from npm:
“This impacted many thousands of projects. [...] We began
observing hundreds of failures per minute, as dependent projects –
and their dependents, and their dependents... – all failed when
requesting the now-unpublished package.”
http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm
17. function leftpad (str, len, ch) {
str = String(str);
var i = -1;
if (!ch && ch !== 0) ch = ' ';
len = len - str.length;
while (++i < len) { str = ch + str; }
return str;
}
Known problems: leftpad
17
npm managers un-unpublished leftpad but …
“a number of dependency chains [...] explicitly
requested 0.0.3.”
http://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm
18. Conclusion
• Simple metrics can be used to compare the topology of
different package-based software ecosystems
• Similarities in the dependency graph structure
• Most non isolated packages are part of a large weakly
connected component
• Differences that can be explained by the specificities of
each ecosystem
• Python’s extensive standard library
• CRAN’s particular versioning policy
• npm's abundance of micro-packages
18
19. Future work
• See our SANER 2017 article
“An empirical comparison of dependency issues in
OSS packaging ecosystems”
• Include RubyGems
• Study the evolution over time
• Frequency of package updates
• Resilience of packages to failures in
dependencies
• Impact of solutions that rely on dependency
constraints and semantic versioning
• Beyond SANER 2017: study the interplay between
social and technical aspects
19
In this talk I will present an empirical study of the comparison of three different programming language ecosystems
Alexander Serebrenik => you probably all know who his is, since he is ICSME chair
Alexandre Decan first carried out research on formal database theory but I managed to convert him to the more practical side of SE research
Bogdan Vasilescu, obtained his PhD with Serebrenik, and after a 2 year postdoc at UCDavis now joined CMU in Pittsburgh.
But before delving into the comparative study itself, let’s start with a little bit of background
I’ve recently finished PhD on the topic of maintainability issues in packaging software ecosystems
Part of the thesis: previous papers on ecosystem, in particular the R ecosystem
Last year Alexander presented most important challenges in ecosystems
One of the future of my own thesis
To beging with: we mean by ecosystem as the Lungu defintion
In our case, software projects are software packages
Particularity: dependency relationships between packages
Nowadays major open source software libraries are distributed as part of software packaging software ecosytems
To beging with: we mean by ecosystem as the Lungu defintion
In our case, software projects are software packages
Particularity: dependency relationships between packages
Nowadays major open source software libraries are distributed as part of software packaging software ecosytems
We selected three popular interpreted programming language’s packaging ecosystems.
They all have gained in popularity recently
R we previously studies and is a language originally oriented towards statistics and data analysis
JavaScript a language mostly web application oriented
Python more general language but nowadays also used for both web and data analysis
We built dependency graph for each month since 2000
This shows the evolution of the increase of the size of these graphs.
They all exhibit an exponential increase.
For R we used tools we previously developed to get data from official sources
For npm we got direct output from the official package API
For PyPI we realized many data were missing dependencies so we used third party data
These three languages have an ecosystem with a different philosophy
We will use the following terminology
… Building dependency graph for those ecosystems (Python/PyPI, R/CRAB, JavaScript/npm)
What we observe:
Python packages tend to be more isolated
Explained by a more complete standard library
JavaScript is on the opposite side
R has a good standard library for data analysis but many packages extend it (e.g. ggplot2)
Explain graphic: number of components vs component size
On the y axis number of components
On the x axis size of component
Far right biggest component of each ecosystem
One of the similarity:
Most non isolated between packages are part of the same large component
We looked at the distribution of the number of transitive dependencies of each package
Differences between npm and the others
Also npm packages have more packages on which thousands of packages depend upon,
=> npm might be more vunerable
From our empirical study we saw that many npm popular packages are very fragile.
=> in particular micro-packages
What happened?
- Everything started with the disagreement over a module name “kik”
Its developer unpublished *all* his 272 modules from npm, including leftpad
This caused thousands of dependent projects to break, including Node and Babel
The community stepped in within minutes to fix the problem.
Required NPM managers to go against their own policy by un-unpublishing the module
What happened?
- Everything started with the disagreement over a module name “kik”
Its developer unpublished *all* his 272 modules from npm, including leftpad
This caused thousands of dependent projects to break, including Node and Babel
The community stepped in within minutes to fix the problem.
Required NPM managers to go against their own policy by un-unpublishing the module
The observed differenced can have impact on both the ecosystem users and developers
=> importance of policy in managing ecosystems with policies and not only tools