Reference classes: a case study with the poweRlaw package

Reference classes: a case study with the poweRlaw
package
Colin Gillespie
Newcastle University, UK
http://aperiodical.com/2013/01/log-log-whos-there-not-a-power-law/

The power law distribution
Name f (x) Notes
Power law x−α Pareto distribution
Log-normal 1
x
exp(−(ln(x)−µ)2
2σ2 )
Exponential e−λx
Power law x−α Zeta distribution
Power law x−α x = 1, . . . , n, Zipf’s dist’
Yule
Γ(x)
Γ(x+α)
Poisson λx
/x!

Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages

Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages
The number of hits on web pages
The number of papers scientist write
The number of citations received by papers
Annual incomes
Sales of books, music; in fact anything that can be sold

Zipf plots
Blackouts Fires Flares
Moby Dick Terrorism Web links
10−8
10−6
10−4
10−2
100
10−8
10−6
10−4
10−2
100
100
102
104
106
100
102
104
106
100
102
104
106
x
1−P(x)

The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constant
The scaling parameter typically lies in the range 2 < α < 3, although
there are some occasional exceptions
When α < 2, all moments are inﬁnite

The power-law distribution is
p(x) ∝ x−α
where α, the scaling parameter, is constant
The scaling parameter typically lies in the range 2 < α < 3, although
there are some occasional exceptions
When α < 2, all moments are inﬁnite
Typically, the entire process doesn’t obey a power law
Instead, the power law applies only for values greater than some
minimum xmin

Power law: PMF & CMF
Discrete power law, the PMF is
p(x) =
x−α
ζ(α, xmin)
where α > 1, xmin ≥ 1 and
ζ(α, xmin) =
∞
∑
n=0
(n + xmin)−α
is the generalised zeta function
When xmin = 1, ζ(α, 1) is the standard
zeta function
PDF
CDF
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50
x
1.50 1.75 2.00 2.25 2.50
α

Fitting power laws
The main technique for ﬁtting power laws comes from Clausett et al, 2009
This paper gets around ten new citations a week
Estimating α given xmin is straightforward - just use the mle
The lower cut-off, xmin, is estimated using a Kolmogorov-Smirnoff
approach

The poweRlaw package
The package is available on CRAN and at
https://github.com/csgillespie/poweRlaw
Makes fitting power laws easy to fit
Crucially, it makes fitting (to the tails) of the log normal, exponential,
Poisson equally easy
Consistent interface between distributions
Estimate parameter uncertainty
Compare distributions (statistically and visually)

Case study: Moby Dick
R> m_pl = displ$new(moby)

R> plot(m_pl)
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q
q
q
q
q
q
q
q
q
Words
CDF
100
101
102
103
104
10−4
10−3
10−2
10−1
100

R> (est = estimate_xmin(m_pl))
$KS
[1] 0.009229
$xmin
[1] 7
$pars
[1] 1.95
attr(,"class")
[1] "estimate_xmin"
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Words
CDF
100
101
102
103
104
10−4
10−3
10−2
10−1
100

R> est = estimate_xmin(m_pl)
R> m_pl$setXmin(est)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Words
CDF
100
101
102
103
104
10−4
10−3
10−2
10−1
100

R> lines(m_pl)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Words
CDF
100
101
102
103
104
10−4
10−3
10−2
10−1
100

R> lines(m_pl)
R> m_ln = dislnorm$new(moby)
R> est = estimate_xmin(m_ln)
R> m_ln$setXmin(est)
R> lines(m_ln)
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Words
CDF
100
101
102
103
104
10−4
10−3
10−2
10−1
100

Why use objects?
Each distribution is represented by an object:
Parent class: distribution
Power-law: displ, log-normal: disln, . . .
Method dispatch on object class:
dist_pdf(m) returns the probability density function based on the class of
m
Consistent interface:
Bootstrapping:
R> bootstrap(m)
Model selection:
R> compare_distributions(m1, m2)
Simple interface that enables easy addition of new distributions (currently
there are seven available distributions to ﬁt)

Reference classes
Reference classes behave like classes in C++, Python and many other
languages - not like standard R classes
You can use these classes with ordinary R expressions and functions
An extension to core R (October, 2010)
Big difference - mutable state

Mutable states
R> displ = setRefClass("displ", fields = "xmin")
R> d1 = displ$new(xmin = 1)
R> d1$xmin
[1] 1

Mutable states
R> d1$xmin
[1] 1
R> d2 = d1
R> d2$xmin = 100
R> d2$xmin
[1] 100

Mutable states
R> d1$xmin
[1] 1
R> d2 = d1
R> d2$xmin = 100
R> d2$xmin
[1] 100
R> d1$xmin
[1] 100

Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efﬁcient caching speeds up calculations 100 fold
For example, using the call
R> m_pl$setXmin(10)
updates internal variables that makes future calculations quicker

Mutable states
When estimating xmin, a naive implementation makes this calculation slow
Efﬁcient caching speeds up calculations 100 fold
For example, using the call
R> m_pl$setXmin(10)
updates internal variables that makes future calculations quicker
On creation of a distribution object, we make "multiple copies" of the data
R> x
R> cumsum(log(x))
using reference classes avoids constant copying and speeds up
calculations
R> pl_ref$xmin = 10
R> pl_s4@xmin = 10

Comments
Reference classes are still new
Code has now broken twice with R upgrades
roxygen2 and reference classes didn’t play well together
Very few questions on Stackoverﬂow on reference classes
Structuring code and ﬁles
Care has to be taken when using them with parallel computing

References
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law
distributions in empirical data. SIAM review 51.4 (2009): 661–703.
poweRlaw package
https://github.com/csgillespie/poweRlaw

Reference classes: a case study with the poweRlaw package

Recommandé

Recommandé

Contenu connexe

Similaire à Reference classes: a case study with the poweRlaw package

Similaire à Reference classes: a case study with the poweRlaw package (20)

Plus de Colin Gillespie

Plus de Colin Gillespie (10)

Dernier

Dernier (20)

Reference classes: a case study with the poweRlaw package