SlideShare a Scribd company logo
1 of 43
Download to read offline
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Arthur Charpentier
charpentier.arthur@gmail.com
http ://freakonometrics.hypotheses.org/
Université de Rennes 1, January 2015
Welfare, Inequality & Poverty, # 2
1
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Modeling Income Distribution
Let {x1, · · · , xn} denote some sample. Then
x =
1
n
n
i=1
xi =
n
i=1
1
n
xi
This can be used when we have census data.
1 load ( u r l ( " http : // freakonometrics . f r e e . f r /
income_5. RData" ) )
2 income <− s o r t ( income )
3 plot ( 1 : 5 , income )
income
qq q q q
0 50000 100000 150000 200000 250000
It is possible to use survey data. If πi denote the probability to be drawn, use
weights
ωi ∝
1
nπi
2
Arthur CHARPENTIER - Welfare, Inequality and Poverty
The weighted average is then
xω =
n
i=1
ωi
ω
xi
where ω = ωi. This is an unbaised estimator of the population mean.
Sometime, data are obtained from stratified samples : before sampling, members
of the population are groupes in homogeneous subgroupes (called a strata).
Given S strata, such that the population in strata s is Ns, then
xS =
S
s=1
Ns
N
xs where xs =
1
Ns
i∈Ss
xi
3
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Statistical Tools Used to Describe the Distribution
Consider a sample {x1, · · · , xn}. Usually, the order is not important. So let us
order those values,
x1:n
min{xi}
≤ x2:n ≤ · · · ≤ xn−1:n ≤ xn:n
max{xi}
As usual, assume that xi’s were randomly drawn from an (unknown) distribution
F.
If F denotes the cumulative distribution function, F(x) = P(X ≤ x), one can
prove that
F(xi:n) = P(X ≤ xi:n) ∼
i
n
The quantile function is defined as the inverse of the cumulative distribution
function F,
Q(u) = F−1
(u) or F(Q(u)) = P(X ≤ Q(u)) = u
4
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Lorenz curve
The empirical version of Lorenz curve is
L =



i
n
,
1
nx
j≤i
xj:n



1 > plot ( ( 0 : 5 ) / 5 , c (0 ,cumsum( income ) /sum( income
) ) )
5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Lorenz curve
p
L(p)
q
q
q
q
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Gini Coefficient
Gini coefficient is defined as the ratio of areas,
A
A + B
.
It can be defined using order statistics as
G =
2
n(n − 1)x
n
i=1
i · xi:n −
n + 1
n − 1
1 > n <− length ( income )
2 > mu <− mean( income )
3 > 2∗sum ( ( 1 : n) ∗ s o r t ( income ) ) / (mu∗n∗ (n−1))−(n
+1)/ (n−1)
4 [ 1 ] 0.5800019
6
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
p
L(p)
q
q
q
qA
B
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution Fitting
Assume that we now have more observations,
1 > load ( u r l ( " http : // freakonometrics . f r e e . f r /income_500. RData" ) )
We can use some histogram to visualize the distribu-
tion of the income
1 > summary( income )
2 Min . 1 st Qu. Median Mean 3rd Qu.
Max.
3 2191 23830 42750 77010 87430
2003000
4 > s o r t ( income ) [ 4 9 5 : 5 0 0 ]
5 [ 1 ] 465354 489734 512231 539103 627292
2003241
6 > h i s t ( income , breaks=seq (0 ,2005000 , by=5000) )
Histogram of income
income
Frequency
0 500000 1000000 1500000 2000000
010203040
7
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution Fitting
Because of the dispersion, look at the histogram of
the logarithm of the data
1 > h i s t ( log ( income , 1 0 ) , breaks=seq ( 3 , 6 . 5 ,
length =51) )
2 > boxplot ( income , h o r i z o n t a l=TRUE, log=" x " )
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q
2e+03 1e+04 5e+04 2e+05 1e+06
Histogram of log(income, 10)
log(income, 10)
Frequency
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5
010203040
8
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution Fitting
The cumulative distribution function (on the log of
the income)
1 > u <− s o r t ( income )
2 > v <− ( 1 : 5 0 0 ) /500
3 > plot (u , v , type=" s " , log=" x " )
Income (log scale)
CumulatedProbabilities
2e+03 1e+04 5e+04 2e+05 1e+06
0.00.20.40.60.81.0
9
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution Fitting
If we invert that graph, we have the quantile function
1 > plot (v , u , type=" s " , c o l=" red " , log=" y " )
10
Probabilities
Income(logscale)
0.0 0.2 0.4 0.6 0.8 1.0
2e+031e+045e+042e+051e+06
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution Fitting
On that dataset, Lorenz curve is
1 > plot ( ( 0 : 5 0 0 ) / 500 , c (0 ,cumsum( income ) /sum(
income ) ) )
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
p
L(p)
11
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Distribution and Confidence Intervals
There are two techniques to get the distribution of an estimator θ,
– a parametric one, based on some assumptions on the underlying distribution,
– a nonparametric one, based on sampling techniques
If Xi’s have a N(µ, σ2
) distribution, then
X ∼ N µ,
σ2
n
But sometimes, distribution can only be obtained as an approximation, because
of asymptotic properties.
From the central limit theorem,
X → N µ,
σ2
n
as n → ∞
In the nonparametric case, the idea is to generate pseudo-samples of size n, by
resampling from the original distribution.
12
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Bootstraping
Consider a sample x = {x1, · · · , xn}. At step b = 1, 2, · · · , B, generate a pseudo
sample xb
by sampling (with replacement) within sample x. Then compute any
statistic θ(xb
)
1 > boot <− function ( sample , f , b=500){
2 + F <− rep (NA, b)
3 + n <− length ( sample )
4 + f o r ( i in 1: b) {
5 + idx <− sample ( 1 : n , s i z e=n , r e p l a c e=TRUE)
6 + F[ i ] <− f ( sample [ idx ] ) }
7 + return (F) }
13
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Bootstraping
Let us generate 10,000 bootstraped sample, and com-
pute Gini index on those
1 >boot_g i n i <− boot ( income , gini ,1 e4 )
To visualize the distribution of the index
1 > h i s t ( boot_gini , p r o b a b i l i t y=TRUE)
2 > u <− seq ( . 4 , . 7 , length =251)
3 > v <− dnorm(u , mean( boot_g i n i ) , sd ( boot_g i n i )
)
4 > l i n e s (u , v , c o l=" red " , l t y =2) boot_gini
Density
0.45 0.50 0.55 0.60
051015
14
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Continuous Versions
The empirical cumulative distribution function
Fn(x) =
1
n
n
i=1
1(xi ≤ x)
Observe that
Fn(xj:n) =
j
n
If F is absolutely continuous,
F(x) =
x
0
f(t)dt i.e. f(x) =
dF(x)
dx
.
Then
P(x ∈ [a, b]) =
b
a
f(t)dt = F(b) − F(a).
15
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Continuous Versions
One can define quantiles as
x = Q(p) = F−1
(p)
The expected value is
µ =
∞
0
xf(x)dx =
∞
0
[1 − F(x)]dx =
1
0
Q(p)dp.
We can compute the average standard of living of the group below z. This is
equivalent to the expectation of a truncated distribution.
µ−
z =
1
F(z)
z
0
xf(x)dx =
∞
0
1 −
F(x)
F(z)
fx
16
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Continuous Versions
Lorenz curve is p → L(p) with
L(p) =
1
µ
Q(p)
0
xf(x)dx
Gastwirth (1971) proved that
L(p) =
1
µ
p
0
Q(u)du =
p
0
Q(u)du
1
0
Q(u)du
The numerator sums the incomes of the bottom p proportion of the population.
The denominator sums the incomes of all the population.
L is a [0, 1] → [0, 1] function, continuous if F is continuous. Observe that L is
increasing, since
dL(p)
dp
=
Q(p)
µ
Further, L is convex
17
Arthur CHARPENTIER - Welfare, Inequality and Poverty
The sample case
L
i
n
=
i
j=1 xj:n
n
j=1 xj:n
The points {i/n, L(i/n)} are then linearly interpolated to complete the
corresponding Lorenz curve.
The continuous distribution case
L(p) =
F −1
(p)
0
ydF(y)
∞
0
ydF(y)
=
1
E(X)
p
0
F−1
(u)du
with p ∈ (0, 1).
Let L be a continuous function on [0, 1], then L is a Lorenz curve if and only if
L(0) = 0, L(1) = 1, L (0+
) ≥ 0 and L (p) ≥ 0 on [0, 1].
18
Arthur CHARPENTIER - Welfare, Inequality and Poverty
From Lorenz to Bonferroni
The Bonferroni curve is
B(p) =
L(p)
p
and the Bonferroni index is
BI = 1 −
1
0
B(p)dp.
Define
Pi =
i
n
and Qi =
1
nx
i
j=1
xj
then
B =
1
n − 1
n−1
i=1
Pi − Qi
Pi
19
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Gini and Pietra indices
The Gini index is defined as twice the area between the egalitarian line and the
Lorenz curve
G = 2
1
0
[p − L(p)]dp = 1 − 2
1
0
L(p)dp
which can also be writen
1 −
1
E(X)
∞
0
[1 − F(x)]2
dx
Pietra index is defined as the maximal vertical deviation between the Lorenz
curve and the egalitarian line
P = max
p∈(0,1)
{p − L(p)} =
E(|X − E(X)|)
2E(X)
if F is strictly increasing (the maximum is reached in p = F(E(X)))
20
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Examples
E.g. consider the uniform distribution
F(x) = min{1,
x − a
b − a
1(x ≥ a)}
Then
L(p) =
2ap + (b − a)2
p2
a + b
and Gini index is
G =
b − a
3(a + b)
E.g. consider a Pareto distribution,
F(x) = 1 −
x0
x
α
, x ≥ x0,
with shape parameter α > 0. Then
F−1
(u) =
x0
(1 − u)
1
α
21
Arthur CHARPENTIER - Welfare, Inequality and Poverty
and
L(p) = 1 − [1 − p]1− 1
α p ∈ (0, 1).
and Gini index is
G =
1
2α − 1
while Pietra index is, if α > 1
P =
(α − 1)α−1
αα
E.g. consider the lognormal distribution,
F(x) = Φ
log x − µ
σ
then
L(p) = Φ(Φ−1
(p) − σ) p ∈ (0, 1).
and Gini index is
G = 2Φ
σ
√
2
− 1
22
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Fitting a Distribution
The standard technique is based on maximum likelihood estimation, provided by
1 > l i b r a r y (MASS)
2 > f i t d i s t r ( income , " lognormal " )
3 meanlog sdlog
4 10.72264538 1.01091329
5 ( 0.04520942) ( 0.03196789)
For other distribution (such as the Gamma distribution), we might have to rescale
1 > ( f i t_g <− f i t d i s t r ( income/1e2 , "gamma" ) )
2 shape rate
3 1.0812757769 0.0014040438
4 (0.0473722529) (0.0000544185)
5 > ( f i t_ln <− f i t d i s t r ( income/1e2 , " lognormal " ) )
6 meanlog sdlog
7 6.11747519 1.01091329
8 (0.04520942) (0.03196789)
23
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Fitting a Distribution
We can compare the densities
1 > u=seq (0 ,2 e5 , length =251)
2 > h i s t ( income , breaks=seq (0 ,2005000 , by=5000) ,
c o l=rgb ( 0 , 0 , 1 , . 5 ) , border=" white " , xlim=c
(0 ,2 e5 ) , p r o b a b i l i t y=TRUE)
3 > v_g <− dgamma(u/1e2 , f i t_g$ estimate [ 1 ] , f i t
_g$ estimate [ 2 ] ) /1e2
4 > v_ln <− dlnorm (u/1e2 , f i t_ln $ estimate [ 1 ] ,
f i t_ln $ estimate [ 2 ] ) /1e2
5 > l i n e s (u , v_g , c o l=" red " , l t y =2)
6 > l i n e s (u , v_ln , c o l=rgb ( 1 , 0 , 0 , . 4 ) )
Income
CumulatedProbabilities
0 50000 100000 150000 200000
0.00.20.40.60.81.0
Gamma
Log Normal
24
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Fitting a Distribution
or the cumuluative distributions
1 x <− s o r t ( income )
2 y <− ( 1 : 5 0 0 ) /500
3 plot (x , y , type=" s " , c o l=" black " )
4 v_g <− pgamma(u/1e2 , f i t_g$ estimate [ 1 ] , f i t_g
$ estimate [ 2 ] )
5 v_ln <− plnorm (u/1e2 , f i t_ln $ estimate [ 1 ] , f i t
_ln $ estimate [ 2 ] )
6 l i n e s (u , v_g , c o l=" red " , l t y =2)
7 l i n e s (u , v_ln , c o l=rgb ( 1 , 0 , 0 , . 4 ) ) income
Density
0 50000 100000 150000 200000
0.0e+005.0e−061.0e−051.5e−05
Gamma
Log Normal
One might consider the parametric version of Lorenz curve, to confirm the
goodness of fit, e.g. a lognormal distribution with σ = 1 since
1 > f i t d i s t r ( income , " lognormal " )
2 meanlog sdlog
3 10.72264538 1.01091329
25
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Fitting a Distribution
We can use functions of R
1 l i b r a r y ( ineq )
2 Lc . sim <− Lc ( income )
3 plot ( 0 : 1 , 0 : 1 , xlab="p" , ylab="L(p) " , c o l=" white
" )
4 polygon ( c (0 ,1 ,1 ,0) , c (0 ,0 ,1 ,0) , c o l=rgb
( 0 , 0 , 1 , . 1 ) , border=NA)
5 polygon ( Lc . sim$p , Lc . sim$L , c o l=rgb ( 0 , 0 , 1 , . 3 ) ,
border=NA)
6 l i n e s ( Lc . sim )
7 segments (0 ,0 ,1 ,1)
8 l i n e s ( Lc . lognorm , parameter =1, l t y =2)
q
q
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
p
L(p)
26
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Standard Parametric Distribution
For those distributions, we mention the R names in the gamlss package. Inference
can be done using
1 f i t <−gamlss (y~ 1 , family=LNO)
• log normal
f(x) =
1
xσ
√
2π
e−
(ln x−µ)2
2σ2
, x ≥ 0
with mean eµ+σ2
/2
, median eµ
, and variance (eσ2
− 1)e2µ+σ2
1 LNO(mu. l i n k = " i d e n t i t y " , sigma . l i n k = " log " )
2 dLNO(x , mu = 1 , sigma = 0.1 , nu = 0 , log = FALSE)
• gamma
f(x) =
x1/σ2
−1
exp[−x/(σ2
µ)]
(σ2µ)1/σ2
Γ(1/σ2)
, x ≥ 0
27
Arthur CHARPENTIER - Welfare, Inequality and Poverty
with mean µ and variance σ2
1 GA(mu. l i n k = " log " , sigma . l i n k =" log " )
2 dGA(x , mu = 1 , sigma = 1 , log = FALSE)
• Pareto
f(x) =
α xα
m
xα+1
for x ≥ xm
with cumulated distribution
F(x) = 1 −
xm
x
α
for x ≥ xm
with mean
αxm
(α − 1)
if α > 1, and variance
x2
mα
(α − 1)2(α − 2)
if α > 2.
1 PARETO2(mu. l i n k = " log " , sigma . l i n k = " log " )
2 dPARETO2(x , mu = 1 , sigma = 0.5 , log = FALSE)
28
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Larger Families
• GB1 - generalized Beta type 1
f(x) =
|a|xap−1
(1 − (x/b)a
)q−1
bapB(p, q)
, 0 < xa
< ba
where b , p , and q are positive
1 GB1(mu. l i n k = " l o g i t " , sigma . l i n k = " l o g i t " , nu . l i n k = " log " , tau . l i n k
= " log " )
2 dGB1(x , mu = 0.5 , sigma = 0.4 , nu = 1 , tau = 1 , log = FALSE)
The GB1 family includes the generalized gamma(GG), and Pareto as special
cases.
• GB2 - generalized Beta type 2
f(x) =
|a|xap−1
bapB(p, q)(1 + (x/b)a)p+q
Arthur CHARPENTIER - Welfare, Inequality and Poverty
1 GB2(mu. l i n k = " log " , sigma . l i n k = " i d e n t i t y " , nu . l i n k = " log " , tau .
l i n k = " log " )
4 dGB2(x , mu = 1 , sigma = 1 , nu = 1 , tau = 0 .5 , log = FALSE)
The GB2 nests common distributions such as the generalized gamma (GG),
Burr, lognormal, Weibull, Gamma, Rayleigh, Chi-square, Exponential, and the
log-logistic.
• Generalized Gamma
f(x) =
(p/ad
)xd−1
e−(x/a)p
Γ(d/p)
,
30
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
1 > load ( u r l ( " http : // freakonometrics . f r e e . f r /income_binned . RData" ) )
2 > head ( income_binned )
3 low high number mean std_e r r
4 1 0 4999 95 3606 964
5 2 5000 9999 267 7686 1439
6 3 10000 14999 373 12505 1471
7 4 15000 19999 350 17408 1368
8 5 20000 24999 329 22558 1428
9 6 25000 29999 337 27584 1520
10 > t a i l ( income_binned )
11 low high number mean std_e r r
12 46 225000 229999 10 228374 1197
13 47 230000 234999 13 232920 1370
14 48 235000 239999 11 236341 1157
15 49 240000 244999 14 242359 1474
16 50 245000 249999 11 247782 1487
17 51 250000 I n f 228 395459 189032
31
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
There is a dedicated package to work with such datasets,
1 > l i b r a r y ( b i n e q u a l i t y )
To fit a parametric distribution, e.g. a log-normal distribution, use functions of R
1 > n <− nrow ( income_binned )
2 > f i t_LN <− fitFunc (ID=rep ( " Fake Data " ,n) , hb=income_binned [ , " number "
] , bin_min=income_binned [ , " low " ] , bin_max=income_binned [ , " high " ] ,
obs_mean=income_binned [ , "mean" ] , ID_name=" Country " , d i s t r i b u t i o n=
LNO, distName="LNO" )
3 Time d i f f e r e n c e of 0.09900618 s e c s
4 f o r LNO f i t across 1 d i s t r i b u t i o n s
32
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
To visualize the cumulated distribution function, use
1 > N <− income_binned $number
2 > y1 <− cumsum(N) /sum(N)
3 > u <− seq (min( income_binned $low ) ,max( income
_binned $low ) , length =101)
4 > v <− plnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$
parameters [ 2 ] )
5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab="
Income " , ylab=" Cumulative Probability " )
6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , 0 , income_binned $ high [ i ] , y1 [ i ] , c o l=rgb
( 1 , 0 , 0 , . 2 ) )
7 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , y1 [ i ] , income_binned $ high [ i ] , c (0 , y1 ) [ i
] , c o l=rgb ( 1 , 0 , 0 , . 4 ) )
0 50000 100000 150000 200000 250000
0.00.20.40.60.8
Income
CumulativeProbability
33
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
and to visualize the cumulated distribution function,
use
1 > N=income_binned $number
2 > y2=N/sum(N) / d i f f ( income_binned $low )
3 > u=seq (min( income_binned $low ) ,max( income_
binned $low ) , length =101)
4 > v=dlnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$
parameters [ 2 ] )
5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab="
Income " , ylab=" Density " )
6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , 0 , income_binned $ high [ i ] , y2 [ i ] , c o l=rgb
( 1 , 0 , 0 , . 2 ) , border=" white " )
0 50000 100000 150000 200000 250000
0.0e+005.0e−061.0e−051.5e−05
Income
Density
34
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
But it is also possible to estimate all GB-distributions at once,
1 > f i t s=run_GB_family (ID=rep ( " Fake Data " ,n) ,hb=income_binned [ , " number "
] , bin_min=income_binned [ , " low " ] , bin_max=income_binned [ , " high " ] , obs
_mean=income_binned [ , "mean" ] ,
2 + ID_name=" Country " )
3 Time d i f f e r e n c e of 0.03800201 s e c s
4 f o r GB2 f i t across 1 d i s t r i b u t i o n s
5
6 Time d i f f e r e n c e of 0.3090181 s e c s
7 f o r GG f i t across 1 d i s t r i b u t i o n s
8
9 Time d i f f e r e n c e of 0.864049 s e c s
10 f o r BETA2 f i t across 1 d i s t r i b u t i o n s
...
1 Time d i f f e r e n c e of 0.04900193 s e c s
Arthur CHARPENTIER - Welfare, Inequality and Poverty
2 f o r LOGLOG f i t across 1 d i s t r i b u t i o n s
3
6 Time d i f f e r e n c e of 1.865106 s e c s
7 f o r PARETO2 f i t across 1 d i s t r i b u t i o n s
1 > f i t s $ f i t . f i l t e r [ , c ( " g i n i " , " a i c " , " bic " ) ]
2 g i n i a i c bic
3 1 NA NA NA
4 2 5.054377 34344.87 34364.43
5 3 5.110104 34352.93 34372.48
6 4 NA 53638.39 53657.94
7 5 4.892090 34845.87 34865.43
8 6 5.087506 34343.08 34356.11
9 7 4.702194 34819.55 34832.59
10 8 4.557867 34766.38 34779.41
11 9 NA 58259.42 58272.45
12 10 5.244332 34805.70 34818.73
1 > f i t s $ best_model$ a i c
Arthur CHARPENTIER - Welfare, Inequality and Poverty
2 Country obsMean d i s t r i b u t i o n estMean var
5 1 Fake Data NA LNO 72328.86 6969188937
6 cv cv_sqr g i n i t h e i l MLD
7 1 1.154196 1.332168 5.087506 0.4638252 0.4851275
8 a i c bic didConverge l o gL ike li ho od nparams
9 1 34343.08 34356.11 TRUE −17169.54 2
10 median sd
11 1 44400.23 83481.67
That was easy, those were simulated data...
37
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
Consider now some real data,
1 > data = read . table ( " http : // freakonometrics . f r e e . f r /us_income . txt " ,
sep=" , " , header=TRUE)
2 > head ( data )
3 low high number_1000 s mean std_e r r
4 1 0 4999 4245 1249 50
5 2 5000 9999 5128 7923 30
6 3 10000 14999 7149 12389 28
7 4 15000 19999 7370 17278 26
8 > t a i l ( data )
9 low high number_1000 s mean std_e r r
10 39 190000 194999 361 192031 115
11 40 195000 199999 291 197120 135
12 41 200000 249999 2160 219379 437
13 42 250000 9999999 2498 398233 6519
38
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
To fit a parametric distribution, e.g. a log-normal distribution, use
1 > n <− nrow ( data )
2 > f i t_LN <− fitFunc (ID=rep ( "US" ,n) , hb=data [ , " number_1000 s " ] , bin_min
=data [ , " low " ] , bin_max=data [ , " high " ] , obs_mean=data [ , "mean" ] , ID_
name=" Country " , d i s t r i b u t i o n=LNO, distName="LNO" )
3 Time d i f f e r e n c e of 0.1040058 s e c s
4 f o r LNO f i t across 1 d i s t r i b u t i o n s
39
Arthur CHARPENTIER - Welfare, Inequality and Poverty
Dealing with Binned Data
To visualize the cumulated distribution function, use
1 > N <− income_binned $number
2 > y1 <− cumsum(N) /sum(N)
3 > u <− seq (min( income_binned $low ) ,max( income
_binned $low ) , length =101)
4 > v <− plnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$
parameters [ 2 ] )
5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab="
Income " , ylab=" Cumulative Probability " )
6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , 0 , income_binned $ high [ i ] , y1 [ i ] , c o l=rgb
( 1 , 0 , 0 , . 2 ) )
7 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , y1 [ i ] , income_binned $ high [ i ] , c (0 , y1 ) [ i
] , c o l=rgb ( 1 , 0 , 0 , . 4 ) )
0 50000 100000 150000 200000 250000
0.00.20.40.60.8
Income
CumulativeProbability
40
Arthur CHARPENTIER - Welfare, Inequality and Poverty
and to visualize the cumulated distribution function,
use
1 > N=income_binned $number
2 > y2=N/sum(N) / d i f f ( income_binned $low )
3 > u=seq (min( income_binned $low ) ,max( income_
binned $low ) , length =101)
4 > v=dlnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$
parameters [ 2 ] )
5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab="
Income " , ylab=" Density " )
6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i
] , 0 , income_binned $ high [ i ] , y2 [ i ] , c o l=rgb
( 1 , 0 , 0 , . 2 ) , border=" white " )
0 50000 100000 150000 200000 250000
0.0e+005.0e−061.0e−051.5e−05
Income
Density
And the winner is....
1 > f i t s $ f i t . f i l t e r [ , c ( " g i n i " , " a i c " , " bic " ) ]
2 g i n i a i c bic
3 1 4.413411 825368.7 825407.4
Arthur CHARPENTIER - Welfare, Inequality and Poverty
4 2 4.395078 825598.8 825627.9
7 3 4.455112 825502.4 825531.5
8 4 4.480844 825881.5 825910.6
9 5 4.413282 825315.3 825344.4
10 6 4.922123 832408.2 832427.6
11 7 4.341085 827065.2 827084.6
12 8 4.318694 826112.9 826132.2
13 9 NA 831054.2 831073.6
14 10 NA NA NA
1 > f i t s $ best_model$ a i c
2 Country obsMean d i s t r i b u t i o n estMean var
3 1 US NA GG 65147.54 3152161910
4 cv cv_sqr g i n i t h e i l MLD
5 1 0.8617995 0.7426984 4.395078 0.3251443 0.3904942
6 a i c bic didConverge l o gL ike li ho od nparams
7 1 825598.8 825627.9 TRUE −412796.4 3
8 median sd
9 1 48953.6 56144.12
42
Arthur CHARPENTIER - Welfare, Inequality and Poverty
43

More Related Content

What's hot (20)

Slides ensae 8
Slides ensae 8Slides ensae 8
Slides ensae 8
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
Slides ineq-3b
Slides ineq-3bSlides ineq-3b
Slides ineq-3b
 
Inequalities #2
Inequalities #2Inequalities #2
Inequalities #2
 
Slides ineq-4
Slides ineq-4Slides ineq-4
Slides ineq-4
 
Classification
ClassificationClassification
Classification
 
Slides barcelona Machine Learning
Slides barcelona Machine LearningSlides barcelona Machine Learning
Slides barcelona Machine Learning
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Slides erasmus
Slides erasmusSlides erasmus
Slides erasmus
 
Slides astin
Slides astinSlides astin
Slides astin
 
slides CIRM copulas, extremes and actuarial science
slides CIRM copulas, extremes and actuarial scienceslides CIRM copulas, extremes and actuarial science
slides CIRM copulas, extremes and actuarial science
 
Slides amsterdam-2013
Slides amsterdam-2013Slides amsterdam-2013
Slides amsterdam-2013
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
Slides ads ia
Slides ads iaSlides ads ia
Slides ads ia
 
Multivariate Distributions, an overview
Multivariate Distributions, an overviewMultivariate Distributions, an overview
Multivariate Distributions, an overview
 
Slides erm
Slides ermSlides erm
Slides erm
 
Slides univ-van-amsterdam
Slides univ-van-amsterdamSlides univ-van-amsterdam
Slides univ-van-amsterdam
 
Slides sales-forecasting-session2-web
Slides sales-forecasting-session2-webSlides sales-forecasting-session2-web
Slides sales-forecasting-session2-web
 

Viewers also liked (19)

Slides ineq-1
Slides ineq-1Slides ineq-1
Slides ineq-1
 
Slides 2016 ineq_1
Slides 2016 ineq_1Slides 2016 ineq_1
Slides 2016 ineq_1
 
Slides inequality 2017
Slides inequality 2017Slides inequality 2017
Slides inequality 2017
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Slides ensae-2016-1
Slides ensae-2016-1Slides ensae-2016-1
Slides ensae-2016-1
 
Slides ensae-2016-2
Slides ensae-2016-2Slides ensae-2016-2
Slides ensae-2016-2
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
 
Slides ensae-2016-4
Slides ensae-2016-4Slides ensae-2016-4
Slides ensae-2016-4
 
15 03 16_data sciences pour l'actuariat_f. soulie fogelman
15 03 16_data sciences pour l'actuariat_f. soulie fogelman15 03 16_data sciences pour l'actuariat_f. soulie fogelman
15 03 16_data sciences pour l'actuariat_f. soulie fogelman
 
Slides ensae-2016-6
Slides ensae-2016-6Slides ensae-2016-6
Slides ensae-2016-6
 
Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Slides ensae-2016-8
Slides ensae-2016-8Slides ensae-2016-8
Slides ensae-2016-8
 
Slides ensae-2016-7
Slides ensae-2016-7Slides ensae-2016-7
Slides ensae-2016-7
 
Slides ensae-2016-5
Slides ensae-2016-5Slides ensae-2016-5
Slides ensae-2016-5
 
Slides ensae-2016-3
Slides ensae-2016-3Slides ensae-2016-3
Slides ensae-2016-3
 
Slides ensae-2016-10
Slides ensae-2016-10Slides ensae-2016-10
Slides ensae-2016-10
 
Slides erm-cea-ia
Slides erm-cea-iaSlides erm-cea-ia
Slides erm-cea-ia
 
HdR
HdRHdR
HdR
 
Slides lln-risques
Slides lln-risquesSlides lln-risques
Slides lln-risques
 

Similar to Slides ineq-2

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheetSuvrat Mishra
 
Logit model testing and interpretation
Logit model testing and interpretationLogit model testing and interpretation
Logit model testing and interpretationFelipe Affonso
 
The Binomial,poisson _ Normal Distribution (1).ppt
The Binomial,poisson _ Normal Distribution (1).pptThe Binomial,poisson _ Normal Distribution (1).ppt
The Binomial,poisson _ Normal Distribution (1).pptIjaz Manzoor
 
slides_online_optimization_david_mateos
slides_online_optimization_david_mateosslides_online_optimization_david_mateos
slides_online_optimization_david_mateosDavid Mateos
 
Docslide.us 2002 formulae-and-tables
Docslide.us 2002 formulae-and-tablesDocslide.us 2002 formulae-and-tables
Docslide.us 2002 formulae-and-tablesbarasActuarial
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheetJoachim Gwoke
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Arthur Charpentier
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingArthur Charpentier
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learningYujiro Katagiri
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015Junfeng Liu
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 

Similar to Slides ineq-2 (20)

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Unit II PPT.pptx
Unit II PPT.pptxUnit II PPT.pptx
Unit II PPT.pptx
 
Probability Cheatsheet.pdf
Probability Cheatsheet.pdfProbability Cheatsheet.pdf
Probability Cheatsheet.pdf
 
Logit model testing and interpretation
Logit model testing and interpretationLogit model testing and interpretation
Logit model testing and interpretation
 
The Binomial,poisson _ Normal Distribution (1).ppt
The Binomial,poisson _ Normal Distribution (1).pptThe Binomial,poisson _ Normal Distribution (1).ppt
The Binomial,poisson _ Normal Distribution (1).ppt
 
slides_online_optimization_david_mateos
slides_online_optimization_david_mateosslides_online_optimization_david_mateos
slides_online_optimization_david_mateos
 
Docslide.us 2002 formulae-and-tables
Docslide.us 2002 formulae-and-tablesDocslide.us 2002 formulae-and-tables
Docslide.us 2002 formulae-and-tables
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
Side 2019, part 2
Side 2019, part 2Side 2019, part 2
Side 2019, part 2
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 

More from Arthur Charpentier (20)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
Side 2019 #9
Side 2019 #9Side 2019 #9
Side 2019 #9
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 

Slides ineq-2

  • 1. Arthur CHARPENTIER - Welfare, Inequality and Poverty Arthur Charpentier charpentier.arthur@gmail.com http ://freakonometrics.hypotheses.org/ Université de Rennes 1, January 2015 Welfare, Inequality & Poverty, # 2 1
  • 2. Arthur CHARPENTIER - Welfare, Inequality and Poverty Modeling Income Distribution Let {x1, · · · , xn} denote some sample. Then x = 1 n n i=1 xi = n i=1 1 n xi This can be used when we have census data. 1 load ( u r l ( " http : // freakonometrics . f r e e . f r / income_5. RData" ) ) 2 income <− s o r t ( income ) 3 plot ( 1 : 5 , income ) income qq q q q 0 50000 100000 150000 200000 250000 It is possible to use survey data. If πi denote the probability to be drawn, use weights ωi ∝ 1 nπi 2
  • 3. Arthur CHARPENTIER - Welfare, Inequality and Poverty The weighted average is then xω = n i=1 ωi ω xi where ω = ωi. This is an unbaised estimator of the population mean. Sometime, data are obtained from stratified samples : before sampling, members of the population are groupes in homogeneous subgroupes (called a strata). Given S strata, such that the population in strata s is Ns, then xS = S s=1 Ns N xs where xs = 1 Ns i∈Ss xi 3
  • 4. Arthur CHARPENTIER - Welfare, Inequality and Poverty Statistical Tools Used to Describe the Distribution Consider a sample {x1, · · · , xn}. Usually, the order is not important. So let us order those values, x1:n min{xi} ≤ x2:n ≤ · · · ≤ xn−1:n ≤ xn:n max{xi} As usual, assume that xi’s were randomly drawn from an (unknown) distribution F. If F denotes the cumulative distribution function, F(x) = P(X ≤ x), one can prove that F(xi:n) = P(X ≤ xi:n) ∼ i n The quantile function is defined as the inverse of the cumulative distribution function F, Q(u) = F−1 (u) or F(Q(u)) = P(X ≤ Q(u)) = u 4
  • 5. Arthur CHARPENTIER - Welfare, Inequality and Poverty Lorenz curve The empirical version of Lorenz curve is L =    i n , 1 nx j≤i xj:n    1 > plot ( ( 0 : 5 ) / 5 , c (0 ,cumsum( income ) /sum( income ) ) ) 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Lorenz curve p L(p) q q q q
  • 6. Arthur CHARPENTIER - Welfare, Inequality and Poverty Gini Coefficient Gini coefficient is defined as the ratio of areas, A A + B . It can be defined using order statistics as G = 2 n(n − 1)x n i=1 i · xi:n − n + 1 n − 1 1 > n <− length ( income ) 2 > mu <− mean( income ) 3 > 2∗sum ( ( 1 : n) ∗ s o r t ( income ) ) / (mu∗n∗ (n−1))−(n +1)/ (n−1) 4 [ 1 ] 0.5800019 6 q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 p L(p) q q q qA B
  • 7. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution Fitting Assume that we now have more observations, 1 > load ( u r l ( " http : // freakonometrics . f r e e . f r /income_500. RData" ) ) We can use some histogram to visualize the distribu- tion of the income 1 > summary( income ) 2 Min . 1 st Qu. Median Mean 3rd Qu. Max. 3 2191 23830 42750 77010 87430 2003000 4 > s o r t ( income ) [ 4 9 5 : 5 0 0 ] 5 [ 1 ] 465354 489734 512231 539103 627292 2003241 6 > h i s t ( income , breaks=seq (0 ,2005000 , by=5000) ) Histogram of income income Frequency 0 500000 1000000 1500000 2000000 010203040 7
  • 8. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution Fitting Because of the dispersion, look at the histogram of the logarithm of the data 1 > h i s t ( log ( income , 1 0 ) , breaks=seq ( 3 , 6 . 5 , length =51) ) 2 > boxplot ( income , h o r i z o n t a l=TRUE, log=" x " ) qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q 2e+03 1e+04 5e+04 2e+05 1e+06 Histogram of log(income, 10) log(income, 10) Frequency 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 010203040 8
  • 9. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution Fitting The cumulative distribution function (on the log of the income) 1 > u <− s o r t ( income ) 2 > v <− ( 1 : 5 0 0 ) /500 3 > plot (u , v , type=" s " , log=" x " ) Income (log scale) CumulatedProbabilities 2e+03 1e+04 5e+04 2e+05 1e+06 0.00.20.40.60.81.0 9
  • 10. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution Fitting If we invert that graph, we have the quantile function 1 > plot (v , u , type=" s " , c o l=" red " , log=" y " ) 10 Probabilities Income(logscale) 0.0 0.2 0.4 0.6 0.8 1.0 2e+031e+045e+042e+051e+06
  • 11. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution Fitting On that dataset, Lorenz curve is 1 > plot ( ( 0 : 5 0 0 ) / 500 , c (0 ,cumsum( income ) /sum( income ) ) ) q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 p L(p) 11
  • 12. Arthur CHARPENTIER - Welfare, Inequality and Poverty Distribution and Confidence Intervals There are two techniques to get the distribution of an estimator θ, – a parametric one, based on some assumptions on the underlying distribution, – a nonparametric one, based on sampling techniques If Xi’s have a N(µ, σ2 ) distribution, then X ∼ N µ, σ2 n But sometimes, distribution can only be obtained as an approximation, because of asymptotic properties. From the central limit theorem, X → N µ, σ2 n as n → ∞ In the nonparametric case, the idea is to generate pseudo-samples of size n, by resampling from the original distribution. 12
  • 13. Arthur CHARPENTIER - Welfare, Inequality and Poverty Bootstraping Consider a sample x = {x1, · · · , xn}. At step b = 1, 2, · · · , B, generate a pseudo sample xb by sampling (with replacement) within sample x. Then compute any statistic θ(xb ) 1 > boot <− function ( sample , f , b=500){ 2 + F <− rep (NA, b) 3 + n <− length ( sample ) 4 + f o r ( i in 1: b) { 5 + idx <− sample ( 1 : n , s i z e=n , r e p l a c e=TRUE) 6 + F[ i ] <− f ( sample [ idx ] ) } 7 + return (F) } 13
  • 14. Arthur CHARPENTIER - Welfare, Inequality and Poverty Bootstraping Let us generate 10,000 bootstraped sample, and com- pute Gini index on those 1 >boot_g i n i <− boot ( income , gini ,1 e4 ) To visualize the distribution of the index 1 > h i s t ( boot_gini , p r o b a b i l i t y=TRUE) 2 > u <− seq ( . 4 , . 7 , length =251) 3 > v <− dnorm(u , mean( boot_g i n i ) , sd ( boot_g i n i ) ) 4 > l i n e s (u , v , c o l=" red " , l t y =2) boot_gini Density 0.45 0.50 0.55 0.60 051015 14
  • 15. Arthur CHARPENTIER - Welfare, Inequality and Poverty Continuous Versions The empirical cumulative distribution function Fn(x) = 1 n n i=1 1(xi ≤ x) Observe that Fn(xj:n) = j n If F is absolutely continuous, F(x) = x 0 f(t)dt i.e. f(x) = dF(x) dx . Then P(x ∈ [a, b]) = b a f(t)dt = F(b) − F(a). 15
  • 16. Arthur CHARPENTIER - Welfare, Inequality and Poverty Continuous Versions One can define quantiles as x = Q(p) = F−1 (p) The expected value is µ = ∞ 0 xf(x)dx = ∞ 0 [1 − F(x)]dx = 1 0 Q(p)dp. We can compute the average standard of living of the group below z. This is equivalent to the expectation of a truncated distribution. µ− z = 1 F(z) z 0 xf(x)dx = ∞ 0 1 − F(x) F(z) fx 16
  • 17. Arthur CHARPENTIER - Welfare, Inequality and Poverty Continuous Versions Lorenz curve is p → L(p) with L(p) = 1 µ Q(p) 0 xf(x)dx Gastwirth (1971) proved that L(p) = 1 µ p 0 Q(u)du = p 0 Q(u)du 1 0 Q(u)du The numerator sums the incomes of the bottom p proportion of the population. The denominator sums the incomes of all the population. L is a [0, 1] → [0, 1] function, continuous if F is continuous. Observe that L is increasing, since dL(p) dp = Q(p) µ Further, L is convex 17
  • 18. Arthur CHARPENTIER - Welfare, Inequality and Poverty The sample case L i n = i j=1 xj:n n j=1 xj:n The points {i/n, L(i/n)} are then linearly interpolated to complete the corresponding Lorenz curve. The continuous distribution case L(p) = F −1 (p) 0 ydF(y) ∞ 0 ydF(y) = 1 E(X) p 0 F−1 (u)du with p ∈ (0, 1). Let L be a continuous function on [0, 1], then L is a Lorenz curve if and only if L(0) = 0, L(1) = 1, L (0+ ) ≥ 0 and L (p) ≥ 0 on [0, 1]. 18
  • 19. Arthur CHARPENTIER - Welfare, Inequality and Poverty From Lorenz to Bonferroni The Bonferroni curve is B(p) = L(p) p and the Bonferroni index is BI = 1 − 1 0 B(p)dp. Define Pi = i n and Qi = 1 nx i j=1 xj then B = 1 n − 1 n−1 i=1 Pi − Qi Pi 19
  • 20. Arthur CHARPENTIER - Welfare, Inequality and Poverty Gini and Pietra indices The Gini index is defined as twice the area between the egalitarian line and the Lorenz curve G = 2 1 0 [p − L(p)]dp = 1 − 2 1 0 L(p)dp which can also be writen 1 − 1 E(X) ∞ 0 [1 − F(x)]2 dx Pietra index is defined as the maximal vertical deviation between the Lorenz curve and the egalitarian line P = max p∈(0,1) {p − L(p)} = E(|X − E(X)|) 2E(X) if F is strictly increasing (the maximum is reached in p = F(E(X))) 20
  • 21. Arthur CHARPENTIER - Welfare, Inequality and Poverty Examples E.g. consider the uniform distribution F(x) = min{1, x − a b − a 1(x ≥ a)} Then L(p) = 2ap + (b − a)2 p2 a + b and Gini index is G = b − a 3(a + b) E.g. consider a Pareto distribution, F(x) = 1 − x0 x α , x ≥ x0, with shape parameter α > 0. Then F−1 (u) = x0 (1 − u) 1 α 21
  • 22. Arthur CHARPENTIER - Welfare, Inequality and Poverty and L(p) = 1 − [1 − p]1− 1 α p ∈ (0, 1). and Gini index is G = 1 2α − 1 while Pietra index is, if α > 1 P = (α − 1)α−1 αα E.g. consider the lognormal distribution, F(x) = Φ log x − µ σ then L(p) = Φ(Φ−1 (p) − σ) p ∈ (0, 1). and Gini index is G = 2Φ σ √ 2 − 1 22
  • 23. Arthur CHARPENTIER - Welfare, Inequality and Poverty Fitting a Distribution The standard technique is based on maximum likelihood estimation, provided by 1 > l i b r a r y (MASS) 2 > f i t d i s t r ( income , " lognormal " ) 3 meanlog sdlog 4 10.72264538 1.01091329 5 ( 0.04520942) ( 0.03196789) For other distribution (such as the Gamma distribution), we might have to rescale 1 > ( f i t_g <− f i t d i s t r ( income/1e2 , "gamma" ) ) 2 shape rate 3 1.0812757769 0.0014040438 4 (0.0473722529) (0.0000544185) 5 > ( f i t_ln <− f i t d i s t r ( income/1e2 , " lognormal " ) ) 6 meanlog sdlog 7 6.11747519 1.01091329 8 (0.04520942) (0.03196789) 23
  • 24. Arthur CHARPENTIER - Welfare, Inequality and Poverty Fitting a Distribution We can compare the densities 1 > u=seq (0 ,2 e5 , length =251) 2 > h i s t ( income , breaks=seq (0 ,2005000 , by=5000) , c o l=rgb ( 0 , 0 , 1 , . 5 ) , border=" white " , xlim=c (0 ,2 e5 ) , p r o b a b i l i t y=TRUE) 3 > v_g <− dgamma(u/1e2 , f i t_g$ estimate [ 1 ] , f i t _g$ estimate [ 2 ] ) /1e2 4 > v_ln <− dlnorm (u/1e2 , f i t_ln $ estimate [ 1 ] , f i t_ln $ estimate [ 2 ] ) /1e2 5 > l i n e s (u , v_g , c o l=" red " , l t y =2) 6 > l i n e s (u , v_ln , c o l=rgb ( 1 , 0 , 0 , . 4 ) ) Income CumulatedProbabilities 0 50000 100000 150000 200000 0.00.20.40.60.81.0 Gamma Log Normal 24
  • 25. Arthur CHARPENTIER - Welfare, Inequality and Poverty Fitting a Distribution or the cumuluative distributions 1 x <− s o r t ( income ) 2 y <− ( 1 : 5 0 0 ) /500 3 plot (x , y , type=" s " , c o l=" black " ) 4 v_g <− pgamma(u/1e2 , f i t_g$ estimate [ 1 ] , f i t_g $ estimate [ 2 ] ) 5 v_ln <− plnorm (u/1e2 , f i t_ln $ estimate [ 1 ] , f i t _ln $ estimate [ 2 ] ) 6 l i n e s (u , v_g , c o l=" red " , l t y =2) 7 l i n e s (u , v_ln , c o l=rgb ( 1 , 0 , 0 , . 4 ) ) income Density 0 50000 100000 150000 200000 0.0e+005.0e−061.0e−051.5e−05 Gamma Log Normal One might consider the parametric version of Lorenz curve, to confirm the goodness of fit, e.g. a lognormal distribution with σ = 1 since 1 > f i t d i s t r ( income , " lognormal " ) 2 meanlog sdlog 3 10.72264538 1.01091329 25
  • 26. Arthur CHARPENTIER - Welfare, Inequality and Poverty Fitting a Distribution We can use functions of R 1 l i b r a r y ( ineq ) 2 Lc . sim <− Lc ( income ) 3 plot ( 0 : 1 , 0 : 1 , xlab="p" , ylab="L(p) " , c o l=" white " ) 4 polygon ( c (0 ,1 ,1 ,0) , c (0 ,0 ,1 ,0) , c o l=rgb ( 0 , 0 , 1 , . 1 ) , border=NA) 5 polygon ( Lc . sim$p , Lc . sim$L , c o l=rgb ( 0 , 0 , 1 , . 3 ) , border=NA) 6 l i n e s ( Lc . sim ) 7 segments (0 ,0 ,1 ,1) 8 l i n e s ( Lc . lognorm , parameter =1, l t y =2) q q 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 p L(p) 26
  • 27. Arthur CHARPENTIER - Welfare, Inequality and Poverty Standard Parametric Distribution For those distributions, we mention the R names in the gamlss package. Inference can be done using 1 f i t <−gamlss (y~ 1 , family=LNO) • log normal f(x) = 1 xσ √ 2π e− (ln x−µ)2 2σ2 , x ≥ 0 with mean eµ+σ2 /2 , median eµ , and variance (eσ2 − 1)e2µ+σ2 1 LNO(mu. l i n k = " i d e n t i t y " , sigma . l i n k = " log " ) 2 dLNO(x , mu = 1 , sigma = 0.1 , nu = 0 , log = FALSE) • gamma f(x) = x1/σ2 −1 exp[−x/(σ2 µ)] (σ2µ)1/σ2 Γ(1/σ2) , x ≥ 0 27
  • 28. Arthur CHARPENTIER - Welfare, Inequality and Poverty with mean µ and variance σ2 1 GA(mu. l i n k = " log " , sigma . l i n k =" log " ) 2 dGA(x , mu = 1 , sigma = 1 , log = FALSE) • Pareto f(x) = α xα m xα+1 for x ≥ xm with cumulated distribution F(x) = 1 − xm x α for x ≥ xm with mean αxm (α − 1) if α > 1, and variance x2 mα (α − 1)2(α − 2) if α > 2. 1 PARETO2(mu. l i n k = " log " , sigma . l i n k = " log " ) 2 dPARETO2(x , mu = 1 , sigma = 0.5 , log = FALSE) 28
  • 29. Arthur CHARPENTIER - Welfare, Inequality and Poverty Larger Families • GB1 - generalized Beta type 1 f(x) = |a|xap−1 (1 − (x/b)a )q−1 bapB(p, q) , 0 < xa < ba where b , p , and q are positive 1 GB1(mu. l i n k = " l o g i t " , sigma . l i n k = " l o g i t " , nu . l i n k = " log " , tau . l i n k = " log " ) 2 dGB1(x , mu = 0.5 , sigma = 0.4 , nu = 1 , tau = 1 , log = FALSE) The GB1 family includes the generalized gamma(GG), and Pareto as special cases. • GB2 - generalized Beta type 2 f(x) = |a|xap−1 bapB(p, q)(1 + (x/b)a)p+q
  • 30. Arthur CHARPENTIER - Welfare, Inequality and Poverty 1 GB2(mu. l i n k = " log " , sigma . l i n k = " i d e n t i t y " , nu . l i n k = " log " , tau . l i n k = " log " ) 4 dGB2(x , mu = 1 , sigma = 1 , nu = 1 , tau = 0 .5 , log = FALSE) The GB2 nests common distributions such as the generalized gamma (GG), Burr, lognormal, Weibull, Gamma, Rayleigh, Chi-square, Exponential, and the log-logistic. • Generalized Gamma f(x) = (p/ad )xd−1 e−(x/a)p Γ(d/p) , 30
  • 31. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data 1 > load ( u r l ( " http : // freakonometrics . f r e e . f r /income_binned . RData" ) ) 2 > head ( income_binned ) 3 low high number mean std_e r r 4 1 0 4999 95 3606 964 5 2 5000 9999 267 7686 1439 6 3 10000 14999 373 12505 1471 7 4 15000 19999 350 17408 1368 8 5 20000 24999 329 22558 1428 9 6 25000 29999 337 27584 1520 10 > t a i l ( income_binned ) 11 low high number mean std_e r r 12 46 225000 229999 10 228374 1197 13 47 230000 234999 13 232920 1370 14 48 235000 239999 11 236341 1157 15 49 240000 244999 14 242359 1474 16 50 245000 249999 11 247782 1487 17 51 250000 I n f 228 395459 189032 31
  • 32. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data There is a dedicated package to work with such datasets, 1 > l i b r a r y ( b i n e q u a l i t y ) To fit a parametric distribution, e.g. a log-normal distribution, use functions of R 1 > n <− nrow ( income_binned ) 2 > f i t_LN <− fitFunc (ID=rep ( " Fake Data " ,n) , hb=income_binned [ , " number " ] , bin_min=income_binned [ , " low " ] , bin_max=income_binned [ , " high " ] , obs_mean=income_binned [ , "mean" ] , ID_name=" Country " , d i s t r i b u t i o n= LNO, distName="LNO" ) 3 Time d i f f e r e n c e of 0.09900618 s e c s 4 f o r LNO f i t across 1 d i s t r i b u t i o n s 32
  • 33. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data To visualize the cumulated distribution function, use 1 > N <− income_binned $number 2 > y1 <− cumsum(N) /sum(N) 3 > u <− seq (min( income_binned $low ) ,max( income _binned $low ) , length =101) 4 > v <− plnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$ parameters [ 2 ] ) 5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab=" Income " , ylab=" Cumulative Probability " ) 6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , 0 , income_binned $ high [ i ] , y1 [ i ] , c o l=rgb ( 1 , 0 , 0 , . 2 ) ) 7 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , y1 [ i ] , income_binned $ high [ i ] , c (0 , y1 ) [ i ] , c o l=rgb ( 1 , 0 , 0 , . 4 ) ) 0 50000 100000 150000 200000 250000 0.00.20.40.60.8 Income CumulativeProbability 33
  • 34. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data and to visualize the cumulated distribution function, use 1 > N=income_binned $number 2 > y2=N/sum(N) / d i f f ( income_binned $low ) 3 > u=seq (min( income_binned $low ) ,max( income_ binned $low ) , length =101) 4 > v=dlnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$ parameters [ 2 ] ) 5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab=" Income " , ylab=" Density " ) 6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , 0 , income_binned $ high [ i ] , y2 [ i ] , c o l=rgb ( 1 , 0 , 0 , . 2 ) , border=" white " ) 0 50000 100000 150000 200000 250000 0.0e+005.0e−061.0e−051.5e−05 Income Density 34
  • 35. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data But it is also possible to estimate all GB-distributions at once, 1 > f i t s=run_GB_family (ID=rep ( " Fake Data " ,n) ,hb=income_binned [ , " number " ] , bin_min=income_binned [ , " low " ] , bin_max=income_binned [ , " high " ] , obs _mean=income_binned [ , "mean" ] , 2 + ID_name=" Country " ) 3 Time d i f f e r e n c e of 0.03800201 s e c s 4 f o r GB2 f i t across 1 d i s t r i b u t i o n s 5 6 Time d i f f e r e n c e of 0.3090181 s e c s 7 f o r GG f i t across 1 d i s t r i b u t i o n s 8 9 Time d i f f e r e n c e of 0.864049 s e c s 10 f o r BETA2 f i t across 1 d i s t r i b u t i o n s ... 1 Time d i f f e r e n c e of 0.04900193 s e c s
  • 36. Arthur CHARPENTIER - Welfare, Inequality and Poverty 2 f o r LOGLOG f i t across 1 d i s t r i b u t i o n s 3 6 Time d i f f e r e n c e of 1.865106 s e c s 7 f o r PARETO2 f i t across 1 d i s t r i b u t i o n s 1 > f i t s $ f i t . f i l t e r [ , c ( " g i n i " , " a i c " , " bic " ) ] 2 g i n i a i c bic 3 1 NA NA NA 4 2 5.054377 34344.87 34364.43 5 3 5.110104 34352.93 34372.48 6 4 NA 53638.39 53657.94 7 5 4.892090 34845.87 34865.43 8 6 5.087506 34343.08 34356.11 9 7 4.702194 34819.55 34832.59 10 8 4.557867 34766.38 34779.41 11 9 NA 58259.42 58272.45 12 10 5.244332 34805.70 34818.73 1 > f i t s $ best_model$ a i c
  • 37. Arthur CHARPENTIER - Welfare, Inequality and Poverty 2 Country obsMean d i s t r i b u t i o n estMean var 5 1 Fake Data NA LNO 72328.86 6969188937 6 cv cv_sqr g i n i t h e i l MLD 7 1 1.154196 1.332168 5.087506 0.4638252 0.4851275 8 a i c bic didConverge l o gL ike li ho od nparams 9 1 34343.08 34356.11 TRUE −17169.54 2 10 median sd 11 1 44400.23 83481.67 That was easy, those were simulated data... 37
  • 38. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data Consider now some real data, 1 > data = read . table ( " http : // freakonometrics . f r e e . f r /us_income . txt " , sep=" , " , header=TRUE) 2 > head ( data ) 3 low high number_1000 s mean std_e r r 4 1 0 4999 4245 1249 50 5 2 5000 9999 5128 7923 30 6 3 10000 14999 7149 12389 28 7 4 15000 19999 7370 17278 26 8 > t a i l ( data ) 9 low high number_1000 s mean std_e r r 10 39 190000 194999 361 192031 115 11 40 195000 199999 291 197120 135 12 41 200000 249999 2160 219379 437 13 42 250000 9999999 2498 398233 6519 38
  • 39. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data To fit a parametric distribution, e.g. a log-normal distribution, use 1 > n <− nrow ( data ) 2 > f i t_LN <− fitFunc (ID=rep ( "US" ,n) , hb=data [ , " number_1000 s " ] , bin_min =data [ , " low " ] , bin_max=data [ , " high " ] , obs_mean=data [ , "mean" ] , ID_ name=" Country " , d i s t r i b u t i o n=LNO, distName="LNO" ) 3 Time d i f f e r e n c e of 0.1040058 s e c s 4 f o r LNO f i t across 1 d i s t r i b u t i o n s 39
  • 40. Arthur CHARPENTIER - Welfare, Inequality and Poverty Dealing with Binned Data To visualize the cumulated distribution function, use 1 > N <− income_binned $number 2 > y1 <− cumsum(N) /sum(N) 3 > u <− seq (min( income_binned $low ) ,max( income _binned $low ) , length =101) 4 > v <− plnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$ parameters [ 2 ] ) 5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab=" Income " , ylab=" Cumulative Probability " ) 6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , 0 , income_binned $ high [ i ] , y1 [ i ] , c o l=rgb ( 1 , 0 , 0 , . 2 ) ) 7 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , y1 [ i ] , income_binned $ high [ i ] , c (0 , y1 ) [ i ] , c o l=rgb ( 1 , 0 , 0 , . 4 ) ) 0 50000 100000 150000 200000 250000 0.00.20.40.60.8 Income CumulativeProbability 40
  • 41. Arthur CHARPENTIER - Welfare, Inequality and Poverty and to visualize the cumulated distribution function, use 1 > N=income_binned $number 2 > y2=N/sum(N) / d i f f ( income_binned $low ) 3 > u=seq (min( income_binned $low ) ,max( income_ binned $low ) , length =101) 4 > v=dlnorm (u , f i t_LN$ parameters [ 1 ] , f i t_LN$ parameters [ 2 ] ) 5 > plot (u , v , c o l=" blue " , type=" l " , lwd=2, xlab=" Income " , ylab=" Density " ) 6 > f o r ( i in 1 : ( n−1)) r e c t ( income_binned $low [ i ] , 0 , income_binned $ high [ i ] , y2 [ i ] , c o l=rgb ( 1 , 0 , 0 , . 2 ) , border=" white " ) 0 50000 100000 150000 200000 250000 0.0e+005.0e−061.0e−051.5e−05 Income Density And the winner is.... 1 > f i t s $ f i t . f i l t e r [ , c ( " g i n i " , " a i c " , " bic " ) ] 2 g i n i a i c bic 3 1 4.413411 825368.7 825407.4
  • 42. Arthur CHARPENTIER - Welfare, Inequality and Poverty 4 2 4.395078 825598.8 825627.9 7 3 4.455112 825502.4 825531.5 8 4 4.480844 825881.5 825910.6 9 5 4.413282 825315.3 825344.4 10 6 4.922123 832408.2 832427.6 11 7 4.341085 827065.2 827084.6 12 8 4.318694 826112.9 826132.2 13 9 NA 831054.2 831073.6 14 10 NA NA NA 1 > f i t s $ best_model$ a i c 2 Country obsMean d i s t r i b u t i o n estMean var 3 1 US NA GG 65147.54 3152161910 4 cv cv_sqr g i n i t h e i l MLD 5 1 0.8617995 0.7426984 4.395078 0.3251443 0.3904942 6 a i c bic didConverge l o gL ike li ho od nparams 7 1 825598.8 825627.9 TRUE −412796.4 3 8 median sd 9 1 48953.6 56144.12 42
  • 43. Arthur CHARPENTIER - Welfare, Inequality and Poverty 43