SlideShare une entreprise Scribd logo
1  sur  32
P a g e | 1
From a long time, the study of parametric families of probability distributions has received profound interest due
to their usefulness in data analytics. My present work is an account of one approach which has generated a great
deal of activity.
There has been a general proclivity in the statistical literature to adopt more flexible methods to analyze data and
to represent features of the data as adequately and exhaustively as possible so as to reduce unrealistic and
unreliable assumptions. For the treatment of continuous observations within a parametric domain, one aspect
which has been little affected by the above process is the overwhelming role played by the assumption of
normality which is the sole basis for every data analysis. A major reason for this is certainly the unrivaled
mathematical tractability of the normal distribution and its simplicity in dealing with even the most complex data.
From a practical viewpoint, the most commonly adopted approach is transformation of the variables to achieve
normality which works satisfactorily for most of the data. However, this approach is not without its faults:
(i) the transformations are usually on each component separately, and achievement of joint normality is
only hoped for;
(ii) the transformed variables are more difficult to deal with in case of interpretations, especially when
each variable is transformed using a different function;
(iii) In case homoscedasticity is required, this often necessitates a different transformation from the one
for normality.
Alternatively, there exists several other parametric classes of distributions to choose from amongst which
many are already reviewed by Johnson & Kotz (1972). A special mention is due to the hyperbolic distribution
and its generalized version, which form a fairly flexible and mathematically tractable parametric class (Refer
to Barndorff-Nielsen & Blæsild (1983) for a summary account, and Blæsild (1981) for a detailed treatment
P a g e | 2
of the bivariate case and a numerical example). Except for data transformation, however, no alternative
method to the normal distribution has been adopted for regular use in applied work, within the framework
considered here of a parametric approach to handle continuous data.
Generally, it can be seen that the motive is to start from a symmetric distribution and then by a suitable
modification, generate a set of asymmetric distributions. In my case, the simplest effect is the introduction of
skewness in the distribution under consideration. But it is to be noted that my focus in not dealing with the
quintessential nature of skewness and methods for measuring it. Rather, my study is to generate more
malleable and realistic parametric distributions with possible departure from symmetry and in other words,
incorporating skewness so as to use them in statistical work.
The present paper examines a different direction of the above broad problem, namely the possibility to extend
some of the classical methods to the class of skew normal distributions which has recently been discussed
by Azzalini & Dalla Valle (1996). We aim at demonstrating that this distribution achieves a reasonable
flexibility in real data fitting, while it maintains a number of convenient formal properties of the normal one.
The concentrated development of research in this area has attracted the interests of a large number of
statisticians recently.
In this introductory phase, let me summarize the different chapters.
 The motivation for such a class of distribution and the key concept of my development are formulated
in Chapter 1.
 Chapter 2 deals with the form of the distribution and the source for such a form.
 The different significant properties and the similarities and dissimilarities with the normality have
been nurtured in Chapter 3.
 Chapter 4 revolves around one such property with the help of which we can be able to generate skew-
normal data.
P a g e | 3
 Chapter 5 pertains to the different methods of estimating parameters and comparing with those of
normality.
 In Chapter 6, I have analyzed Roberts IQ data and fitted it to a skew normal distribution to show its
preference over normal family.
 Chapter 7 discusses various applicability of skew normal distributions in a vast number of fields
There are various references that have been enlisted in the following chapters for further studies. I have
also taken particular care in following the correct technicalities on introducing a statistical or mathematical
term. I hope you will enjoy this paper very much.
P a g e | 4
CHAPTER 1:
MOTIVATION BEHIND SKEW NORMAL DISTRIBUTION
A question naturally comes to our mind that with vast families of skewed probability distributions currently
available, do we really need any more? Can’t we be happy with what we have? The answer is no. and there are
three motivations behind this answer.
The first lies in the essence of the mechanism itself, which starts with a continuous symmetric density function
which is then modified to generate a variety of alternative forms. The set of densities so constructed incudes the
original symmetric one as an ‘interior point’. (Let S be a subset of a topological space X. Then x is an ‘interior
point’ of S if x is contained in an open subset of S). Let us gaze upon the normal family, which is of huge
prominence. It is well known fact that the normal distribution is the limiting form of a large number of non-
normal parametric families, while in the following construction is the ‘central’ form of a set of alternatives. This
situation is more in line with the common perception of the normal distribution as ‘central’ with respect to
others, which represent ‘departures from normality’ rather than ‘incomplete convergence to normality’.
The second motivation lies in the applicability of such tractable distributions. While considering normality, we
also take into account its symmetricity while in reality that may not be so. Hence removal of such restriction has
become a necessity.
The last motivation derives from the mathematical elegance and complaisance of the construction in two
respects. Firstly, the new family distributions emerges out of a simple process not requiring complex
formulations, as we can see later and secondly, the newly generated distributions are more or less alike the
previous ones and retain some properties of the parent symmetric distributions.
P a g e | 5
*) A generalconstruction :
Since the role of symmetricity is crucial in our development, we recall the condition of central symmetry:
according to Serfling (2006), a random variable X is centrally symmetric about 0 if it is distributed as −X. Let us
state a proposition:
Denote by f0 a probability density function on Rd, by G0(·) a continuous distribution function on the real line,
and by w(·) a real valued function on Rd, such that f0(−x) = f0(x), w(−x) = −w(x), G0(−y) = 1 − G0(y) for all
x ∈ Rd, y ∈ R. Then f(x) = 2 f0(x) G0{w(x)} is a density function on Rd.
Technical proof:
Note that g(x) = 2 [G0{w(x)}− 1/2 ] f0(x) is an odd function and it is integrable because |g(x)| ≤ f0(x). Then
0 = ∫ 𝑔( 𝑥) 𝑑𝑥 = ∫ 2𝑓0( 𝑥) 𝐺0{ 𝑤( 𝑥)} 𝑑𝑥 − 1
𝑅 𝑑
𝑅 𝑑
Thus we have encapsulated the motivations and construction behind the formulation of the skew-normal
distribution and now go on to its distributional form in the next chapter.
(QED)
P a g e | 6
CHAPTER 2:
FUNCTIONAL FORM OF THE UNIVARIATE SKEW
NORMAL DISTRIBUTION AND ITS SOURCE
Let us consider the probability density function (pdf) and cumulative density function (cdf) of a standard normal
distribution Z,
(i)
By the proposition stated in Chapter 1, the product of the two functions in (i) gives rise to an interesting class of
random variables, which has been the subject of intense study for the last two decades. More precisely for a real
number α,
is the bona fide pdf of a new random variable X, which inherits some of the features of normal distribution. The
class of distributions denoted by the pdf given above were introduced by Azzalini (1985) and christened “skew
normal distributions” with the skewness parameter α, symbolically represented as X ~ SN (α).
For applied work we must introduce the location and scale parameters, µ and σ respectively. Let Y be the new
variable such that Y = µ + σX. (µ ε R, σ ε R+). The pdf is given by
fY(x) =
𝟐
𝝈
𝝓(
𝒙−µ
𝝈
)ф (𝜶
𝒙−µ
𝝈
) =
𝟏
𝝈
√
𝟐
𝝅
𝒆
−
𝟏
𝟐
(𝒙−𝝁) 𝟐
𝝈 𝟐
∫ 𝒆−𝒕 𝟐𝜶(
𝒙−𝝁
𝝈
)
−∞
𝒅𝒕
P a g e | 7
and symbolically, Y ~ SN (µ,σ,α).
The cdf of the skew normal distribution is given by
FY(x) = ф (
𝒙−𝝁
𝝈
) − 𝟐𝑻 (
𝒙−𝝁
𝝈
, 𝜶)
where T(h, a) is a Owen’s T-function defined by
Now, the pdfs and cdfs for standard skew normal distributions has been graphically shown below:
P a g e | 8
CHAPTER 3:
PROPERTIES AND SIMILARITIES & DISSIMILARITIES
WITH NORMAL DISTRIBUTION
PROPERTIES 
Let us now discuss some of the properties of skew-normal distribution:-
Property I For α = 0, X = Z and for α → ± ∞, X = ±| Z| where Z ∼ N(0, 1).
The above property shows that the normal and half-normal random variables lie at the center (α = 0) and
boundary (α = ±∞) of the class of SN random variables, respectively. Research following the publications of
Azzalini (1985, 1986), Henze (1986) and Arnold et al. (1993) have revealed that simple and common nonlinear
operations such as truncation, conditioning and censoring performed on normal random variables lead
invariably to versions of SN random variables. Consequently and not surprisingly, it has been revealed that the
implicit appearance of SN random variables in the literature of statistics has a reasonably long pre-1985 history.
The first known birthplace of SN distributions is the work of Birnbaum (1950) in the context of educational
testing which involved truncation of normal variables, followed by the work of Weinstein (1964) and Nelson
(1964) on finding the distribution of the sum of a normal variable and an independent truncated normal
variable; some other early work are Roberts (1966), O’Hagan and Leonard (1976), Aigner et al. (1977) and
Andel et al. (1984). For an extended review of the literature refer to Genton (2004), Arellano-Valle and Azzalini
(2006) and Pourahmadi (2007).
Property II If X ∼ SN(α), then −X ∼ SN( −α) for any α.
Property III Φ(x;- α) = 1-Φ(-x; α)where Φ(x;α)is df of the standard normal family.
Property IV If X ∼ SN(α), then | X | and | Z | are identically distributed, where Z~N(0,1)
P a g e | 9
Property V where Φ is df of the standard normal distribution.
Property VI If X ∼ SN(α), then X2
∼ χ2
1, i.e. a chi-squared rv with df = 1.
The chi-square distribution in Property IV which is immediate from III, was first recognized and employed
effectively by Roberts (1966), it implies in particular that the distributions of | X| , X2, and all even functions of
X do not depend on the skewness parameter α, i.e. there exists an invariance property, with respect to α, that
could have interesting inferential consequences (Genton et al. 2001; Loperfido, 2001). For example, all
goodness-of-fit tests based on even functions of the data are incapable of distinguishing between normal and SN
distributions (refer to Loperfido, 2004).
It is to be noted that the inverse of the property VI is not true.
Property VII A random variable X has the SN pdf (2), iff it has the representation
X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2
where Z1, Z2 are independent N(0, 1) random variables, and δ =
𝛂
√ 𝟏−𝛂 𝟐
∈ [−1, 1].
It is to be noted that the new parameter δ is, indeed, the correlation coefficient between X and |Z1|.
Property VIII If X ∼ SN(α)and Z ∼ N(0, 1) are independent, then
𝑿+𝒁
√𝟐
~ 𝑺𝑵(
𝜶
√ 𝟐+𝜶 𝟐
)
This shows that unlike the normal random variables, the class of SN random-variables is not closed
with respect to the addition of independent copies of its members.
Property IX Let Xi ∼ SN(αi)be independent with αi ≠ 0, i = 1, 2. Then, in general, X1 +
X2 is not SN. However, if X1 and X2 are dependent sharing a common half-normal,
then X1 + X2 is SN. More precisely, if
P a g e | 10
X1 = δ1|Z| + √𝟏 − 𝜹 𝟏
𝟐
Z1,
X2 = δ2|Z| + √𝟏 − 𝜹 𝟐
𝟐
Z2,
where Z, Z1, Z2 are independent N(0, 1). Then, it follows that
𝑿 𝟏+𝑿 𝟐
√ 𝟏+𝟐𝜹 𝟏 𝜹 𝟐
~ 𝑺𝑵(
𝜹 𝟏+𝜹 𝟐
√ 𝟏+𝟐𝜹 𝟏 𝜹 𝟐
)
~ PROOFS:
 Property II:
 Property III:
X ~ SN(α) => fX(x) = 2 ϕ(x)Φ (αx)
Let Z = -X ; We know x ε R
x = -z , z ε R J =
𝑑𝑧
𝑑𝑥
= -1
fZ(z) = 2 ϕ(−z)Φ (α(−z)) | 𝐽 | = 2 ϕ(z)Φ ((−α)z)
Thus Z ~ SN(-α) (QED)
LHS = Φ(x;−α) = Φ(x) − 2T(x,−α)
RHS = 1 − Φ(−x;α) = 1 −(Φ(−x) − 2𝑇(−𝑥, 𝛼)) = Φ(x) −
2𝑇(−𝑥, 𝛼)
Since T(x,-α) = T(-x,α), LHS = RHS
(QED)
P a g e | 11
 Property IV:
 Property VI:
Thus |X| and |Z| are identically distributed where Z ~ N(0,1)
(QED)
Let 𝑌 = 𝑋2
Then, 𝑥 = ±√ 𝑦 , y ε 𝑅2
|J| =
𝜕𝑥
𝜕𝑦
=
1
2√ 𝑦
Hence, 𝑓𝑌( 𝑦) =
1
2√ 𝑦
𝑓𝑋 (√ 𝑦) +
1
2√ 𝑦
𝑓𝑋 (−√ 𝑦)
=
2
2√ 𝑦
[𝜙(√ 𝑦)Φ(𝛼√ 𝑦) + 𝜙(−√ 𝑦)Φ(−𝛼√ 𝑦)]
=
1
√2𝜋√ 𝑦
𝑒−
𝑦
2 [Φ(𝛼√ 𝑦) + Φ(−𝛼√ 𝑦)]
=
1
√2𝜋√ 𝑦
𝑒−
𝑦
2 which is thepdf of a 𝜒(1)
2
distribution.
P a g e | 12
DERIVATION OF MGF AND MOMENT MEASURES:
After these properties, we will derive the formulae for expectation, variance and skewness from derivation of
MGF, give below.
Given: From property VII, X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2
𝑀| 𝑍1|( 𝑡) = 𝐸( 𝑒 𝑡| 𝑍1|
) = ∫ 𝑒 𝑡| 𝑢| 𝑒
−
𝑢2
2
√2𝜋
𝑑𝑢
∞
−∞
, u = z1
= √
2
𝜋
∫ 𝑒 𝑡𝑢
𝑒−
𝑢2
2 𝑑𝑢
∞
0
= 2𝑒
𝑡2
2 ∫
𝑒
−
(𝑢−𝑡)2
2
√2𝜋
𝑑𝑢
∞
0
= 2𝑒
𝑡2
2 𝛷( 𝑡)
Therefore, 𝑀 𝑋
( 𝑡) = 𝐸( 𝑒 𝑡𝑋) = 𝐸( 𝑒 𝑡{𝑎| 𝑍1|+𝑏𝑍2} ) , where a = δ and b = √1 − δ2
= 𝑀| 𝑍1|( 𝑡𝑎) 𝑀 𝑍2
( 𝑡𝑏)
= 2𝑒
(𝑎𝑡)2
2 𝛷( 𝑡𝑎) 𝑒
(𝑏𝑡)2
2
= 2𝑒
𝑡2
2 𝛷( 𝑡δ)
Now if Y = µ + σX,
𝑴 𝒀( 𝒕) = 𝟐𝒆
𝝁𝒕+
(𝒕𝝈) 𝟐
𝟐 𝜱( 𝝈𝒕𝛅)
P a g e | 13
Now differentiating MGF once, twice and thrice at t=0, we get E(Y), V(Y), Skewness(Y) respectively.
𝐄( 𝐘) = 𝛍 + 𝛔𝛅√
𝟐
𝛑
𝐕( 𝐘) = 𝛔 𝟐
( 𝟏 −
𝟐𝛅 𝟐
𝛑
)
𝐒𝐤𝐞𝐰𝐧𝐞𝐬𝐬( 𝐘) =
𝟒 − 𝛑
𝟐
(𝛅√
𝟐
𝛑
) 𝟑
(𝟏 −
𝟐𝛅 𝟐
𝛑
)
𝟑
𝟐
In chapter 4, we will see how to estimate the parameters from a given data.
P a g e | 14
CHAPTER 4:
GENERATION OF RANDOM SAMPLES FROM
SKEW NORMAL DISTRIBUTION
We will use property VII to draw random samples using normal distributions. It is quite interesting to note how
the special case of skew normal distribution is used to generate its random data.
PROOF :
Let a = δ and b = √1 − δ2 . Then P(X ≤ x) = 𝐸|𝑍1|[P(X ≤ x | |Z1| = z)
= 2∫ 𝑃(𝑍2 ≤ −
au
𝑏
∞
0
) 𝜙( 𝑧) 𝑑𝑧
= 2∫ 𝛷 (
𝑥−𝑎𝑢
𝑏
) 𝜙( 𝑧) 𝑑𝑧
∞
0
Differentiating yields the density of X as follows
𝑑
𝑑𝑥
P(X ≤ x) = 2∫ 𝜙 (
𝑥−𝑎𝑢
𝑏
) 𝜙( 𝑧) 𝑑𝑧
∞
0
Using the fact that a2+b2 = 1, we get
𝑑
𝑑𝑥
P(X ≤ x) = ϕ(x) ∫
1
√2𝜋𝑏
𝑒
−
(𝑧−𝑎𝑥)2
2𝑏2
𝑑𝑧
∞
0
A random variable X has the SN pdf iff it has the representation
X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2
where Z1, Z2 are independent N(0, 1) random variables, and δ =
𝛂
√ 𝟏−𝛂 𝟐
∈ [−1, 1].
P a g e | 15
= 2ϕ(x) ∫
1
√2𝜋
𝑒−
𝑡2
2 𝑑𝑡
∞
−
𝑎𝑥
𝑏
= 2ϕ(x) {1 − Φ (−
𝑎𝑥
𝑏
)}
= 2ϕ(x) 𝛷( 𝛼𝑥) , which is the skew normal density.
Thus X ~ SN(α) (QED)
Thus we can generate data in this manner using standard normal distributions. However recently, Azzalini has
developed an R-package named ‘sn’ by which skew normal random data can be easily generated through
software.
P a g e | 16
CHAPTER 5:
ESTIMATION OF PARAMETERS OF SKEW NORMAL
DISTRIBUTION
Estimation of parameters is a fundamental problem in data analysis. However, though various methods of
estimating parameters have evolved over time, we deal with only two such methods.
Earlier Methods of Estimation: Estimation is the process of determining approximate values for parameters
of different populations or events. How well the parameter is approximated can depend on the method, the type
of data and other factors.
Gauss was the first to document the method of least squares, around 1794. This method tests different values
of parameters in order to find the best fit model for the given data set. However, least squares is only as robust
as the data points are close to the model and thus outliers can cause a least squares estimate to be outside the
range of desired accuracy.
The method of moments is another way to estimate parameters. The 1st moment is defined to be the mean, and
the 2nd moment the variance; the 3rd moment is the skewness and the 4th moment is the kurtosis. In complex
models, with more than one parameter, it can be difficult to solve for these moments directly, and so moment
generating functions were developed using sophisticated analysis. These moment generating functions can also
be used to estimate their respective moments.
Bayesianestimation is based on Bayes’ Theorem for conditional probability. Bayesian analysis starts with
little or no information about the parameter to be estimated. Any data collected can be used to adjust the
function of the parameter, thereby improving the estimation of the parameter. This process of refinement can
continue as new data is collected until a satisfactory estimate is found.
Evolution of Maximum Likelihood Estimation: It was none other than R. A. Fisher who developed
maximum likelihood estimation. Fisher based his work on that of Karl Pearson, who promoted several
estimation methods, in particular the method of moments. While Fisher agreed with Pearson that the method of
P a g e | 17
moments is better than least squares, Fisher had an idea for an even better method. It took many years for him to
fully conceptualize his method, which ended up with the name maximum likelihood estimation. In 1912, when
he was a third year undergraduate student, Fisher published a paper called” Absolute criterion for fitting
frequency curves.” The concepts in this paper were based on the principle of inverse probability, which Fisher
later discarded. (If any method can be considered comparable to inverse probability, it is Bayesian estimation.)
Because Fisher was convinced that he had an idea for the superior method of estimation, criticism of his idea
only fueled his pursuit of the precise definition. In the end, his debates with other statisticians resulted in the
creation of many statistical terms we use today, including the word ”estimation” itself and even ”statistics”.
Finally, Fisher defined the difference between probability and likelihood and put his final touches on maximum
likelihood estimation in 1922.
Finally, this chapter deals with a primitive method of estimation, though not uncommon, method of moments
and one of the revolutionary modern methods, method of maximum likelihood. After generating random data
from SN(10, 5.5, -4), I have applied the following to estimate the three parameters.
I. METHOD OF MOMENTS :~
In chapter 3, I have worked out the MGF of the skew normal distribution and based on that, derived the
formulae of expectation, variance and skewness.
Let us denote sample mean by m, sample variance by s2 and sample skewness by g1. Hence, based
on the given data (x1, x2…, xn), we calculate
m =
𝟏
𝒏
∑ 𝒙𝒊
𝒏
𝒊=𝟏 = E(X)
s2
=
𝟏
𝒏−𝟏
∑ (𝒙𝒊 − 𝒎) 𝟐𝒏
𝒊=𝟏 = V(X)
P a g e | 18
g1 =
𝒏√𝒏−𝟏
𝒏−𝟐
∑ (𝒙 𝒊−𝒎) 𝟑𝒏
𝒊=𝟏
(∑ (𝒙 𝒊−𝒎) 𝟐𝒏
𝒊=𝟏 )
𝟑
𝟐
= 𝞬1̂
Thus, using expressions of mean, variance and skewness in Chapter 4, we get the estimates as
Note:
 The method of moment estimates can be used as starting values for maximum likelihood
estimates (since there is no closed form as seen later).
 The sign of δ̂ should be the same as that of 𝞬1. The maximum (theoretical) skewness is obtained
by setting in the skewness equation, giving . However it is possible that the
sample skewness is larger, and then cannot be determined from these equations. Hence when
estimating parameters using R-code I have skipped those data where δ > 1 so as to estimate α
without hindrance.
After generating 5000 random sample from the SN(10, 5.5, -4), the table next page shows the estimates
and the corresponding se’s.
𝜇̂ = 𝑚 − 𝜎𝛿√
2
𝜋
𝜎̂ =
𝑠
√1−2
𝛿2
𝜋
|δ̂| =
𝛼̂ =
𝛿
√1−𝛿2
P a g e | 19
PARAMETERS ESTIMATES STANDARD ERRORS
𝝁̂ 9.9864 0.0616
𝝈̂ 5.4946 0.0974
𝜶̂ -5.1286 0.0470
II. METHOD OF MAXIMUM LIKELIHOOD :~
 INTRODUCTION:
Let 𝐲 = (y1, y2, …, yn)’ be a vector of iid RVs from one of the family of distributions on R and indexed
by a p-dimensional parameter θ = (θ1, θ2, …., θn)’ , θ ε Θ. Let me denote the df of y by F(y|θ) and
assume that the density function f(y|θ) exists. Then the likelihood function of θ is given by
L = ∏ f(y𝑖|θ)𝑛
𝑖=1
Let the p partial derivatives of the log-likelihood form the p × 1 vector be
𝑢( 𝜃) =
𝜕 ln 𝐿
𝜕𝜃
=
(
𝜕 ln 𝐿
𝜕𝜃1
⋮
𝜕 ln 𝐿
𝜕𝜃 𝑛 )
The vector u(θ) is called the score vector of the log-likelihood function. The moments of u(θ) satisfy
two important identities. First, the expectation of u(θ) with respect to y is equal to zero, and second, the
variance of u(θ) is the negative of the second derivative of ln 𝐿 (θ) , i.e.,
𝑉(𝑢( 𝜃)) = −𝐸 [(𝑢( 𝜃))(𝑢( 𝜃))
′
] = −𝐸 {(
𝜕2
𝑙𝑛𝐿( 𝜃)
𝜕𝜃𝑗 𝜃𝑘
)}
The p × p matrix on the right hand side is called the expectedFisher information matrix and usually
denoted by I(θ). The expectation here is taken over the distribution of y at a fixed value of θ. Under
P a g e | 20
conditions which allow the operations of integration with respect to y and differentiation with respect to
θ to be interchanged, the maximum likelihood estimate of θ is given by the solution θ̂ to the p equations
u(θ̂) = 0
and under some regularity conditions, the distribution of θˆ is asymptotically normal with mean θ and
variance-covariance matrix given by the p × p matrix I(θ)−1 i.e., the inverse of the expected information
matrix. The p × p matrix I(θ) = − {(
𝜕2
𝑙𝑛𝐿( 𝜃)
𝜕 𝜃𝑗 𝜃 𝑘
)} is called the observed information matrix. In practice,
since the true value of θ is not known, these two matrices are estimated by substituting the estimated
value θˆ to give I(θˆ ) and I(θˆ ), respectively. Asymptotically, these four forms of the information matrix
can be shown to be equivalent.
From a computational standpoint, the above quantities are related to those computed to solve an
optimization problem as follows: −ln L(θ) corresponds to the objective function to be minimized, u(θ)
represents the gradient vector, the vector of first-order partial derivatives, usually denoted by g, and
I(θ), corresponds to the negative of the Hessian matrix H(θ), the matrix of second-order derivatives
of the objective function, respectively. In the MLE problem, the Hessian matrix is used to determine
whether the minimum of the objective function –ln L(θ) is achieved by the solution θˆ to the equations
u(θ) = 0, i.e., whether θˆ is a stationery point of ln L(θ). If this is the case, then θˆ is the maximum
likelihood estimate of θ and the asymptotic covariance matrix of θˆ is given by the inverse of the
negative of the Hessian matrix evaluated at θˆ, which is the same as I(θˆ ), the observed information
matrix evaluated at θˆ .
Sometimes it is easier to use the observed information matrix I(θˆ ) for estimating the asymptotic
covariance matrix of θˆ , since if I(θˆ ) were to be used then the expectation of I(θˆ ) needs to be
evaluated analytically. However, if computing the derivatives of ln L(θ) in closed form is difficult or if
the optimization procedure does not produce an estimate of the Hessian as a byproduct, estimates of the
derivatives obtained using finite difference methods may be substituted for I(θˆ ).
P a g e | 21
Many a times, iterative methods are used to estimate θ, out of which the most commonly used modern
technique is Newton-Raphson method. We can recall that the maximum or minimum of f(x) was given
by 𝜃̂(𝑖+1) = 𝜃̂(𝑖) − 𝐻(𝜃̂( 𝑖))
−1
𝑢(𝜃̂( 𝑖))
Observe that the Hessian needs to be computed and inverted at every step of the iteration. In difficult
cases when the Hessian cannot be evaluated in closed form, it may be substituted by a discrete estimate
obtained using finite difference methods as mentioned above. In either case, computation of the Hessian
may end up being a substantially large computational burden. When the expected information matrix
I(θ) can be derived analytically without too much difficulty, i.e., the expectation can be expressed as
closed form expressions for the elements of I(θ), and hence, I(θ)-1
it may be substituted in the above
iteration to obtain the modified iteration.
𝜃̂(𝑖+1) = 𝜃̂(𝑖) + 𝐼(𝜃̂( 𝑖))
−1
𝑢(𝜃̂( 𝑖))
This saves on the computation of H(θˆ ) because functions of the data y are not involved in the
computation of I(𝜃ˆ (𝑖)) as they are with the computation of H(θˆ ). This provides a sufficiently accurate
Hessian to correctly orient the direction to the maximum. This procedure is also called the method of
scoring.
 PROCEDURE:
Let us denote the likelihood function by L. Hence
ln L = c + −
𝑛
2
ln 𝜎2
−
1
2
∑ (
𝑥 𝑖−𝜇
𝜎
)2𝑛
𝑖=1 + ∑ 𝑙𝑛 𝛷(𝛼
𝑥𝑖−𝜇
𝜎
𝑛
𝑖=1 )
The score equations are given by –
(i)
𝜕 ln 𝐿
𝜕𝜇
= ∑ (
𝑥𝑖−𝜇
𝜎
)
2𝑛
𝑖=1 −
𝛼
𝜎
∑
𝜙( 𝛼
𝑥𝑖−𝜇
𝜎
)
𝛷( 𝛼
𝑥𝑖−𝜇
𝜎
)
𝑛
𝑖=1 = 0
P a g e | 22
(ii)
𝜕 ln 𝐿
𝜕𝜎
= −
𝑛
2𝜎2 +
1
2
∑
(𝑥𝑖−𝜇)
𝜎4
2
𝑛
𝑖=1 −
𝛼
2𝜎3
∑
𝜙( 𝛼
𝑥𝑖−𝜇
𝜎
)
𝛷( 𝛼
𝑥𝑖−𝜇
𝜎
)
(𝑥𝑖 − 𝜇)𝑛
𝑖=1 = 0
(iii)
𝜕 ln 𝐿
𝜕𝛼
= ∑
𝜙( 𝛼
𝑥𝑖−𝜇
𝜎
)
𝛷( 𝛼
𝑥𝑖−𝜇
𝜎
)
(
𝑥𝑖−𝜇
𝜎
)𝑛
𝑖=1 = 0
If we let W(xi) =
𝜙(𝛼
𝑥 𝑖−𝜇
𝜎
)
𝛷(𝛼
𝑥 𝑖−𝜇
𝜎
)
, then the maximum likelihood estimates are given by
Using Newton Raphson method here includes using the Fisher information. The Fisher information
matrix is given by
[
𝑛(1 + 𝛼2
𝑎0)
𝜎2
𝑛𝑏𝛼(1 + 𝛼2
)
𝜎2(1+ 𝛼2)
3
2
+
𝑛𝛼2
𝑎1
𝜎2
𝑛
𝜎
{
𝑏
(1 + 𝛼2)
3
2
− 𝛼𝑎1}
𝑛𝑏𝛼(1 + 𝛼2
)
𝜎2(1 + 𝛼2)
3
2
+
𝑛𝛼2
𝑎1
𝜎2
𝑛
𝜎2
(1 + 𝛼2
𝑎2) −
𝑛𝛼𝑎2
𝜎
𝑛
𝜎
{
𝑏
(1 + 𝛼2)
3
2
− 𝛼𝑎1} −
𝑛𝛼𝑎2
𝜎
𝑛𝑎2
]
Where b = √
2
𝜋
𝑧𝑖 =
𝑥 𝑖−𝜇
𝜎
𝛼̂ =
P a g e | 23
𝑎 𝑘 = 𝐸 {𝑧 𝑘
(
𝜙( 𝛼𝑧 𝑖)
𝛷( 𝛼𝑧 𝑖)
)
2
} , k = 0, 1, 2
Thus we get the corresponding estimates. After generating 5000 random sample from the SN(10, 5.5, -
4), the table below shows the estimates and the corresponding se’s.
PARAMETERS ESTIMATES STANDARD ERRORS
𝝁̂ 9.9966 0.0791
𝝈̂ 5.5079 0.0932
𝜶̂ -4.007 0.2292
NOTE: We can see by running a R-code that both the estimates are more or less biased. The method of
moments estimates are not consistent while the ML estimates are so.
Apart from estimating the parameters we also calculate the median, IQR, QD and Bowley’s measure of
skewness using same skew-normal parameters with 5000 randomly generated observations.
MEDIAN 6.706
IQR 5.147
QD 2.573
BOWLEY’S MEASURE -0.416
Hence, the results show that median is more than mean, high value of IQR means that data is quite
spread and negative value of Bowley’s measure show that the data is negatively skewed, which conforms to our
distribution.
P a g e | 24
CHAPTER 6:
ROBERTS IQ DATA
Arnold et al. (1993) applied the skew normal distribution to a portion of an IQ score data set from Roberts
(1988). In this section we expand the application to the full data set. The Roberts IQ data gives the Otis IQ
scores for 87 white males and 52 non-white males hired by a large insurance company in 1971. The data is
given in the following tables:
(Snapshot taken from Roberts)
To apply the skew normal as a truncated normal according to the motivation of the model given by Arnold et al.
(1993), we assume that these individuals were screened with respect to some variable Y, which is unknown. We
further assume that only individuals who scored above average with respect to the screening variable were
hired. Let X represent the IQ scores of the individuals hired. The variable X is the unscreened variable, and only
this variable is observed. We assume that (X, Y) has a bivariate normal distribution with mean vector (µ1, µ2),
P a g e | 25
variance vector (σ1
2, σ2
2) and correlation ρ. Therefore the observed IQ scores represent a sample from a
nonstandard skew normal distribution.
We apply the non-standard skew normal maximum likelihood estimators to the IQ score sample to estimate the
mean and variance of IQ scores for the unscreened population. We now let XI represent the score for whites and
let X2 represent the scores for non-whites. The two data sets displayed above are analyzed separately, each
under the assumptions of normality and skew normality. The estimates are given in the following tables:
CONCLUSION BASED ON THE GIVEN DATA:
For both data sets, under the assumption of normality the mean is overestimated and the standard deviation is
underestimated.
Using the estimates from above, we first transform the data sets, given in Tables 5.1 and 5.2, to the data sets on
standard skew-normal random variables Z1 and Z2. We then estimate α1
’ and α2
’ for the standard skew-normal
random variables. The resulting estimates are α1
’ = 1.15 and α2
’ = 1.84.
Simultaneously we can also carry out hypothesis testing to check for skew normality.
To test H0: Z~ SN(µ, σ2, α=0) ag H1: Z ~ SN(µ, σ2, α≠0)
α1
α2
P a g e | 26
The likelihood ratio test statistic is given by
χ =
𝑳(µ̂,𝝈̂,𝜶=𝟎)
𝑳(µ̂,𝝈̂,𝜶̂)
Where ‘L’ denotes the likelihood function. It can be shown that
-2lnχ ~ χ1
2
under H0
After applying the given data, we see that pvalue for the 2 data sets are coming out as 0.00129 and 0.01056,
which are both less than the desired level of significance, which is taken to be 0.05.
Hence, we can conclude that H0 is rejected for both the data sets.
OVERALL CONCLUSION:
In many real life applications, it has been observed that the unrestricted use of the normal distribution to model
data can yield erroneous results. For the Roberts IQ (1988) data analyzed in here, the application of normality
resulted in overestimates of the mean IQ scores in both cases. This is due to the fact that the scores were
obtained by screening on some other variable which is unknown, giving rise to skewness in the data. For this
reason, I have used the skew normal distribution.
P a g e | 27
CHAPTER 7:
APPLICATIONS OF SKEW NORMAL IN VARIOUS
FIELDS
The various applications are listed below:
1. STATISTICAL PROCESS CONTROL: The most commonly used standard procedures of Statistical
Quality Control (SQC), control charts and acceptance sampling plans, are often implemented under the
assumption of normal data, which rarely holds in practice. The analysis of several data sets from diverse
areas of application, such as, statistical process control (SPC), reliability, etc. lead us to notice that this
type of data usually exhibit moderate to strong asymmetry as well as light or heavy tails. Thus, despite
of the simplicity and popularity of the Gaussian distribution, we conclude that in most of the cases,
fitting a normal distribution to the data is not the best option. On another side, modeling real data sets,
even when we have some potential asymmetric models for the underlying data distribution, is always a
very difficult task due to some uncontrollable perturbation factors. Hence, in these problems, skew-
normal distribution comes to rescue.
2. BIOMEDICAL STUDIES: Another important application is concerned with continuous longitudinal
responses in biomedical studies. The first attempt to use skew-normal and related distributions to relax
the standard assumption of normality of the random effects in linear mixed models has been developed
by Ma et al. (2004). Because the likelihood function does not have a closed form in this setting, they
proposed inference based on the EM algorithm as well as inference via MCMC simulations in a
Bayesian framework. In this approach, the choice of the degree of the polynomial involved in the
skewing function was identified by means of model selection criteria. A simulation study indicated that
a flexible model for the distribution of random effects in the linear mixed model results in more efficient
estimators of the fixed effects and also more efficient estimators of the mean and the variance of the
P a g e | 28
unobserved random effects. The special case where the distribution of the random effects is skew-
normal permits a closed form expression for the likelihood function and has been studied by Arellano-
Valle et al. (2005a). The use of skew-normal and related distributions in biomedical sciences has been
further promoted by Sahu & Dey (2004) for the development of survival models with a skewed frailty,
and by Chen (2004) for the construction of skewed link models for categorical response data.
3. GEOLOGICAL STUDIES: The skew-normal and related distributions can play very important role in
applications arising from geosciences. Kim & Mallick (2004) used the skew-normal distribution to
model spatial data. In the context of data assimilation, Naveauet al. (2004) developed a skewed
Kalman filter for the analysis of climatic time series. Specifically, they studied the impact of strong but
short-lived perturbations from large explosive volcanic eruptions on climate. The use of skew-normal
distributions gave a more realistic representation of volcanic forcing than the normal distribution.
Genton & Thompson (2003) used skew-elliptical time series to model sea levels and to evaluate the
risk of coastal flooding in Charlottetown, Canada.
4. LINEAR MODELS: Another area of interest is the application of the univariate skew normal to linear
models. In one of the presidential elections in US, there was considerable discussion about the
measurement error of the machine and hand recounts. In linear models, the error term is assumed to be
normal with mean zero. However, if we consider each Florida County separately, with each showing a
significant margin of victory for one of the two candidates, then the measurement error will have
skewness in favor of the winner. Therefore, it is appropriate to investigate the nature of a linear model
with a skew normal error term.
5. EVOLUTIONARY GENETICS: Evolutionary algorithms constitute a set of optimization techniques
inspired by the idea of biological evolution of a population of organisms which adapts to their
surrounding environment via mechanisms of mutation and selection. An algorithm of this type starts by
P a g e | 29
choosing an initial ‘population’ formed by a random set of n points in the feasible space of the target
function and making it evolve via successive generations. In the evolution process, the points which are
best performing are used to breed a new generation via a mutation operator. This step involves
generation of new random points, typically making use of a multivariate Gaussian distribution. In this
framework Berlik (2006) has considered adopting an asymmetric parent distribution, instead of a
Gaussian one. The main idea of directed mutation is to impart directionality to the search by generating
random numbers that lie preferably in the direction where the optimum is presumed to be. Operationally,
the SN family provides the sampling distribution depending on whether we want to keep mutation in the
various components independent or allow for correlation.
6. SATELLITE IMAGING: The 6th application involves the flexibility of the skew normal distribution to
classify the pixels of a remotely sensed satellite image. In most of the remote sensing packages, for
example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal.
Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the
pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However,
the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew-
normal discriminant function (SDF) is one of techniques to obtain more accurate image. Generally, we
compare the SDF with LDF and QDF using and results show that ignoring the skewness of the data
increases the misclassification probability and consequently we get wrong image.
7. PSYCHIATRIC MEASURES: Variables arising from instruments designed to assess health status
often follow asymmetric and long-tailed distributions, resulting from a majority of healthy individuals
with low values and a few individuals with larger values reflecting particular disorders (e.g. screening
questionnaires for symptoms and diagnoses). Skewness occurs frequently in a screening setting where
the distributions are nearly normal; thus it is particularly important to account for the values reflecting
P a g e | 30
disorder whilst preserving the usual Normal properties of the general population. To adequately perform
statistical analyses, we can rely on empirically chosen transformations (e.g. logarithmic, Box-Cox) to
make the data conform to the methods’ assumptions. However, it is not always possible to find a
suitable transformation, and analyzing data on a different scale might compromise interpretability. This
means that rather than the dataset properties informing the statistical analyses, often inappropriate or
non-optimal methods in which the data is not fully exploited are used, with assumption violations being
accepted as an inevitable nuisance. This problem is compounded when skew normal distributions are
taken into account.
P a g e | 31
The references used in the text are given below:
1. The Skew-normal and Related families by Adelchi Azzalini
2. Construction of Skew-Normal Random Variables: Are They Linear Combinations of Normal and
Half-Normal? By MohsenPourahmadi, Northern Illinois University
3. RELIABILITY STUDIES OF THE SKEW NORMAL DISTRIBUTION by Nicole Dawn Brown
4. www.wikipedia.org
5. THE SKEW-NORMAL DISTRIBUTION IN SPC by Fernanda Figueiredo – CEAUL and
Faculdade de Economia da Universidade do Porto, Portugal and M. Ivette Gomes – Universidade de
Lisboa, FCUL, DEIO and CEAUL, Portugal
P a g e | 32
I would like to express my special thanks of gratitude to Mr. Debjit Sengupta, my project guide who
came up with such a wonderful project on the topic (Analysis of Skew-normal Distributions). This not
only helped me in doing a lot of research on such a unique topic but also got me acquainted about
various new horizons in statistics. He made this project, which consumed huge amount of hard work,
research and dedication, a grand success.
Secondly I would also like to thank my parents and friends who helped me a lot in finalizing this project.

Contenu connexe

Tendances

Literature Review on Vague Set Theory in Different Domains
Literature Review on Vague Set Theory in Different DomainsLiterature Review on Vague Set Theory in Different Domains
Literature Review on Vague Set Theory in Different Domainsrahulmonikasharma
 
Http Www.Sciencedirect.Com Science Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...
Http   Www.Sciencedirect.Com Science  Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...Http   Www.Sciencedirect.Com Science  Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...
Http Www.Sciencedirect.Com Science Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...guest3904e8
 
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij (Stepan Douplii)
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using anijaia
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folderSteven Scott
 
Hyers ulam rassias stability of exponential primitive mapping
Hyers  ulam rassias stability of exponential primitive mappingHyers  ulam rassias stability of exponential primitive mapping
Hyers ulam rassias stability of exponential primitive mappingAlexander Decker
 
A jensen-shannon
A    jensen-shannonA    jensen-shannon
A jensen-shannonUnirio
 
STUDY OF Ε-SMOOTH SUPPORT VECTOR REGRESSION AND COMPARISON WITH Ε- SUPPORT ...
STUDY OF Ε-SMOOTH SUPPORT VECTOR  REGRESSION AND COMPARISON WITH Ε- SUPPORT  ...STUDY OF Ε-SMOOTH SUPPORT VECTOR  REGRESSION AND COMPARISON WITH Ε- SUPPORT  ...
STUDY OF Ε-SMOOTH SUPPORT VECTOR REGRESSION AND COMPARISON WITH Ε- SUPPORT ...ijscai
 
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...Session 2 b inter temporal equivalence scales based on stochastic indifferenc...
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...IARIW 2014
 
Race, articulation and societies structured in dominance
Race, articulation and societies structured in dominanceRace, articulation and societies structured in dominance
Race, articulation and societies structured in dominanceJen W
 
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...theijes
 

Tendances (14)

Literature Review on Vague Set Theory in Different Domains
Literature Review on Vague Set Theory in Different DomainsLiterature Review on Vague Set Theory in Different Domains
Literature Review on Vague Set Theory in Different Domains
 
Http Www.Sciencedirect.Com Science Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...
Http   Www.Sciencedirect.Com Science  Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...Http   Www.Sciencedirect.Com Science  Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...
Http Www.Sciencedirect.Com Science Ob=M Img& Imagekey=B6 Syr 4 Kry3 Vy 7 1...
 
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"Steven Duplij, "Polyadic Hopf algebras and quantum groups"
Steven Duplij, "Polyadic Hopf algebras and quantum groups"
 
Optimistic decision making using an
Optimistic decision making using anOptimistic decision making using an
Optimistic decision making using an
 
01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder01.ConditionalProb.pdf in the Bayes_intro folder
01.ConditionalProb.pdf in the Bayes_intro folder
 
Profit analysis
Profit analysisProfit analysis
Profit analysis
 
Hyers ulam rassias stability of exponential primitive mapping
Hyers  ulam rassias stability of exponential primitive mappingHyers  ulam rassias stability of exponential primitive mapping
Hyers ulam rassias stability of exponential primitive mapping
 
Econometrics ch11
Econometrics ch11Econometrics ch11
Econometrics ch11
 
A jensen-shannon
A    jensen-shannonA    jensen-shannon
A jensen-shannon
 
STUDY OF Ε-SMOOTH SUPPORT VECTOR REGRESSION AND COMPARISON WITH Ε- SUPPORT ...
STUDY OF Ε-SMOOTH SUPPORT VECTOR  REGRESSION AND COMPARISON WITH Ε- SUPPORT  ...STUDY OF Ε-SMOOTH SUPPORT VECTOR  REGRESSION AND COMPARISON WITH Ε- SUPPORT  ...
STUDY OF Ε-SMOOTH SUPPORT VECTOR REGRESSION AND COMPARISON WITH Ε- SUPPORT ...
 
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...Session 2 b inter temporal equivalence scales based on stochastic indifferenc...
Session 2 b inter temporal equivalence scales based on stochastic indifferenc...
 
Race, articulation and societies structured in dominance
Race, articulation and societies structured in dominanceRace, articulation and societies structured in dominance
Race, articulation and societies structured in dominance
 
Teste
TesteTeste
Teste
 
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...
Multiple Linear Regression Model with Two Parameter Doubly Truncated New Symm...
 

En vedette

Multimedia Company in India-Edit One International
Multimedia Company in India-Edit One InternationalMultimedia Company in India-Edit One International
Multimedia Company in India-Edit One InternationalPankaj Jagya
 
CURRICULAM_VITAE__Vijay_Singh
CURRICULAM_VITAE__Vijay_SinghCURRICULAM_VITAE__Vijay_Singh
CURRICULAM_VITAE__Vijay_SinghVijay kumar Singh
 
Internship report
Internship reportInternship report
Internship reportNeel Sheth
 
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)Fransiska Ade Kurnia Widodo
 
MARK RESUME UPDATED
MARK RESUME UPDATEDMARK RESUME UPDATED
MARK RESUME UPDATEDMark Howard
 
The Story of Pea Pod 2014-07-11 16_35
The Story of Pea Pod 2014-07-11 16_35The Story of Pea Pod 2014-07-11 16_35
The Story of Pea Pod 2014-07-11 16_35kaia hilson
 
LaurenGilbert_Fellowship_Presentation
LaurenGilbert_Fellowship_PresentationLaurenGilbert_Fellowship_Presentation
LaurenGilbert_Fellowship_PresentationLauren Gilbert
 
1семінар
1семінар1семінар
1семінарolga_ruo
 
The One Laptop per Child project and the need for media literacy: The problem...
The One Laptop per Child project and the need for media literacy: The problem...The One Laptop per Child project and the need for media literacy: The problem...
The One Laptop per Child project and the need for media literacy: The problem...Marcus Leaning
 

En vedette (13)

Multimedia Company in India-Edit One International
Multimedia Company in India-Edit One InternationalMultimedia Company in India-Edit One International
Multimedia Company in India-Edit One International
 
CURRICULAM_VITAE__Vijay_Singh
CURRICULAM_VITAE__Vijay_SinghCURRICULAM_VITAE__Vijay_Singh
CURRICULAM_VITAE__Vijay_Singh
 
REQ-051946-1
REQ-051946-1REQ-051946-1
REQ-051946-1
 
Internship report
Internship reportInternship report
Internship report
 
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)
Personal_-_Fransiska_Ade_Kurnia_Widodo_(August 2016)
 
MARK RESUME UPDATED
MARK RESUME UPDATEDMARK RESUME UPDATED
MARK RESUME UPDATED
 
Examen
ExamenExamen
Examen
 
The Story of Pea Pod 2014-07-11 16_35
The Story of Pea Pod 2014-07-11 16_35The Story of Pea Pod 2014-07-11 16_35
The Story of Pea Pod 2014-07-11 16_35
 
update pope cv
update pope cvupdate pope cv
update pope cv
 
LaurenGilbert_Fellowship_Presentation
LaurenGilbert_Fellowship_PresentationLaurenGilbert_Fellowship_Presentation
LaurenGilbert_Fellowship_Presentation
 
viabilidad económica y financiera
 viabilidad económica y financiera viabilidad económica y financiera
viabilidad económica y financiera
 
1семінар
1семінар1семінар
1семінар
 
The One Laptop per Child project and the need for media literacy: The problem...
The One Laptop per Child project and the need for media literacy: The problem...The One Laptop per Child project and the need for media literacy: The problem...
The One Laptop per Child project and the need for media literacy: The problem...
 

Similaire à Dissertation Paper

Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscapeDevansh16
 
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONFUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONijdms
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxmadlynplamondon
 
Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Nicky Grant
 
A brief history of generative models for power law and lognormal ...
A brief history of generative models for power law and lognormal ...A brief history of generative models for power law and lognormal ...
A brief history of generative models for power law and lognormal ...sugeladi
 
Non-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docxNon-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docxvannagoforth
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques finalBenard Maina
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009yamanote
 
Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Inder Singh
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical DatabasesAlaaZ
 
Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 

Similaire à Dissertation Paper (20)

Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATIONFUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
FUZZY STATISTICAL DATABASE AND ITS PHYSICAL ORGANIZATION
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Thesis_NickyGrant_2013
Thesis_NickyGrant_2013
 
A brief history of generative models for power law and lognormal ...
A brief history of generative models for power law and lognormal ...A brief history of generative models for power law and lognormal ...
A brief history of generative models for power law and lognormal ...
 
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
 
Non-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docxNon-Normally Distributed Errors In Regression Diagnostics.docx
Non-Normally Distributed Errors In Regression Diagnostics.docx
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques final
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009
 
Part1: Quest for DataScience 101
Part1: Quest for DataScience 101Part1: Quest for DataScience 101
Part1: Quest for DataScience 101
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical Databases
 
Topic models
Topic modelsTopic models
Topic models
 
Lesson 1 07 measures of variation
Lesson 1 07 measures of variationLesson 1 07 measures of variation
Lesson 1 07 measures of variation
 
Fuzzy hypersoft sets and its weightage operator for decision making
Fuzzy hypersoft sets and its weightage operator for decision makingFuzzy hypersoft sets and its weightage operator for decision making
Fuzzy hypersoft sets and its weightage operator for decision making
 
Principa
PrincipaPrincipa
Principa
 
Principa
 Principa Principa
Principa
 
F0742328
F0742328F0742328
F0742328
 
K0230950102
K0230950102K0230950102
K0230950102
 
O0447796
O0447796O0447796
O0447796
 
gamdependence_revision1
gamdependence_revision1gamdependence_revision1
gamdependence_revision1
 

Dissertation Paper

  • 1. P a g e | 1 From a long time, the study of parametric families of probability distributions has received profound interest due to their usefulness in data analytics. My present work is an account of one approach which has generated a great deal of activity. There has been a general proclivity in the statistical literature to adopt more flexible methods to analyze data and to represent features of the data as adequately and exhaustively as possible so as to reduce unrealistic and unreliable assumptions. For the treatment of continuous observations within a parametric domain, one aspect which has been little affected by the above process is the overwhelming role played by the assumption of normality which is the sole basis for every data analysis. A major reason for this is certainly the unrivaled mathematical tractability of the normal distribution and its simplicity in dealing with even the most complex data. From a practical viewpoint, the most commonly adopted approach is transformation of the variables to achieve normality which works satisfactorily for most of the data. However, this approach is not without its faults: (i) the transformations are usually on each component separately, and achievement of joint normality is only hoped for; (ii) the transformed variables are more difficult to deal with in case of interpretations, especially when each variable is transformed using a different function; (iii) In case homoscedasticity is required, this often necessitates a different transformation from the one for normality. Alternatively, there exists several other parametric classes of distributions to choose from amongst which many are already reviewed by Johnson & Kotz (1972). A special mention is due to the hyperbolic distribution and its generalized version, which form a fairly flexible and mathematically tractable parametric class (Refer to Barndorff-Nielsen & Blæsild (1983) for a summary account, and Blæsild (1981) for a detailed treatment
  • 2. P a g e | 2 of the bivariate case and a numerical example). Except for data transformation, however, no alternative method to the normal distribution has been adopted for regular use in applied work, within the framework considered here of a parametric approach to handle continuous data. Generally, it can be seen that the motive is to start from a symmetric distribution and then by a suitable modification, generate a set of asymmetric distributions. In my case, the simplest effect is the introduction of skewness in the distribution under consideration. But it is to be noted that my focus in not dealing with the quintessential nature of skewness and methods for measuring it. Rather, my study is to generate more malleable and realistic parametric distributions with possible departure from symmetry and in other words, incorporating skewness so as to use them in statistical work. The present paper examines a different direction of the above broad problem, namely the possibility to extend some of the classical methods to the class of skew normal distributions which has recently been discussed by Azzalini & Dalla Valle (1996). We aim at demonstrating that this distribution achieves a reasonable flexibility in real data fitting, while it maintains a number of convenient formal properties of the normal one. The concentrated development of research in this area has attracted the interests of a large number of statisticians recently. In this introductory phase, let me summarize the different chapters.  The motivation for such a class of distribution and the key concept of my development are formulated in Chapter 1.  Chapter 2 deals with the form of the distribution and the source for such a form.  The different significant properties and the similarities and dissimilarities with the normality have been nurtured in Chapter 3.  Chapter 4 revolves around one such property with the help of which we can be able to generate skew- normal data.
  • 3. P a g e | 3  Chapter 5 pertains to the different methods of estimating parameters and comparing with those of normality.  In Chapter 6, I have analyzed Roberts IQ data and fitted it to a skew normal distribution to show its preference over normal family.  Chapter 7 discusses various applicability of skew normal distributions in a vast number of fields There are various references that have been enlisted in the following chapters for further studies. I have also taken particular care in following the correct technicalities on introducing a statistical or mathematical term. I hope you will enjoy this paper very much.
  • 4. P a g e | 4 CHAPTER 1: MOTIVATION BEHIND SKEW NORMAL DISTRIBUTION A question naturally comes to our mind that with vast families of skewed probability distributions currently available, do we really need any more? Can’t we be happy with what we have? The answer is no. and there are three motivations behind this answer. The first lies in the essence of the mechanism itself, which starts with a continuous symmetric density function which is then modified to generate a variety of alternative forms. The set of densities so constructed incudes the original symmetric one as an ‘interior point’. (Let S be a subset of a topological space X. Then x is an ‘interior point’ of S if x is contained in an open subset of S). Let us gaze upon the normal family, which is of huge prominence. It is well known fact that the normal distribution is the limiting form of a large number of non- normal parametric families, while in the following construction is the ‘central’ form of a set of alternatives. This situation is more in line with the common perception of the normal distribution as ‘central’ with respect to others, which represent ‘departures from normality’ rather than ‘incomplete convergence to normality’. The second motivation lies in the applicability of such tractable distributions. While considering normality, we also take into account its symmetricity while in reality that may not be so. Hence removal of such restriction has become a necessity. The last motivation derives from the mathematical elegance and complaisance of the construction in two respects. Firstly, the new family distributions emerges out of a simple process not requiring complex formulations, as we can see later and secondly, the newly generated distributions are more or less alike the previous ones and retain some properties of the parent symmetric distributions.
  • 5. P a g e | 5 *) A generalconstruction : Since the role of symmetricity is crucial in our development, we recall the condition of central symmetry: according to Serfling (2006), a random variable X is centrally symmetric about 0 if it is distributed as −X. Let us state a proposition: Denote by f0 a probability density function on Rd, by G0(·) a continuous distribution function on the real line, and by w(·) a real valued function on Rd, such that f0(−x) = f0(x), w(−x) = −w(x), G0(−y) = 1 − G0(y) for all x ∈ Rd, y ∈ R. Then f(x) = 2 f0(x) G0{w(x)} is a density function on Rd. Technical proof: Note that g(x) = 2 [G0{w(x)}− 1/2 ] f0(x) is an odd function and it is integrable because |g(x)| ≤ f0(x). Then 0 = ∫ 𝑔( 𝑥) 𝑑𝑥 = ∫ 2𝑓0( 𝑥) 𝐺0{ 𝑤( 𝑥)} 𝑑𝑥 − 1 𝑅 𝑑 𝑅 𝑑 Thus we have encapsulated the motivations and construction behind the formulation of the skew-normal distribution and now go on to its distributional form in the next chapter. (QED)
  • 6. P a g e | 6 CHAPTER 2: FUNCTIONAL FORM OF THE UNIVARIATE SKEW NORMAL DISTRIBUTION AND ITS SOURCE Let us consider the probability density function (pdf) and cumulative density function (cdf) of a standard normal distribution Z, (i) By the proposition stated in Chapter 1, the product of the two functions in (i) gives rise to an interesting class of random variables, which has been the subject of intense study for the last two decades. More precisely for a real number α, is the bona fide pdf of a new random variable X, which inherits some of the features of normal distribution. The class of distributions denoted by the pdf given above were introduced by Azzalini (1985) and christened “skew normal distributions” with the skewness parameter α, symbolically represented as X ~ SN (α). For applied work we must introduce the location and scale parameters, µ and σ respectively. Let Y be the new variable such that Y = µ + σX. (µ ε R, σ ε R+). The pdf is given by fY(x) = 𝟐 𝝈 𝝓( 𝒙−µ 𝝈 )ф (𝜶 𝒙−µ 𝝈 ) = 𝟏 𝝈 √ 𝟐 𝝅 𝒆 − 𝟏 𝟐 (𝒙−𝝁) 𝟐 𝝈 𝟐 ∫ 𝒆−𝒕 𝟐𝜶( 𝒙−𝝁 𝝈 ) −∞ 𝒅𝒕
  • 7. P a g e | 7 and symbolically, Y ~ SN (µ,σ,α). The cdf of the skew normal distribution is given by FY(x) = ф ( 𝒙−𝝁 𝝈 ) − 𝟐𝑻 ( 𝒙−𝝁 𝝈 , 𝜶) where T(h, a) is a Owen’s T-function defined by Now, the pdfs and cdfs for standard skew normal distributions has been graphically shown below:
  • 8. P a g e | 8 CHAPTER 3: PROPERTIES AND SIMILARITIES & DISSIMILARITIES WITH NORMAL DISTRIBUTION PROPERTIES  Let us now discuss some of the properties of skew-normal distribution:- Property I For α = 0, X = Z and for α → ± ∞, X = ±| Z| where Z ∼ N(0, 1). The above property shows that the normal and half-normal random variables lie at the center (α = 0) and boundary (α = ±∞) of the class of SN random variables, respectively. Research following the publications of Azzalini (1985, 1986), Henze (1986) and Arnold et al. (1993) have revealed that simple and common nonlinear operations such as truncation, conditioning and censoring performed on normal random variables lead invariably to versions of SN random variables. Consequently and not surprisingly, it has been revealed that the implicit appearance of SN random variables in the literature of statistics has a reasonably long pre-1985 history. The first known birthplace of SN distributions is the work of Birnbaum (1950) in the context of educational testing which involved truncation of normal variables, followed by the work of Weinstein (1964) and Nelson (1964) on finding the distribution of the sum of a normal variable and an independent truncated normal variable; some other early work are Roberts (1966), O’Hagan and Leonard (1976), Aigner et al. (1977) and Andel et al. (1984). For an extended review of the literature refer to Genton (2004), Arellano-Valle and Azzalini (2006) and Pourahmadi (2007). Property II If X ∼ SN(α), then −X ∼ SN( −α) for any α. Property III Φ(x;- α) = 1-Φ(-x; α)where Φ(x;α)is df of the standard normal family. Property IV If X ∼ SN(α), then | X | and | Z | are identically distributed, where Z~N(0,1)
  • 9. P a g e | 9 Property V where Φ is df of the standard normal distribution. Property VI If X ∼ SN(α), then X2 ∼ χ2 1, i.e. a chi-squared rv with df = 1. The chi-square distribution in Property IV which is immediate from III, was first recognized and employed effectively by Roberts (1966), it implies in particular that the distributions of | X| , X2, and all even functions of X do not depend on the skewness parameter α, i.e. there exists an invariance property, with respect to α, that could have interesting inferential consequences (Genton et al. 2001; Loperfido, 2001). For example, all goodness-of-fit tests based on even functions of the data are incapable of distinguishing between normal and SN distributions (refer to Loperfido, 2004). It is to be noted that the inverse of the property VI is not true. Property VII A random variable X has the SN pdf (2), iff it has the representation X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2 where Z1, Z2 are independent N(0, 1) random variables, and δ = 𝛂 √ 𝟏−𝛂 𝟐 ∈ [−1, 1]. It is to be noted that the new parameter δ is, indeed, the correlation coefficient between X and |Z1|. Property VIII If X ∼ SN(α)and Z ∼ N(0, 1) are independent, then 𝑿+𝒁 √𝟐 ~ 𝑺𝑵( 𝜶 √ 𝟐+𝜶 𝟐 ) This shows that unlike the normal random variables, the class of SN random-variables is not closed with respect to the addition of independent copies of its members. Property IX Let Xi ∼ SN(αi)be independent with αi ≠ 0, i = 1, 2. Then, in general, X1 + X2 is not SN. However, if X1 and X2 are dependent sharing a common half-normal, then X1 + X2 is SN. More precisely, if
  • 10. P a g e | 10 X1 = δ1|Z| + √𝟏 − 𝜹 𝟏 𝟐 Z1, X2 = δ2|Z| + √𝟏 − 𝜹 𝟐 𝟐 Z2, where Z, Z1, Z2 are independent N(0, 1). Then, it follows that 𝑿 𝟏+𝑿 𝟐 √ 𝟏+𝟐𝜹 𝟏 𝜹 𝟐 ~ 𝑺𝑵( 𝜹 𝟏+𝜹 𝟐 √ 𝟏+𝟐𝜹 𝟏 𝜹 𝟐 ) ~ PROOFS:  Property II:  Property III: X ~ SN(α) => fX(x) = 2 ϕ(x)Φ (αx) Let Z = -X ; We know x ε R x = -z , z ε R J = 𝑑𝑧 𝑑𝑥 = -1 fZ(z) = 2 ϕ(−z)Φ (α(−z)) | 𝐽 | = 2 ϕ(z)Φ ((−α)z) Thus Z ~ SN(-α) (QED) LHS = Φ(x;−α) = Φ(x) − 2T(x,−α) RHS = 1 − Φ(−x;α) = 1 −(Φ(−x) − 2𝑇(−𝑥, 𝛼)) = Φ(x) − 2𝑇(−𝑥, 𝛼) Since T(x,-α) = T(-x,α), LHS = RHS (QED)
  • 11. P a g e | 11  Property IV:  Property VI: Thus |X| and |Z| are identically distributed where Z ~ N(0,1) (QED) Let 𝑌 = 𝑋2 Then, 𝑥 = ±√ 𝑦 , y ε 𝑅2 |J| = 𝜕𝑥 𝜕𝑦 = 1 2√ 𝑦 Hence, 𝑓𝑌( 𝑦) = 1 2√ 𝑦 𝑓𝑋 (√ 𝑦) + 1 2√ 𝑦 𝑓𝑋 (−√ 𝑦) = 2 2√ 𝑦 [𝜙(√ 𝑦)Φ(𝛼√ 𝑦) + 𝜙(−√ 𝑦)Φ(−𝛼√ 𝑦)] = 1 √2𝜋√ 𝑦 𝑒− 𝑦 2 [Φ(𝛼√ 𝑦) + Φ(−𝛼√ 𝑦)] = 1 √2𝜋√ 𝑦 𝑒− 𝑦 2 which is thepdf of a 𝜒(1) 2 distribution.
  • 12. P a g e | 12 DERIVATION OF MGF AND MOMENT MEASURES: After these properties, we will derive the formulae for expectation, variance and skewness from derivation of MGF, give below. Given: From property VII, X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2 𝑀| 𝑍1|( 𝑡) = 𝐸( 𝑒 𝑡| 𝑍1| ) = ∫ 𝑒 𝑡| 𝑢| 𝑒 − 𝑢2 2 √2𝜋 𝑑𝑢 ∞ −∞ , u = z1 = √ 2 𝜋 ∫ 𝑒 𝑡𝑢 𝑒− 𝑢2 2 𝑑𝑢 ∞ 0 = 2𝑒 𝑡2 2 ∫ 𝑒 − (𝑢−𝑡)2 2 √2𝜋 𝑑𝑢 ∞ 0 = 2𝑒 𝑡2 2 𝛷( 𝑡) Therefore, 𝑀 𝑋 ( 𝑡) = 𝐸( 𝑒 𝑡𝑋) = 𝐸( 𝑒 𝑡{𝑎| 𝑍1|+𝑏𝑍2} ) , where a = δ and b = √1 − δ2 = 𝑀| 𝑍1|( 𝑡𝑎) 𝑀 𝑍2 ( 𝑡𝑏) = 2𝑒 (𝑎𝑡)2 2 𝛷( 𝑡𝑎) 𝑒 (𝑏𝑡)2 2 = 2𝑒 𝑡2 2 𝛷( 𝑡δ) Now if Y = µ + σX, 𝑴 𝒀( 𝒕) = 𝟐𝒆 𝝁𝒕+ (𝒕𝝈) 𝟐 𝟐 𝜱( 𝝈𝒕𝛅)
  • 13. P a g e | 13 Now differentiating MGF once, twice and thrice at t=0, we get E(Y), V(Y), Skewness(Y) respectively. 𝐄( 𝐘) = 𝛍 + 𝛔𝛅√ 𝟐 𝛑 𝐕( 𝐘) = 𝛔 𝟐 ( 𝟏 − 𝟐𝛅 𝟐 𝛑 ) 𝐒𝐤𝐞𝐰𝐧𝐞𝐬𝐬( 𝐘) = 𝟒 − 𝛑 𝟐 (𝛅√ 𝟐 𝛑 ) 𝟑 (𝟏 − 𝟐𝛅 𝟐 𝛑 ) 𝟑 𝟐 In chapter 4, we will see how to estimate the parameters from a given data.
  • 14. P a g e | 14 CHAPTER 4: GENERATION OF RANDOM SAMPLES FROM SKEW NORMAL DISTRIBUTION We will use property VII to draw random samples using normal distributions. It is quite interesting to note how the special case of skew normal distribution is used to generate its random data. PROOF : Let a = δ and b = √1 − δ2 . Then P(X ≤ x) = 𝐸|𝑍1|[P(X ≤ x | |Z1| = z) = 2∫ 𝑃(𝑍2 ≤ − au 𝑏 ∞ 0 ) 𝜙( 𝑧) 𝑑𝑧 = 2∫ 𝛷 ( 𝑥−𝑎𝑢 𝑏 ) 𝜙( 𝑧) 𝑑𝑧 ∞ 0 Differentiating yields the density of X as follows 𝑑 𝑑𝑥 P(X ≤ x) = 2∫ 𝜙 ( 𝑥−𝑎𝑢 𝑏 ) 𝜙( 𝑧) 𝑑𝑧 ∞ 0 Using the fact that a2+b2 = 1, we get 𝑑 𝑑𝑥 P(X ≤ x) = ϕ(x) ∫ 1 √2𝜋𝑏 𝑒 − (𝑧−𝑎𝑥)2 2𝑏2 𝑑𝑧 ∞ 0 A random variable X has the SN pdf iff it has the representation X = δ| Z1| + √𝟏 − 𝛅 𝟐 Z2 where Z1, Z2 are independent N(0, 1) random variables, and δ = 𝛂 √ 𝟏−𝛂 𝟐 ∈ [−1, 1].
  • 15. P a g e | 15 = 2ϕ(x) ∫ 1 √2𝜋 𝑒− 𝑡2 2 𝑑𝑡 ∞ − 𝑎𝑥 𝑏 = 2ϕ(x) {1 − Φ (− 𝑎𝑥 𝑏 )} = 2ϕ(x) 𝛷( 𝛼𝑥) , which is the skew normal density. Thus X ~ SN(α) (QED) Thus we can generate data in this manner using standard normal distributions. However recently, Azzalini has developed an R-package named ‘sn’ by which skew normal random data can be easily generated through software.
  • 16. P a g e | 16 CHAPTER 5: ESTIMATION OF PARAMETERS OF SKEW NORMAL DISTRIBUTION Estimation of parameters is a fundamental problem in data analysis. However, though various methods of estimating parameters have evolved over time, we deal with only two such methods. Earlier Methods of Estimation: Estimation is the process of determining approximate values for parameters of different populations or events. How well the parameter is approximated can depend on the method, the type of data and other factors. Gauss was the first to document the method of least squares, around 1794. This method tests different values of parameters in order to find the best fit model for the given data set. However, least squares is only as robust as the data points are close to the model and thus outliers can cause a least squares estimate to be outside the range of desired accuracy. The method of moments is another way to estimate parameters. The 1st moment is defined to be the mean, and the 2nd moment the variance; the 3rd moment is the skewness and the 4th moment is the kurtosis. In complex models, with more than one parameter, it can be difficult to solve for these moments directly, and so moment generating functions were developed using sophisticated analysis. These moment generating functions can also be used to estimate their respective moments. Bayesianestimation is based on Bayes’ Theorem for conditional probability. Bayesian analysis starts with little or no information about the parameter to be estimated. Any data collected can be used to adjust the function of the parameter, thereby improving the estimation of the parameter. This process of refinement can continue as new data is collected until a satisfactory estimate is found. Evolution of Maximum Likelihood Estimation: It was none other than R. A. Fisher who developed maximum likelihood estimation. Fisher based his work on that of Karl Pearson, who promoted several estimation methods, in particular the method of moments. While Fisher agreed with Pearson that the method of
  • 17. P a g e | 17 moments is better than least squares, Fisher had an idea for an even better method. It took many years for him to fully conceptualize his method, which ended up with the name maximum likelihood estimation. In 1912, when he was a third year undergraduate student, Fisher published a paper called” Absolute criterion for fitting frequency curves.” The concepts in this paper were based on the principle of inverse probability, which Fisher later discarded. (If any method can be considered comparable to inverse probability, it is Bayesian estimation.) Because Fisher was convinced that he had an idea for the superior method of estimation, criticism of his idea only fueled his pursuit of the precise definition. In the end, his debates with other statisticians resulted in the creation of many statistical terms we use today, including the word ”estimation” itself and even ”statistics”. Finally, Fisher defined the difference between probability and likelihood and put his final touches on maximum likelihood estimation in 1922. Finally, this chapter deals with a primitive method of estimation, though not uncommon, method of moments and one of the revolutionary modern methods, method of maximum likelihood. After generating random data from SN(10, 5.5, -4), I have applied the following to estimate the three parameters. I. METHOD OF MOMENTS :~ In chapter 3, I have worked out the MGF of the skew normal distribution and based on that, derived the formulae of expectation, variance and skewness. Let us denote sample mean by m, sample variance by s2 and sample skewness by g1. Hence, based on the given data (x1, x2…, xn), we calculate m = 𝟏 𝒏 ∑ 𝒙𝒊 𝒏 𝒊=𝟏 = E(X) s2 = 𝟏 𝒏−𝟏 ∑ (𝒙𝒊 − 𝒎) 𝟐𝒏 𝒊=𝟏 = V(X)
  • 18. P a g e | 18 g1 = 𝒏√𝒏−𝟏 𝒏−𝟐 ∑ (𝒙 𝒊−𝒎) 𝟑𝒏 𝒊=𝟏 (∑ (𝒙 𝒊−𝒎) 𝟐𝒏 𝒊=𝟏 ) 𝟑 𝟐 = 𝞬1̂ Thus, using expressions of mean, variance and skewness in Chapter 4, we get the estimates as Note:  The method of moment estimates can be used as starting values for maximum likelihood estimates (since there is no closed form as seen later).  The sign of δ̂ should be the same as that of 𝞬1. The maximum (theoretical) skewness is obtained by setting in the skewness equation, giving . However it is possible that the sample skewness is larger, and then cannot be determined from these equations. Hence when estimating parameters using R-code I have skipped those data where δ > 1 so as to estimate α without hindrance. After generating 5000 random sample from the SN(10, 5.5, -4), the table next page shows the estimates and the corresponding se’s. 𝜇̂ = 𝑚 − 𝜎𝛿√ 2 𝜋 𝜎̂ = 𝑠 √1−2 𝛿2 𝜋 |δ̂| = 𝛼̂ = 𝛿 √1−𝛿2
  • 19. P a g e | 19 PARAMETERS ESTIMATES STANDARD ERRORS 𝝁̂ 9.9864 0.0616 𝝈̂ 5.4946 0.0974 𝜶̂ -5.1286 0.0470 II. METHOD OF MAXIMUM LIKELIHOOD :~  INTRODUCTION: Let 𝐲 = (y1, y2, …, yn)’ be a vector of iid RVs from one of the family of distributions on R and indexed by a p-dimensional parameter θ = (θ1, θ2, …., θn)’ , θ ε Θ. Let me denote the df of y by F(y|θ) and assume that the density function f(y|θ) exists. Then the likelihood function of θ is given by L = ∏ f(y𝑖|θ)𝑛 𝑖=1 Let the p partial derivatives of the log-likelihood form the p × 1 vector be 𝑢( 𝜃) = 𝜕 ln 𝐿 𝜕𝜃 = ( 𝜕 ln 𝐿 𝜕𝜃1 ⋮ 𝜕 ln 𝐿 𝜕𝜃 𝑛 ) The vector u(θ) is called the score vector of the log-likelihood function. The moments of u(θ) satisfy two important identities. First, the expectation of u(θ) with respect to y is equal to zero, and second, the variance of u(θ) is the negative of the second derivative of ln 𝐿 (θ) , i.e., 𝑉(𝑢( 𝜃)) = −𝐸 [(𝑢( 𝜃))(𝑢( 𝜃)) ′ ] = −𝐸 {( 𝜕2 𝑙𝑛𝐿( 𝜃) 𝜕𝜃𝑗 𝜃𝑘 )} The p × p matrix on the right hand side is called the expectedFisher information matrix and usually denoted by I(θ). The expectation here is taken over the distribution of y at a fixed value of θ. Under
  • 20. P a g e | 20 conditions which allow the operations of integration with respect to y and differentiation with respect to θ to be interchanged, the maximum likelihood estimate of θ is given by the solution θ̂ to the p equations u(θ̂) = 0 and under some regularity conditions, the distribution of θˆ is asymptotically normal with mean θ and variance-covariance matrix given by the p × p matrix I(θ)−1 i.e., the inverse of the expected information matrix. The p × p matrix I(θ) = − {( 𝜕2 𝑙𝑛𝐿( 𝜃) 𝜕 𝜃𝑗 𝜃 𝑘 )} is called the observed information matrix. In practice, since the true value of θ is not known, these two matrices are estimated by substituting the estimated value θˆ to give I(θˆ ) and I(θˆ ), respectively. Asymptotically, these four forms of the information matrix can be shown to be equivalent. From a computational standpoint, the above quantities are related to those computed to solve an optimization problem as follows: −ln L(θ) corresponds to the objective function to be minimized, u(θ) represents the gradient vector, the vector of first-order partial derivatives, usually denoted by g, and I(θ), corresponds to the negative of the Hessian matrix H(θ), the matrix of second-order derivatives of the objective function, respectively. In the MLE problem, the Hessian matrix is used to determine whether the minimum of the objective function –ln L(θ) is achieved by the solution θˆ to the equations u(θ) = 0, i.e., whether θˆ is a stationery point of ln L(θ). If this is the case, then θˆ is the maximum likelihood estimate of θ and the asymptotic covariance matrix of θˆ is given by the inverse of the negative of the Hessian matrix evaluated at θˆ, which is the same as I(θˆ ), the observed information matrix evaluated at θˆ . Sometimes it is easier to use the observed information matrix I(θˆ ) for estimating the asymptotic covariance matrix of θˆ , since if I(θˆ ) were to be used then the expectation of I(θˆ ) needs to be evaluated analytically. However, if computing the derivatives of ln L(θ) in closed form is difficult or if the optimization procedure does not produce an estimate of the Hessian as a byproduct, estimates of the derivatives obtained using finite difference methods may be substituted for I(θˆ ).
  • 21. P a g e | 21 Many a times, iterative methods are used to estimate θ, out of which the most commonly used modern technique is Newton-Raphson method. We can recall that the maximum or minimum of f(x) was given by 𝜃̂(𝑖+1) = 𝜃̂(𝑖) − 𝐻(𝜃̂( 𝑖)) −1 𝑢(𝜃̂( 𝑖)) Observe that the Hessian needs to be computed and inverted at every step of the iteration. In difficult cases when the Hessian cannot be evaluated in closed form, it may be substituted by a discrete estimate obtained using finite difference methods as mentioned above. In either case, computation of the Hessian may end up being a substantially large computational burden. When the expected information matrix I(θ) can be derived analytically without too much difficulty, i.e., the expectation can be expressed as closed form expressions for the elements of I(θ), and hence, I(θ)-1 it may be substituted in the above iteration to obtain the modified iteration. 𝜃̂(𝑖+1) = 𝜃̂(𝑖) + 𝐼(𝜃̂( 𝑖)) −1 𝑢(𝜃̂( 𝑖)) This saves on the computation of H(θˆ ) because functions of the data y are not involved in the computation of I(𝜃ˆ (𝑖)) as they are with the computation of H(θˆ ). This provides a sufficiently accurate Hessian to correctly orient the direction to the maximum. This procedure is also called the method of scoring.  PROCEDURE: Let us denote the likelihood function by L. Hence ln L = c + − 𝑛 2 ln 𝜎2 − 1 2 ∑ ( 𝑥 𝑖−𝜇 𝜎 )2𝑛 𝑖=1 + ∑ 𝑙𝑛 𝛷(𝛼 𝑥𝑖−𝜇 𝜎 𝑛 𝑖=1 ) The score equations are given by – (i) 𝜕 ln 𝐿 𝜕𝜇 = ∑ ( 𝑥𝑖−𝜇 𝜎 ) 2𝑛 𝑖=1 − 𝛼 𝜎 ∑ 𝜙( 𝛼 𝑥𝑖−𝜇 𝜎 ) 𝛷( 𝛼 𝑥𝑖−𝜇 𝜎 ) 𝑛 𝑖=1 = 0
  • 22. P a g e | 22 (ii) 𝜕 ln 𝐿 𝜕𝜎 = − 𝑛 2𝜎2 + 1 2 ∑ (𝑥𝑖−𝜇) 𝜎4 2 𝑛 𝑖=1 − 𝛼 2𝜎3 ∑ 𝜙( 𝛼 𝑥𝑖−𝜇 𝜎 ) 𝛷( 𝛼 𝑥𝑖−𝜇 𝜎 ) (𝑥𝑖 − 𝜇)𝑛 𝑖=1 = 0 (iii) 𝜕 ln 𝐿 𝜕𝛼 = ∑ 𝜙( 𝛼 𝑥𝑖−𝜇 𝜎 ) 𝛷( 𝛼 𝑥𝑖−𝜇 𝜎 ) ( 𝑥𝑖−𝜇 𝜎 )𝑛 𝑖=1 = 0 If we let W(xi) = 𝜙(𝛼 𝑥 𝑖−𝜇 𝜎 ) 𝛷(𝛼 𝑥 𝑖−𝜇 𝜎 ) , then the maximum likelihood estimates are given by Using Newton Raphson method here includes using the Fisher information. The Fisher information matrix is given by [ 𝑛(1 + 𝛼2 𝑎0) 𝜎2 𝑛𝑏𝛼(1 + 𝛼2 ) 𝜎2(1+ 𝛼2) 3 2 + 𝑛𝛼2 𝑎1 𝜎2 𝑛 𝜎 { 𝑏 (1 + 𝛼2) 3 2 − 𝛼𝑎1} 𝑛𝑏𝛼(1 + 𝛼2 ) 𝜎2(1 + 𝛼2) 3 2 + 𝑛𝛼2 𝑎1 𝜎2 𝑛 𝜎2 (1 + 𝛼2 𝑎2) − 𝑛𝛼𝑎2 𝜎 𝑛 𝜎 { 𝑏 (1 + 𝛼2) 3 2 − 𝛼𝑎1} − 𝑛𝛼𝑎2 𝜎 𝑛𝑎2 ] Where b = √ 2 𝜋 𝑧𝑖 = 𝑥 𝑖−𝜇 𝜎 𝛼̂ =
  • 23. P a g e | 23 𝑎 𝑘 = 𝐸 {𝑧 𝑘 ( 𝜙( 𝛼𝑧 𝑖) 𝛷( 𝛼𝑧 𝑖) ) 2 } , k = 0, 1, 2 Thus we get the corresponding estimates. After generating 5000 random sample from the SN(10, 5.5, - 4), the table below shows the estimates and the corresponding se’s. PARAMETERS ESTIMATES STANDARD ERRORS 𝝁̂ 9.9966 0.0791 𝝈̂ 5.5079 0.0932 𝜶̂ -4.007 0.2292 NOTE: We can see by running a R-code that both the estimates are more or less biased. The method of moments estimates are not consistent while the ML estimates are so. Apart from estimating the parameters we also calculate the median, IQR, QD and Bowley’s measure of skewness using same skew-normal parameters with 5000 randomly generated observations. MEDIAN 6.706 IQR 5.147 QD 2.573 BOWLEY’S MEASURE -0.416 Hence, the results show that median is more than mean, high value of IQR means that data is quite spread and negative value of Bowley’s measure show that the data is negatively skewed, which conforms to our distribution.
  • 24. P a g e | 24 CHAPTER 6: ROBERTS IQ DATA Arnold et al. (1993) applied the skew normal distribution to a portion of an IQ score data set from Roberts (1988). In this section we expand the application to the full data set. The Roberts IQ data gives the Otis IQ scores for 87 white males and 52 non-white males hired by a large insurance company in 1971. The data is given in the following tables: (Snapshot taken from Roberts) To apply the skew normal as a truncated normal according to the motivation of the model given by Arnold et al. (1993), we assume that these individuals were screened with respect to some variable Y, which is unknown. We further assume that only individuals who scored above average with respect to the screening variable were hired. Let X represent the IQ scores of the individuals hired. The variable X is the unscreened variable, and only this variable is observed. We assume that (X, Y) has a bivariate normal distribution with mean vector (µ1, µ2),
  • 25. P a g e | 25 variance vector (σ1 2, σ2 2) and correlation ρ. Therefore the observed IQ scores represent a sample from a nonstandard skew normal distribution. We apply the non-standard skew normal maximum likelihood estimators to the IQ score sample to estimate the mean and variance of IQ scores for the unscreened population. We now let XI represent the score for whites and let X2 represent the scores for non-whites. The two data sets displayed above are analyzed separately, each under the assumptions of normality and skew normality. The estimates are given in the following tables: CONCLUSION BASED ON THE GIVEN DATA: For both data sets, under the assumption of normality the mean is overestimated and the standard deviation is underestimated. Using the estimates from above, we first transform the data sets, given in Tables 5.1 and 5.2, to the data sets on standard skew-normal random variables Z1 and Z2. We then estimate α1 ’ and α2 ’ for the standard skew-normal random variables. The resulting estimates are α1 ’ = 1.15 and α2 ’ = 1.84. Simultaneously we can also carry out hypothesis testing to check for skew normality. To test H0: Z~ SN(µ, σ2, α=0) ag H1: Z ~ SN(µ, σ2, α≠0) α1 α2
  • 26. P a g e | 26 The likelihood ratio test statistic is given by χ = 𝑳(µ̂,𝝈̂,𝜶=𝟎) 𝑳(µ̂,𝝈̂,𝜶̂) Where ‘L’ denotes the likelihood function. It can be shown that -2lnχ ~ χ1 2 under H0 After applying the given data, we see that pvalue for the 2 data sets are coming out as 0.00129 and 0.01056, which are both less than the desired level of significance, which is taken to be 0.05. Hence, we can conclude that H0 is rejected for both the data sets. OVERALL CONCLUSION: In many real life applications, it has been observed that the unrestricted use of the normal distribution to model data can yield erroneous results. For the Roberts IQ (1988) data analyzed in here, the application of normality resulted in overestimates of the mean IQ scores in both cases. This is due to the fact that the scores were obtained by screening on some other variable which is unknown, giving rise to skewness in the data. For this reason, I have used the skew normal distribution.
  • 27. P a g e | 27 CHAPTER 7: APPLICATIONS OF SKEW NORMAL IN VARIOUS FIELDS The various applications are listed below: 1. STATISTICAL PROCESS CONTROL: The most commonly used standard procedures of Statistical Quality Control (SQC), control charts and acceptance sampling plans, are often implemented under the assumption of normal data, which rarely holds in practice. The analysis of several data sets from diverse areas of application, such as, statistical process control (SPC), reliability, etc. lead us to notice that this type of data usually exhibit moderate to strong asymmetry as well as light or heavy tails. Thus, despite of the simplicity and popularity of the Gaussian distribution, we conclude that in most of the cases, fitting a normal distribution to the data is not the best option. On another side, modeling real data sets, even when we have some potential asymmetric models for the underlying data distribution, is always a very difficult task due to some uncontrollable perturbation factors. Hence, in these problems, skew- normal distribution comes to rescue. 2. BIOMEDICAL STUDIES: Another important application is concerned with continuous longitudinal responses in biomedical studies. The first attempt to use skew-normal and related distributions to relax the standard assumption of normality of the random effects in linear mixed models has been developed by Ma et al. (2004). Because the likelihood function does not have a closed form in this setting, they proposed inference based on the EM algorithm as well as inference via MCMC simulations in a Bayesian framework. In this approach, the choice of the degree of the polynomial involved in the skewing function was identified by means of model selection criteria. A simulation study indicated that a flexible model for the distribution of random effects in the linear mixed model results in more efficient estimators of the fixed effects and also more efficient estimators of the mean and the variance of the
  • 28. P a g e | 28 unobserved random effects. The special case where the distribution of the random effects is skew- normal permits a closed form expression for the likelihood function and has been studied by Arellano- Valle et al. (2005a). The use of skew-normal and related distributions in biomedical sciences has been further promoted by Sahu & Dey (2004) for the development of survival models with a skewed frailty, and by Chen (2004) for the construction of skewed link models for categorical response data. 3. GEOLOGICAL STUDIES: The skew-normal and related distributions can play very important role in applications arising from geosciences. Kim & Mallick (2004) used the skew-normal distribution to model spatial data. In the context of data assimilation, Naveauet al. (2004) developed a skewed Kalman filter for the analysis of climatic time series. Specifically, they studied the impact of strong but short-lived perturbations from large explosive volcanic eruptions on climate. The use of skew-normal distributions gave a more realistic representation of volcanic forcing than the normal distribution. Genton & Thompson (2003) used skew-elliptical time series to model sea levels and to evaluate the risk of coastal flooding in Charlottetown, Canada. 4. LINEAR MODELS: Another area of interest is the application of the univariate skew normal to linear models. In one of the presidential elections in US, there was considerable discussion about the measurement error of the machine and hand recounts. In linear models, the error term is assumed to be normal with mean zero. However, if we consider each Florida County separately, with each showing a significant margin of victory for one of the two candidates, then the measurement error will have skewness in favor of the winner. Therefore, it is appropriate to investigate the nature of a linear model with a skew normal error term. 5. EVOLUTIONARY GENETICS: Evolutionary algorithms constitute a set of optimization techniques inspired by the idea of biological evolution of a population of organisms which adapts to their surrounding environment via mechanisms of mutation and selection. An algorithm of this type starts by
  • 29. P a g e | 29 choosing an initial ‘population’ formed by a random set of n points in the feasible space of the target function and making it evolve via successive generations. In the evolution process, the points which are best performing are used to breed a new generation via a mutation operator. This step involves generation of new random points, typically making use of a multivariate Gaussian distribution. In this framework Berlik (2006) has considered adopting an asymmetric parent distribution, instead of a Gaussian one. The main idea of directed mutation is to impart directionality to the search by generating random numbers that lie preferably in the direction where the optimum is presumed to be. Operationally, the SN family provides the sampling distribution depending on whether we want to keep mutation in the various components independent or allow for correlation. 6. SATELLITE IMAGING: The 6th application involves the flexibility of the skew normal distribution to classify the pixels of a remotely sensed satellite image. In most of the remote sensing packages, for example ENVI and ERDAS, it is assumed that populations are distributed as a multivariate normal. Then linear discriminant function (LDF) or quadratic discriminant function (QDF) is used to classify the pixels, when the covariance matrix of populations are assumed equal or unequal, respectively. However, the data was obtained from the satellite or airplane images suffer from non-normality. In this case, skew- normal discriminant function (SDF) is one of techniques to obtain more accurate image. Generally, we compare the SDF with LDF and QDF using and results show that ignoring the skewness of the data increases the misclassification probability and consequently we get wrong image. 7. PSYCHIATRIC MEASURES: Variables arising from instruments designed to assess health status often follow asymmetric and long-tailed distributions, resulting from a majority of healthy individuals with low values and a few individuals with larger values reflecting particular disorders (e.g. screening questionnaires for symptoms and diagnoses). Skewness occurs frequently in a screening setting where the distributions are nearly normal; thus it is particularly important to account for the values reflecting
  • 30. P a g e | 30 disorder whilst preserving the usual Normal properties of the general population. To adequately perform statistical analyses, we can rely on empirically chosen transformations (e.g. logarithmic, Box-Cox) to make the data conform to the methods’ assumptions. However, it is not always possible to find a suitable transformation, and analyzing data on a different scale might compromise interpretability. This means that rather than the dataset properties informing the statistical analyses, often inappropriate or non-optimal methods in which the data is not fully exploited are used, with assumption violations being accepted as an inevitable nuisance. This problem is compounded when skew normal distributions are taken into account.
  • 31. P a g e | 31 The references used in the text are given below: 1. The Skew-normal and Related families by Adelchi Azzalini 2. Construction of Skew-Normal Random Variables: Are They Linear Combinations of Normal and Half-Normal? By MohsenPourahmadi, Northern Illinois University 3. RELIABILITY STUDIES OF THE SKEW NORMAL DISTRIBUTION by Nicole Dawn Brown 4. www.wikipedia.org 5. THE SKEW-NORMAL DISTRIBUTION IN SPC by Fernanda Figueiredo – CEAUL and Faculdade de Economia da Universidade do Porto, Portugal and M. Ivette Gomes – Universidade de Lisboa, FCUL, DEIO and CEAUL, Portugal
  • 32. P a g e | 32 I would like to express my special thanks of gratitude to Mr. Debjit Sengupta, my project guide who came up with such a wonderful project on the topic (Analysis of Skew-normal Distributions). This not only helped me in doing a lot of research on such a unique topic but also got me acquainted about various new horizons in statistics. He made this project, which consumed huge amount of hard work, research and dedication, a grand success. Secondly I would also like to thank my parents and friends who helped me a lot in finalizing this project.