Missing data handling

Location:
Boston Data Festival
September 23rd 2016
What’s Missing ? Methods in missing data
analysis
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com

• Will be on the QuantUniversity Meetup page.
• If you are not a member signup here:
https://www.meetup.com/QuantUniversity-Meetup/
Slides and code

- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits

• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial
Analytics
• Prior Experience at MathWorks, Citigroup
and Endeca and 25+ financial services and
energy customers
• Regular Columnist for the Wilmott
Magazine
• Author of forthcoming book
“Financial Modeling: A case study
approach” published by Wiley
• Charted Financial Analyst and Certified
Analytics Professional
• Teaches Analytics in the Babson College
MBA program and at Northeastern
University, Boston
Sri Krishnamurthy
Founder and CEO
4

5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data
Science and Big Data Technologies
using MATLAB, Python and R
• Launching the Analytics Certificate
Program in 2016

(MATLAB version also available)

8
www.analyticscertificate.com/SparkWorkshop
Early bird
pricing ending
today!

 Definition
 Assumptions
 Work flow

What does a missing data problem look like?

Missing data
• Dealing with missing data, has been always a challenge in data analysis context.
• We need methods in missing data analysis that:
▫ Minimize the bias
▫ Maximize use of available information, and
▫ Get good estimates of uncertainty e.g., p-value, confidence interval, etc.
Rec No Variable n
1 Unit non-response Unobserved/Latent
variable
2
3 Missing data
4 Item-non-
response

Assumptions (MCAR, MAR, NMAR)
• When values are missing completely at random (MCAR), the probability of
missingness is unrelated to the values of any variable
• Data are missing at random (MAR) if missingness is unrelated to the value that
is missing, but is related to the values of other variables
• E.g., a question about typical number of hours spent browsing the internet
might be missing more often for married than unmarried participants; BUT
among the married subset, missingness is completely random—Not related
to how many hours the person browses
• Data are missing not at random (MNAR) if missingness is related to the value
that is missing, and often to the values of other variables as well
• E.g. missing values are more prevalent among those who typically browse
more than among those who browse less

• Deletion methods : Delete cases or variables that are missing
▫ Listwise methods
▫ Pairwise deletion
▫ Variable deletion
• Imputation methods : Substitution methods
▫ Single imputation
 Mean imputation
 Conditional mean imputation
 Case mean imputation
 Regression imputation
 Last observation carried forward
 Worst case imputation
 Best case imputation
 EM imputation
▫ Multiple imputation
Methods of handling missing data

List wise deletion
• A good method when the proportion of
missing data is less than 15%.
• Advantages:
▫ It can be used for any type of statistical
analysis.
▫ No special computations are required.
▫ The parameters estimations are
unbiased.
▫ The standard errors are appropriate
compare to original data.
• Disadvantages:
▫ May remove a considerable fraction of
data

Pair wise deletion
• Pairwise deletion involves dropping cases
with missing values on an analysis-by-
analysis basis
• Advantages:
▫ Using all available non-missing data
• Disadvantages:
▫ Estimated standard errors and test
statistics are biased

Variable deletion
• Variable deletion involves dropping
variables with missing values on an case -
by-case basis
• Advantages:
▫ Makes sense when lot of missing values in
a variable and if the variable is of relatively
less importance
• Disadvantages:
▫ Loss of information regarding the variable

Mean imputation
• Replace missing values with the mean of
that variable Case Var1 Var2 Var3
1 9 8 8
2 7.44 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7

Conditional Mean imputation
• Replace missing values with value of the
variable mean for a relevant subgroup
Case Var1 Sex Var2 Var3
1 9 F 8 8
2 8.25 F 7 6
3 8 F 5 6
4 7 F 4 5
5 9 F 5 7
6 8 M 8 9
7 6 M 7 6
8 5 M 9 7
9 7 M 8 ?
10 8 M 8 7

Case Mean imputation
• Replace missing values using information
from other variables for the same case to
impute the missing value
Case Var1 Var2 Var3
1 9 8 8
2 6.50 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7

Regression imputation
• Replace missing values using information
from complete cases to “predict” the
value of the missing data, based on a
regression equation for cases with
nonmissing values
Case Var1 Var2 Var3
1 9 8 8
2 6.32 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 ?
10 8 8 7
VAR1′ = 4.621 – (.734 * VAR2) + (1.139 * VAR3)

• Imputes the missing value as a
value on the same outcome the
most recent time it was observed
• Variants :
• Average of T1 and T2
Last observation carried forward
Case T1 T2 T3
1 9 8 8
2 ? 7 6
3 8 5 6
4 7 4 5
5 9 5 7
6 8 8 9
7 6 7 6
8 5 9 7
9 7 8 8
10 8 8 7

• Use interpolation to fill in missing
values
• Useful for longitudinal datasets
Interpolation

• Worst case replaces a missing value with the worst case scenario for a
categorical outcome
• Best case replaces a missing value with the best case scenario for a
categorical outcome
Worst case and Best case imputation

• Substitute best missing values using
a ML imputation
• In the E-step, expected values are
calculated based on all complete
data points
• In the M-step, the procedure
imputes the expected values from
the E-step and then maximizes the
likelihood function to obtain new
parameter estimates
Expectation-Maximization

• Multiple imputation is quickly
becoming the “gold standard”
approach to handling missing
values
• Computationally complex
Multiple imputation

Summary
We have covered Missing data
Introduction  Missing data definition, assumptions and work flow
Deletion methods  Listwise methods
 Pairwise deletion
 Variable deletion
Imputation methods  Single imputation
 Mean imputation
 Conditional mean imputation
 Case mean imputation
 Regression imputation
 Last observation carried forward
 Worst case imputation
 Interpolation
 Best case imputation
 EM imputation
 Multiple imputation
References

29
www.analyticscertificate.com/SparkWorkshop
Early bird
pricing ending
today!
Answer this question on the Eventbrite site
promo code section and take an additional
25% off! Only today!
What package in Python do you need to
use DataFrame functionality?

Thank you!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
srikrishnamurthy
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and
shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

Missing data handling

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Missing data handling

Similaire à Missing data handling (20)

Plus de QuantUniversity

Plus de QuantUniversity (20)

Dernier

Dernier (20)

Missing data handling