Statistical Modeling in 3D: Describing, Explaining and Predicting
1. Statistical Modeling in 3D:
Describing
Explaining
Predicting
University of Padova, Jun 15, 2018
Galit Shmueli ĺžč
č
Institute of Service
Science
2. 1997-2000 (PhD, Statistics)
Israel Institute of Technology
Faculty of IE & M
2000-2002
Carnegie Mellon Univ.
Department of Statistics
2002-2012
Univ. of Maryland
Smith School of Business
2011-2014
Indian School of Business
Hyderabad, India
2014-âŚ
National Tsing Hua Univ.
Institute of Service Science
My Academic Path
My Research
âEntrepreneurialâ statistical &
data mining modeling
Interdisciplinary
Statistical Strategy
⢠To Explain or To Predict?
⢠Information Quality
⢠Data Mining for Causality
⢠Predicting with Causal Models
3. Road Map
1. Definitions
2. Monopolies & confusion in academia & industry
3. Explanatory, predictive, descriptive modeling &
evaluation are different
Why?
Different modeling paths
Explanatory power vs. predictive power
4. Where next?
9. Social sciences & management research
Domination of âExplainâ
Purpose: test causal theory (âexplainâ)
Association-based statistical models
Prediction & description nearly absent
10. Start with a causal theory
Generate causal
hypotheses on constructs
Operationalize constructs â measurable variables
Fit statistical model
Classic journal paper
Statistical inference â causal conclusions
11. Misconception #1:
The same model is best for explaining, describing, predicting
Social Sci & Mgmt: Build explanatory model and use it to âpredictâ
âA good explanatory model will also predict wellâ
âYou must understand the underlying causes in order to predictâ
âTo examine the predictive power of the
proposed model, we compare it to four models
in terms of R2 adjustedâ
12. Misconception #1:
The same model is best for explaining, describing, predicting
CS/eng/stat: Build a predictive model and use it to âexplainâ
in cs / stat / engineering / industry
2014 6th International Conference on Mobile
Computing, Applications and Services
(Agent-based modeling using census data)
âour model is able to provide both predictions of how the
population may vote and why they are voting this wayââŚ
2009 IEEE International Conference on Systems, Man and
Cybernetics
13. Misconception #2:
explain > predict or predict > explain
Emanuel Parzen, Comment on
âStatistical Modeling: The Two Culturesâ
Statistical Science 2001
âCorrelation supersedes causation, and
science can advance even without
coherent models, unified theories, or
really any mechanistic explanation at allâ
*Chris Anderson is the editor in chief of Wired
14.
15. Philosophy of Science
âExplanation and prediction have the
same logical structureâ
Hempel & Oppenheim, 1948
âIt becomes pertinent to investigate the
possibilities of predictive procedures
autonomous of those used for explanationâ
Helmer & Rescher, 1959
âTheories of social and human behavior
address themselves to two distinct goals of
science: (1) prediction and (2) understandingâ
Dubin, Theory Building, 1969
17. Explanatory Model:
test/quantify causal effect between constructs for
âaverageâ unit in population
Descriptive Model:
test/quantify distribution or correlation structure for
measured âaverageâ unit in population
Predictive Model:
predict values for new/future individual units
Different Scientific Goals
Different generalization
19. Notation
Theoretical constructs: X, Y
Causal theoretical model: Y=F(X)
Measurable variables: X, Y
Statistical model: E(y)=f(X)
Breiman, âStat Modeling: The Two Culturesâ, Stat Science, 2001
20. Five aspects to consider
Theory â
Causation â
Retrospective â
Bias â
Average unit â
Data
Association
Prospective
Variance
Individual unit
21. âThe goal of finding models
that are predictively accurate
differs from the goal of finding
models that are true.â
23. Example: Regression Model for Explanation
yi|xi = b0 + b1xi +b2 xcontrols + ei
parameter
of interest
(inference)
Chosen to avoid Omitted Var
Bias (better to over-specify)
Measures of
X, Y constructs
Underlying model: X Y
Danger:
endogeneity
24. yi|xi = b0 + b1 x1i +âŚ+bp xpi + ei
parameters
of interest
(inference)
Chosen b/c related to Y
Danger: multicollinearity
All variables treated/interpreted
as observable
Remain in model only if
statistically significant
Residual analysis
for GoF & test
assumptions
Example: Regression Model for Description
25. yi|xi = b0 + b1 x1i +âŚ+bp xpi + ei
Quantity of
interest for
new iâs
(prediction)
Chosen b/c possibly
correlated with Y
Danger: over-fitting
All variables treated as observable,
available at time of prediction
Retain only if improve out-
of-sample prediction
Evaluate overfitting
(train vs holdout)
Example: Regression Model for Prediction
27. Predict â Explain
+ ?
âwe tried to benefit from an
extensive set of attributes
describing each of the movies in
the dataset. Those attributes
certainly carry a significant signal
and can explain some of the user
behavior. However⌠they could
not help at all for improving the
[predictive] accuracy.â
Bell et al., 2008
28. Predict â Describe
Election Polls
âThere is a subtle, but important, difference between
reflecting current public sentiment and predicting the
results of an election. Surveys have focused largely on
the former⌠[as opposed to] survey based prediction
models [that are] focused entirely on analysis and
projectionâ
Kenett, Pfefferman & Steinberg (2017) âElection Polls â A Survey, A Critique,
and Proposalsâ, Annual Rev of Stat & its Applications
30. Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measurement accuracy)
How much data?
How to sample?
Study design
& data collection
predict: increase group size
explain/describe: increase #groups
Multilevel (nested) data
School
Class
Student
35. Evaluation, Validation & Model Selection
training datastatistical
model holdout data
Predictive power
Over-fitting
analysis
theoretical
model
statistical
model
Data
Validation
Model fit â
Explanatory power
36. Point #2
Cannot infer one from the others
explanatory
power
predictive
power
descriptive
power
41. Currently in Academia
(social sciences, management)
⢠Theory-based explanatory modeling
⢠Prediction underappreciated
⢠Distinction blurred
⢠Unfamiliar with predictive modeling â
getting better
How/why use prediction
(predictive models + evaluation)
for scientific research
beyond project-specific
solution/utility/profit?
42. The predictive power of an
explanatory/descriptive model
has important scientific value
relevance, reality check, predictability
43. Generate new theory
Develop measures
Compare theories
Improve theory
Assess relevance
Evaluate predictability
Prediction for Scientific Research
Shmueli & Koppius, âPredictive Analytics in Information Systems Researchâ
MIS Quarterly, 2011
44. Currently in Industry
(and machine learning)
⢠Data-driven predictive modeling
⢠Prediction over-appreciated
⢠Distinction blurred
⢠A-B testing
⢠Unfamiliar with theory-based
explanatory modeling
Will the
customer
pay?
What causes
non-payment?