Contenu connexe Similaire à Actuarial Analytics in R (20) Plus de Revolution Analytics (20) Actuarial Analytics in R1. Actuarial Science as Data Science
Actuarial Modeling in R
Revolution Analytics Webinar Jim Guszcza, FCAS, MAAA
Deloitte Consulting LLP
University of Wisconsin-Madison
March 28, 2012
2. About Your Presenter
• James Guszcza, PhD, FCAS, MAAA
• National Predictive Analytics Lead – Deloitte Consulting Actuarial, Risk, Analytics practice
• Assistant professor of actuarial science & risk management – U. Wisconsin-Madison
• PhD in Philosophy – The University of Chicago
• Fellow of the Casualty Actuarial Society
• Lots experience building predictive models / analyzing data in and outside of insurance
jguszcza@deloitte.com
jguszcza@bus.wisc.edu
2 Deloitte Analytics Institute © 2011 Deloitte LLP
3. Agenda
Introduction
Actuarial Science and Data Science
R Background
Case Studies
• Fitting a complex size of loss model
• Loss Reserving
• Bayesian Hierarchical Modeling
• Revolution: Tweedie Regression on big data
5. Not Just Hype
“Perhaps the most important cultural trend today: The
explosion of data about every aspect of our world and
the rise of applied math gurus who know how to use it.”
-- Chris Anderson, editor-in-chief of Wired
• So behavioral economics is important in insurance for two
classes of reasons:
• Decision-makers at insurance companies are human
• People making insurance purchasing decisions are human
5 Deloitte Analytics Institute © 2010 Deloitte LLP
6. Brave New World With Such Algorithms In IT
• The analysis of data affects:
• What we buy
• What we read
• What we watch
• How we network
• How we socialize
• The opinions we form
• Whom we date and marry!
6 Deloitte Analytics Institute © 2010 Deloitte LLP
7. Clinical vs Actuarial Judgment – the Motion Picture
7 Deloitte Analytics Institute © 2010 Deloitte LLP
8. Analytics Everywhere
• Neural net models are used to predict movie box-office returns based on
features of their scripts
• Decision tree models are used to help ER doctors better triage patients
complaining of chest pain.
• Predictive models are used to predict the price of different wine vintages
based on variables about the growing season.
• Predictive models to help commercial insurance underwriters better select
and price risks.
• Predict which non-custodial parents are at highest risk of falling into
arrears on their child support.
• Predicting which job candidates will successfully make it through the
interviewing / recruiting process… and which candidates will subsequently
retain and perform well on the job.
• Predicting which doctors are at highest risk of being sued for malpractice.
• Predicting the ultimate severity of injury claims.
8 Deloitte Analytics Institute (Deloitte applications in green)
© 2010 Deloitte LLP
9. At the Center of It All: Data Science
Or: “The Collision between Statistics and Computation”
• Today the analytics world is
different largely due to
exponential growth in
computing power.
• The skill set underlying
business analytics is
increasingly called
data science.
• Data science goes beyond:
• Traditional statistics
• Business intelligence [BI]
Image borrowed from Drew Conway’s blog
• Information technology http://www.dataists.com/2010/09/the-data-science-venn-diagram
9 Deloitte Analytics Institute © 2010 Deloitte LLP
10. Where Do We Want to Be?
•Here?
Image borrowed from Drew Conway’s blog
http://www.dataists.com/2010/09/the-data-science-venn-diagram
10 Deloitte Analytics Institute © 2010 Deloitte LLP
11. Where Do We Want to Be?
•Or Here?
Image borrowed from Drew Conway’s blog
http://www.dataists.com/2010/09/the-data-science-venn-diagram
11 Deloitte Analytics Institute © 2010 Deloitte LLP
12. On then, on to R
12 Deloitte Analytics Institute © 2010 Deloitte LLP
14. R Overview
R is an open-source, object-oriented statistical programming language.
In the past decade, it has become the global lingua franca of statistics.
• History:
• R is based on the S statistical programming language developed by
John Chambers at Bell labs in the 1980’s
• R is an open-source implementation of the S language
• Developed by Robert Gentlemen and Ross Ihaka at U Auckland
• Revolution R is a commercially supported, scalable implementation
of R, with parallel processing and big data capabilities
• Features:
• R is an interactive, object-oriented programming environment
• R has advanced graphical capabilities
• Statisticians around the world contribute add-on packages
14 Deloitte Analytics Institute © 2010 Deloitte LLP
15. On the Shoulders of Giants
• … therefore prominent people tend say things like this:
http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all
15 Deloitte Analytics Institute © 2010 Deloitte LLP
16. Facets of R
• In a recent article John Chambers discussed 6 “Facets of R”
1. An interface to computational procedures of many kinds
2. Interactive, hands-on in real time
3. Functional in its model of programming
4. Object-oriented, “everything is an object”
5. Modular, built from standardized pieces
6. Collaborative, a world-wide, open-source effort
• Interactive interface: Chambers was influenced by APL
• In the days before spreadsheets, APL was very popular in the actuarial
community
• One of the rare interactive scientific computing environments
• Gives user ability to express novel computations
• Heavy emphasis on matrices and arrays
• But: unlike R, APL had no interface to procedures
16 Deloitte Analytics Institute © 2010 Deloitte LLP
17. A Network ExteRnality
• Hal Varian’s “giant” has grown at
an exponential rate.
• The open-source nature of R
has encouraged top researchers
from around the world to
contribute new, often highly
advanced, packages.
• Result: a powerful “network
effect”.
• The value of a product increases as
more people use it.
• R has become something like
the Wikipedia of the statistics
world.
17 Deloitte Analytics Institute © 2010 Deloitte LLP
18. Adoption in the Actuarial World
18 Deloitte Analytics Institute © 2010 Deloitte LLP
19. Free from Frees
• Jed Frees at the University of Wisconsin-Madison has made R integral to
his new book on regression and time series. He maintains a nice website
containing R instructions, data, and code.
http://instruction.bus.wisc.edu/jfrees/jfreesbooks/Regression%20Modeling/BookWebDec2010/learnR.html
19 Deloitte Analytics Institute © 2010 Deloitte LLP
21. Some Everyday Uses of R
• Free-form Exploratory Data Analysis
• ad hoc data munging, data visualizations, fitting simple models on the fly
• Loss models (“exam 4/C”)
• Unsupervised Learning
• Correlation analysis, principal component / factor analysis, variable clustering,
k-means and hierarchical clustering, self-organizing maps, association rules
(aka “market basket analysis”), Latent Dirichlet Analysis
• Supervised Learning
• “statistics paradigm”: GLM, Multilevel/Hierarchical models, quantile
regression
• “machine learning paradigm: CART, MARS, Random Forests, Neural
Networks, Support Vector Machines
• Bayesian data analysis (MCMC simulation), causal analysis
• Optimization
21 Deloitte Analytics Institute © 2010 Deloitte LLP
23. Modeling a Non-Trivial Loss Distribution
• A typical actuarial
problem: modeling a
highly skew and
ambiguous loss
8 e-06
distribution
6 e-06
• Traditional medium of
analysis: spreadsheets.
4 e-06
• Why limit ourselves? 2 e-06
0 e+00
0 e+00 1 e+06 2 e+06 3 e+06 4 e+06 5 e+06
loss
23 Deloitte Analytics Institute © 2010 Deloitte LLP
25. Three Approaches to Loss Reserving
• A garden-variety loss triangle:
Cumulative Losses in 1000's
AY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res
1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 0
1989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 29
1990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 67
1991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 89
1992 2,077 257 569 754 892 958 1,007 1,110 0.53 103
1993 1,703 193 423 589 661 713 828 0.49 115
1994 1,438 142 361 463 533 675 0.47 142
1995 1,093 160 312 408 601 0.55 193
1996 1,012 131 352 702 0.69 350
1997 976 122 576 0.59 454
chain link 2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015 1.000 12,067 1,543
chain ldf 4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015 1.000
growth curve 21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%
• Let’s use R to forecast outstanding losses using three methods:
• Replicate the above chain-ladder spreadsheet calculation – easy!
• Use the Over-dispersed Poisson GLM model
• Longitudinal data analysis using growth curves
25 Deloitte Analytics Institute © 2010 Deloitte LLP
26. What Do You See?
• Let’s look at the loss triangle with fresh eyes.
• We would like to do stochastic reserving the “right” way.
• What considerations come to mind?
Cumulative Losses in 1000's
AY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res
1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 0
1989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 29
1990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 67
1991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 89
1992 2,077 257 569 754 892 958 1,007 1,110 0.53 103
1993 1,703 193 423 589 661 713 828 0.49 115
1994 1,438 142 361 463 533 675 0.47 142
1995 1,093 160 312 408 601 0.55 193
1996 1,012 131 352 702 0.69 350
1997 976 122 576 0.59 454
chain link 2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015 1.000 12,067 1,543
chain ldf 4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015 1.000
growth curve 21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%
26 Deloitte Analytics Institute © 2010 Deloitte LLP
27. Some Essential Features of Loss Reserving
Cumulative Losses in 1000's
AY premium 12 24 36 48 60 72 84 96 108 120 CL Ult CL LR CL res
1988 2,609 404 986 1,342 1,582 1,736 1,833 1,907 1,967 2,006 2,036 2,036 0.78 0
1989 2,694 387 964 1,336 1,580 1,726 1,823 1,903 1,949 1,987 2,017 0.75 29
1990 2,594 421 1,037 1,401 1,604 1,729 1,821 1,878 1,919 1,986 0.77 67
1991 2,609 338 753 1,029 1,195 1,326 1,395 1,446 1,535 0.59 89
• Repeated measures
1992 2,077 257 569 754 892 958 1,007 1,110 0.53 103
1993 1,703 193 423 589 661 713 828 0.49 115
1994 1,438 142 361 463 533 675 0.47 142
1995 1,093 160 312 408 601 0.55 193
1996 1,012 131 352 702 0.69 350
1997 976 122 576 0.59 454
• The dataset is inherently longitudinal in nature. chain link
chain ldf
growth curve
2.365 1.354 1.164 1.090 1.054 1.038 1.026 1.020 1.015
4.720 1.996 1.473 1.266 1.162 1.102 1.062 1.035 1.015
1.000
1.000
21.2% 50.1% 67.9% 79.0% 86.1% 90.7% 94.2% 96.6% 98.5% 100.0%
12,067 1,543
• A “Bundle” of time series
• Loss triangle: a collection of time series that are “related” to one another…
• … no guarantee that the same development pattern is appropriate to each one
• Non-linear
• Each year’s loss development pattern in inherently non-linear
• Ultimate loss (ratio) is an asymptote
• Incomplete information
• Few loss triangles contain all of the information needed to make forecasts
• Most reserving exercises must incorporate judgment and/or background
information
Loss reserving is inherently Bayesian
27 Deloitte Analytics Institute © 2010 Deloitte LLP
28. Origin of the Approach: Dave’s Idea + Random Effects
+
=
28 Deloitte Analytics Institute © 2010 Deloitte LLP
29. And Now it’s Bayesian
• Fully Bayesian model
• Provides posterior credible
intervals (“range of reasonable
reserves”)
• Add further hierarchical structure
to simultaneously model loss
development for multiple
companies. (Wayne’s idea!)
29 Deloitte Analytics Institute © 2010 Deloitte LLP
31. Workers Comp Ratemaking
• We have 7 years of Workers Comp data
• Data from Klugman [1992 Bayes book]
• 128 workers comp classes (types of business)
• 7 years of summarized data
• Given: total payroll, claim count by class
• (payroll is a measure of “exposure” in this domain)
• Problem: use years 1-6 data to predict year 7
31 Deloitte Analytics Institute © 2010 Deloitte LLP
32. Empirical Bayes “Credibility” Approach
• Naïve approach:
• Calculate average year 1-6 claim frequency by class
• Use these 128 averages as estimates for year 7.
• Better approach: build empirical Bayes hierarchical model.
• “Bühlmann-Straub credibility model”
• “Shrinks” low-credibility classes towards the grand mean
• Use Douglas Bates’ lme4 package (UW-Madison again!)
clmcnti ~ Poi ( payrolli λ j[ i ] )
(
λ j ~ N µλ , σ λ
2
)
32 Deloitte Analytics Institute © 2010 Deloitte LLP
33. Shrinkage Effect of Empirical Bayes Model
• Top row: estimated claim
frequencies from un-pooled Modeled Claim Frequency by C
model. Poisson Models: No Pooling and Simple
• Separately calculate
#claims/payroll by class no pool
• Bottom row: estimated
claim frequencies from
Poisson hierarchical
(credibility) model.
• Credibility estimates are
“shrunk” towards the grand
mean. hierach
• Dotted line: shrinkage between
5=10%.
• Solid line: shrinkage > 10% 0.00 grand mean 0.05 0.10
Claim Frequency
33 Deloitte Analytics Institute © 2010 Deloitte LLP
34. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Use the rjags package
• JAGS: Just Another Gibbs Sampler
• We’re standing on the shoulders of giants named David Spiegelhalter, Martyn Plummer, …
34 Deloitte Analytics Institute © 2010 Deloitte LLP
35. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Poisson regression with an offset
35 Deloitte Analytics Institute © 2010 Deloitte LLP
36. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Allow for overdispersion
36 Deloitte Analytics Institute © 2010 Deloitte LLP
37. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Allow for overdispersion
37 Deloitte Analytics Institute © 2010 Deloitte LLP
38. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• “Credibility weighting” (aka shrinkage) results from giving class-level intercepts
a probability sub-model.
38 Deloitte Analytics Institute © 2010 Deloitte LLP
39. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Put a diffuse prior on all of the hyperparameters
• Fully Bayesian model
• Bayes or Bust!
39 Deloitte Analytics Institute © 2010 Deloitte LLP
40. clmcnti ~ Poi ( payrolli λ j[ i ] )
Now Specify a Fully Bayesian Model (
λ j ~ N µλ , σ λ
2
)
• Here we specify a fully Bayesian model.
• Replace year-7 actual values with missing values
• We model the year-7 results … produce 128 posterior density estimates
• Can compare actual claims with Bayesian posterior probabilities
40 Deloitte Analytics Institute © 2010 Deloitte LLP
41. A Credible Result
• Let’s rank the top 30
WC classes by the
median of the
posterior predictive
density of year-7
claim count.
• 87% of the top 30
classes have actual
year-7 claim count
falling within the
90% posterior
credible interval.
41 Deloitte Analytics Institute © 2010 Deloitte LLP
43. Big Data Headed Our Way
• Credibility concerns and a Bayesian outlook
are part and parcel of actuarial science.
• But for many actuaries, working with “big
data” is a much more pressing concern.
• Many millions of personal lines policy terms
• Premium, loss, credit, billing transactions
• Telematics data
• … much more to come
• Base R handles data in memory
• This is beautiful for “small data” problems like doing loss
reserving on summarized data
• But breaks down for many industrial datasets
• So on to Revolution-R
43 Deloitte Analytics Institute © 2011 Deloitte LLP
44. The kaggle Allstate Claim Prediction Challenge Data
44 Deloitte Analytics Institute © 2011 Deloitte LLP
45. Loading the Data
• Data volume:
• 13M rows
• ~ 40 cols
• Took about 6-7
minutes to load
• Perform some
variable
transformations
on the fly to
minimize passes
though the data.
• Data saved on
disk in “xdf” file
format for easy
access and
interactive
modeling.
45 Deloitte Analytics Institute © 2011 Deloitte LLP
46. Viewing the Data
• Data characteristics:
• 13,184,290 rows
• A few dozen predictive variables (mostly blinded)
• Target variable: claim amount
• kaggle competition goal: build a model that segments well out-of-sample
• Let’s use the 2005-6 data to predict the 2007 data
• (Just a quick model to get a sense of Revolution R’s scalability)
• Tweedie regression models fit in seconds
46 Deloitte Analytics Institute © 2011 Deloitte LLP
47. Helpful Resources
• Edward (Jed) Frees – Regression modeling with actuarial and financial
applications http://www.amazon.com/Regression-Actuarial-Financial-Applications-
International/dp/0521135966
• Andrew Gelman / Jennifer Hill - Data Analysis using Regression and
Multilevel/Hierarchical Models http://www.amazon.com/Analysis-Regression-Multilevel-
Hierarchical-
Models/dp/052168689X/ref=sr_1_1?s=books&ie=UTF8&qid=1332961819&sr=1-1
• Venables and Ripley – Modern Applied Statistics in S http://www.amazon.com/Modern-
Applied-Statistics-
Computing/dp/1441930086/ref=sr_1_1?s=books&ie=UTF8&qid=1332961867&sr=1-1
• Hastie, Tibshirani, Friedman – the Elements of Statistical
Learning http://www.amazon.com/The-Elements-Statistical-Learning-
Prediction/dp/0387848576/ref=sr_1_1?s=books&ie=UTF8&qid=1332961913&sr=1-1
• Gelman, Carlin, Stern, Ruin – Bayesian Data Analysis http://www.amazon.com/Bayesian-
Analysis-Edition-Chapman-Statistical/dp/158488388X/ref=tag_dpp_lp_edpp_ttl_in
• John Kruschke – Doing Bayesian Data Analysis http://www.amazon.com/Doing-Bayesian-
Data-Analysis-
Tutorial/dp/0123814855/ref=sr_1_3?s=books&ie=UTF8&qid=1332961975&sr=1-3
47 Deloitte Analytics Institute © 2011 Deloitte LLP