Longitudinal and survival data are naturally observed with multiple origination dates. They form a dual-time data structure with horizontal axis representing the calendar time and the vertical axis representing the lifetime. In this talk we discuss how to model dual-time data based on a decomposition strategy and how to forecast over the time horizon. Various statistical techniques are used for treating fixed and random effects.
Among other fields, we share the potential applications in quantitative risk management, and demonstrate a large-scale credit risk analysis powered by big data computing.
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Dual-time Modeling and Forecasting in Consumer Banking (2016)
1. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Modeling and Forecasting
in Consumer Banking
Dr. Aijun Zhang
March 2016 · Hong Kong
StatSoft.org 1
2. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Outline of The Talk
1 Dual-time Data Structure
2 Age-Period-Cohort Model
3 Dual-time Factor Model
4 Applications in Consumer Banking
5 Appendix: Big Data Computing with R
StatSoft.org 2
3. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Data Collection in Consumer Banking
Consumer credit portfolios (cards, mortgages, auto loans, etc):
{c,t}
Calendar Time
1 2 3 4 . . . T
1 ∗ ∗ ∗ ∗ . . . ∗
2 ∗ ∗ ∗ . . . ∗
3 ∗ ∗ . . . ∗
4 ∗ . . . ∗
..
.
... ∗
Cohort
T ∗
Cohort: group of accounts with sharing origination, age and
longitudinal sampling. Better known as “vintage” in banking.
Dual-time scales: age (vertical), calendar time (horizontal)
StatSoft.org 3
4. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Data Diagrams
Cohort-level Heatmap Account-level Diagram∗
∗
Known as Lexis diagram (Keiding, 1990)
StatSoft.org 4
5. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Data: Other Examples
StatSoft.org 5
6. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Research Objectives
1. To model the main effects of age, time and cohort
2. To model their interaction effects
3. To incorporate various covariates
4. To forecast performance over time horizon
In this project, we propose a general framework for dual-time data
analytics, with main focus on applications in banking industry, as
well as treatment of the real big data.
StatSoft.org 6
7. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Outline of The Talk
1 Dual-time Data Structure
2 Age-Period-Cohort Model
3 Dual-time Factor Model
4 Applications in Consumer Banking
5 Appendix: Big Data Computing with R
StatSoft.org 7
8. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Age-Period-Cohort Model
Intuitively, consider the additive decomposition of responses up to
some link (e.g. identity, logit, log):
E[y(s, t; c)] = αs + βt + γc
[Age] αs endogenous growth curve
[Period] βt exogenous dynamics
[Cohort] γc origination heterogeneity
Note: The effects αs, βt, γc lie in the manifold {(s, t; c) : t − s = c}.
StatSoft.org 8
9. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Challenges in APC Model Estimation
Cohort Effect Puzzle
Long debate in demography & sociology about γc’s estimability:
Ryder (1965), Mason, et. al (1973), Glenn (1976), Feinberg and Mason (1979), Fu
(2000), Yang and Land (2006), Yang, et. al (2008), Tu, et. al (2012), Luo (2013),
Bell and Jones (2014, 2015), Reither, et. al (2015a, 2015b) ... the “futile quest”!
Theorem (Intrinsic Non-Identifiability)
The intercepts and drifts of αs, βt, γc are not separable.
Other Challenges
Higher-order αs, βt, γc appear varying degrees of smoothness;
Unequal sample sizes: limited data for small t, large s, large c.
StatSoft.org 9
10. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Constrained Age-Period-Cohort Model
Consider the APC model y(s, t; c) = αs + βt + γc + ε subject to
E[βt] = E[γc] = 0,
Slope[γc] = 0
Regularization on αs, βt, γc
We may formulate it as a regularized least squares problem
min
α,β,γ
∑
i
[
yi − α(si) − β(ti) − γ(ci)
]2
+ Ωs[α] + Ωt[β] + Ωc[γ]
Then develop distributed and iterative algorithms1 based on the
alternating least squares (ALS, aka Backfitting).
1
Increasingly popular for {large-n, large-p} convex optimization on big data.
StatSoft.org 10
11. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
DtALS Algorithm 1.0
Regularizers: ℓ2-norm (e.g. splines), ℓ1-norm (e.g. trend filtering,
Kim, et. al (SIAM Review, 2009) and Tibshirani (Ann. Statist., 2014))
Dual-time Alternating Least Squares via ℓ1-trend filtering2:
ˆαk+1
← arg min
α
∑
i
[
yi − α(si) − ˆβk
(ti) − ˆγk
(ci)
]2
+ λs∥D(2)
α∥1
ˆβk+1
← arg min
β
∑
i
[
yi − ˆαk+1
(si) − β(ti) − ˆγk
(ci)
]2
+ λt∥D(1)
β∥1
ˆβk+1
← ˆβ(k+1)
− E[ˆβ(k+1)
]
ˆγk+1
← arg min
γ
∑
i
[
yi − ˆαk+1
(si) − ˆβk+1
(ti) − γ(ci)
]2
+ λc∥D(2)
γ∥1
ˆγk+1
← ˆγ(k+1)
(c) − [ˆa + ˆbc],
(
ˆa, ˆb ← lm(ˆγk+1
∼ c)
)
2
Based on the ADMM method by Ramdas & Tibshirani (JCGS, to appear),
by assuming piecewise-quadratic ˆαs and ˆγc and piecewise-linear ˆβt.
StatSoft.org 11
12. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Simulation Study
Data-generating mechanism: log(y(s, t; c)) = αs + βt + γc + ε
a. Endogenous αs follows a smooth growth curve;
b. Exogenous βt mimics the macro dynamics plus a jump.
The DtALS algorithm recovered the underlying αs, βt, γc effects.
The spike in βt was auto-detected by D(1)β 1
regularization.
StatSoft.org 12
13. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Simulation Study
The APC model may perform data smoothing in age, time and
cohort directions.
StatSoft.org 13
14. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Outline of The Talk
1 Dual-time Data Structure
2 Age-Period-Cohort Model
3 Dual-time Factor Model
4 Applications in Consumer Banking
5 Appendix: Big Data Computing with R
StatSoft.org 14
16. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Factor Model
Using SVD idea for ηst(s, t), we propose the dual-time factor model
η(s, t) = µ + u0(s) + v0(t) +
K∑
k=1
uk(s)vk(t) (∗)
where K is a low rank, vk(t) are the latent dynamic factors, and
uk(s) are factor loadings, such that
max
{
D(2)
uk 1
, k = 0, 1, . . . , K
}
≤ C1
max
{
D(1)
vk 1
, k = 0, 1, . . . , K
}
≤ C2
for some constants C1, C2 > 0.
Key assumption: vk(t) piecewise-linear, uk(s) piecewise-quadratic.
StatSoft.org 16
17. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Factor Model: Special Cases
Lee-Carter mortality model (1992): η(s, t) = µ(s) + u(s)v(t)
Nelson-Siegel three-factor yield curve model (1987)
η(s, t) = v1(t) +
(
1 − e−αts
αts
)
v2(t) +
(
1 − e−αts
αts
− e−αts
)
v3(t)
Theorem (DtFM vs. APC)
For (s, t) ∈ {0, 1, . . . , n − 1}2, the dual-time factor model with
uk(s) = ξn−s−1
k and vk(t) = σkξt
k for given {(σk, ξk)}n
k=1 is
equivalent to the APC model with the following cohort effect
γc =
K∑
k=1
σkξn+c−1
k , for 1 − n ≤ c ≤ n − 1.
Proof by Hankel matrix with Vandermonde factorization.
StatSoft.org 17
18. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Dual-time Factor Model: Similar Works
PCA approach to study age-cohort or/and age-period associations by
Hatzopoulos & Haberman (Insurance Math. Econom. 2009,11,13,15)
η(s, t; c) = µ(s) +
p∑
i=1
ui(s)wi(c) +
q∑
i=1
u′
i(s)vi(t)
Functional dynamic factor model by Hays, Shen & Huang (2012):
η(s, t) =
K∑
k=1
uk(s)vk(t) s.t.
∫
uk(s)ul(s)ds = δkl
vk(t) ∼ AR(p)
Both approaches involve quite some complication in model estimation.
StatSoft.org 18
19. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
DtALS Algorithm 2.0
Reformulate dual-time factor model as regularized least squares:
min
u,v
∑
i
[
yi − η(si, ti)
]2
+ λs
K∑
k=0
D(2)
uk 1
+ λt
K∑
k=0
D(1)
vk 1
Step 1: When k = 0, run DtALS-Algo1.0 (additive) for ˆµ, ˆu0(s) and ˆv0(t)
Step 2: For k ← k + 1, update the residuals ˜η(s, t) ← ˜η(s, t) − ˆuk(s)ˆvk(t).
Run alternating least squares (multiplicative)3
via ℓ1-trend filtering:
ˆuk+1 ← arg min
u
∑
i
[
˜η(si, ti) − u(si)ˆvk+1(ti)
]2
+ λs D(2)
u 1
ˆvk+1 ← arg min
v
∑
i
[
˜η(si, ti) − ˆuk+1(si)v(ti)
]2
+ λt D(1)
v 1
3
Similar ALS with ℓ2-norm was used by Hastie, et. al (JMLR, to appear)
StatSoft.org 19
20. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Simulation Study
StatSoft.org 20
21. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Extension 1: Binary and Survival Responses
DtFM: η(s, t) = µ + u0(s) + v0(t) +
K∑
k=1
uk(s)vk(t)
Binary responses yi(s, t) ∈ {−1, 1} (e.g. panel logistic model)
LHS = logit
{
Pr
[
Yi(s, t) = 1
]}
Survival responses on Lexis diagram under discrete setting:
LHS = log
{
Pr
[
Yi(s, t) = 1 Yi(s + h, t + h) = 0, h < 0
]}
We obtain the dual-time discrete hazard rate models.
Both can be fitted via the modern GLM4 plus some data trick.
4
Glmnet with lasso and elastic-net, one of the best machine learning
algorithms today; also ℓ1-trend filtering with family of link functions.
StatSoft.org 21
22. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Extension 2: Incorporating Covariates
Provided the endogenous covariates zi(s) that are time-varying
(or static) at the account (or cohort) level:
ηi(s, t|zi) = µ + u0(s) + v0(t) +
K∑
k=1
uk(s)vk(t) + θT
zi(s)
Could be treated by an extra regression step in DtALS-Algo1.0.
Provided the exogenous covariates x(t) (usually, macroeconomic
variables), they affect only the latent factors vk(t) via (say)
vk(t) = ϕT
k x(t) + AR(p), k = 0, 1, . . . , K
Could be separately fitted by ARMAX, after obtaining ˆvk(t).
StatSoft.org 22
23. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Forecasting over Time Horizon
Many works on forecasting mortality or bond yields via APC or
factor models: Li and Carter (1992), Li and Lee (2005), Kuang, et. al
(2008), Diebold and Li (2006), Hays, et. al (2012), Alai and Sherris
(2014), Hatzopoulos and Haberman (2009,11,13,15), ...
Common approach: modeling ˆβt or ˆvk(t) by time series models,
e.g. AR(p). When the exogenous variables x(t) are used,
ˆvk(t) = ϕT
k x(t) + AR(p), k = 0, 1, . . . , K
Future x(t) could be provided → US Stress Testing and CCAR!
But future zi(s) cannot be provided → Landmarking Method!
StatSoft.org 23
24. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Stress Testing and CCAR
Stress Testing: loss forecast under baseline & stressful scenarios
over time horizon (e.g. nine-quarter forecast required for the US
banks by Dodd-Frank Act, signed by Obama in 2010)
Latest release of supervisory macroeconomic scenarios by US Fed (Jan28, 2016)
It is the mandatory part of annual CCAR5 exercise for big banks.
5
Comprehensive Capital Analysis Review (2011 onwards)
StatSoft.org 24
25. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Landmarking Method
Landmarking method from clinical survival analysis6:
λi(s + h|zi(s)) = λ
(s)
0 (s + h) exp
{
θ(s)T
zi(s)
}
, h ≥ 0
How it works?
1. Each landmark s is treated as a new origin with static zi(s)
2. For each landmark, fit a separate CoxPH model over h-horizon
3. For a testing subject survival at s, use the landmark model to
make h-step prediction
6
Original idea by van Houwelingen (2007), Zheng and Heagerty (2005)
StatSoft.org 25
26. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Landmarking Method for Dual-time Data
To make h-step prediction for all cohorts at time snapshot ,
ηi(s + h, t + h|zi(s)) = η
(s)
0 (s + h, t + h) + θ(s)T
zi(s)
Smoothing η
(s)
0 and θ(s) over landmark s would yield a unified
framework that is more stable and interpretable.
An ongoing collaboration with Wells Fargo Model Risk Team …
StatSoft.org 26
27. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Outline of The Talk
1 Dual-time Data Structure
2 Age-Period-Cohort Model
3 Dual-time Factor Model
4 Applications in Consumer Banking
5 Appendix: Big Data Computing with R
StatSoft.org 27
28. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Real Application 1: California Mortgages (2005-08)
Retail banks host big data for credit portfolios: mostly 10+ years
of vintages, millions of accounts, various loan characteristics
Background: mortgage loans dominated by prepayment prior to
2007-09 subprime crisis, suffered from foreclosure during the crisis
Dataset: mortgage loans from US California (2005-08) with
covariates: FICO (credit score), CLTV (leverage ratio), NoteRate
(dynamic variable), Documentation (full vs. limited), IO Indicator
(Interest-Only), Loan Purpose (purchase, refi, cash-out refi), etc
StatSoft.org 28
29. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Real Application 1: California Mortgages (2005-08)
Dual-time Cox Modeling of Competing Risks (Default vs. Prepay)
λ
(q)
i (s, t) = exp
{
u
(q)
0 (s) + v
(q)
0 (t) + θ
(q)
1 CLTVi + θ
(q)
2 FICOi
+ θ
(q)
3 NoteRatei(s) + θ
(q)
4 DocFulli + θ
(q)
5 IOi
+ θ
(q)
6.pPurpose.purchasei + θ
(q)
6.rPurpose.refii
}
Results CLTV FICO NoteRate DocFull IO Purpose.p Purpose.r
Default 2.841 −0.500 1.944 −1.432 1.202 −1.284 −0.656
Prepay −0.285 0.385 0.2862 −0.091 0.141 0.185 −0.307
Justified that “a homeowner with low leverage and good credit
would more likely prepay rather than choose to default”.
StatSoft.org 29
30. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Real Application 1: California Mortgages (2005-08)
Exogenous Hazard vs. Macroeconomic Variables
The estimated exogenous default hazards exp{ˆv(t)} versus the
regional home price indices and unemployment rates.
StatSoft.org 30
31. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Real Application 2: Freddie Mac SFLL Dataset
Freddie Mac’s Single Family Loan-Level Dataset, made available
since 2013, consists of origination variables, credit performance,
and actual loss for a “big” portion of fixed-rate mortgages.
March 2016 release: ∼21.8 millions of loans, ∼998 millions of
performance records, covering the years from 1999 to Q32015.
Big data wrangling by SparkR, then perform DtALS algorithm in
R and visualize the results.
StatSoft.org 31
32. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Real Application 2: Freddie Mac SFLL Dataset
StatSoft.org 32
33. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Outline of The Talk
1 Dual-time Data Structure
2 Age-Period-Cohort Model
3 Dual-time Factor Model
4 Applications in Consumer Banking
5 Appendix: Big Data Computing with R
StatSoft.org 33
34. Dual-time Data Structure APC Model Dual-time Factor Model Applications Appendix: R
Appendix: Big Data Computing with R
Our current toolset7 for big data processing and computing:
Hadoop + Spark + R + H2O + Python
R is one of the best tools for (small) data analysis. Tremendous
efforts made to make R useful for big data, including “Microsoft
Revolution R”, “IBM BigInsights BigR”, “Oracle R”, “HP
Distributed R”, H2O, SparkR, Tessera, SupR, · · ·
One feasible procedure for handling big data:
1. Hadoop “fs -copyFromLocal” raw data files to HDFS
2. Read from HDFS with SparkR to perform data wrangling
3. Collect wrangled data of modest/small size to local R session
4. Run sophisticated analytics with R codes and packages.
7
Maintained by HKBU Center for Big Data in Education
StatSoft.org 34