1. Big Data
Case Study
Proposed Model
Results and Conclusions
On Big Data Applications
P. M. Pardalos1, 2
1Center for Applied Optimization
Department of Industrial and Systems Engineering
University of Florida
2Laboratory of Algorithms and Technologies for Networks Analysis (LATNA)
National Research University Higher School of Economics
ItForum, 2013
P. M. Pardalos, On Big Data Applications
3. Big Data
Case Study
Proposed Model
Results and Conclusions
How big will the data be tomorrow?
Cisco: Global ’Net Traffic to Surpass 1 Zettabyte in 2016
http://cacm.acm.org/news/150041-cisco-global-net-traffic-to-surpass-
1-zettabyte-in-2016/fulltext
P. M. Pardalos, On Big Data Applications
4. Big Data
Case Study
Proposed Model
Results and Conclusions
What is BigData?
The proliferation of massive datasets brings with it a series
of special computational challenges.
This data avalanche arises in a wide range of scientific and
commercial applications.
With advances in computer and information technologies,
many of these challenges are beginning to be addressed.
P. M. Pardalos, On Big Data Applications
5. Big Data
Case Study
Proposed Model
Results and Conclusions
HandBook of Massive Datasets
Handbook of Massive Data Sets,
co-editors: J. Abello, P.M. Pardalos,
and M. Resende, Kluwer Academic
Publishers, (2002).
P. M. Pardalos, On Big Data Applications
6. Big Data
Case Study
Proposed Model
Results and Conclusions
Data Mining: the practice of searching through large
amounts of computerized data to find useful patterns or
trends.
Optimization: An act, process, or methodology of making
something (as a design, system, or decision) as fully
perfect, functional, or effective as possible; specifically :
the mathematical procedures (as finding the maximum of a
function) involved in this.
P. M. Pardalos, On Big Data Applications
7. Big Data
Case Study
Proposed Model
Results and Conclusions
Hard drive Cost
P. M. Pardalos, On Big Data Applications
8. Big Data
Case Study
Proposed Model
Results and Conclusions
Hard drive Capacity
P. M. Pardalos, On Big Data Applications
9. Big Data
Case Study
Proposed Model
Results and Conclusions
Processing power
P. M. Pardalos, On Big Data Applications
10. Big Data
Case Study
Proposed Model
Results and Conclusions
Some Examples of Massive Datasets
Internet Data
Weather Data
Telecommunications Data
Medical Data
Demographics Data
Financial Data
P. M. Pardalos, On Big Data Applications
11. Big Data
Case Study
Proposed Model
Results and Conclusions
Handling Massive Datasets
New computing technologies are needed to handle massive
datasets
Input-Output Parallel Systems
External Memory Algorithms
Quantum Computing
P. M. Pardalos, On Big Data Applications
12. Big Data
Case Study
Proposed Model
Results and Conclusions
Study Objective
An example in biomedicine
Acute Kidney Injury (AKI) is one of common complications
in post-operative patients.
The in-hospital mortality rate for patients with AKI may be
as high as 60%.
An elevated serum creatinine (sCr) level in blood is
commonly recognized as an indicator of AKI.
The degree to which patterns of sCr change are
associated with in-hospital mortality is unknown.
P. M. Pardalos, On Big Data Applications
13. Big Data
Case Study
Proposed Model
Results and Conclusions
Study Objective
Study Outline
Objectives:
To develop a comprehensive model for assessing the
mortality risk in post-operative patients
To establish a quantitative relationship between sCr pattern
and mortality risk
Results:
A probabilistic model was designed to assess mortality risk
in post-operative patients
The model provided high discriminative capacity and
accuracy (ROC = 0.92)
A quantitative association between sCr time series and
mortality risk was established
Simple and informative sCr risk factors were derived
P. M. Pardalos, On Big Data Applications
14. Big Data
Case Study
Proposed Model
Results and Conclusions
Study Objective
Data Set
We performed a retrospective study involving patients
admitted to Shands Hospital (Gainesville, FL) from 2000
through 2010.
For each patient who underwent a surgery detailed clinical
and outcome data were collected.
The analysis included 60,074 patients who underwent
in-patient surgery at Shands Hospital during a 10-years
period.
P. M. Pardalos, On Big Data Applications
15. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Logistic Function
Probabilistic score 1
P(C = 1|X = x) = 1 + exp w0 +
m
i=1
wigi(xi)
−1
; (1)
C ∈ {0, 1} is an outcome;
X = (X1, . . . , Xm) - the risk factors;
gi - nonlinear function associated with the ith factor
w0 = ln P(C = 1)/P(C = 0) is the a priori log odds ratio.
{wi}, i > 0 - weights indicating importance of the risk factors
1
S. Saria, A. Rajani,J. Gould,D. Koller,A. Penn, Integration of Early
Physiological Responses Predicts Later Illness Severity in Preterm Infants,
Sci Transl Med, 2010
P. M. Pardalos, On Big Data Applications
16. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Nonlinear Risk Factors Estimation
From the probabilistic score equation we can derive
w0 +
m
i=1
wi · gi(xi) = log
P(X = x|C = 0)
P(X = x|C = 1)
+ log
P(C = 0)
P(C = 1)
Under assumption of risk factors independence:
w0 +
m
i=1
wi · gi(xi) =
m
i=1
log
P(Xi = xi|C = 0)
P(Xi = xi|C = 1)
+ log
P(C = 0)
P(C = 1)
.
For continuous factors P(Xi = xi|C) implies
P(Xi ∈ [xi − ε, xi + ε]|C) where ε is sufficiently small
P. M. Pardalos, On Big Data Applications
17. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Model Parameters
max
w
k∈{0,1} j: Cj =k
ln P(Cj
= k|Xi = xj
i , . . . , Xi = xj
m);
j denotes patients in the data set;
Cj ∈ {0, 1} - outcome for the jth patient;
xj
i - value of ith risk factor for the jth patient.
We solved this problem approximately with interior point
method implemented in MATLAB
P. M. Pardalos, On Big Data Applications
18. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Nonlinear risk functions
We get
gi(xi) = ln
P(Xi = xi|C = 0)
P(Xi = xi|C = 1)
;
w0 = ln
P(C = 1)
P(C = 0)
;
wi = 1, i = 1, . . . , m.
The probabilities P(Xi = xi|C), i = 1, . . . , m were learned
from the data using maximum likelihood principle.
P. M. Pardalos, On Big Data Applications
19. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Discrete risk factors estimation
For discrete valued factors
gi(x) = ln
#{j : Cj = 1, xj
i = x}
#{j : Cj = 1}
#{j : Cj = 0}
#{j : Cj = 1, xj
i = x}
.
P. M. Pardalos, On Big Data Applications
20. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Continuous Risk Factors Estimation
For continuous factors we defined
gi(x) = ln
f0
i (x)
f1
i (x)
,
fk
i (x) = ϕ∗
(x − h∗
, θ∗
),
{ϕ∗
, h∗
, θ∗
} = arg max
ϕ∈Φ,θ,h
Lk
i (ϕ, θ, h),
Lk
i (ϕ, θ, h) =
j:Cj =k
ln
ϕ(xj
i − h, θ)
Fϕ(l2
i , θ) − Fϕ(l1
i , θ)
.
P. M. Pardalos, On Big Data Applications
21. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Continuous Risk Factors Estimation
1 2 3 4 5 6 7
0
50
100
150
200
maxCr/minCr value
Histogramofvalues,C=1
Observed values
Learned Inverse Gaussian distribution
1 2 3 4 5 6 7
0
0.5
1
1.5
2
x 10
4
maxCr/minCr value
Histogramofvalues,C=0
Observed values
Learned Loglogistic distribution
1 2 3 4 5 6 7
0
50
100
150
200
recentCr/minCr value
Histogramofvalues,C=1
Histogram of observed values
Learned Lognormal distribution
1 2 3 4 5 6 7
0
1
2
3
x 10
4
recentCr/minCr value
Histogramofvalues,C=0
Histogram of observed values
Learned LogLogistic distribution
P. M. Pardalos, On Big Data Applications
22. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Age Risk Factor
20 40 60 80 100
0
20
40
60
Age
Distributionofvalues,C=1
20 40 60 80 100
0
1000
2000
3000
Age
Distributionofvalues,C=0
P. M. Pardalos, On Big Data Applications
23. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Which creatinine (Cr) characteristics are ”good”?
Absolute Cr value is not very informative
(maximal Cr value)/(minimal Cr value) works good
Recent Cr value is important
P. M. Pardalos, On Big Data Applications
24. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Cr plots in patients with elevated Cr level
P. M. Pardalos, On Big Data Applications
25. Big Data
Case Study
Proposed Model
Results and Conclusions
Probabilistic score for mortality risk
Risk Factors Estimation
Creatinine time series analysis
Our Speculations
The risk is defined by a function R(RC, OD),
RC stands for recent condition
OD stands for overall damage suffered during AKI
One should look for features describing these two factors
P. M. Pardalos, On Big Data Applications
26. Big Data
Case Study
Proposed Model
Results and Conclusions
Resulting nonlinear risk functions
P. M. Pardalos, On Big Data Applications
27. Big Data
Case Study
Proposed Model
Results and Conclusions
Classification accuracy evaluation
70/30 cross-validation analysis has been performed
Random split into 70% training subset and 30% testing
subset
The average over 100 runs results are reported
P. M. Pardalos, On Big Data Applications
28. Big Data
Case Study
Proposed Model
Results and Conclusions
Learned risk factors weights
0.2 0.4 0.6 0.8 1
procedure
admission_type
doctor
morbidity
age
maxCr/minCr
recentCr/minCr
Oddsratio
20 30 40 50 60 70 80 90
0
2
4
Age
1 1.5 2 2.5 3
0
5
10
recentCr/minCr value
1 1.5 2 2.5 3
0
1
2
maxCr/minCr value
P. M. Pardalos, On Big Data Applications
29. Big Data
Case Study
Proposed Model
Results and Conclusions
The odds ratio based on Cr factors
1.5 2 2.5 3
1.5
2
2.5
3
maxCr/minCr value
recentCr/minCrvalue
2
4
6
8
10
12
14
P. M. Pardalos, On Big Data Applications
30. Big Data
Case Study
Proposed Model
Results and Conclusions
The Resulting ROC Curves
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive rate (1 − specificity)
Truepositiverate(sensitivity)
Both Cr factors, AUC = 0.85
maxCr/minCr, AUC = 0.84
recentCr/minCr, 0.76
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive rate (1 − specificity)
Truepositiverate(sensitivity)
Preop. + both Cr factors, AUC = 0.92
Preop. + maxCr/minCr, AUC = 0.91
Preop. + recentCr/minCr, AUC = 0.91
Preop. only, AUC = 0.83
P. M. Pardalos, On Big Data Applications
31. Big Data
Case Study
Proposed Model
Results and Conclusions
Cr time series patterns
admission discharge
1
2
3
4
Odds ratios: No recovery
time
sCr/minCr
1.79 (1.7,1.85)
6.46 (6.04,6.91)
21.39 (19.42,23.5)
admission discharge
1
2
3
4
Odds ratios: Partial recovery
time
sCr/minCr
0.79 (0.75,0.83)
2.34 (2.26,2.49)
8.3 (7.85,9.04)
admission discharge
1
2
3
4
Odds ratios: Full recovery
time
sCr/minCr
0.48 (0.44,0.53)
0.63 (0.57,0.73)
0.81 (0.71,0.99)
admission discharge
1
2
3
4
Odds ratios: Reference patterns
time
sCr/minCr
1 (0.88,1.22)
1.12
1 (0.93,1.12)
1.24
1 (0.96,1.03) 1.32
0.22 (0.17,0.25)
P. M. Pardalos, On Big Data Applications
32. Big Data
Case Study
Proposed Model
Results and Conclusions
Model Calibration
20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Percentile
Mortalityrisk
Probabilistic model score compared to mortality risks derived from the data
P. M. Pardalos, On Big Data Applications
33. Big Data
Case Study
Proposed Model
Results and Conclusions
Risk factors included Prob. model AUC (95% CI)
Preop. 0.835 (0.817-0.854)
Hosmer-Lemeshow statistics s = 95.06; p = 0.57
Preop. + maxCr/minCr 0.900 (0.888-0.911)
Hosmer-Lemeshow statistics s = 84.33; p = 0.84
Preop. + recentCr/minCr 0.910 (0.897-0.923)
Hosmer-Lemeshow statistics s = 72.19; p = 0.98
Preop. + both sCr factors 0.917 (0.906-0.929)
Hosmer-Lemeshow statistics s = 68.94; p = 0.99
maxCr/minCr only 0.834 (0.816-0.852)
Hosmer-Lemeshow statistics s = 1,818; p = 0
recentCr/minCr only 0.791 (0.764-0.818)
Hosmer-Lemeshow statistics s = 112.88; p = 0.14
both sCr factors 0.855 (0.837-0.873)
Hosmer-Lemeshow statistics s = 137.31; p = 0.01
P. M. Pardalos, On Big Data Applications
34. Big Data
Case Study
Proposed Model
Results and Conclusions
Conclusions
Relatively high classification accuracy was obtained using
probabilistic model
Creatinine time series do provide complimentary
information on patient’s risk of death
Presumably AKI effect on risk of death depends on two
factors:
Kidney recent condition
Overall kidney damage suffered since surgery
P. M. Pardalos, On Big Data Applications
35. Big Data
Case Study
Proposed Model
Results and Conclusions
The End
Thank you!
P. M. Pardalos, On Big Data Applications