Scaling API-first – The story of a global engineering organization
Graphical Structure Learning accelerated with POWER9
1. Arghya Kusum Das, Ph.D.
Assistant Professor, University of Wisconsin-Platteville
In collaboration with
Radha Nagarajan, Ph.D.
Director, COSH, Marshfield Clinic Health System (Digital Health, Data Science, Bioinformatics, RWE)
RWE)
Graphical Structure Learning Accelerated with POWER9
2. o Overview of Graphical Models
o Implementation
o Preliminary Findings
o Healthcare Applications
Overview
4. Why Graphical Models?
o system-level abstractions: Graphical models can reveal system-level properties and behavior not
apparent in the reductionist representation. System-level abstractions is especially critical in developing
developing targeted intervention.
e.g. model COVID spread in a given community from contact tracing1; use the model to assist in
assist in targeted community-based interventions/policies
e.g. model the signaling mechanism initiated by COVID spike protein; use the model to identify
identify potential target molecules for drugs to minimize disease severity/inflammation2
o in-silico models: Graphical models can be experimented in a controlled and cost-effective manner. This
includes posing questions to these models (e.g. inference).
e.g. given the evidence that a subject has cough, fever, sore throat and shortness of breath
determine the probability that the subject is COVID +ve
o causal associations: Graphical models may reveal causal association3 under certain implicit assumptions
(Note: we are attempting decipher causality from observational data!)
1https://www.cdc.gov/coronavirus/2019-ncov/daily-life-coping/contact-tracing.html
2https://www.cebm.net/covid-19/dexamethasone/
3Pearl, J [2009] Causality: Models, Reasoning and Inference.
5. Problem:
What we have: Data across an informed set of variables (D)
What we need: Graphical structure (G) representing the associations between these variables
Pair-wise dependencies:
Direct associations between a given pair of nodes determined using similarity measures
Note: Associations between a pair of variables may not be direct and can mediated through a third
variable.Conclusions based on pair-wise dependencies while helpful may be incomplete.
e.g. Loss of Taste (L) and Disease Severity (D) may not be associated as such (i.e. marginally
marginally independent). However, L and D may be associated given that the subject has COVID
L D
C
D
L
6. What we need: Graphical structure
Approach: Bayesian structure learning
- Models the joint probability distribution across the given informed set of variables
- Incorporates conditional dependencies between a given set of variables in an iterative manner
C
D
L
7. Data?
o multivariate: more than one variable is measured
o Can be longitudinal or cross-sectional
longitudinal:
a continuous process is sampled as a function of time resulting in time series
challenging to obtain as the several factors have to be controlled
cross-sectional:
replicate measurements of a continuous process is sampled in a given time window (snapshot)
(snapshot)
relatively easier to obtain
Note: The approaches to be discussed implicitly assumes that the properties of the data is preserved across
the replicate realizations.
8. Question: Given the cross-sectional data on the loss of taste (Yes/No), Disease Severity (Yes/No), Result of
COVID test (+/-) can we model the association between them
Three popular approaches for structure learning (static):
o Constraint-based Learn the structure using conditional independence tests
o Search and score Learn the structure that best fits the data using a greedy search with a scoring
criteria
o Hybrid Learn the structure using a combination of constraint-based and search-score
approaches
Subject C (+/-) D (Y/N) L (Y/N)
1 + Y Y
2 + Y N
3 - Y Y
4 - N Y
. . . .
. . . .
. . . .
C
D
L
? ?
9. Bayesian network structure learning:
o Exhaustive Enumeration: Number of possible structures grows super-exponentially with the number
of nodes n1.
𝑎𝑛 = 𝑘=1
𝑛
(−1)𝑘−1 𝑛
𝑘
2𝑘(𝑛−𝑘)
𝑎𝑛−𝑘
𝑎0 = 1
Note: Exhaustive enumeration in general is not computationally feasible from a practical standpoint.
1Robinson, R. W. "Counting Labeled Acyclic Digraphs." In New Directions in Graph Theory (Ed. F. Harary). New
Nodes DAGs
1 1
2 3
3 25
4 543
5 29281
. .
. .
10. Markov Equivalence Class: probabilistically indistinguishable graphical structures.
𝑝 𝐿, 𝐷, 𝐶 = 𝑝(𝐿/𝐶). 𝑝 𝐶 . 𝑝(𝐷/𝐶)
𝑝 𝐿, 𝐷, 𝐶 = 𝑝(𝐿/𝐶). 𝑝 𝐷 . 𝑝(𝐶/𝐷)
𝑝 𝐿, 𝐷, 𝐶 = 𝑝 𝐿 . 𝑝(𝐶/𝐿). 𝑝(𝐷/𝐶)
Note: Even if exhaustive enumeration were possible, structures can be learned only up to the Markov
equivalence class.
C
D
L
C
D
L
C
D
L
11. Search and Score (Hill Climbing):
𝑃 𝐺|𝐷 α 𝑃 𝐷|𝐺 . 𝑃(𝐺)
Theoretical consideration on the complexity of Greedy search under certain assumptions have been
been investigated1
1Scutari, M et al. [2018] Learning Bayesian Networks from Big Data with Greedy Search, Statistics and Computing
Likelihood Prior
12. Search and Score (Hill Climbing)
Hill-climbing is a sequential algorithm. Score of the present structure G* is generated by modifying the
modifying the previous structure (G) as in Step 4 in an iterative manner
BIC Score = 𝑖=1
𝑛
log[𝑃(𝑋𝑖/Π𝑋𝑖
)] −
𝑑
2
log 𝑛
Opportunities for distributing the computation in the hill climbing approach
o The potential structures interrogated in Step 4(a) can be distributed
o BIC score of a candidate structure is the sum of the scores of its local structures, hence can be
distributed
o Greedy aspect of hill-climbing in conjunction Markov equivalence can result in locally optimal
convergence encouraging repeating the procedure with multiple random restarts, this in turn can be
can be distributed
Regularization term
d = #parameters
13. Implementation: Architecture
*Image from IC922 Redbook
x86:
Server: HPE ProLiant DL580
servers
CPU Type: Intel Xeon EX-series
Cores per node: 16
DRAM: 512GB
POWER 9:
Server: IC922
CPU Type: DD2.3 POWER9
processor modules
Cores per node: 160 virtual cores
Access up to 32DIMM
Sustained bandwidth 28.8 GB
14. Implementation:
o Data description: HEPMASS1,2 (10.5 x 106 samples comprising of 28 variables , Baldi et al., 20161). All
continuous normalized features were discretized into binary categorical variables by thresholding
thresholding about their mean.
o Python Implementation:
Bayesian network using Pandas, NetworkX
1Baldi P, et al. [2016] Parameterized Neural Networks for High-Energy Physics. The Eur. Phys. J. C 76(235).
2Scutari, M et al. [2018] Learning Bayesian Networks from Big Data with Greedy Search, Statistics and Computing.
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
15. Multiple Cores Architecture: Dask Distributed
Python/Dask APIs
Parallel Restart
SHA-256 Hash confirms
uniqueness of visited graph
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
A
C
D
B
E
Spawning multiple Hill Climbing
instances
Data
16. Performance of structure learning on POWER and x86:
Mean, standard distribution of the computational time across 5 runs of the HEPMASS data with Hill-
Climbing. A two-sample ttest with unequal variance was used to compared the times between x86 and
POWER architectures (# implies significant difference).
The computational time were statistically significant (p < 0.001) between the x86 and the POWER
architectures, with the POWER architectures taken considerably lesser time than x86. As expected, BIC
score takes less computational time than K2 score and these scores
0
10000
20000
30000
40000
50000
1 2 3
Time
(Seconds)
Max Fan in
x86 POWER
Performance of x86 and POWER 9 on HEPMASS (BIC Score)
0
10000
20000
30000
40000
50000
1 2 3
Time
(Seconds)
Max Fan in
x86 POWER
Performance of x86 and POWER 9 on HEPMASS (K2 Score)
# # #
# # #
17. Performance of structure learning on POWER and x86 with varying Map Tasks:
Mean, standard distribution of the computational time across 5 runs of the HEPMASS data with Hill-
Climbing. A two-sample ttest with unequal variance was used to compared the times between x86 and
POWER architectures (# implies significant difference).
There was statistically significant difference in the computational time between the x86 and the POWER
architectures when random restarts were distributed as map task jobs. As the number of map tasks were
increased the computations time decreased across both POWER and x86 and the separation in the average
time increased between x86 and POWER.
# Corresponds to p < 0.05; * Corresponds to p <
0.0001
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1 2 3 4 5 6 7
Time
(Seconds)
Map Tasks
x86 POWER
# # # # # * *
Performance of POWER and x86 with varying Map Tasks
(BIC Score)
2 4 8 16 32 64 128
18. Healthcare – current trends:
o Explosion in Digital Healthcare Data:
- Source Systems: Continued digitization from multiple sources (EHR, Claims, Registries, IoT) and multiple types
(Text, Image, Signals)
- Multiscale Profiles: Emphasis on capturing the complete description of patients.
- Common Data Models: Develop approaches for sharing observational healthcare data (OMOP/OHDSI) across
multiple organizations and research networks (e.g. HIE, PCORNet)
- High-throughput: molecular data (e.g. Next Generation Sequencing)
- FHIR: Development of (Fast Healthcare Interoperability Resources) for enhanced interoperability across systems
and devices
o Explosion in Analytics Adoption:
- Descriptive, Predictive, Prescriptive Analytics
- Shift from storage to analytics and consensus-based to evidence-based/data-driven approaches to impact
outcomes/KPIs.
- Surge in the adoption of Machine Learning (ML) and Artificial Intelligence (AI) approaches.
19. Healthcare Applications:
Graphical Models – where do they fit in
o Healthcare data sets are inherently multivariate and noisy attributed to
several factors. Probabilistic graphical models are especially suited to
handle noisy data.
o Associations in multivariate healthcare data may be unknown.
Graphical models can discover novel associations (hypothesis
generation) in addition to validating known associations (hypothesis
testing). Deciphering these associations is critical in prescribing
targeted interventions.
o Graphical models fall under ML and AI1. Can be used for descriptive,
predictive and prescriptive analytics (e.g. Naïve Bayes Classifier). AI
aspect of Graphical models: Answer queries posed from the evidence
provided about a disease.
o Graphical Models Healthcare applications include: Diagnostic
Reasoning, Prognostic Reasoning and Treatment selection, Discovering
functional associations2
o Emphasis on inferring causal associations from observational
healthcare data with potential to complement classical approaches (e.g.
RCT 3), RCTs being idealizations.
o Interpretable and easily visualized for critical evaluation in healthcare
settings.
Need: Architectures and programming environment that can implement
1Russell, S. Norvig, R. [2020] Artificial Intelligence: A Modern
Approach, 4th ed
2Lucas PJF et al. [2004] Bayesian networks in biomedicine and
health-care Artif. Intell. Med. 30(3):201-14
3Berwick, D [2008] The Science of Improvement, JAMA, 1182-
1184
4Mclachlan, S et al. [2020] Bayesian networks in healthcare:
Distribution by medical condition. Artificial Intelligence in
Medicine. 107, 101912
20. Summary
o Structure learning is computationally intensive especially across large data sets and large number of variables
o Preliminary findings revealed marked improvement in performance using POWER architectures in
addressing computational challenges of structure learning approaches such as hill-climbing
o Need for a more detailed investigation using a battery of data sets and across distinct graphical model
algorithms
o Graphical modeling approaches in general have considerable healthcare applications. Their ability to reason
under uncertainty makes them especially ideal for healthcare analytics.
o https://onstituteacademy.herokuapp.com
Acknowledgements
Marco Scutari, Ph.D. Senior Researcher, Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA),
Switzerland
Terry Leatherland, Trish Froeschle, Thomas Prokop, IBM, USA
Ganesan Narayanswami, OpenPOWER leader in Education and Research