Population genetics and genomics is an emerging topic for the application of machine learning methods in healthcare and biomedical sciences. Currently, several large genomics initiatives, such as Genomics England, UK Biobank, the All of Us Project, and Europe's 1 Million Genomes Initiative are all in the process of making both clinical and genomics data available from large numbers of patients to benefit biomedical research. However, a key challenge in these initiatives is the standardization of the clinical and outcomes data in such a way that machine learning methods can be effectively trained to discover useful medical and scientific insights. In this talk, we will look at what data is available at scale, and review some of examples of the application of common data and evidence models such as OMOP, FHIR, GA4GH etc. in order to achieve this, based on projects which The Hyve has executed with some of these initiatives to harmonize their clinical, genomics, imaging and wearables data and make it FAIR.
3. Core values
● Share
● Reuse
● Specialize
Office Locations
● Utrecht, Netherlands
● Cambridge, US
Fast-growing
● Started in 2012
● Now 40+ people
We advance biology and medical sciences
by building and serving thriving open source communities
3
Customers
● Pharma & Life Sciences
● Healthcare
● Government & non-profit
4. 4
Teams
Research Data Management
● FAIR / Data Governance consultancy
● Fairspace (meta)data management
Cancer Genomics
● Cancer data warehouse: cBioPortal
● Knowledge base: Open Targets
Data Warehousing
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE
● Data catalogues: CKAN, DataVerse
6. 6
Relevance of FAIR Data in Pharma
“The first thing we’ve learned is the importance
of having outstanding data to actually base
your ML on. (...)
I think people underestimate how little clean
data there is out there, and how hard it is to
clean and link the data.”
- Vas Marasimhan, CEO Novartis
https://www.forbes.com/sites/davidshaywitz/2019/01/16/novartis-ceo-who-wanted-to-bring-tech-into-pharma-
now-explains-why-its-so-hard
7. 7
FAIR Workshop at The Hyve in Utrecht, 2018
http://blog.thehyve.nl/blog/highlights-from-pistoia-alliances-fair-workshop
https://www.sciencedirect.com/science/article/pii/S1359644618303039
8. http://www.nature.com/articles/sdata201618
Accessible:
A1. standardized protocol
A1.1 open, free and universally implementable
A1.2. authentication and authorization
A2. metadata stay accessible
Reusable:
R1. attributes
R1.1. license
R1.2. provenance
R1.3. community standards
Interoperable:
I1. language for knowledge representation
I2. vocabularies that follow FAIR principles
I3. qualified references
to other (meta)data
Findable:
F1. persistent identifier
F2. metadata
F3. metadata - data link
F4. registered or indexed
8
OMOP, FHIR,
i2b2, CDISC
etc.
RDF
DCAT
VoID
FAIRmetrics
PROV-O
CC
14. 14
From : Architecture of a Biomedical Informatics Research Data Management Pipeline, Bauer, .. Sax et al. Stud Health Technol Inform. 2016;228:262-6.
FHIR
IHE
CDA
HL7
I2b2
OMOP
SMART
CWL/WDL
RADAR
OCI
GA4GH
openEHRDICOM
SNOMED
ICD
LOINC
CDISC
DCAT
Two
perspectives:
Healthcare:
HL7 FHIR, RIM,
SMART on FHIR,
DCM’s,
OpenEHR etc.
Research &
Trials:
i2b2/tranSMAR
T, OMOP, HPO,
ICD, SNOMED-
CT, LOINC, ….
15. 15
Bringing people & communities together
http://blog.thehyve.nl/blog/pan-european-health-data-networks-meeting
16. 16
Deep-dive into Health Data Networks
https://youtu.be/
C95pl11zdAs
About the policy
background of health
data networks, patient
consent, GDPR,
wearables & a lot more!
18. 18
A small detour to our beginnings
▶ Objective Reality
▶ Subjective Reality
▶ Intersubjective Reality
“Ever since the Cognitive Revolution, Sapiens have been living in a
dual reality. On the one hand, the objective reality of rivers, trees
and lions; and on the other hand, the imagined reality of gods,
nations and corporations. As time went by, the imagined reality
became ever more powerful, so that today the very survival of rivers,
trees and lions depends on the grace of imagined entities such as the
United States and Google.”
19. 19
Data Models 101
▶ Problem space models
▶ Semantics of the model are restricted to those that characterize the “problem domain” as described
by domain experts
▶ Domain Information Models (e.g. BRIDG):
Basic, pre-clinical, clinical, and translational research and associated regulatory artifacts, i.e. the data,
organization, resources, rules, and processes involved in the formal assessment of the utility, impact,
or other pharmacological, physiological, or psychological effects of a drug, procedure, process,
subject characteristic, biologic, cosmetic, food or device on a human, animal, or other subject or
substance plus all associated regulatory artifacts required for or derived from this effort, including
data specifically associated with postmarket surveillance and adverse event reporting.
21. 21
Data Models 101
▶ Solution space models
▶ CDISC SDTM provides a standard for organizing and formatting data to streamline processes in
collection, management, analysis and reporting
▶ i2b2: model patient-centric clinical and biological data for the purpose of translational research ‘from
bench to bedside’
▶ OMOP: model data from healthcare databases for the purpose of observational research including
studying the effects of medical products
▶ Models should have a clearly bounded domain of interest
23. 23
CDISC
▶ (Underlying) standards
evolve over time
▶ SDTM is bound by its
regulatory submission
context
▶ Not meant / suited for
analysis (cf. AdaM)
From Tim Williams (UCB), PhUSE 2017 paper
24. 24
Common Data Models Comparison
OMOP
▶ Scope: Observational Data
▶ Standardized Vocabularies
▶ Person Centric Model
▶ Pre-defined domains:
Condition, Drug, Procedure,
Measurement, Observation...
Increased standardization
Increased flexibility
I2b2/tranSMART
▶ Scope: Translational Data
▶ Flexible Concept Trees
▶ Observation Centric Model
▶ Pre-defined dimensions:
Patient, Study, Visit,
Concept, Modifier etc.
RDF
▶ Scope: Not Limited
▶ ‘Knowledge’ Graph
▶ Flexible Model
▶ Building on Linked
Open Data standards
27. 27
i2b2/tranSMART Data Model
▶ One observation
domain
▶ Study-specific tree of
concepts
▶ Supports:
▶ absolute and relative time
series
▶ samples and replicates
▶ cross-study concepts and
ontologies
29. 29
FHIR
● Fast Healthcare Interoperability
Resources, “HL7 REST API”
● Exchange of healthcare data
elements such as Patient,
Practitioner, Procedure, Medication
● FHIR Profiles describe usage
30. 30
Common Data Model Definition
▶ CDM is “a mechanism by which raw data are
standardised to a common structure, format and
terminology independently from any particular study in
order to allow a combined analysis across several
databases/datasets.”
▶ Standardisation of structure and content allows the use
of standardised applications, tools and methods across
the data to answer a wide range of questions.
32. 32
MI: Architecture for health research IT
● Information Model: FHIR Profiles
● Data transport & persistence: FHIR for clinical data, GA4GH /
genomics standards for genomics data
● Dedicated & reproducible data warehouses for analysis
FHIR + GA4GH Data Warehouses & Marts
Collaboration with O. Kohlbacher, University of Tubingen
33. 33
EHDEN: using the OMOP CDM
Data
Catalogue
Mapping
Tools
Federated
Network
- White pages of EHRs,
registries, etc.
- Store metadata about data
provenance, governance,
population
characteristics etc.
- Increase FAIRness and
citability (e.g. DOI)
- Training materials
- Data (mapping)
quality assessment &
ETL validation tooling
- European CDM
vocabulary
extensions
- Federated study execution
- Security and access control
- Dashboards for study
results
- Remote research
environment
- EHDEN Portal
- Interoperability & FAIR
35. 35
Deep-dive into Open Science with OHDSI
https://youtu.be/
X5yuoJoL6xs
About a 5 day ‘study-a-
thon’, in which about 40
scientists tried to predict
the outcome of a long
running RCT with
observational data.
38. 38
Deep-dive into Personal Health Train
https://youtu.be/
jcxZiqkqMgc
About the technical
architecture, the Personal
Health Train, creating the
stations, securing the
data and federated
workflows!