At the Pharma IT 2019 conference in London, Kees van Bochove, Founder of The Hyve gave a talk on how 2019 became the year in which many biopharmaceutical companies have operational programs to make data FAIR across the enterprise.
Module for Grade 9 for Asynchronous/Distance learning
How 2019 became the year FAIR landed in biopharmaceutical R&D
1. Kees van Bochove, Founder, The Hyve
How 2019 became the year FAIR
landed in biopharmaceutical R&D
@keesvanbochove
#PharmaTec19
London, 24 Sep 2019
2. Outline
1. FAIR Data is about people
2. The data lake is a passing phase
3. Relational data models are back
3. The Hyve
We advance biology and medical research…
… by building and serving thriving open source communities.
Services
Professional support for
open source software in
biomedical informatics
➢Software development
➢Data engineering
➢Consultancy
➢Hosting / SLAs
Core values
Share
Reuse
Specialize
Office Locations
Utrecht, The Netherlands
Cambridge, MA, United States
Customer Segments
Pharma
Life Sciences
Healthcare
Fast-growing
Started in 2012
40+ people by now
5. The roots of FAIR
►Public-private partnership to advance:
►Open Science
► Sustainability & reuse of data
►Workshop in Leiden in 2014
►Towards a Modular Blueprint ‘Floor-plan’ of a safe
and fair Data Stewardship, Trading and Routing
environment, provisionally called the Data
FAIRPORT
https://www.lorentzcenter.nl/lc/web/2014/602/info.php3?wsid=602
6. FAIR Workshop at The Hyve in Utrecht, 2018
http://blog.thehyve.nl/blog/highlights-from-pistoia-alliances-fair-workshop
https://www.sciencedirect.com/science/article/pii/S1359644618303039
8. FAIR Data Principles <> People
GO-CHANGE: socio-cultural changes around working together on
data: it’s about connecting people to each other’s data
GO-TRAIN: promote awareness of FAIR and teach best practices on
how to make your data available to others
GO-BUILD: provide the infrastructure that supports this change
Goes by many names: digital transformation, data-driven, FAIR, silo-
breaking etc., but the result is improved (scientific) collaboration
9. Why resilience to change matters
● Domain changes and focus shifts: new data types,
applications etc.
● Organizational changes: M&A, re-orgs, people
moving roles etc.
● Technology changes: new software and hardware
platforms, analysis methods, automation, ML/AI etc.
10. Let’s look at one of the 15 principles as example
Findable:
F1. (meta)data are assigned a globally
unique and persistent identifier;
F2. data are described with rich metadata;
GO-CHANGE
● Adapt information processes to systematically
acquire, capture and persist metadata
GO-TRAIN
● Work with data and domain experts to define
important metadata to capture for all datasets
GO-BUILD
▶ Choose widely accepted and easy to produce
machine-readable format for describing metadata
(hint: RDFa, JSON-LD etc.)
▶ Master metadata management services
FAIR Maturity Indicators
● F2A Structured Metadata
● F2B Grounded Metadata
11. FAIR Data is
about people
Statement #1
● Connecting people to
each other’s data
● Changing processes
● Supporting change
@keesvanbochove @TheHyveNL
13. The modern (?) monolith
Ingest
Self-service
Pipelines
AnalyticsEnterprise Data Lake
Ingestion Team Data Engineering Team Unification TeamSearch TeamPlatform API Team Analytics Team
Architectural division
Axis of
change
15. Decentralized data management
● IRI / identifier schemes
● Metadata standards
● Provenance standards
CDO
Data Federation
{
{
Oncology
Neuro-
science Development
ClinOps
HCS
Omics platforms
Data science
Preclinical
ADME/Tox
Biomarker dev.
RWD
Epidemiology
● Catalog function
● Data standards
● Entities / data sets
Publish
16. Advantages of a decentralized FAIR approach
● More resilient to change: no dependency on large central functions
● Allows for an iterative data strategy operationalization (no ‘big bang’
data lake delivery needed, FAIRification can start today and locally)
● No need to shuffle people around to start a big data lake project:
embed informatics and data experts directly in the research and
development teams
● Centralize only standardization functions, decentralize the rest
empower teams to do their own data science and informatics
● Embrace usage of external data and collaborations, no need to
‘ingest first’ via a central function, but use & link directly
17. The data lake is a
passing phase
Statement #2
● Centralization is a
potential bottleneck and
a barrier for change
● The solution is in
decentralization of
storage, applications etc.
● Standards management
and data federation as
central functions
@keesvanbochove @TheHyveNL
18. Teams at The Hyve: open source communities
Research Data Management
● FAIR Data Governance consultancy
● Fairspace (meta)data management
Genomics
● Cancer data portal: cBioPortal
● Knowledge base: Open Targets
Health Data Networks
● Data warehouses: tranSMART, i2b2
● Cohort selection: Glowing Bear
● Request Portals: Podium
Real World Data
● Real world evidence: OMOP/OHDSI
● Wearables platform: RADAR-BASE
19. FAIR Services at The Hyve
● Semantic modelling: creating (meta)data models that allow traversal of
linked data
● Data conformance: choose the right data standard for specific problems,
align with community standards to maximize benefits from the open
science communities and precompetitive collaborations
● Data landscape: create an understanding of existing applications and
data sources in the company and readiness for FAIR
● FAIRification: get started with FAIRifying datasets, defining metadata,
appropriate standards, provenance etc.
● Data catalog: build collaborative environment around data catalog (e.g.
using Fairspace)
20. Example: OMOP CDM v5 for RWE/RWD
● Observational
healthcare
data
● Fields defined
per domain
● Standardized
Vocabularies
21. cBioPortal: hard to resist value proposition
● 4000+ citations
in literature
● ~20k+ unique
users per
month
● Local instances
deployed in
many pharma
companies
and cancer
centers
22. Relational data
models are back
Statement #3
● RDBMS abandoned in favor
of NoSQL, ‘schemaless’,
‘we use ElasticSearch’ etc.
● But some applications need
strong (relational)
semantics (e.g. CDISC)
● Descriptions can be in
relational db (e.g. OMOP),
RDF, JSON-LD etc.
● Underlying infrastructure
doesn’t matter as long as it
does not leak abstractions
@keesvanbochove @TheHyveNL
23. We advance biology and medical
sciences by building and serving
thriving open source communities