In the tranSMART 17.1 development project The Hyve is enriching tranSMART with support for time series, samples and cross-study concepts. The project is lead by the tranSMART Foundation and supported by Sanofi, Pfizer, Roche and AbbVie. Read more at http://thehyve.nl/transmart-17-1-time-series-samples-cross-study-concepts-and-more/.
As presented on the tranSMART Foundation Annual Meeting 2016: https://youtu.be/k1eKMhXbqOA?t=5h6m57s
4. 4
What are we solving here?
1. Missing crucial functionality
○ Time series, Samples, Cross-study concepts
○ Transcript-level RNA-Seq data
○ Large file storage
2. Code problems: ‘technical debt’
○ Monolithic architecture
○ Lack of automated tests
○ Old version Grails/Java, many code repositories
○ No documentation of database
5. 5
Backend only
● Stable, commercial grade core
○ Decoupling of the backend from the transmartApp User
Interface via the REST API
○ Towards an ecosystem of User interfaces on top of the
tranSMART backend
● Why only the backend?
○ transmartApp has issues: assumptions, old layered code
○ Enough work already
○ Current data will still work in transmartApp
○ 17.1 project will be part of the full 17.1 release
7. Time series, samples and
cross study concepts
i2b2 database alignment and extension
8. 8
History
● tranSMART was developed on top of i2b2
to combine clinical with omics data
● i2b2 has cross-study concepts (with ontology codes) and
support for storing samples and time series data
● tranSMART lost this:
○ Concepts are study specific
○ User Interface assumes a patient-concept pair to have
one value
“Patient John has for concept heart rate the value 80 bpm”
“Concept age in study A is not the same as
concept age in study B”
9. 9
Time series
● Absolute time
○ Blood measurement with start (and end) date+time
○ Hospital visit per patient grouping multiple measurements
with start (and end) date+time
● Relative time
○ ‘Baseline’ (0 days) or ‘Week 1’ (7 days) observation
○ Shared between patients
● Ordinal time
○ First, second and third observation
10. 10
Samples
● Differentiated by ‘modifiers’
○ Tumor and normal measurement
○ Multiple doses
○ Multiple tissues
● Differentiated only by a number ‘instance_num’
○ Multiple replicas
11. 11
Cross-study concepts
● We want ‘Age’ in different studies to be the same
concept
○ Get subjects which match ‘Age > 50’ from ALL studies
○ Use ontology codes, eg. from an external ontology server
● Difference with i2b2: tranSMART is study based
○ Study based data loading
○ Study based data access
● We need to support both
13. 13
Time series and samples - Example 1
A study with tumor and normal
samples
● Multiple observations for the same patient
differentiated by the modifier ‘tissue type’.
● The Start_Date (and End_Date) for the
observation will be empty.
● All observations will be linked to the same
trial_visit, which will link to the study.
14. Clinical trial with multiple timepoints
(Baseline, Week 1, Week 2)
● Multiple observations for the same patient
differentiated by their trial_visit.
● All observations will be linked to one of the
available trial_visits, which will link to the
study. Each trial_visit has a Label
(Baseline), a Unit (Days) and a Value (0, 7
and 14).
14
Time series and samples - Example 2
15. 15
Time series and samples - Example 3 (1/2)
An EHR dataset with observation and
visit timestamps and samples.
● Multiple observations for the same patient
differentiated by their observation
Start_Date, visit and Instance_Num.
● The Start_Date (and End_Date) for the
observation will be set to a timestamp.
● The Instance_Num will be set starting
from 1 for multiple samples on the same
observation Start_Date and visit.
16. 16
Time series and samples - Example 3 (2/2)
An EHR dataset with observation and
visit timestamps and samples.
● The observations from a patient will be
linked one visit per hospital visit.
● The Start_Date (and End_Date) for the
visit will be set to a timestamp including
time and date for the hospital visit.
● All observations will be linked to the same
trial_visit, which will link to the study.
17. ● Querying observations based on a combination of:
○ start time, end time
○ aggregated time series/samples values:
■ minimum, maximum, average
○ temporal constraints on sets of events:
■ define sets of events (e.g. A: all blood pressure readings for a
patient, B: the first use of drug X by the patient)
■ Specify constraints (e.g. All of A happen at least one week after
any of B).
17
Querying time series and samples
18. 18
● Querying patients based on observations:
○ Certain constraints are valid for any or for all observations
for the patient
○ Return patients where all observations of high blood
pressure occur after supply of drug X.
● Querying for aggregated values for numerical data:
○ minimum, maximum, average
Querying time series and samples
20. 20
TranSMART data types
● Metadata
○ Study, concept, patient metadata / Links to source data
● Clinical / NHTMP / Derived imaging data / Biobanking data
○ numerical and categorical
● Gene expression - RNA
○ Micro array
○ mRNAseq, miRNAseq - only linked to genes
○ qPCR miRNA
● Copy Number Variation data (Array CGH)
● Small Genomic Variants (SNP, indel – VCF format)
● Large genomic rearrangements
● Proteomics
○ Protein mass spectrometry – peptide or protein quantities
○ Immunoassay Rule-based medicine (RBM) – analyte concentrations
● Metabolomics
○ Metabolite quantities
21. 21
Transcript-level RNA-Seq data
● Adding a data type where measurements
(readcount, normalised readcount and z-score)
are linked to transcripts instead of genes
● Dictionary will link genes to transcript for searching
REF_ID GPL_ID CHROMOSOME START_BP END_BP TRANSCRIPT
ENST0001 RNASEQ_TRANSCRIPT_ANNOT X 1000 1100 TR1
2 RNASEQ_TRANSCRIPT_ANNOT Y 2000 2500
3 RNASEQ_TRANSCRIPT_ANNOT 10 3000 4000 TR2
23. 23
Linking with Arvados: Scalable Genomics
● Linking files in Arvados to studies in tranSMART
for the storage of large files (eg BAM, VCF)
● If possible:
○ Align with linking files in MongoDB to studies
● Eventual UI goals:
○ See in tranSMART which Arvados files linked to study
○ Start from tranSMART a Arvados workflow on Arvados
files
25. 25
Upgrade path / data migration
● If you have your data in 16.1 or 16.2
○ There will be a data migration path provided to 17.1
26. 26
Backwards compatibility
If you have your data in 16.1 or 16.2
● The current user interface (transmartApp) will still work on current
data
○ So only for data without time series, samples,
○ Plugins are not guaranteed to work (but might very well)
● The current REST clients will still work with the V1 version of the
REST API
28. 28
Documentation
● Documentation will be provided for all available
REST and Core API calls
● Data model design will describe the complete data
model
29. 29
Automated testing
● The Core API will have
unit and integration tests
with a minimal test
coverage of 70%.
● The RESTful API will have
automated functional
tests for all API calls.
34. 34
The Hyve team
● Project manager: Erik van Eeuwijk
● Business analyst: Ward Weistra (me)
● Technical lead: Gijs Kant
● Development team:
○ Piotr Zakrzewski (present)
○ Ruslan Forostianov (present)
○ Jan Kanis
○ Ewelina Grudzien
○ Olaf Meuwese
○ Barteld Klasens (automated testing)
35. 35
Timeline
● Module A and B: End of 2016
○ Time series, samples, cross-study concepts
○ Transcript-level RNA-Seq
● Module C and project release: End of Q1 2017
○ Linking with Arvados
● TranSMART 17.1 version release: Q2 2017
○ Integration with all community developments