3. Typical Branch Distribution
Grails Code
Database
transmartApp (without full
repo history, always with
wrong ancestry information
⇒ merging quite difficult)
RModules (if you’re lucky),
but analyses definitions in
DB not provided
SQL scripts on top of GPL
1.0 dump or later. Probably
insufficent/won’t apply
Stored procedures for ETL.
Overlapping definitions with
yours, but no history ⇒
merging quite difficult
Manual fixups always
required (even if just
permissions/synonyms)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
3 / 22
4. Typical Branch Distribution (II)
ETL
Solr/Rserve/Configuration
High variablity in strategies
Instructions/sample data
rarely provided
Solr
schemas/dataimport.xml
perpetually forgotten
Kettle scripts are
problematic
Idem for information on R
packages
Sample configuration rarely
provided
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
4 / 22
5. Versioning Control
Version control used ONLY for Grails Code. . .
But often squashed and with wrong ancestor information.
Forget about database, Solr, most of ETL.
Result
Merges are very difficult.
Changes cannot easily be tracked
Changes’ wherefores are unknown
Regressions are introduced (no conflicts)
Collaboration is based on e-mail attachments
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
5 / 22
6. Automation
Even with all the pieces. . .
Setting up a new branch takes days;
weeks for non-basic functionality
No reproductibility in the process!
Result
Devs driven away from fully local
environment (too much work)
Robust environment for CI passed over
(too much work)
Bugs cannot be reliably reproduced (see
also: no consistent usage of VCS)
Time wasted with deployment specific
mistakes/inconsistencies
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
6 / 22
7. Why?!
The “source code” for a work means
the preferred form of the work for
making modifications to it.
— GPL v3, section 1
Is everyone holding back “source code”?
More likely explanation:
No appropriate tooling being used
Guillaume Duchenne (public domain)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
7 / 22
8. Situation for tranSMART 1.1
The situation is much better!
Some problems remain, though.
The Good
Create/populate DB
is easy
Most stuff is
versioned
CI for builds
Image available
Public issue tracking
Gustavo Lopes (The Hyve B.V.)
The Bad
No Oracle support
Changes to DB scripts/seed data are
ad hoc (lax structure)
No mechanism to support/compare
schemas with other branches
R analyses are json blobs in TSVs
No VCS for Solr or Rserve/images’ setup
Set up Sol/Rserve is time-consuming
Population of DB with sample data is still
time-consuming
Config changes required for dev
transmart-data
November 6, 2013
8 / 22
9. Description of transmart-data
We developed transmart-data to address most of these problems:
transmart-data is a set of
scripts for managing tranSMART’s environment and
certain application data (e.g. Solr schemas, DDL, seed data), which
is used by scripts and sometimes generated by them.
It has a makefile based interface.
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
9 / 22
10. transmart-data: Purposes
Purposes of transmart-data:
1
Allow setting up a complete dev environment quickly (< 30 min)
2
Bring versioning to the database schema and Solr files
3
Setup Solr runtime
4
Invoke ETL pipelines
5
Setup Rserve
Target audience: Programmers
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
10 / 22
11. transmart-data: Non-purposes
Non-purposes of transmart-data:
1
Setup a production environment
(some components can be used)
2
New users evaluating tranSMART
(use an pre-built image)
3
Building transmartApp or its plugin dependencies
(build them yourself or use artifacts from Bamboo/Nexus)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
11 / 22
12. Configuration
Environment variable based configuration
cp v a r s . s a m p l e v a r s
vim v a r s #e d i t f i l e
source v a r s
Gustavo Lopes (The Hyve B.V.)
PGHOST=/tmp
PGPORT=5432
PGDATABASE=t r a n s m a r t
PGUSER=$USER
PGPASSWORD=
TABLESPACES=$HOME/ pg / t a b l e s p a c e s /
PGSQL BIN=$HOME/ pg / b i n /
ORAHOST=l o c a l h o s t
ORAPORT=1521
ORASID=o r c l
ORAUSER=” s y s a s s y s d b a ”
ORAPASSWORD=mypassword
ORACLE MANAGE TABLESPACES=0
#c o n t i n u e s . . .
transmart-data
November 6, 2013
12 / 22
13. Database Schema Management
Support for Oracle and Postgres
Oracle
Postgres
Uses pg dump(all)
Queries dba * tables
Parses the dump files
Dumps DDL w/
DBMS METADATA
#Dump
make −C p o s t g r e s / d d l dump
make −C p o s t g r e s / d d l /
GLOBAL e x t e n s i o n s . s q l
roles . sql
#Dump
make −C o r a c l e / d d l dump
#Load
make o r a c l e
#Load
make −C p o s t g r e s / d d l l o a d
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
13 / 22
14. Seed Data
Only Postgres for now
#Dump
#T a b l e s t o dump i n p o s t g r e s / d a t a/<schema> l s t
make −C p o s t g r e s / d a t a dump
make −C p o s t g r e s /common m i n i m i z e d i f f s
#Load
make −C p o s t g r e s / d a t a l o a d
#Load DDL and d a t a
make p o s t g r e s
Only for basic stuff with no ETL!
Pretty fast (DDL+data loaded in 10s)
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
14 / 22
15. ETL (I)
Unified interface for ETL
Prepare dataset
Load dataset
1
Prepare ETL-specific source
files
2
Prepare file with ETL
specific params
3
Upload dataset to CDN
(optional)
For each new ETL pipeline,
support must be added
Gustavo Lopes (The Hyve B.V.)
make −C s a m p l e s /{ o r a c l e ,
p o s t g r e s } l o a d <type>
<s t u d y i d >
#Example :
make −C s a m p l e s / p o s t g r e s
load clinical GSE8581
Everything is automated!
transmart-data
November 6, 2013
15 / 22
17. RModules Analyses’(tsApp-DB)
Situation in transmartApp-DB:
u p d a t e searchapp . plugin_module
s e t params = ' {" id ":" survivalAnalysis " ," converter ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y
|| Common / dataBuilders . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common /
E xt ra ct Concepts . R ' ') " ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / collapsingData . R ' ')
" ," source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Common / BinData . R ' ') " ," source ( ' ' ||
P L U G I N S C R I P T D I R E C T O R Y || Survival / Bui ldS urv iva lDa ta . R ' ') " ," tSurvivalData . build ( n
tinput . dataFile = ' ' || T E M P F O L D E RD I R E C T O R Y || Clinical / clinical . i2b2trans ' ' , n
tconcept . time = ' ' || TIME || ' ' , n tconcept . category = ' ' || CATEGORY || ' ' , n tconcept .
eventYes = ' ' || EVENTYES || ' ' , n tbinning . enabled = ' ' || BINNING || ' ' , n tbinning . bins = ' ' ||
NUMBERBINS || ' ' , n tbinning . type = ' ' || BINNINGTYPE || ' ' , n tbinning . manual = ' ' ||
BINNINGMANUAL || ' ' , n tbinning . binrangestring = ' ' || B I NN IN G RA NG E ST R IN G || ' ' , n tbinning
. variabletype = ' ' || B IN N I N G V A R I AB L E T Y P E || ' ' , n tinput . gexFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || mRNA / Processed_Data / mRNA . trans ' ' , n tinput . snpFile = ' ' ||
T E M P F O L D E R D I R E CT O R Y || SNP / snp . trans ' ' , n tconcept . category . type = ' ' || TYPEDEP || ' ' , n
tgenes . category = ' ' || GENESDEP || ' ' , n tgenes . category . aggregate = ' ' || AGGREGATEDEP
|| ' ' , n tsample . category = ' ' || SAMPLEDEP || ' ' , n ttime . category = ' ' || TIMEPOINTSDEP
|| ' ' , n tsnptype . category = ' ' || SNPTYPEDEP || ' ') n t "]} ," name ":" Survival Analysis " ,"
d a t a F i l e I n p u t M a p p i n g ":{" CLINICAL . TXT ":" TRUE " ," SNP . TXT ":" snpData " ," MRNA_DETAILED . TXT
":" mrnaData "} ," dataTypes ":{" subset1 ":[" CLINICAL . TXT "]} ," pivotData ": false ," view ":"
S u r v i v a lAnalysis " ," processor ":{" R ":[" source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival /
C o x R e g r e s s i o n L oa d e r . r ' ') " ," CoxRegression . loader ( input . filename = ' ' outputfile ' ') " ,"
source ( ' ' || P L U G I N S C R I P T D I R E C T O R Y || Survival / S u r v i v a l Cu r v e L o a d e r . r ' ') " ," SurvivalCurve
. loader ( input . filename = ' ' outputfile ' ' , concept . time = ' ' || TIME || ' ') "]} ," renderer ":{"
GSP ":"/ survivalAnalysis / s u r v i v a l A n a l y s i s O u t p u t "} ,... ( goes on ) '
where module_name = ' p gs u rv iv a lA n al ys i s ';
Not very nice...
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
17 / 22
18. RModules Analyses’ (transmart-data)
In transmart-data:
One file per analysis
Files can be generated from DB data
Sanely formatted
But we really want to remove this from the DB!
array (
'id' => 'heatmap',
'name' => 'Heatmap',
'dataTypes' =>
array (
'subset1' =>
array (
0 => 'CLINICAL.TXT',
),
),
'dataFileInputMapping' =>
array (
'CLINICAL.TXT' => 'FALSE',
'SNP.TXT' => 'snpData',
'MRNA_DETAILED.TXT' => 'TRUE',
),
'pivotData' => false,
...
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
18 / 22
19. Rserve
Targets for Rserve:
Download/build R
Install R packages
Start Rserve
Install System V init
script for Rserve
Idem for systemd
cd R
make - j8 bin / root / R
# some packages don ' t support
concurrent builds
make install_packages
make start_Rserve
make start_Rserve . dbg
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ in i t
TRANSMART_USER = tomcat7 sudo E make i n s ta l l _r s e rv e _ un i t
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
19 / 22
20. Solr
Solr (4.5.0) automatically
downloaded and configured
Solr cores automatically created
User only needs to create a schema
file and dataconfig.xml
# setup & solr ( psql )
make start
# just c o n f i g u r e
make solr_home
make < core > _full_import
make < core > _delta_import
make clean_cores
ORACLE =1 make start
Gustavo Lopes (The Hyve B.V.)
transmart-data
November 6, 2013
20 / 22
21. transmartApp Configuration
Out-of-tree config management:
Targets for installing files
Zero configuration for
dev!
Customization allowed
without touching the target
files
Only supports ours branches
But a lot of configuration
should be in-tree instead!
Gustavo Lopes (The Hyve B.V.)
# install everything
# previous files are backed
up
make install
# just one file :
make install_Config . groovy
make install_ Bu il dC on fi g .
groovy
make install _D at aS ou rce .
groovy
# costumizations in :
# Config - extra . php
# BuildConfig . groovy (
limited )
transmart-data
November 6, 2013
21 / 22