tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker Discovery

TM4P
Translational Medicine for Patients

TM Data Hub Project
Implementation of a Translational Medicine Data Integration Platform

tranSMART Community Meeting
Developer Stream, Nov 06-2013
Charlotte Raillère (tranSMART Expert)
Claire Virenque (Project Manager)

|

1

Content of the presentation
●

Update on Sanofi latest achievements
1. IT security assessment of tranSMART
2. Improvement of SNP (subject level) data loading

●

Update on work-in-progress
3. New release under development (‘RC2’)
4. tranSMART x MongoDB integration

tranSMART Community Meeting – Nov 06, 2013

|

2

Context – tranSMART at Sanofi
●

Pilot experience with tranSMART from September 2011 till June 2012
● Evaluate tranSMART capabilities to support clinical biomarker research

●

Implementation project launched in September 2012
● Identify tranSMART improvements, which are of highest value for Sanofi
● Implement tranSMART improvements through two successive tranSMART Release
Candidates (RC)
• RC1 is available since March, 2013 – code base available in Github
• RC2 building is in progress
• RC2 is expected to move into production mode in Q2 next year

●

Working version of tranSMART available for our early adopter business units
● Obj = Meet their ongoing needs related to translational research data integration.
● Support for data curation & loading is also provided.


|

3

tranSMART IT security
assessment
Feedback
Special thanks to Vincent Rossetto
and the IS Security Team!

Part 1 – Scope and Context
●

●

Objective of Security Risk Assessment: Protect R&D information
● Mission of R&D IS Security team – Control and Assess the risks on R&D information asset

Risk assessment methodology
● ‘Ethical’ hacking – penetration testing
• From vulnerability scans to exploitation
• Using free tools (Nessus, BackTrack, Metasploit, Sharepoint perl script)
• With no account on Sanofi systems neither sanofi standard workstation

●

●

● Without access account, try to gain high level access (admin account, sensitive data)

Risk Classification: Four grades
● From ‘High’: Risk with important consequences on Sanofi activities – can happen or be
caused easily
● Till ‘Negligible’: Risk with minor consequences – requires expert knowledge or favorable
context

Recommendations: Remediation Action Plan
● With prioritization of the recommendations

|

5

Part 2 – tranSMART risk assessment results
●

tranSMART strength overview
● No trivial system accounts found. No default database accounts found.

● Web servers are running under low privileges. User authentication cannot be
bypassed.
• Authentication through Sanofi’s Active Directory

●

Main risks identified
● Credential disclosure
(database, Tomcat, Jboss…)
● Session hi-jacking
● Privilege elevation
● Application malevolence (XSS)

●

Impact
● Sensitive data disclosure
● Technical information disclosure
● Identity usurpation

|

6

Part 3 – Application Security weaknesses
●

XSS attack: Certain parameters (tags) are
prone to store cross-site scripting attacks.
●

●

This vulnerability can be exploited to take control
of another administrator’s browser or more
probably to lead phishing or viral spreading attacks
Admin session hijacking XSS alert :
• <script>alert(String.fromCharCode(88, 83, 83, 32, 97,
116, 116, 97, 99, 107, 32, 105, 110, 32, 112, 114, 111,
103, 114, 101, 115, 115))</script>

●

Privilege escalation: Basic users can access some administrative features
●

The following URL must not be accessible to users with standard account:
•
•

/transmart/secureObjectAccess/manageAccess
/transmart/secureObjectAccess/manageAccessBySecObj

|

7

Part 4 – Recommendations and good practices
●
●
●
●

Use good development practices to avoid XSS attacks and privilege escalation
●

Based on development standards such as OWASP

Ensure compliance of application accounts with company’s password policy
●
●

LDAP authentication using AD (preferred)
Or set up specific application password policy (pwd complexity, pwd expiration, time out…)

Encrypt tranSMART authentication (https)
●

Avoid sniffing attacks and credential disclosure

Avoid default or weak accounts
● Administrative console (Jboss, Tomcat, Axis2) must have complex and secret password
•
•

●

●

Risk: Exploit vulnerability to access admin areas and compromise the application (crafted application
Consequence: Can impact the application availability or the data confidentiality & integrity.

Database accounts (DBA, application) must have complex and secret password
•
•

Risk: Exploit vulnerability to access the Web application database
Conquequence: Can impact the data confidentiality & integrity

Sensitize users on security topics
●

Lock Workstation or log off from tranSMART session to avoid unauthorized access

|

8

Loading of SNP data
Latest achievements

tranSMART Community Meeting – Nov 06, 2013 |

9

Loading of SNP genotyping data
●

Modification of loader.jar (from tranSMART-ETL repository)
● Correction of errors
● Loading speeded up
• Some inserts replaced by batch inserts
• Parameters modified to insert/select data
● Less constraints on file format
• Columns from the annotation file can be described in property files
• New class to load SNP data from Illumina platform

●

Loading of three studies with SNP data from Illumina platform (> 1million SNP)
● 4 patients → 40 minutes
● 30 patients → 5 hours
● 1500 patients → 80 hours

●

Estimation
(on-going)

Integration of SNP loading in ICE (tranSMART
Curation & Loading Tool) done


|

10

New tranSMART release
under development (‘RC2’)
Improvements – New features

tranSMART Community Meeting – Nov 06, 2013 |

11

tranSMART RC2 – Scope outline
●

Accommodate new data types
●
●
●
●

●

●
●

miRNA data (qPRC and microarray)
Proteomic data (RBM data, mass spec data)
Metabolomic data
RNA sequencing data

Accommodate serial data (time courses, doses
responses, etc.)

Enable sequential loading of data for a study
Enhance critical current analytics

Developments in-progress.
Partnership w/ Cognizant and
The Hyve.
Completion of RC2
developments planned for
January, 2014.
Developments will be
contributed back to the
community.

● Box Plot, Line Graph, Correlation Analysis, Grid View
● Plus adaptation of analytics to new data types

●

Enhance data export features

Click here for further details
on RC2 enhancements


| 12

tranSMART RC2 – Key points
●

RC2 is built ‘on top of’ Sanofi RC1 release
● ETL: impact of changes = high (Kettle scripts converted into Groovy, new ETL
pipelines, mapping files modified)
● Data model: impact = high (creation of new tables for new data types, etc.)
● UI: impact = low

●

Our goal is to converge towards the GPL version
● RC1 was merged with ‘Core DB’ & ‘Core API’ enhancements (from GPL1.1)
• Start of the modularization of tranSMART

● New data types are implemented in a modular fashion.
• This should help to the future merging of RC2 with open source code base
Limit deviation from the open
source code base

Do not duplicate efforts

Maximally benefit from public
tranSMART development efforts

Contribute back all
developments to the community


|

13

tranSMART x MonGo DB
integration
Objective and timeline


|

14

MongoDB integration with tranSMART (1/2)
●
●

MongoDB is a NoSQL document oriented database
Main need for tranSMART: Physical storage of unstructured data (i.e., files)
● Any files that are uploaded and visible through the Browse tab of the Sanofi RC1 (raw
data files, study related documentation such as clinical protocol, etc.)
● Currently, files are stored on tranSMART app server… Limited storage capacity.
 Objective: Move storage of unstructured data from tranSMART server to MongoDB db

●

Why MongoDB ?
● Ability to store huge volume of unstructured files
● Horizontal scalability
● Easy installation process


|

15

MongoDB integration with tranSMART (2/2)
●

Timelines
● Integration with Sanofi RC2 release (backend + UI): Q4-2013
● Testing in Q1-2014


|

16

Conclusion
Any questions?

Thank you!
Acknowledgement: Sherry Cao, Jike Cui, Angelo DeCristofano, Christophe Gibault, Lars
Greiffenberg, Manfred Hendlich, Rainer Kappes, Adam Palermo, Annick Peleraux, David
Peyruc, Charlotte Raillère, Vincent Rossetto, Claire Virenque

Making a difference in Healthcare with Information Technologies.


|

17

Additional slides


|

18

tranSMART RC1 – Summary
●

Released in March 2013
● Code base available in Github

●

Main improvements delivered in tranSMART RC1:

Topic 1: Data
Management

•
•
•
•
•
•

Topic 2: tranSMART
User Interface

•

Topic 3: Data
Searching and
Analysis

Ability to organize data within a hierarchical structure (Program/Study/Assay) with new
tagging capabilities
Synonym management for several dictionaries (e.g. compounds, genes, diseases)
New capabilities for posting, searching and exporting files
New functionality to load gene expression analysis results
Better support for time points/series
Improvement of tranSMART curation and loading tool & pipelines
Simplification of tranSMART UI:
– All searching functionalities centralized
– Synchronization of the browser and analysis modules

•

Improvement of data searching capabilities:
– Integrated search / filter for querying any data available (levels 1 to 4)
– More search / filter criteria
• Implementation of standard analytics from GPL1.0


| 19

RC1 – New organization of tranSMART UI
●

Two main tabs – synchronized with each other:

Global view of all the data available
From level 1 data (uncurated/raw files)
to levels 3-4 data (analysis results, findings)

Run analysis on subject-level data
(former Dataset Explorer)

Navigate within Programs > Studies > Assays
, Analysis and File Folders (see next slide)

Browse level 2 (processed) data – incl. clinical /
preclinical / molecular data, etc.

Search data using dictionaries

Search subject-level data

Create new Programs > Studies > Assays and Files
Folders, and annotate (tag) them

Select data subsets (cohorts)

Export files

Run basic statistical and genomic analyses on
those subsets (standard features from tranSMART v1.0)

Visualize gene expression analysis results

Export out data subsets


| 20

tranSMART RC2 – Requirements (1/2)
Area

Req #
1
2

3
Data
loading /
ETL
pipelines

Security

4
5
6
7
8
9
10
11
12
13
14
15
16

Analytics –
Advanced
Workflows

17
18
19

Requirement
Optimize the clinical ETL pipeline to accelerate loading time for large clinical studies
Enable incremental loading of data for a given study
Enable loading of ‘serial’ high and low dimensional data (time course, dose response,
different sampling conditions, etc.)
Improve samples handling
Enable loading of RBM subject-level data as high dimensional data
Enable loading of microarray miRNA subject-level data as high dimensional data.
Enable loading of qPCR miRNA subject-level data
Enable loading of mass spec proteomic subject-level data as high dimensional data.
Enable loading of metabolomic subject-level data as high dimensional data.
Improve SNP subject-level data loading – in particular, accelerate loading time
Enable loading of RNA sequencing subject-level data (gene-level expression quantification)
Optimize the management of annotation files for omic data
Set up user authentication through the company’s Active Directory
Implement security rules and user permissions in Browse tab (RC1 feature)
Allow better analysis of ‘serial’ high and low dimensional data using existing analytics
Improve the Line Graph analytics:
• Enable Line Graph to use high dimensional data
•Better handle x axis
• Add option to plot individual data in addition to group means or medians.
Improve sub categorization of high dimensional data (tissue, time points, etc.) in the high
dimensional data node selection screen in Advanced Workflows – linked to req #3
Improve the Boxplot analytics – make individual box plots for each variable when dragging
multiple nodes in field ‘Dependent Variable’, and present output in table format
Improve the Correlation Analysis analytics

Sprint #
2
4

2
1
3
2
2
3
3
Done
2
Done
Done
1
2
4

2
4
4


|

21

tranSMART RC2 – Requirements (2/2)
Area

Req #
20
21
Analytics –
22
Advanced
23
Workflows
24
25

Analytics –
Grid View

26
27

Requirement
Allow analysis of RBM data using existing analytics for high dimensional data
Allow analysis of microarray miRNA data using existing analytics
Allow analysis of qPCR miRNA and mRNA data using existing analytics
Allow analysis of mass spectrometry subject-level data using existing analytics
Allow analysis of metabolomic subject-level data using existing analytics
Allow analysis of RNA sequencing data using existing analytics
Improve Grid View
•
•
•
•

Sprint #
3
2
2
3
3
2

Enable categorical variables in a single column
Enable column deletion, row or column selection
Enable export of selection
Automatically include variables used in Advanced Workflows

3

Display sample ID related to patient ID in Grid View
Improve export of data
• Improve performances (response time) when exporting large data volume

1

28

• Add advanced filters to allow users to limit the exported data to subset of clinical fields, genes…
• Add ability to better categorize the data available for a study (clinical, gene expression, etc.)
• Harmonize with Grid View export capabilities

2+4

Tagging
Gene sign.

29
30
31

1
2
Done

UI

32

Add ability to preview a file in browser (IE8 and Firefox)
Add dictionaries for miRNA, proteins, metabolites
In Gene Signature/List tab, add gene symbols – linked to req #12
Improve consistency and synchronization of data trees in Browse (Program Explorer panel) and
in Analyze (Navigate Terms panel)
Secure file indexing
After running a free text search in Browse tab, when clicking on bold items in Program Explorer
panel, highlight in right hand side Browse panel:

Export

33
Search

34

2
Done

• String found in metadata (including in file names)
• Files containing that string

3
|

22

Risk Assessment methodology


|

23

tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker Discovery

Recommandé

Recommandé

Contenu connexe

Similaire à tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker Discovery

Similaire à tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker Discovery (20)

Plus de David Peyruc

Plus de David Peyruc (20)

Dernier

Dernier (20)

tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker Discovery