A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
1. Enabling simultaneous analysis of multiple
cohort studies without accessing the full
dataset: a BRISSKit use case
Dr Jonathan Tedds jat26@le.ac.uk @jtedds
Senior Research Fellow,
Health Informatics & Interdisciplinary Research Group,
Department of Health Sciences (University of Leicester)
PI #BRISSKit http://www.brisskit.le.ac.uk
4. Data Reuse: asking new questions
•
Hubble Space Telescope
Papers based upon reuse of archived observations now exceed those based on the
use described in the original proposal.
– http://archive.stsci.edu/hst/bibliography/pubstat.html
•
See also work by Piwowar & Vision re life sciences: “Data reuse and the open data
citation advantage”
– http://peerj.com/preprints/1/
5. Why open? an Open Enterprise Report
Science as
•
•
•
•
As a first step towards this intelligent
openness, data that underpin a journal article
should be made concurrently available in an
accessible database
We are now on the brink of an achievable aim:
for all science literature to be online, for all of
the data to be online and for the two to be
interoperable. [p.7]
Royal Society June 2012, Science as an Open
Enterprise,
http://royalsociety.org/policy/projects/science
-public-enterprise/report/
Issues linking data to the scientific record:
–
–
–
•
Data persistence
Data and metadata quality
Attribution and credit for data producers
Geoffrey Boulton (Edinburgh), Lead author:
– “Science has been sleepwalking into crisis of
replicability...and of the credibility of science”
– “Publishing articles without making the data
available is scientific malpractice”
6. BRISSKit
context:
The I4Health goal of applying knowledge engineering to close the
‘ICT gap’ between research and healthcare (Beck, T. et al 2012)
7. Biomedical Research Infrastructure Software Service Kit
A vision for cloud-based open source research applications
#BRISSKit
http://www.brisskit.le.ac.uk
9. BRISSKit USPs
Integrated support for core research processes
Well-established mature open source applications as
protoyped in Cardiovascular, Respiratory, Cancer
Theme Biobank: UK customised
A platform for seamless management and integration
between applications
An API allows integration with existing clinical systems
Easy set up, use and administration through browser
(including on mobile devices)
Capability of being hosted in any compliant cloud
provider including UHL (NHS information governance)
12. BRISSKit Information Governance
& Security Management Work Stream
- Dr Andrew Burnham leading
1.
Information Governance Toolkit - analysis of Department of Health (DoH/NHS) IGT
requirements vs. BRISSKit organisation/project and services/tools
a) Hosted Secondary Use Team/project (Hosted IGT)
b) Acute Trust (Acute Trust IGT)
2.
IG Training Tool (NHS – University is registered)
3.
Pseudonymisation requirements
4.
Data Management Plan
5.
IT Security & standards – Penetration Testing & Security Testing
6.
Other NHS Standards/Requirements:
- Care Records Guarantee
- NHS Constitution
- NHS Records Management
- Patient Safety DSCN 14/2009, 18/2009
13. The semantic bridge
OBiBa Onyx
Records participant
consent, questionnaire
data and primary
specimen IDs
Bio-ontology!
i2b2
Cohort selection and
data querying
?
14. BRISSKit and Bio Banking
• Deploy solutions in international bio banking
initiatives
• Investment through Prof Paul Burton (Health
Sciences at Leicester/Bristol) & international
collaborations
• Building on strong informatics expertise at
University of Leicester in partnership with the
University Hospitals Leicester Trust
• Cardiovascular, Respiratory & Lifestyle BRUs
• Cancer Theme Biobank
• Genomics etc
16. Large data sets, why bother?
•Sample size
•Depth of phenotyping
•Quality of measurement
All critical
17. How big is BIG?
The direct effect of a gene
• 2,000 cases minimum, 10,000 cases better
Environmental and life-style factors
• Highly context specific: from hundreds to tens of
thousands of cases
Gene-lifestyle and gene-gene “interactions”
• Absolute minimum 10,000, usually need at least
20,000, a comprehensive platform needs at least
50,000
• Scientifically fundamental
18. The bottom line
• Scientific harmonization
• Restriction on access to individual level data
• Streamlined access to multiple data sets
Central to the integrative aims of P3G, PHOEBE,
BioSHaRE-eu etc
Also fundamental to the aims of potential BRISSKit
users
18
Effective data access is crucial
Effective joint analysis is essential too (integration)
Fundamental challenges
20. DataSHIELD: a novel solution
Take analysis to data not
data to analysis
One step analyses: simple
Iterative analyses: parallel
processes linked together by
entirely non-identifying
summary statistics
Typically produces
mathematically identical
results to fitting a single
model to all the data held
in one pooled data set
22. Horizontal
DataSHIELD
Data computer
Opal includes
• DataSHIELD
• DataSHaPER
• Researcher ID
Opal
Finrisk
R
Data computer
Opal
Prevend
Web services
Web services BioSHaRE Web services
web site
R
Web services
R
Analysis
Computer
Data computer
Opal
1958BC
R
23. Horizontal
DataSHIELD
Work in progress:
• Embed Opal in
BRISSKit
Data computer
• ALSPAC
• MRC e-HIRCS
• +more…
Opal
Finrisk
R
Data computer
Opal
Prevend
Web services
Web services BioSHaRE Web services
web site
R
Web services
R
Analysis
Computer
Data computer
Opal
1958BC
R
26. Opal gains
• Direct interface
with more tools
• I2B2 functionality
•Potential for enhanced
user interface
Opal
1958BC
BRISSKit gains
• DataSHIELD
• DataSHaPER
• Researcher ID
27. Opal gains
• Direct interface
with more tools
• I2B2 functionality
•Potential for enhanced
user interface
Opal
1958BC
BRISSKit gains
• DataSHIELD
• DataSHaPER
• Researcher ID
Everybody gains
• Enhanced combined
functionality
- better science
• Bigger user group
- greater portability
• Greater potential to
become a sustainable
standard
28. Opal gains
• Direct interface
with more tools
• I2B2 functionality
•Potential for enhanced
user interface
Opal
1958BC
BRISSKit gains
• DataSHIELD
• DataSHaPER
• Researcher ID
Everybody gains
• Enhanced combined
functionality
- better science
• Bigger user group
- greater portability
• Greater potential to
become a sustainable
standard
Enhanced joint analysis with
• Ethico-legal constraints
e.g.US/Europe biobanks
• Intellectual property issues
e.g. H3AFRICA
29. The bottom line
• Scientific harmonization
• Restriction on access to individual level data
• Streamlined access to multiple data sets
Central to the integrative aims of P3G, PHOEBE,
BioSHaRE-eu etc
Also fundamental to the aims of potential BRISSKit
users
29
Effective data access is crucial
Effective joint analysis is essential too (integration)
Fundamental challenges
Editor's Notes
Hubble Space Telescope (HST) in operation since 1990Observations are made on the basis of prposals, data is collected and made available to the proposers; data is stored at the Space Telescope Science Institute and made available after an embargo.Each year approx 200 proposals are selected from a field of 1,000; leading to c. 20,000 individual observationsThere are now more research papers published on the basis of ‘reuse’ of the archived data than those based on the use described in the original proposal.
SAOE report looked at the changing conduct of science.Key recommendations are the research data, the data underpinning research findings should be as openly available as possible.