SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Reproducible, Open
Data Science in the
Life Sciences
Digital Research 2013, 9th-10th September 2013
Eamonn Maguire
Lead Software Engineer
University of Oxford e-Research Centre
“data science” - the storage, management and analysis of data sets
or...
What is “data science”...
these days, all hard sciences are “data sciences”
"data science" - formalizing a hypothesis given a set of observations
and assumptions, grabbing data about that hypothesis, testing it and
analyzing it to either confirm or falsify the hypothesis. 
we shift the focus of science from performing physical experiments
where data is the by-product used to test a hypothesis, to working
directly with the data
both definitions have different levels of validity in terms of the etymology of the word
“science”, but in this presentation, both go very much hand in hand.
Why reproducible and open?
open
experiments are
expensive...and often
funded publicly.
data from experiments
may extend way beyond
the realms originally
intended...
one without the other is really no use at all.
reproducible
findings need to be
robust...and testable by the
wider scientific community...
provided with
data, metadata, analysis
methods, algorithms
enabled by
data, metadata, analysis methods,
algorithms
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
The workflow of a “data scientist”...
Planning
Planning
What is your hypothesis?
What do you need to prove/disprove this?
You need an experimental design.
Preferably balanced groups, but enough samples to make it statistically valid.
If you need to generate the data...there’s an app for that.
+
Design Wizard
Plan your experiment by answering questions about what you want to measure and how
you want to measure it...then let the tool create the design
Creates the ISA-Tab stub, leaving you to fill in which files match which biological samples.
Planning
Planning
You also need a data
management strategy.
Which ontologies, minimal
information checklists and
exchange formats can be used
for my domain?
What are the requirements of
my funder for data deposition?
Which databases support my
data?
Data Collection
Data Collection
Use existing
data
Perform new
experiment
New experiment data Use existing data
Collect the data and metadata
from an experiment
Or use existing data and
metadata to test a hypothesis...
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
SAMPLE 5
SAMPLE 6
SAMPLE 7
SAMPLE 8
SAMPLE 9
SAMPLE 10
SAMPLE 11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
FILE 8
FIL
FIL
FIL
Data Collection
Use existing
data
Perform new
experiment
New experiment data
Excel
Create templates to fit the
type of experiments to be
described
Curate your experiment
using a desktop-based,
platform independent tool.
Describe & curate your
experiment with geographically
distributed collaborators
Check out http://isa-tools.org
to download.
Enables the creation of meaningful
experiment data in a simple
extendable format
Data Collection
Use existing
data
Perform new
experiment
Use existing data
Data Management Data
Management
Data Management Data
Management
Shifting towards a new system
Analysis
Analysis
The interesting bit...doing something with our data and metadata...
Analysis of ISA Tab data in
the R language. Brings
together the context and data
to enable more meaningful
analysis.
Also suggests packages to
use for analysis based on the
data types in the ISA Tab file.
Publication coming soon...
Analysis of ISA-Tab data in the
Galaxy Environment.
Creates Galaxy Library objects
from ISA-Tab files.
Analysis of ISA-Tab data in the
GenomeSpace Environment.
Load and edit files stored on distributed
servers.
Created by Brad Chapman at the
Harvard School for Public Health
Visualization
Visualization
Check out your experiment, visualize experimental design
Visual Compression of Workflow
Visualizations with Automated
Detection of Macro Motifs
Maguire et al, 2013
IEEE TVCG
Taxonomy−based Glyph Design –
with a Case Study on Visualizing
Workflows of Biological Experiments
Maguire et al, 2012
IEEE TVCG
Publication
Visualization
Publish, along with
your research
articles
& specialised
community
repositories
Share, link and
reason over
experiments with
linked data
Getting your work out there...
Publication - current work
Visualization
See http://www.slideshare.net/GigaScience/scott-edmunds-ismb-talk-on-big-data-
publishing for a use case showing how we achieve this.
Analysis results
Data files
Publications
Metadata
Encapsulates all the
information about the
experiment, providing links
to the data files,
publications and analysis
protocols
Analysis workflows in the Galaxy
Environment
Workflows
Presentations
Logs
Box it all up
The role of a data scientist, (or in the life sciences, a bioinformatician) is multi fold
We’ve presented a suite of tools to help the data scientist in the management of their data and
support the creation of open, meaningful life science data
Data
Scientist
VisualizationData
ManagementData Collection PublicationPlanning
a b c e f
Analysis
d
“data science” - the storage, management and analysis of data sets
"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing
data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis. 
With the systems we have in place for data discovery paired with data already created with the ISA
suite of tools, we make possible data integration
The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Making science with data possible...
The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
And this...
Making science with data possible...
The workflow of a “data scientist”...
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Making science with data possible...
The workflow of a “data scientist”...
...
Making science with data possible...
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
Data
Scientist
Planning
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Planning
Publication
Data
Scientist
Analysis
Planning
Data
Management
Data Collection
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Publication
Recent Publications
Visual Compression of Workflow
Visualizations with Automated
Detection of Macro Motifs
Maguire et al, 2013
IEEE TVCG
Taxonomy−based Glyph Design –
with a Case Study on Visualizing
Workflows of Biological Experiments
Maguire et al, 2012
IEEE TVCG
ISA software suite:
Overview of ISA-Tab and first set of tools
Rocca-Serra et al, 2010
Bioinformatics
Towards interoperable bioscience data: Pre-
senting the ISA Commons, authored by more than
50 collaborators at over 30 scientific organizations
around the globe.
Sansone et al, 2012
Nature Genetics
OntoMaton: a Bioportal powered
ontology widget for Google Spreadsheets.
Maguire et al, 2013
Bioinformatics
The Harvard Stem Cell Discovery Engine:
an integrated repository and analysis system for
Ho Sui et al, 2012
Nucleic Acids Research
Taxonomy-based Glyph Design
Maguire et al, 2012
IEEE TVCG
Visualizing (ISA based) workflows of
biological experiments
Standardizing data: ISA-Tab-Nano: ISA-Tab
extension for nanotechnology applications au-
thored by over 20 organizations inlc. government
agencies, academia and industry.
Baker et al, 2013
Nature Nanotechnology
MetaboLights: an open-access repository
for metabolomics at EBI powered by ISA.
Haug et al, 2013
Nucleic Acids Research
The ToxBank Data Warehouse: a research
cluster of 7 EU FP7 Health systems toxicology and
toxicogenomics projects develops the ISAtoRDF
module
Kohonen et al, 2013
Molecular Informatics
Thanks to
ISA team
Susanna-Assunta Sansone
Philippe Rocca-Serra
Eamonn Maguire
Alejandra Gonzalez-Beltran
Contributors
Marco Brandizi
Natalija Sklyar
Brad Chapman
Bob MacCallum
Kenneth Haug
Pablo Conesa
Audrey Kauffman
Funders
& Our Many Collaborators!
S te m C e ll C om m ons
Nanotechnology
Informatics Working
Group
Questions?
http://isa-tools.org
http://isacommons.org
http://biosharing.org

Contenu connexe

Tendances

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
PyData
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the Database
Vanessa Hurst
 
Data science services YLS
Data science services YLSData science services YLS
Data science services YLS
Dima Semchuk
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney
 

Tendances (19)

CS215 - Lec 9 indexing and reclaiming space in files
CS215 - Lec 9  indexing and reclaiming space in filesCS215 - Lec 9  indexing and reclaiming space in files
CS215 - Lec 9 indexing and reclaiming space in files
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
 
Dr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with PalladiumDr. Andreas Lattner- Setting up predictive services with Palladium
Dr. Andreas Lattner- Setting up predictive services with Palladium
 
Honey I Shrunk the Database
Honey I Shrunk the DatabaseHoney I Shrunk the Database
Honey I Shrunk the Database
 
Data science services YLS
Data science services YLSData science services YLS
Data science services YLS
 
07.bootstrapping
07.bootstrapping07.bootstrapping
07.bootstrapping
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Impetus White Paper- Handling Data Corruption in Elasticsearch
Impetus White Paper- Handling  Data Corruption  in ElasticsearchImpetus White Paper- Handling  Data Corruption  in Elasticsearch
Impetus White Paper- Handling Data Corruption in Elasticsearch
 
Gooey data sets
Gooey data setsGooey data sets
Gooey data sets
 
DF1 - R - Natekin - Improving Daily Analysis with data.table
DF1 - R - Natekin - Improving Daily Analysis with data.tableDF1 - R - Natekin - Improving Daily Analysis with data.table
DF1 - R - Natekin - Improving Daily Analysis with data.table
 
SQL Prepared Statements Tutorial
SQL Prepared Statements TutorialSQL Prepared Statements Tutorial
SQL Prepared Statements Tutorial
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Tools and connectors
Tools and connectorsTools and connectors
Tools and connectors
 
20190110 tfug fukuoka
20190110 tfug fukuoka20190110 tfug fukuoka
20190110 tfug fukuoka
 
4. R- files Reading and Writing
4. R- files Reading and Writing4. R- files Reading and Writing
4. R- files Reading and Writing
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
 
Lecture06 ie321 dr_atifshahzad
Lecture06 ie321 dr_atifshahzadLecture06 ie321 dr_atifshahzad
Lecture06 ie321 dr_atifshahzad
 

En vedette

En vedette (9)

Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...Visual Compression of Workflow Visualizations with Automated Detection of Mac...
Visual Compression of Workflow Visualizations with Automated Detection of Mac...
 
Taxonomy-Based Glyph Design
Taxonomy-Based Glyph DesignTaxonomy-Based Glyph Design
Taxonomy-Based Glyph Design
 
A Data Science Workflow: Nonprofit Edition
A Data Science Workflow: Nonprofit EditionA Data Science Workflow: Nonprofit Edition
A Data Science Workflow: Nonprofit Edition
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
 
Principles of Data Visualization
Principles of Data VisualizationPrinciples of Data Visualization
Principles of Data Visualization
 
Web valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app developmentWeb valley talk - usability, visualization and mobile app development
Web valley talk - usability, visualization and mobile app development
 
Visualization of Publication Impact
Visualization of Publication ImpactVisualization of Publication Impact
Visualization of Publication Impact
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 

Similaire à Reproducible, Open Data Science in the Life Sciences

Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 

Similaire à Reproducible, Open Data Science in the Life Sciences (20)

Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Reproducibility for IR evaluation
Reproducibility for IR evaluationReproducibility for IR evaluation
Reproducibility for IR evaluation
 
Reproducibility for IR evaluation
Reproducibility for IR evaluationReproducibility for IR evaluation
Reproducibility for IR evaluation
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Standards and tools for model management in biomedical research
Standards and tools for model management in biomedical researchStandards and tools for model management in biomedical research
Standards and tools for model management in biomedical research
 
Thinking about Data Strategy: for Ophthalmologists
Thinking about Data Strategy: for OphthalmologistsThinking about Data Strategy: for Ophthalmologists
Thinking about Data Strategy: for Ophthalmologists
 
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Reproducible, Open Data Science in the Life Sciences

  • 1. Reproducible, Open Data Science in the Life Sciences Digital Research 2013, 9th-10th September 2013 Eamonn Maguire Lead Software Engineer University of Oxford e-Research Centre
  • 2. “data science” - the storage, management and analysis of data sets or... What is “data science”... these days, all hard sciences are “data sciences” "data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis.  we shift the focus of science from performing physical experiments where data is the by-product used to test a hypothesis, to working directly with the data both definitions have different levels of validity in terms of the etymology of the word “science”, but in this presentation, both go very much hand in hand.
  • 3. Why reproducible and open? open experiments are expensive...and often funded publicly. data from experiments may extend way beyond the realms originally intended... one without the other is really no use at all. reproducible findings need to be robust...and testable by the wider scientific community... provided with data, metadata, analysis methods, algorithms enabled by data, metadata, analysis methods, algorithms
  • 5. Planning Planning What is your hypothesis? What do you need to prove/disprove this? You need an experimental design. Preferably balanced groups, but enough samples to make it statistically valid. If you need to generate the data...there’s an app for that. + Design Wizard Plan your experiment by answering questions about what you want to measure and how you want to measure it...then let the tool create the design Creates the ISA-Tab stub, leaving you to fill in which files match which biological samples.
  • 6. Planning Planning You also need a data management strategy. Which ontologies, minimal information checklists and exchange formats can be used for my domain? What are the requirements of my funder for data deposition? Which databases support my data?
  • 7. Data Collection Data Collection Use existing data Perform new experiment New experiment data Use existing data Collect the data and metadata from an experiment Or use existing data and metadata to test a hypothesis... SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7 SAMPLE 8 SAMPLE 9 SAMPLE 10 SAMPLE 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 FILE 8 FIL FIL FIL SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7 SAMPLE 8 SAMPLE 9 SAMPLE 10 SAMPLE 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 FILE 8 FIL FIL FIL
  • 8. Data Collection Use existing data Perform new experiment New experiment data Excel Create templates to fit the type of experiments to be described Curate your experiment using a desktop-based, platform independent tool. Describe & curate your experiment with geographically distributed collaborators Check out http://isa-tools.org to download. Enables the creation of meaningful experiment data in a simple extendable format
  • 9. Data Collection Use existing data Perform new experiment Use existing data
  • 12. Analysis Analysis The interesting bit...doing something with our data and metadata... Analysis of ISA Tab data in the R language. Brings together the context and data to enable more meaningful analysis. Also suggests packages to use for analysis based on the data types in the ISA Tab file. Publication coming soon... Analysis of ISA-Tab data in the Galaxy Environment. Creates Galaxy Library objects from ISA-Tab files. Analysis of ISA-Tab data in the GenomeSpace Environment. Load and edit files stored on distributed servers. Created by Brad Chapman at the Harvard School for Public Health
  • 13. Visualization Visualization Check out your experiment, visualize experimental design Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs Maguire et al, 2013 IEEE TVCG Taxonomy−based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments Maguire et al, 2012 IEEE TVCG
  • 14. Publication Visualization Publish, along with your research articles & specialised community repositories Share, link and reason over experiments with linked data Getting your work out there...
  • 15. Publication - current work Visualization See http://www.slideshare.net/GigaScience/scott-edmunds-ismb-talk-on-big-data- publishing for a use case showing how we achieve this. Analysis results Data files Publications Metadata Encapsulates all the information about the experiment, providing links to the data files, publications and analysis protocols Analysis workflows in the Galaxy Environment Workflows Presentations Logs
  • 16. Box it all up The role of a data scientist, (or in the life sciences, a bioinformatician) is multi fold We’ve presented a suite of tools to help the data scientist in the management of their data and support the creation of open, meaningful life science data Data Scientist VisualizationData ManagementData Collection PublicationPlanning a b c e f Analysis d “data science” - the storage, management and analysis of data sets "data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis.  With the systems we have in place for data discovery paired with data already created with the ISA suite of tools, we make possible data integration
  • 17. The workflow of a “data scientist”... Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Making science with data possible...
  • 18. The workflow of a “data scientist”... Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment And this... Making science with data possible...
  • 19. The workflow of a “data scientist”... Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Making science with data possible...
  • 20. The workflow of a “data scientist”... ... Making science with data possible... Data Scientist Analysis Planning Data Management Data Collection Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Publication Data Scientist Analysis Planning Data Management Data Collection Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Publication Data Scientist Planning Data Collection Use existing data Perform new experiment Data Scientist Planning Data CollectionPublication Use existing data Perform new experiment Data Scientist Planning Data CollectionPublication Use existing data Perform new experiment Data Scientist Planning Data CollectionPublication Use existing data Perform new experiment Data Scientist Planning Data CollectionPublication Use existing data Perform new experiment Data Scientist Planning Data CollectionPublication Use existing data Perform new experiment Data Scientist Planning Publication Data Scientist Analysis Planning Data Management Data Collection Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Data Management Data CollectionPublication Use existing data Perform new experiment Data Scientist Visualization Analysis Planning Publication
  • 21. Recent Publications Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs Maguire et al, 2013 IEEE TVCG Taxonomy−based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments Maguire et al, 2012 IEEE TVCG ISA software suite: Overview of ISA-Tab and first set of tools Rocca-Serra et al, 2010 Bioinformatics Towards interoperable bioscience data: Pre- senting the ISA Commons, authored by more than 50 collaborators at over 30 scientific organizations around the globe. Sansone et al, 2012 Nature Genetics OntoMaton: a Bioportal powered ontology widget for Google Spreadsheets. Maguire et al, 2013 Bioinformatics The Harvard Stem Cell Discovery Engine: an integrated repository and analysis system for Ho Sui et al, 2012 Nucleic Acids Research Taxonomy-based Glyph Design Maguire et al, 2012 IEEE TVCG Visualizing (ISA based) workflows of biological experiments Standardizing data: ISA-Tab-Nano: ISA-Tab extension for nanotechnology applications au- thored by over 20 organizations inlc. government agencies, academia and industry. Baker et al, 2013 Nature Nanotechnology MetaboLights: an open-access repository for metabolomics at EBI powered by ISA. Haug et al, 2013 Nucleic Acids Research The ToxBank Data Warehouse: a research cluster of 7 EU FP7 Health systems toxicology and toxicogenomics projects develops the ISAtoRDF module Kohonen et al, 2013 Molecular Informatics
  • 22. Thanks to ISA team Susanna-Assunta Sansone Philippe Rocca-Serra Eamonn Maguire Alejandra Gonzalez-Beltran Contributors Marco Brandizi Natalija Sklyar Brad Chapman Bob MacCallum Kenneth Haug Pablo Conesa Audrey Kauffman Funders & Our Many Collaborators! S te m C e ll C om m ons Nanotechnology Informatics Working Group