Big data, bioscience and the cloud biocatalyst june 2015 sullivan

•Download as PPTX, PDF•

1 like•757 views

One of the most challenging problems in bioscience is data integration. From subcellular studiesto population simulations, we are faced with large volumes of difficult to integrate data. Presentation includes tips on getting started in big data bioscience.

Data & Analytics

Big Data, Bioscience
and the Cloud
Dan Sullivan
June 25, 2015
BioCatalyst: Cloud Computing in Bioscience
Oregon Bioscience Association

Overview
• Background
• Varieties of Big Data in Bioscience
• Continuous learning about Big Data & Cloud
• Making Connections

My Background
 Data Architect / Engineer
 NoSQL and relational data modeler
 Big data
 Analytics, machine learning and text mining
 Cloud computing
 Computational Biologist
 Author
 No SQL for Mere Mortals
 Contributor to TechTarget

Big Data Challenges in Bioscience
Volume
Velocity
Variety
Integration

Varieties of Big Data in
Bioscience
Subcellular – Genetics and
Proteomics
Cellular – Metabolic and
Signaling Pathways
Organism – Disease, Medicine,
Insurance
Populations – Epidemiology,
Social Networks

Genetics and Proteomics
• Genetic Sequencing
• Order of nucleotides in DNA
• Most DNA is common across species
• Many genes code proteins
• Some variants associated with disease
• Which ones?
• Proteomics
• Structure and function of proteins
• Variation in protein sequence and
structure associated with disease
• Which ones? In what context?
Images: http://www.masimo.it/hemoglobin/anemia.htm, https://en.wikipedia.org/wiki/DNA

Pathways
• Metabolic Pathways
• Series of chemical reactions
• Coordinated to produce
reactants
• Choreography of molecules
• Signaling Pathways
• Molecules on cell surface detect
changes in environment
• Cascade of reactions to change
state of cell
• Choreography of molecules
• How do they interact?

• Early 1950s
Korean War
autopsies
2012-2016 Genomic and Proteomic Studies
1985-1998 Pathology Studies - Pathodeterminants of
Atherosclerosis in Youth (PDAY) study
Disease - Atherosclerosis

Healthcare
• Genetics and Disease
• Post-Approval Drug Efficacy
• Discovering and Retrieving Medical
Information
• Comparative Quality

Populations
• Infectious Disease Spread
• How fast will disease spread?
• What countermeasures are
effective?
• What is the morbidity and
mortality?
• Simulation
– Synthetic population
– Model interactions
– Probabilistic

Why Cloud for Big Data in
BioScience?
• Scalability
• Access to compute and memory optimized
virtual machines
• Virtually unlimited storage
• Speed
• Many bioscience computations highly
parallel
• Minimize time to analyze, lower IT
overhead
• Cost
• AWS Spot Instances
• Google Pre-emptible VMs

Continuous Learning
• Coursera
• Cloud Computing Concepts
• Bioinformatics: Life Sciences on Your
Computer
• edX
• Introduction to Statistics
• Introduction to Biology
• Principles of Biochemistry
• Rackspace CloudU
• You Tube
• Big Data Vendors
• MapR
• Cloudera
• HortonWorks
• DataStax
• Data Bricks
• Trade Publications
– TechTarget
• SearchAWS
• SearchCloudComputing
• SearchCloudSecurity
– Health Data Management
– Harvard Business Review

Final Thoughts
• Great time to get into Biosciences and
Big Data
• Don’t be intimidated if it’s been a
while since you’ve studied biology –
we are all constantly learning in this
field
• Network online and in person
• Take advantage of free resources
• Courses
• Cloud
• AWS Free Tier
• MAPR Hadoop On Demand Training
• Connect with me on LinkedIn
• https://www.linkedin.com/in/dansull
ivanpdx
• Join me at a Meetup
• Dan.sullivan@cambiahealth.com

What's hot

Science Distributed's Chain Event: Distributed Science Pilot - Lauren LongSean Manion PhD

Validating microbiome claims – including the latest DNA techniquesEagle Genomics

Expert Panel on Data Challenges in Translational ResearchEagle Genomics

Smartness in Today’s healthcare applicationsDr. Shivananda Koteshwar

PerkinElmer Informatics OverviewPerkinElmer, Inc.

Berzinski Writing Sample7-091023pberzins

Data Con LA 2018 Keynote - Better Collaborative Data Science by Megan RisdalData Con LA

Irving-TeraData: data and science driven big industry-nfdp13DataDryad

Beacon: A Protocol for Federated Discovery and Sharing of Genomic DataMiro Cupak

Beacon Network: A System for Global Genomic Data SharingMiro Cupak

Exploring New Methods for Protecting and Distributing Confidential Research ...Bryan Beecher

Providing support for JC Bradleys vision of open science using RSC cheminform...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Caris Life SciencesKim Kozlik

GENOME DATA ANALYSISAmeldaAkoijam

What's hot (15)

Science Distributed's Chain Event: Distributed Science Pilot - Lauren Long

Validating microbiome claims – including the latest DNA techniques

Expert Panel on Data Challenges in Translational Research

Smartness in Today’s healthcare applications

PerkinElmer Informatics Overview

Berzinski Writing Sample7-091023

Data Con LA 2018 Keynote - Better Collaborative Data Science by Megan Risdal

Irving-TeraData: data and science driven big industry-nfdp13

Beacon: A Protocol for Federated Discovery and Sharing of Genomic Data

Beacon Network: A System for Global Genomic Data Sharing

Exploring New Methods for Protecting and Distributing Confidential Research ...

Providing support for JC Bradleys vision of open science using RSC cheminform...

Caris Life Sciences

GENOME DATA ANALYSIS

Similar to Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe

Database technologies in bioinformaticsGleb Sklyr

2016 09 cxo forumChris Dwan

Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe

Data Virtualization Modernizes BiobankingDenodo

Big data challenges associated with building a national data repository for c...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

E.Gombocz: Semantics in a Box (SemTech 2013-04-30)Erich Gombocz

Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit

Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Amazon Web Services

SAMSI Precision Medicine Keynote, August 2018: Data: where Precision Oncology...Warren Kibbe

Next Gen Sequencing and Associated Big Data / AI problemSubhendu Dey

Hadoop Enabled HealthcareDataWorks Summit

The pulse of cloud computing with bioinformatics as an exampleEnis Afgan

Information Technology and Radiology: challenges and future perspectivesErik R. Ranschaert, MD, PhD

ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult

Big data analysticsAll India Institute of Medical Sciences

Vph2012 20 sept12_shublaq_finalNour Shublaq

MedChemica BigData What Is That All About?Al Dossetter

Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.

How to Architect Smarter Systems for HealthcareReal-Time Innovations (RTI)

Similar to Big data, bioscience and the cloud biocatalyst june 2015 sullivan (20)

Data Harmonization for a Molecularly Driven Health System

Database technologies in bioinformatics

2016 09 cxo forum

Data Harmonization for a Molecularly Driven Health System

Data Virtualization Modernizes Biobanking

Big data challenges associated with building a national data repository for c...

E.Gombocz: Semantics in a Box (SemTech 2013-04-30)

Big Data at Geisinger Health System: Big Wins in a Short Time

Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013

SAMSI Precision Medicine Keynote, August 2018: Data: where Precision Oncology...

Next Gen Sequencing and Associated Big Data / AI problem

Hadoop Enabled Healthcare

The pulse of cloud computing with bioinformatics as an example

Information Technology and Radiology: challenges and future perspectives

ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Big data analystics

Vph2012 20 sept12_shublaq_final

MedChemica BigData What Is That All About?

Using Big Data for Improved Healthcare Operations and Analytics

How to Architect Smarter Systems for Healthcare

Recently uploaded

在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证nhjeo1gg

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一F La

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Business Analytics using Microsoft Excelysmaelreyes

Learn How Data Science Changes Our WorldEduminds Learning

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Recently uploaded (20)

在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证

Generative AI for Social Good at Open Data Science East 2024

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Real-Time AI Streaming - AI Max Princeton

办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一

Student profile product demonstration on grades, ability, well-being and mind...

Decoding Patterns: Customer Churn Prediction Data Analysis Project

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

MK KOMUNIKASI DATA (TI)komdat komdat.docx

IMA MSN - Medical Students Network (2).pptx

Business Analytics using Microsoft Excel

Learn How Data Science Changes Our World

Advanced Machine Learning for Business Professionals

Identifying Appropriate Test Statistics Involving Population Mean

Defining Constituents, Data Vizzes and Telling a Data Story

Top 5 Best Data Analytics Courses In Queens

Big data, bioscience and the cloud biocatalyst june 2015 sullivan

1. Big Data, Bioscience and the Cloud Dan Sullivan June 25, 2015 BioCatalyst: Cloud Computing in Bioscience Oregon Bioscience Association

2. Overview • Background • Varieties of Big Data in Bioscience • Continuous learning about Big Data & Cloud • Making Connections

3. My Background  Data Architect / Engineer  NoSQL and relational data modeler  Big data  Analytics, machine learning and text mining  Cloud computing  Computational Biologist  Author  No SQL for Mere Mortals  Contributor to TechTarget

4. Overview • Background • Varieties of Big Data in Bioscience • Continuous learning about Big Data & Cloud • Making Connections

5. Big Data Challenges in Bioscience Volume Velocity Variety Integration

6. Varieties of Big Data in Bioscience Subcellular – Genetics and Proteomics Cellular – Metabolic and Signaling Pathways Organism – Disease, Medicine, Insurance Populations – Epidemiology, Social Networks

7. Genetics and Proteomics • Genetic Sequencing • Order of nucleotides in DNA • Most DNA is common across species • Many genes code proteins • Some variants associated with disease • Which ones? • Proteomics • Structure and function of proteins • Variation in protein sequence and structure associated with disease • Which ones? In what context? Images: http://www.masimo.it/hemoglobin/anemia.htm, https://en.wikipedia.org/wiki/DNA

8. Pathways • Metabolic Pathways • Series of chemical reactions • Coordinated to produce reactants • Choreography of molecules • Signaling Pathways • Molecules on cell surface detect changes in environment • Cascade of reactions to change state of cell • Choreography of molecules • How do they interact?

9. • Early 1950s Korean War autopsies 2012-2016 Genomic and Proteomic Studies 1985-1998 Pathology Studies - Pathodeterminants of Atherosclerosis in Youth (PDAY) study Disease - Atherosclerosis

10. Healthcare • Genetics and Disease • Post-Approval Drug Efficacy • Discovering and Retrieving Medical Information • Comparative Quality

11. Populations • Infectious Disease Spread • How fast will disease spread? • What countermeasures are effective? • What is the morbidity and mortality? • Simulation – Synthetic population – Model interactions – Probabilistic

12. Why Cloud for Big Data in BioScience? • Scalability • Access to compute and memory optimized virtual machines • Virtually unlimited storage • Speed • Many bioscience computations highly parallel • Minimize time to analyze, lower IT overhead • Cost • AWS Spot Instances • Google Pre-emptible VMs

13. Overview • Background • Varieties of Big Data in Bioscience • Continuous learning about Big Data & Cloud • Making Connections

14. Continuous Learning • Coursera • Cloud Computing Concepts • Bioinformatics: Life Sciences on Your Computer • edX • Introduction to Statistics • Introduction to Biology • Principles of Biochemistry • Rackspace CloudU • You Tube • Big Data Vendors • MapR • Cloudera • HortonWorks • DataStax • Data Bricks • Trade Publications – TechTarget • SearchAWS • SearchCloudComputing • SearchCloudSecurity – Health Data Management – Harvard Business Review

15. Overview • Background • Varieties of Big Data in Bioscience • Continuous learning about Big Data & Cloud • Making Connections

16. LinkedIn Groups

17.

18. Final Thoughts • Great time to get into Biosciences and Big Data • Don’t be intimidated if it’s been a while since you’ve studied biology – we are all constantly learning in this field • Network online and in person • Take advantage of free resources • Courses • Cloud • AWS Free Tier • MAPR Hadoop On Demand Training • Connect with me on LinkedIn • https://www.linkedin.com/in/dansull ivanpdx • Join me at a Meetup • Dan.sullivan@cambiahealth.com

Editor's Notes

Projects with any two of these can probably be well handled by RDBMS. When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
Autopsies performed during Korean War found evidence of early on set athero. Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero. PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes. 3,000 autopsies 15-34 year olds Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks. Liver samples also collected. GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.

Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Similar to Big data, bioscience and the cloud biocatalyst june 2015 sullivan (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (13)

Recently uploaded

Recently uploaded (20)

Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Editor's Notes