One of the most challenging problems in bioscience is data integration. From subcellular studiesto population simulations, we are faced with large volumes of difficult to integrate data. Presentation includes tips on getting started in big data bioscience.
3. My Background
Data Architect / Engineer
NoSQL and relational data modeler
Big data
Analytics, machine learning and text mining
Cloud computing
Computational Biologist
Author
No SQL for Mere Mortals
Contributor to TechTarget
6. Varieties of Big Data in
Bioscience
Subcellular – Genetics and
Proteomics
Cellular – Metabolic and
Signaling Pathways
Organism – Disease, Medicine,
Insurance
Populations – Epidemiology,
Social Networks
7. Genetics and Proteomics
• Genetic Sequencing
• Order of nucleotides in DNA
• Most DNA is common across species
• Many genes code proteins
• Some variants associated with disease
• Which ones?
• Proteomics
• Structure and function of proteins
• Variation in protein sequence and
structure associated with disease
• Which ones? In what context?
Images: http://www.masimo.it/hemoglobin/anemia.htm, https://en.wikipedia.org/wiki/DNA
8. Pathways
• Metabolic Pathways
• Series of chemical reactions
• Coordinated to produce
reactants
• Choreography of molecules
• Signaling Pathways
• Molecules on cell surface detect
changes in environment
• Cascade of reactions to change
state of cell
• Choreography of molecules
• How do they interact?
9. • Early 1950s
Korean War
autopsies
2012-2016 Genomic and Proteomic Studies
1985-1998 Pathology Studies - Pathodeterminants of
Atherosclerosis in Youth (PDAY) study
Disease - Atherosclerosis
10. Healthcare
• Genetics and Disease
• Post-Approval Drug Efficacy
• Discovering and Retrieving Medical
Information
• Comparative Quality
11. Populations
• Infectious Disease Spread
• How fast will disease spread?
• What countermeasures are
effective?
• What is the morbidity and
mortality?
• Simulation
– Synthetic population
– Model interactions
– Probabilistic
12. Why Cloud for Big Data in
BioScience?
• Scalability
• Access to compute and memory optimized
virtual machines
• Virtually unlimited storage
• Speed
• Many bioscience computations highly
parallel
• Minimize time to analyze, lower IT
overhead
• Cost
• AWS Spot Instances
• Google Pre-emptible VMs
18. Final Thoughts
• Great time to get into Biosciences and
Big Data
• Don’t be intimidated if it’s been a
while since you’ve studied biology –
we are all constantly learning in this
field
• Network online and in person
• Take advantage of free resources
• Courses
• Cloud
• AWS Free Tier
• MAPR Hadoop On Demand Training
• Connect with me on LinkedIn
• https://www.linkedin.com/in/dansull
ivanpdx
• Join me at a Meetup
• Dan.sullivan@cambiahealth.com
Editor's Notes
Projects with any two of these can probably be well handled by RDBMS.
When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
Autopsies performed during Korean War found evidence of early on set athero.
Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero.
PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes.
3,000 autopsies
15-34 year olds
Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks.
Liver samples also collected.
GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.