SlideShare une entreprise Scribd logo
1  sur  38
Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)
Volume of data
Variety of data
Integration of data
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
The key,
The whole key, and
Nothing but the key.
Implementation
bottlenecks
vs.
Data
Modeler
Developer
Scaling-up vs.
scaling-out
Frequent need for
denormalization
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results
Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study
“… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song
Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Scalability
Cost
Availability
Consistency
Flexibility
 Key Value Databases
 Document Databases
 Wide Column Stores
 Graph Databases
 Search Engines
Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
>>> Import redis
>>> r_server = redis.Redis(“localhost”)
>>> r_server.set(“sample:123:type”,”Aorta”)
>>> r_server.get(“sample:123:type”)
>>> “Aorta”
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}
Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers
Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
 Wide column data stores:
 Extremely large volumes
of data
 High availability
 Graph Databases:
 Connected data
 Need path finding and
recursive queries
Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality
* Slide 1:
* http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re
117_genome.png
* http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net
work_of_Treponema_pallidum.png
* http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg
* http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium
* http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/
* Slide 2:
* http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/
* http://en.wikipedia.org/wiki/File:MySQL.svg
* http://commons.wikimedia.org/wiki/File:Database-postgres.svg
* http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png
* http://commons.wikimedia.org/wiki/File:Oracle_logo.svg
* http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png
* Slide 3
* http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy
* http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
* Sllide 4
* http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database-
152091_640.png
* http://www.clker.com/clipart-desk-work.html
* Slide 6
* http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent
er.jpg
* Slide 7
* http://en.wikipedia.org/wiki/Chase_(bank)
* http://en.wikipedia.org/wiki/Computer-
aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg
* http://olioshealth.com/services/electronic-medical-record-implementation/
* Slide 9
* http://tran-bio3u-
fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg
* Slide 11
* http://arteriosclerotic.org/arteriosclerotic-cardiovascular/
Slide 12
http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png
http://en.wikipedia.org/wiki/File:Riak_product_logo.png
http://download.oracle.com/berkeley-
db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp
http://www.yegor256.com/images/2014/04/dynamodb-logo.png
https://foundationdb.com/
http://www.aerospike.com/
Slide 13
http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train-
wrecks/
Slide 15
http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png
http://tomphilip.me/couchdb-its-too-easy/
http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou
chbase/
http://ravendb.net/
https://cloudant.com/
Slide 17
http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan
dra_logo.svg
https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s
rc/site/resources/images
https://accumulo.apache.org/
http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a-
graph-database.html
Slide 18
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg19&position=chr10%3A90973326-
90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5
Slide 19
https://github.com/thinkaurelius/titan
http://www.neotechnology.com/logos/
http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p
ng
http://franz.com/
Slide 21
http://blogs.teradata.com/international/why-the-reports-of-the-death-
of-the-relational-database-are-an-exaggeration/
*Dr. Rebecca Wattam,
Advisor
*Becky Will, GPAA VT PI
*Chengdong Zhang, DBA & SE
*Cyberinfrastructure Division
*GPAA Collaborators
Limits of RDBMS and Need for NoSQL in Bioinformatics

Contenu connexe

Tendances

Environmental Genomics
Environmental Genomics Environmental Genomics
Environmental Genomics Erik Rumbaugh
 
Hydrocarbon degrading fungi
Hydrocarbon degrading fungiHydrocarbon degrading fungi
Hydrocarbon degrading fungianku00009
 
Industrial biotechnology presentattion
Industrial biotechnology presentattionIndustrial biotechnology presentattion
Industrial biotechnology presentattionAmulyaSingh10
 
Gene linkage analysis for crime scene
Gene linkage analysis for crime sceneGene linkage analysis for crime scene
Gene linkage analysis for crime sceneShabnam Ameenudeen
 
Biotechnology as a career
Biotechnology as a careerBiotechnology as a career
Biotechnology as a careerIshita Sidhu
 
Stem Cells and Tissue Engineering: past, present and future
Stem Cells and Tissue Engineering: past, present and futureStem Cells and Tissue Engineering: past, present and future
Stem Cells and Tissue Engineering: past, present and futureAna Rita Ramos
 
Biorefining and biobased products
Biorefining and biobased productsBiorefining and biobased products
Biorefining and biobased productsNNFCC
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
Presentation bioenergy
Presentation  bioenergyPresentation  bioenergy
Presentation bioenergyFardin Tiha
 
Tissue Engineering Poster
Tissue Engineering PosterTissue Engineering Poster
Tissue Engineering PosterShasta Rizzi
 
Engineering bone tissue using human Embryonic Stem Cells
Engineering bone tissue using human Embryonic Stem CellsEngineering bone tissue using human Embryonic Stem Cells
Engineering bone tissue using human Embryonic Stem CellsBalaganesh Kuruba
 
Future biotechnology
Future biotechnologyFuture biotechnology
Future biotechnologyOmnia Mohamed
 
Protein Sequence Databases
Protein Sequence Databases Protein Sequence Databases
Protein Sequence Databases Hemant Bothe
 

Tendances (20)

Protein structure analysis
Protein structure analysis Protein structure analysis
Protein structure analysis
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Environmental Genomics
Environmental Genomics Environmental Genomics
Environmental Genomics
 
Hydrocarbon degrading fungi
Hydrocarbon degrading fungiHydrocarbon degrading fungi
Hydrocarbon degrading fungi
 
Biofuels
 Biofuels Biofuels
Biofuels
 
Industrial biotechnology presentattion
Industrial biotechnology presentattionIndustrial biotechnology presentattion
Industrial biotechnology presentattion
 
Gene linkage analysis for crime scene
Gene linkage analysis for crime sceneGene linkage analysis for crime scene
Gene linkage analysis for crime scene
 
Biotechnology as a career
Biotechnology as a careerBiotechnology as a career
Biotechnology as a career
 
Stem Cells and Tissue Engineering: past, present and future
Stem Cells and Tissue Engineering: past, present and futureStem Cells and Tissue Engineering: past, present and future
Stem Cells and Tissue Engineering: past, present and future
 
Cloning vector
Cloning vectorCloning vector
Cloning vector
 
Biorefining and biobased products
Biorefining and biobased productsBiorefining and biobased products
Biorefining and biobased products
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Presentation bioenergy
Presentation  bioenergyPresentation  bioenergy
Presentation bioenergy
 
Tissue Engineering Poster
Tissue Engineering PosterTissue Engineering Poster
Tissue Engineering Poster
 
Engineering bone tissue using human Embryonic Stem Cells
Engineering bone tissue using human Embryonic Stem CellsEngineering bone tissue using human Embryonic Stem Cells
Engineering bone tissue using human Embryonic Stem Cells
 
Future biotechnology
Future biotechnologyFuture biotechnology
Future biotechnology
 
Introduction to Proteogenomics
Introduction to Proteogenomics Introduction to Proteogenomics
Introduction to Proteogenomics
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 
Protein Sequence Databases
Protein Sequence Databases Protein Sequence Databases
Protein Sequence Databases
 
Analysis of gene expression
Analysis of gene expressionAnalysis of gene expression
Analysis of gene expression
 

En vedette

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairslittledata
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to chooseLars Thorup
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December Ruru Chowdhury
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Cloudera, Inc.
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisrobertstevens65
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
Genomics in Public Health
Genomics in Public HealthGenomics in Public Health
Genomics in Public HealthJennifer Gardy
 
An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"Asar Khan
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSpark Summit
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQLMike Crabb
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 

En vedette (19)

Key-Value Pairs
Key-Value PairsKey-Value Pairs
Key-Value Pairs
 
SQL or NoSQL - how to choose
SQL or NoSQL - how to chooseSQL or NoSQL - how to choose
SQL or NoSQL - how to choose
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December
 
SQL & NoSQL
SQL & NoSQLSQL & NoSQL
SQL & NoSQL
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
Genomics in Public Health
Genomics in Public HealthGenomics in Public Health
Genomics in Public Health
 
An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"
 
Solving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized GenomicsSolving The N+1 Problem In Personalized Genomics
Solving The N+1 Problem In Personalized Genomics
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQL
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 

Similaire à Limits of RDBMS and Need for NoSQL in Bioinformatics

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptxRushikeshChikane2
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL DatabasesAbiral Gautam
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lernetarunprajapati0t
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication networkAyoubSohiabMohammad
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...ijdms
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationGuru Ji
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clustersresponseteam
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptxRushikeshChikane2
 

Similaire à Limits of RDBMS and Need for NoSQL in Bioinformatics (20)

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL Databases
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lerne
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication network
 
Nosql
NosqlNosql
Nosql
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Nosql
NosqlNosql
Nosql
 
Unit-10.pptx
Unit-10.pptxUnit-10.pptx
Unit-10.pptx
 
Unit01 dbms
Unit01 dbmsUnit01 dbms
Unit01 dbms
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL Presentation
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clusters
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
 
No sq lv2
No sq lv2No sq lv2
No sq lv2
 

Plus de Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured dataDan Sullivan, Ph.D.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 

Plus de Dan Sullivan, Ph.D. (13)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 

Dernier

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Limits of RDBMS and Need for NoSQL in Bioinformatics

  • 1. Alternative Approaches to Managing and Integrating Bioinformatics Data GBCB Seminar October 9, 2014 Dan Sullivan Cyberinfrastructure Division
  • 2.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 3. Relational Database – a database that [explicitly] stores information about both the data and how it is related.” (Source: http://en.wikipedia.org/wiki/Relational_database) NoSQL Database – “[a] database [that] provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.” (Source: http://en.wikipedia.org/wiki/NoSQL)
  • 4. Volume of data Variety of data Integration of data
  • 5.
  • 6.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 7. The key, The whole key, and Nothing but the key.
  • 9.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 10.
  • 11.
  • 12. Text Mining Storing Text Caching Word Vectors Extracted Features Experiment Results Atherosclerosis Research Demographics Sample Tracking Genomic data Sequence Variants Mass Spec Results
  • 13.
  • 14. Early 1950s Korean War autopsies 2012-2016 Genomic and Proteomic Architecture of Atherosclerosis (GPAA) 1985-1998 Pathodeterminants of Atherosclerosis in Youth (PDAY) study
  • 15. “… tell your children not to do what I have done …” House of the Rising Sun American Folk Song
  • 16. Started with MySQL Could have stayed with relational model, but: Requirements change New data sets Unknown data structures Increasingly complex normalized model
  • 17.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 19.  Key Value Databases  Document Databases  Wide Column Stores  Graph Databases  Search Engines
  • 20. Features Simple primitive data structure No predefined schema Limited query capabilities Dictionary-like functionality at large scale key3 key2 key1 value1 value2 value2 Bioinformatics Use Case Word vectors in text mining Caching Limitations Key lookup only, no generalized query Small number of attributes per entity
  • 21. >>> Import redis >>> r_server = redis.Redis(“localhost”) >>> r_server.set(“sample:123:type”,”Aorta”) >>> r_server.get(“sample:123:type”) >>> “Aorta”
  • 22.
  • 23. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Bioinformatics Use Case Text mining Atherosclerosis Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 24. { subject_id: "F8273", age : "26", sex : "M" date_of_death : "12-Jan-1995”, glycohemoglobin: 10%, BMI : 22, samples : [ {type:"Thoracic Aorta", AHA_score: 1}, {type:"Abdominal Aorta", AHA_score: 2}, {type:"LAD", AHA_Score:5} ], sequence: {seq_file: "F8273_08152014.bam", variant_file: "F8273_08152014.vcf”} }
  • 25.
  • 26. Features Groups attributes into column families Column families store key- value pairs Implemented as sparse multi-dimensional arrays Denormalized 104-106 columns; 109 rows  Bioinformatics Use Case  Large studies  Many experiments & data types  Simulations  Limitations  Operationally challenging  Suitable for large number of servers
  • 27.
  • 28. Limitations Less suited for tabular data Features Highly normalized Graph-based query language (Gremlin) SQL-inspired query language (Cypher) Support for path finding and recursion Bioinformatics Use Case Epidemiology simulations Interaction networks
  • 29.
  • 30.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 31. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 32. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 33.  Wide column data stores:  Extremely large volumes of data  High availability  Graph Databases:  Connected data  Need path finding and recursive queries
  • 34.
  • 35. Multiple types of databases NoSQL complements relational models Research question drives selection Balance benefits and limitations May use multiple types of databases in a single project NoSQL databases are improving rapidly, gaining additional functionality
  • 36. * Slide 1: * http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re 117_genome.png * http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net work_of_Treponema_pallidum.png * http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg * http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium * http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/ * Slide 2: * http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/ * http://en.wikipedia.org/wiki/File:MySQL.svg * http://commons.wikimedia.org/wiki/File:Database-postgres.svg * http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png * http://commons.wikimedia.org/wiki/File:Oracle_logo.svg * http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png * Slide 3 * http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy * http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf * Sllide 4 * http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database- 152091_640.png * http://www.clker.com/clipart-desk-work.html * Slide 6 * http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent er.jpg * Slide 7 * http://en.wikipedia.org/wiki/Chase_(bank) * http://en.wikipedia.org/wiki/Computer- aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg * http://olioshealth.com/services/electronic-medical-record-implementation/ * Slide 9 * http://tran-bio3u- fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg * Slide 11 * http://arteriosclerotic.org/arteriosclerotic-cardiovascular/ Slide 12 http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png http://en.wikipedia.org/wiki/File:Riak_product_logo.png http://download.oracle.com/berkeley- db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp http://www.yegor256.com/images/2014/04/dynamodb-logo.png https://foundationdb.com/ http://www.aerospike.com/ Slide 13 http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train- wrecks/ Slide 15 http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png http://tomphilip.me/couchdb-its-too-easy/ http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou chbase/ http://ravendb.net/ https://cloudant.com/ Slide 17 http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan dra_logo.svg https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s rc/site/resources/images https://accumulo.apache.org/ http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a- graph-database.html Slide 18 http://genome.ucsc.edu/cgi- bin/hgTracks?db=hg19&position=chr10%3A90973326- 90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5 Slide 19 https://github.com/thinkaurelius/titan http://www.neotechnology.com/logos/ http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p ng http://franz.com/ Slide 21 http://blogs.teradata.com/international/why-the-reports-of-the-death- of-the-relational-database-are-an-exaggeration/
  • 37. *Dr. Rebecca Wattam, Advisor *Becky Will, GPAA VT PI *Chengdong Zhang, DBA & SE *Cyberinfrastructure Division *GPAA Collaborators

Notes de l'éditeur

  1. Relational databases take advantage of relationships between entities (things, nouns) to minimize the amount of data stored NoSQL model entities but relationships are often implicit in structure. Less emphasis on minimizing storage, preserving data integrity, or avoiding data anomalies.
  2. Projects with any two of these can probably be well handled by RDBMS. When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
  3. Simple data sets can be managed in spreadsheets. Not ideal but works in some cases. Larger and more complicated data sets require a database. Relational is a natural next step from spreadsheets because of the tabular nature of data.
  4. Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  5. Normalization is a process of reducing redundancy and risk of data anomalies. Several rules of normalization most important are Codd’s first three. Much of the code in RDBMS is designed to support querying normalized data: how to bring related data together, how to do it with an optimal set of steps (query optimizer)
  6. RDMBSs run well on single server. Can implement failover solutions, load balance read-only, difficult to have distributed RDBMSs with write operations and immediate consistency. Network and database latency causes delay in the time a row is updated in one instance and when it is updated in all others. Can require locking all replicas of rows until all replicas updated. Distributed RDBMS requires: Two phase commit for writes in Master-master configuration Master-slave replication helps with reads but not writes Sharding – helps if querying by shard key, otherwise need to query all servers Vertical partitioning – tables placed on different servers; hard to join tables on different servers Watch out for software license costs if scaling out with COTS. NoSQL database relax consistency constraint. Some implement eventual consistency. Implementation bottlenecks – need data modeler to change model schema and DBA to implement those changes. NoSQL allows developers to add columns, collections and other structures on the fly. Lose some benefits of RDBMS, such as referential integrity. Joins are time and resource consuming. Developers often deformalize to improve performance. Makes one question the use of RDBMSs if core functionality is not used.
  7. Relational good when - audit and compliance important - referential integrity - Immediate consistency - relational integrity - durability satisfied by backups Use cases: financial services, health care, manufacturing, even our own beloved Hokie Spa. Our use cases are different. Is relational really the best data model? Not necessary when - tolerant of some errors - availability primary concern - durability important
  8. Most important point of this talk Don’t be driven to choose a database model based on - what you are familiar with - what others say is the “best” data model - what has been used before just because it has been used before Let research requirements subject to constraints (time, funding, etc). Drive decision. Some of use learn this lesson the hard way.
  9. I’ll discuss how NoSQL databases can be used in two different bioinformatics areas: text mining and atherosclerosis I described text mining project in detail in seminar last semester so I won’t go into much detail in that area but I will spend a few minutes to provide background on atherosclerosis And I’ll use atherosclerosis examples when describing NoSQL data models.
  10. Build up of plaque inside arteries Plaque consists of fat, cholesterol, calcium and other substances Limits flow of oxygen Leads to: Heart attack Stroke From http://www.nhlbi.nih.gov/health/health-topics/topics/atherosclerosis/causes.html: The exact cause of atherosclerosis isn't known. However, studies show that atherosclerosis is a slow, complex disease that may start in childhood. It develops faster as you age. Atherosclerosis may start when certain factors damage the inner layers of the arteries. These factors include: Smoking High amounts of certain fats and cholesterol in the blood High blood pressure High amounts of sugar in the blood due to insulin resistanceexternal link icon or diabetesexternal link icon Plaque may begin to build up where the arteries are damaged. Over time, plaque hardens and narrows the arteries. Eventually, an area of plaque can rupture (break open). When this happens, blood cell fragments called platelets (PLATE-lets) stick to the site of the injury. They may clump together to form blood clots. Clots narrow the arteries even more, limiting the flow of oxygen-rich blood to your body.
  11. Autopsies performed during Korean War found evidence of early on set athero. Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero. PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes. 3,000 autopsies 15-34 year olds Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks. Liver samples also collected. GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.
  12. Time for confession. I ignored earlier advice about letting requirements and constraints drive database selection in GPAA project. I’ve worked with relational databases extensively, developed models for demographic, phenotypic, genomic and proteomic data before. I did not pay enough attention to the “unknown unknowns” – collaborators had additional ideas of how to leverage other data about GWAS, eQTL, histones, chromatins, etc. Did not appreciate how much would change.
  13. Could have stayed with relational model, but: Requirements were changing New data sets: GWAS, eQTL, Chromatin Segmentation, Histones Unknown data structures for Multiple Reaction Monitoring (MRM) Mass Spec and SWATH Normalized model was beginning to be more trouble than it was worth. Flexibility was a primary concern.
  14. First 4 especially important to organizations with big data and need for constant access to data and applications – e.g. Facebook, Amazon, Google Flexibility is primary driver for us to consider and eventually adopt a NoSQL database.
  15. 4 most commonly referenced database types in NoSQL community and press. Will not discuss Search databases here. PATRIC is using hybrid Relational-Search database strategy which is significantly improving performance over relational-only approach. Integration key for bioinformaticians and biologist; Don’t make them integrate data.
  16. So simple, it is almost trivial. Can store non-atomic values as well, e.g. JSON documents, but can only access entire document, cannot select a single value in the document or search for values of a particular field.
  17. Example KV databases. Redis – popular, easy to use, commonly used for caching; master-slave replication; multiple servers respond to read request; one server handles writes Riak – scalable, masterless BerkeleyDB – first widely used KV data store Areospike and FoundationDB – supports ACID transactions Amazon DynamoDB available in cloud (just announced on 10/9/2014 DynamoDB will support documents as well as KVs)
  18. JSON/BSON or XML storage
  19. Cassandra developed by Facebook Hbase part of Hadoop ecosystem Accumulo designed to support cell level access control; originally created by NSA Hypertable – used commercially
  20. Neo4j is probably most widely used of graph dbs OrientDB incorportes document db features as well as graphdb Titan runs on cluster, used Cassandra or HDFS (I think) for distributed storgae GraphChi-DB – project to run large graphs on small machines, e.g. Mac Mini’s AllegroGraph – commercial product from Franz, a long established Lisp vendor