SlideShare une entreprise Scribd logo
1  sur  38
Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)
Volume of data
Variety of data
Integration of data
 Pragmatic
 Widely applicable
 Many options
 Modeling
 Reduce risk of data
anomalies.
 Separate logical
and physical
models
The key,
The whole key, and
Nothing but the key.
Implementation
bottlenecks
vs.
Data
Modeler
Developer
Scaling-up vs.
scaling-out
Frequent need for
denormalization
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results
Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study
“… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song
Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Scalability
Cost
Availability
Consistency
Flexibility
 Key Value Databases
 Document Databases
 Wide Column Stores
 Graph Databases
 Search Engines
Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
>>> Import redis
>>> r_server = redis.Redis(“localhost”)
>>> r_server.set(“sample:123:type”,”Aorta”)
>>> r_server.get(“sample:123:type”)
>>> “Aorta”
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}
Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers
Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
 Bioinformatics and Relational Database
Management Systems (RDBMs)
 Use Cases – Text Mining and Atherosclerosis
 Bioinformatics and NoSQL Databases
 How to Choose a Database for Your Project
 Closing Comments
Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins

Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
 Wide column data stores:
 Extremely large volumes
of data
 High availability
 Graph Databases:
 Connected data
 Need path finding and
recursive queries
Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality
* Slide 1:
* http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re
117_genome.png
* http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net
work_of_Treponema_pallidum.png
* http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg
* http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium
* http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/
* Slide 2:
* http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/
* http://en.wikipedia.org/wiki/File:MySQL.svg
* http://commons.wikimedia.org/wiki/File:Database-postgres.svg
* http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png
* http://commons.wikimedia.org/wiki/File:Oracle_logo.svg
* http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png
* Slide 3
* http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy
* http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
* Sllide 4
* http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database-
152091_640.png
* http://www.clker.com/clipart-desk-work.html
* Slide 6
* http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent
er.jpg
* Slide 7
* http://en.wikipedia.org/wiki/Chase_(bank)
* http://en.wikipedia.org/wiki/Computer-
aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg
* http://olioshealth.com/services/electronic-medical-record-implementation/
* Slide 9
* http://tran-bio3u-
fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg
* Slide 11
* http://arteriosclerotic.org/arteriosclerotic-cardiovascular/
Slide 12
http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png
http://en.wikipedia.org/wiki/File:Riak_product_logo.png
http://download.oracle.com/berkeley-
db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp
http://www.yegor256.com/images/2014/04/dynamodb-logo.png
https://foundationdb.com/
http://www.aerospike.com/
Slide 13
http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train-
wrecks/
Slide 15
http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png
http://tomphilip.me/couchdb-its-too-easy/
http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou
chbase/
http://ravendb.net/
https://cloudant.com/
Slide 17
http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan
dra_logo.svg
https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s
rc/site/resources/images
https://accumulo.apache.org/
http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a-
graph-database.html
Slide 18
http://genome.ucsc.edu/cgi-
bin/hgTracks?db=hg19&position=chr10%3A90973326-
90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5
Slide 19
https://github.com/thinkaurelius/titan
http://www.neotechnology.com/logos/
http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p
ng
http://franz.com/
Slide 21
http://blogs.teradata.com/international/why-the-reports-of-the-death-
of-the-relational-database-are-an-exaggeration/
*Dr. Rebecca Wattam,
Advisor
*Becky Will, GPAA VT PI
*Chengdong Zhang, DBA & SE
*Cyberinfrastructure Division
*GPAA Collaborators
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

Contenu connexe

Tendances

Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword Haitham El-Ghareeb
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsMerce Crosas
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongeSAT Publishing House
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
 
TCP connection management in SDN
TCP connection management in SDNTCP connection management in SDN
TCP connection management in SDNChao Chen
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
473_LightningTalks.pptx
473_LightningTalks.pptx473_LightningTalks.pptx
473_LightningTalks.pptxAakash Takale
 
Liger cat challenge
Liger cat challengeLiger cat challenge
Liger cat challengea s
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsChimezie Ogbuji
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Bernardo Najlis
 

Tendances (20)

Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMCASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
 
Dataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTagsDataverse, Cloud Dataverse, and DataTags
Dataverse, Cloud Dataverse, and DataTags
 
Hierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures alongHierarchal clustering and similarity measures along
Hierarchal clustering and similarity measures along
 
Hierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representationHierarchal clustering and similarity measures along with multi representation
Hierarchal clustering and similarity measures along with multi representation
 
TCP connection management in SDN
TCP connection management in SDNTCP connection management in SDN
TCP connection management in SDN
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
473_LightningTalks.pptx
473_LightningTalks.pptx473_LightningTalks.pptx
473_LightningTalks.pptx
 
Liger cat challenge
Liger cat challengeLiger cat challenge
Liger cat challenge
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
Social Media World News Impact on Stock Index Values - Investment Fund Analyt...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

En vedette

H4 visas new regulations
H4 visas new regulationsH4 visas new regulations
H4 visas new regulationsAnnick Koloko
 
Foundry managent system
Foundry managent system Foundry managent system
Foundry managent system Dharmendra Sid
 
Five Instructional Events with Reflections
Five Instructional Events with ReflectionsFive Instructional Events with Reflections
Five Instructional Events with Reflectionsjeanne asberry
 
American History Lesson Plans Florida Railroads
American History Lesson Plans Florida RailroadsAmerican History Lesson Plans Florida Railroads
American History Lesson Plans Florida Railroadsjeanne asberry
 
Yvonne okoro
Yvonne okoroYvonne okoro
Yvonne okorowendy56
 
Foundry Management System Desktop Application
Foundry Management System Desktop Application Foundry Management System Desktop Application
Foundry Management System Desktop Application Dharmendra Sid
 
The US Constitution Thematic Unit
The US Constitution Thematic UnitThe US Constitution Thematic Unit
The US Constitution Thematic Unitjeanne asberry
 
Un cuento de navidad millennial 2015
Un cuento de navidad millennial 2015Un cuento de navidad millennial 2015
Un cuento de navidad millennial 2015Comisuras
 
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checked
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checkedAssessment and Teaching Project Kara and Jeanne (2) grammar and spelling checked
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checkedjeanne asberry
 

En vedette (13)

We Shall Overcome
We Shall OvercomeWe Shall Overcome
We Shall Overcome
 
H4 visas new regulations
H4 visas new regulationsH4 visas new regulations
H4 visas new regulations
 
MKH - ISO
MKH - ISOMKH - ISO
MKH - ISO
 
IBRAHIM MAHMOOD C.V4
IBRAHIM MAHMOOD C.V4IBRAHIM MAHMOOD C.V4
IBRAHIM MAHMOOD C.V4
 
Foundry managent system
Foundry managent system Foundry managent system
Foundry managent system
 
Five Instructional Events with Reflections
Five Instructional Events with ReflectionsFive Instructional Events with Reflections
Five Instructional Events with Reflections
 
American History Lesson Plans Florida Railroads
American History Lesson Plans Florida RailroadsAmerican History Lesson Plans Florida Railroads
American History Lesson Plans Florida Railroads
 
Yvonne okoro
Yvonne okoroYvonne okoro
Yvonne okoro
 
Foundry Management System Desktop Application
Foundry Management System Desktop Application Foundry Management System Desktop Application
Foundry Management System Desktop Application
 
Paranormal activity
Paranormal activityParanormal activity
Paranormal activity
 
The US Constitution Thematic Unit
The US Constitution Thematic UnitThe US Constitution Thematic Unit
The US Constitution Thematic Unit
 
Un cuento de navidad millennial 2015
Un cuento de navidad millennial 2015Un cuento de navidad millennial 2015
Un cuento de navidad millennial 2015
 
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checked
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checkedAssessment and Teaching Project Kara and Jeanne (2) grammar and spelling checked
Assessment and Teaching Project Kara and Jeanne (2) grammar and spelling checked
 

Similaire à Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptxRushikeshChikane2
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL DatabasesAbiral Gautam
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lernetarunprajapati0t
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication networkAyoubSohiabMohammad
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...Felix Gessert
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...ijdms
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationGuru Ji
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clustersresponseteam
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptxRushikeshChikane2
 

Similaire à Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2 (20)

2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx2.Introduction to NOSQL (Core concepts).pptx
2.Introduction to NOSQL (Core concepts).pptx
 
Presentation On NoSQL Databases
Presentation On NoSQL DatabasesPresentation On NoSQL Databases
Presentation On NoSQL Databases
 
data base system to new data science lerne
data base system to new data science lernedata base system to new data science lerne
data base system to new data science lerne
 
Softwae and database in data communication network
Softwae and database in data communication networkSoftwae and database in data communication network
Softwae and database in data communication network
 
Nosql
NosqlNosql
Nosql
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Nosql
NosqlNosql
Nosql
 
Unit-10.pptx
Unit-10.pptxUnit-10.pptx
Unit-10.pptx
 
Unit01 dbms
Unit01 dbmsUnit01 dbms
Unit01 dbms
 
Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...Comparative study of no sql document, column store databases and evaluation o...
Comparative study of no sql document, column store databases and evaluation o...
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
CBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL PresentationCBSE XII Database Concepts And MySQL Presentation
CBSE XII Database Concepts And MySQL Presentation
 
No sql – rise of the clusters
No sql – rise of the clustersNo sql – rise of the clusters
No sql – rise of the clusters
 
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
3.Implementation with NOSQL databases Document Databases (Mongodb).pptx
 
No sq lv2
No sq lv2No sq lv2
No sq lv2
 

Plus de Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured dataDan Sullivan, Ph.D.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyDan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 

Plus de Dan Sullivan, Ph.D. (10)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

  • 1. Alternative Approaches to Managing and Integrating Bioinformatics Data GBCB Seminar October 9, 2014 Dan Sullivan Cyberinfrastructure Division
  • 2.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 3. Relational Database – a database that [explicitly] stores information about both the data and how it is related.” (Source: http://en.wikipedia.org/wiki/Relational_database) NoSQL Database – “[a] database [that] provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.” (Source: http://en.wikipedia.org/wiki/NoSQL)
  • 4. Volume of data Variety of data Integration of data
  • 5.
  • 6.  Pragmatic  Widely applicable  Many options  Modeling  Reduce risk of data anomalies.  Separate logical and physical models
  • 7. The key, The whole key, and Nothing but the key.
  • 9.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 10.
  • 11.
  • 12. Text Mining Storing Text Caching Word Vectors Extracted Features Experiment Results Atherosclerosis Research Demographics Sample Tracking Genomic data Sequence Variants Mass Spec Results
  • 13.
  • 14. Early 1950s Korean War autopsies 2012-2016 Genomic and Proteomic Architecture of Atherosclerosis (GPAA) 1985-1998 Pathodeterminants of Atherosclerosis in Youth (PDAY) study
  • 15. “… tell your children not to do what I have done …” House of the Rising Sun American Folk Song
  • 16. Started with MySQL Could have stayed with relational model, but: Requirements change New data sets Unknown data structures Increasingly complex normalized model
  • 17.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 19.  Key Value Databases  Document Databases  Wide Column Stores  Graph Databases  Search Engines
  • 20. Features Simple primitive data structure No predefined schema Limited query capabilities Dictionary-like functionality at large scale key3 key2 key1 value1 value2 value2 Bioinformatics Use Case Word vectors in text mining Caching Limitations Key lookup only, no generalized query Small number of attributes per entity
  • 21. >>> Import redis >>> r_server = redis.Redis(“localhost”) >>> r_server.set(“sample:123:type”,”Aorta”) >>> r_server.get(“sample:123:type”) >>> “Aorta”
  • 22.
  • 23. Features  JSON/XML structures  Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Bioinformatics Use Case Text mining Atherosclerosis Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 24. { subject_id: "F8273", age : "26", sex : "M" date_of_death : "12-Jan-1995”, glycohemoglobin: 10%, BMI : 22, samples : [ {type:"Thoracic Aorta", AHA_score: 1}, {type:"Abdominal Aorta", AHA_score: 2}, {type:"LAD", AHA_Score:5} ], sequence: {seq_file: "F8273_08152014.bam", variant_file: "F8273_08152014.vcf”} }
  • 25.
  • 26. Features Groups attributes into column families Column families store key- value pairs Implemented as sparse multi-dimensional arrays Denormalized 104-106 columns; 109 rows  Bioinformatics Use Case  Large studies  Many experiments & data types  Simulations  Limitations  Operationally challenging  Suitable for large number of servers
  • 27.
  • 28. Limitations Less suited for tabular data Features Highly normalized Graph-based query language (Gremlin) SQL-inspired query language (Cypher) Support for path finding and recursion Bioinformatics Use Case Epidemiology simulations Interaction networks
  • 29.
  • 30.  Bioinformatics and Relational Database Management Systems (RDBMs)  Use Cases – Text Mining and Atherosclerosis  Bioinformatics and NoSQL Databases  How to Choose a Database for Your Project  Closing Comments
  • 31. Relational: Requirements known at start of project Entities described by common attributes Compliance and audit issues Need normalization Acceptable performance on small number of servers Need server side joins 
  • 32. Key value: Caching Few attributes Document databases: Varying attributes Integrate diverse data types Use denormalized data key3 key2 key1 value1 value2 value3 { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 33.  Wide column data stores:  Extremely large volumes of data  High availability  Graph Databases:  Connected data  Need path finding and recursive queries
  • 34.
  • 35. Multiple types of databases NoSQL complements relational models Research question drives selection Balance benefits and limitations May use multiple types of databases in a single project NoSQL databases are improving rapidly, gaining additional functionality
  • 36. * Slide 1: * http://upload.wikimedia.org/wikipedia/commons/e/e9/Arthrobacter_arilaitensis_Re 117_genome.png * http://upload.wikimedia.org/wikipedia/commons/b/b4/The_protein_interaction_net work_of_Treponema_pallidum.png * http://upload.wikimedia.org/wikipedia/commons/c/c7/Picoplancton_cytometrie.jpg * http://www.ncbi.nlm.nih.gov/pubmed/?term=salmonella+typhimurium * http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-form-a-hypothesis-5/ * Slide 2: * http://pixabay.com/id/spreadsheet-excel-tabel-diagram-98491/ * http://en.wikipedia.org/wiki/File:MySQL.svg * http://commons.wikimedia.org/wiki/File:Database-postgres.svg * http://commons.wikimedia.org/wiki/File:SQLite_Logo_4.png * http://commons.wikimedia.org/wiki/File:Oracle_logo.svg * http://upload.wikimedia.org/wikipedia/commons/7/78/Sql-server-ce-4-logo.png * Slide 3 * http://faculty.csuci.edu/Fminder.chen/mba550/caseStudy * http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf * Sllide 4 * http://pixabay.com/static/uploads/photo/2013/07/12/17/22/database- 152091_640.png * http://www.clker.com/clipart-desk-work.html * Slide 6 * http://cdn.ttgtmedia.com/ITKE/uploads/blogs.dir/141/files/2012/09/greendatacent er.jpg * Slide 7 * http://en.wikipedia.org/wiki/Chase_(bank) * http://en.wikipedia.org/wiki/Computer- aided_dispatch#mediaviewer/File:Moderne_Leitstelle_Arbeitsplatz.jpg * http://olioshealth.com/services/electronic-medical-record-implementation/ * Slide 9 * http://tran-bio3u- fall09.wikispaces.com/file/view/Atherosclerosis.jpg/114176189/Atherosclerosis.jpg * Slide 11 * http://arteriosclerotic.org/arteriosclerotic-cardiovascular/ Slide 12 http://opentodo.net/wp-content/uploads/2014/05/redis-300dpi.png http://en.wikipedia.org/wiki/File:Riak_product_logo.png http://download.oracle.com/berkeley- db/docs/je/3.2.76/images/Oracle_BerkeleyDB_clr.bmp http://www.yegor256.com/images/2014/04/dynamodb-logo.png https://foundationdb.com/ http://www.aerospike.com/ Slide 13 http://arnoldit.com/wordpress/2008/05/07/enterprise-search-and-train- wrecks/ Slide 15 http://upload.wikimedia.org/wikipedia/en/e/eb/MongoDB_Logo.png http://tomphilip.me/couchdb-its-too-easy/ http://www.datanami.com/2014/02/25/look_out_mongo_here_comes_cou chbase/ http://ravendb.net/ https://cloudant.com/ Slide 17 http://en.wikipedia.org/wiki/Apache_Cassandra#mediaviewer/File:Cassan dra_logo.svg https://svn.apache.org/repos/asf/hbase/branches/instant_schema_alter/s rc/site/resources/images https://accumulo.apache.org/ http://hypertable.com/http://radar.oreilly.com/2013/07/why-choose-a- graph-database.html Slide 18 http://genome.ucsc.edu/cgi- bin/hgTracks?db=hg19&position=chr10%3A90973326- 90985006&hgsid=391056163_yzDnkth3pso3om9pe5BgBFunDug5 Slide 19 https://github.com/thinkaurelius/titan http://www.neotechnology.com/logos/ http://en.wikipedia.org/wiki/OrientDB#mediaviewer/File:OrientdbLogo.p ng http://franz.com/ Slide 21 http://blogs.teradata.com/international/why-the-reports-of-the-death- of-the-relational-database-are-an-exaggeration/
  • 37. *Dr. Rebecca Wattam, Advisor *Becky Will, GPAA VT PI *Chengdong Zhang, DBA & SE *Cyberinfrastructure Division *GPAA Collaborators

Notes de l'éditeur

  1. Relational databases take advantage of relationships between entities (things, nouns) to minimize the amount of data stored NoSQL model entities but relationships are often implicit in structure. Less emphasis on minimizing storage, preserving data integrity, or avoiding data anomalies.
  2. Projects with any two of these can probably be well handled by RDBMS. When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
  3. Simple data sets can be managed in spreadsheets. Not ideal but works in some cases. Larger and more complicated data sets require a database. Relational is a natural next step from spreadsheets because of the tabular nature of data.
  4. Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well. Mature set of tools, such as IDEs for database developers. Many resources and best practices available. From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly). Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media). Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
  5. Normalization is a process of reducing redundancy and risk of data anomalies. Several rules of normalization most important are Codd’s first three. Much of the code in RDBMS is designed to support querying normalized data: how to bring related data together, how to do it with an optimal set of steps (query optimizer)
  6. RDMBSs run well on single server. Can implement failover solutions, load balance read-only, difficult to have distributed RDBMSs with write operations and immediate consistency. Network and database latency causes delay in the time a row is updated in one instance and when it is updated in all others. Can require locking all replicas of rows until all replicas updated. Distributed RDBMS requires: Two phase commit for writes in Master-master configuration Master-slave replication helps with reads but not writes Sharding – helps if querying by shard key, otherwise need to query all servers Vertical partitioning – tables placed on different servers; hard to join tables on different servers Watch out for software license costs if scaling out with COTS. NoSQL database relax consistency constraint. Some implement eventual consistency. Implementation bottlenecks – need data modeler to change model schema and DBA to implement those changes. NoSQL allows developers to add columns, collections and other structures on the fly. Lose some benefits of RDBMS, such as referential integrity. Joins are time and resource consuming. Developers often deformalize to improve performance. Makes one question the use of RDBMSs if core functionality is not used.
  7. Relational good when - audit and compliance important - referential integrity - Immediate consistency - relational integrity - durability satisfied by backups Use cases: financial services, health care, manufacturing, even our own beloved Hokie Spa. Our use cases are different. Is relational really the best data model? Not necessary when - tolerant of some errors - availability primary concern - durability important
  8. Most important point of this talk Don’t be driven to choose a database model based on - what you are familiar with - what others say is the “best” data model - what has been used before just because it has been used before Let research requirements subject to constraints (time, funding, etc). Drive decision. Some of use learn this lesson the hard way.
  9. I’ll discuss how NoSQL databases can be used in two different bioinformatics areas: text mining and atherosclerosis I described text mining project in detail in seminar last semester so I won’t go into much detail in that area but I will spend a few minutes to provide background on atherosclerosis And I’ll use atherosclerosis examples when describing NoSQL data models.
  10. Build up of plaque inside arteries Plaque consists of fat, cholesterol, calcium and other substances Limits flow of oxygen Leads to: Heart attack Stroke From http://www.nhlbi.nih.gov/health/health-topics/topics/atherosclerosis/causes.html: The exact cause of atherosclerosis isn't known. However, studies show that atherosclerosis is a slow, complex disease that may start in childhood. It develops faster as you age. Atherosclerosis may start when certain factors damage the inner layers of the arteries. These factors include: Smoking High amounts of certain fats and cholesterol in the blood High blood pressure High amounts of sugar in the blood due to insulin resistanceexternal link icon or diabetesexternal link icon Plaque may begin to build up where the arteries are damaged. Over time, plaque hardens and narrows the arteries. Eventually, an area of plaque can rupture (break open). When this happens, blood cell fragments called platelets (PLATE-lets) stick to the site of the injury. They may clump together to form blood clots. Clots narrow the arteries even more, limiting the flow of oxygen-rich blood to your body.
  11. Autopsies performed during Korean War found evidence of early on set athero. Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero. PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes. 3,000 autopsies 15-34 year olds Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks. Liver samples also collected. GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.
  12. Time for confession. I ignored earlier advice about letting requirements and constraints drive database selection in GPAA project. I’ve worked with relational databases extensively, developed models for demographic, phenotypic, genomic and proteomic data before. I did not pay enough attention to the “unknown unknowns” – collaborators had additional ideas of how to leverage other data about GWAS, eQTL, histones, chromatins, etc. Did not appreciate how much would change.
  13. Could have stayed with relational model, but: Requirements were changing New data sets: GWAS, eQTL, Chromatin Segmentation, Histones Unknown data structures for Multiple Reaction Monitoring (MRM) Mass Spec and SWATH Normalized model was beginning to be more trouble than it was worth. Flexibility was a primary concern.
  14. First 4 especially important to organizations with big data and need for constant access to data and applications – e.g. Facebook, Amazon, Google Flexibility is primary driver for us to consider and eventually adopt a NoSQL database.
  15. 4 most commonly referenced database types in NoSQL community and press. Will not discuss Search databases here. PATRIC is using hybrid Relational-Search database strategy which is significantly improving performance over relational-only approach. Integration key for bioinformaticians and biologist; Don’t make them integrate data.
  16. So simple, it is almost trivial. Can store non-atomic values as well, e.g. JSON documents, but can only access entire document, cannot select a single value in the document or search for values of a particular field.
  17. Example KV databases. Redis – popular, easy to use, commonly used for caching; master-slave replication; multiple servers respond to read request; one server handles writes Riak – scalable, masterless BerkeleyDB – first widely used KV data store Areospike and FoundationDB – supports ACID transactions Amazon DynamoDB available in cloud (just announced on 10/9/2014 DynamoDB will support documents as well as KVs)
  18. JSON/BSON or XML storage
  19. Cassandra developed by Facebook Hbase part of Hadoop ecosystem Accumulo designed to support cell level access control; originally created by NSA Hypertable – used commercially
  20. Neo4j is probably most widely used of graph dbs OrientDB incorportes document db features as well as graphdb Titan runs on cluster, used Cassandra or HDFS (I think) for distributed storgae GraphChi-DB – project to run large graphs on small machines, e.g. Mac Mini’s AllegroGraph – commercial product from Franz, a long established Lisp vendor