Text Mining for Biocuration of Bacterial Infectious Diseases
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
1. Alternative Approaches to Managing and Integrating
Bioinformatics Data
GBCB Seminar
October 9, 2014
Dan Sullivan
Cyberinfrastructure Division
2. Bioinformatics and Relational Database
Management Systems (RDBMs)
Use Cases – Text Mining and Atherosclerosis
Bioinformatics and NoSQL Databases
How to Choose a Database for Your Project
Closing Comments
3. Relational Database – a database that [explicitly] stores
information about both the data and how it is related.”
(Source: http://en.wikipedia.org/wiki/Relational_database)
NoSQL Database – “[a] database [that] provides a
mechanism for storage and retrieval of data that is
modeled in means other than the tabular relations used
in relational databases.”
(Source: http://en.wikipedia.org/wiki/NoSQL)
9. Bioinformatics and Relational Database
Management Systems (RDBMs)
Use Cases – Text Mining and Atherosclerosis
Bioinformatics and NoSQL Databases
How to Choose a Database for Your Project
Closing Comments
10.
11.
12. Text Mining
Storing Text
Caching Word Vectors
Extracted Features
Experiment Results
Atherosclerosis
Research
Demographics
Sample Tracking
Genomic data
Sequence Variants
Mass Spec Results
13.
14. Early 1950s Korean War
autopsies
2012-2016 Genomic and Proteomic
Architecture of Atherosclerosis (GPAA)
1985-1998 Pathodeterminants
of Atherosclerosis in Youth
(PDAY) study
15. “… tell your
children not to do
what I have done …”
House of the Rising Sun
American Folk Song
16. Started with
MySQL
Could have stayed with
relational model, but:
Requirements change
New data sets
Unknown data structures
Increasingly complex
normalized model
17. Bioinformatics and Relational Database
Management Systems (RDBMs)
Use Cases – Text Mining and Atherosclerosis
Bioinformatics and NoSQL Databases
How to Choose a Database for Your Project
Closing Comments
20. Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
26. Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
Bioinformatics Use Case
Large studies
Many experiments & data types
Simulations
Limitations
Operationally
challenging
Suitable for large
number of servers
27.
28. Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
29.
30. Bioinformatics and Relational Database
Management Systems (RDBMs)
Use Cases – Text Mining and Atherosclerosis
Bioinformatics and NoSQL Databases
How to Choose a Database for Your Project
Closing Comments
31. Relational:
Requirements known at start
of project
Entities described by common
attributes
Compliance and audit issues
Need normalization
Acceptable performance on
small number of servers
Need server side joins
32. Key value:
Caching
Few attributes
Document databases:
Varying attributes
Integrate diverse data
types
Use denormalized
data
key3
key2
key1 value1
value2
value3
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
33. Wide column data stores:
Extremely large volumes
of data
High availability
Graph Databases:
Connected data
Need path finding and
recursive queries
34.
35. Multiple types of databases
NoSQL complements relational models
Research question drives selection
Balance benefits and limitations
May use multiple types of databases in a
single project
NoSQL databases are improving rapidly,
gaining additional functionality
Relational databases take advantage of relationships between entities (things, nouns) to minimize the amount of data stored
NoSQL model entities but relationships are often implicit in structure. Less emphasis on minimizing storage, preserving data integrity, or avoiding data anomalies.
Projects with any two of these can probably be well handled by RDBMS.
When all three are encountered in one project, NoSQL can often provide better performance with different levels of support for Consistency, Availability and network Partitioning (CAP Theorem)
Simple data sets can be managed in spreadsheets. Not ideal but works in some cases.
Larger and more complicated data sets require a database. Relational is a natural next step from spreadsheets because of the tabular nature of data.
Free, high quality RDBMSs available, e.g. MySQL PostgreSQL. Many commercial options as well.
Mature set of tools, such as IDEs for database developers. Many resources and best practices available.
From a more theoretic perspective, the relational model reduces risk of data anomalies (i.e. insert anomaly, delete anomaly & update anomaly).
Also separates logical model (what we see as database users) from physical model (e.g. how data is actually stored on disk or other persistent storage media).
Some performance disadvantages due to need for joins – gathering related information stored in separate tables and therefore on different parts of disk.
Normalization is a process of reducing redundancy and risk of data anomalies.
Several rules of normalization most important are Codd’s first three.
Much of the code in RDBMS is designed to support querying normalized data: how to bring related data together, how to do it with an optimal set of steps (query optimizer)
RDMBSs run well on single server. Can implement failover solutions, load balance read-only, difficult to have distributed RDBMSs with write operations and immediate consistency.
Network and database latency causes delay in the time a row is updated in one instance and when it is updated in all others. Can require locking all replicas of rows until all replicas updated.
Distributed RDBMS requires:
Two phase commit for writes in Master-master configuration
Master-slave replication helps with reads but not writes
Sharding – helps if querying by shard key, otherwise need to query all servers
Vertical partitioning – tables placed on different servers; hard to join tables on different servers
Watch out for software license costs if scaling out with COTS.
NoSQL database relax consistency constraint. Some implement eventual consistency.
Implementation bottlenecks – need data modeler to change model schema and DBA to implement those changes. NoSQL allows developers to add columns, collections and other structures on the fly. Lose some benefits of RDBMS, such as referential integrity.
Joins are time and resource consuming. Developers often deformalize to improve performance. Makes one question the use of RDBMSs if core functionality is not used.
Relational good when
- audit and compliance important
- referential integrity
- Immediate consistency
- relational integrity
- durability satisfied by backups
Use cases: financial services, health care, manufacturing, even our own beloved Hokie Spa.
Our use cases are different. Is relational really the best data model?
Not necessary when
- tolerant of some errors
- availability primary concern
- durability important
Most important point of this talk
Don’t be driven to choose a database model based on
- what you are familiar with
- what others say is the “best” data model
- what has been used before just because it has been used before
Let research requirements subject to constraints (time, funding, etc). Drive decision.
Some of use learn this lesson the hard way.
I’ll discuss how NoSQL databases can be used in two different bioinformatics areas: text mining and atherosclerosis
I described text mining project in detail in seminar last semester so I won’t go into much detail in that area but I will spend a few minutes to provide background on atherosclerosis
And I’ll use atherosclerosis examples when describing NoSQL data models.
Build up of plaque inside arteries
Plaque consists of fat, cholesterol, calcium and other substances
Limits flow of oxygen
Leads to:
Heart attack
Stroke
From http://www.nhlbi.nih.gov/health/health-topics/topics/atherosclerosis/causes.html:
The exact cause of atherosclerosis isn't known. However, studies show that atherosclerosis is a slow, complex disease that may start in childhood. It develops faster as you age.
Atherosclerosis may start when certain factors damage the inner layers of the arteries. These factors include:
Smoking
High amounts of certain fats and cholesterol in the blood
High blood pressure
High amounts of sugar in the blood due to insulin resistanceexternal link icon or diabetesexternal link icon
Plaque may begin to build up where the arteries are damaged. Over time, plaque hardens and narrows the arteries. Eventually, an area of plaque can rupture (break open).
When this happens, blood cell fragments called platelets (PLATE-lets) stick to the site of the injury. They may clump together to form blood clots. Clots narrow the arteries even more, limiting the flow of oxygen-rich blood to your body.
Autopsies performed during Korean War found evidence of early on set athero.
Not enough time for lifestyle factors, such as high fat diet, smoking and inactivity to be sole cause of plague. Hypothesis – genetic factor influencing athero.
PDAY – confirmed and expanded on earlier findings. Large collaboration of pathologists collected samples from young people who died of non-cardiovascular causes.
3,000 autopsies
15-34 year olds
Aorta and LAD samples preserved in fixed formalin, paraffin embedded blocks.
Liver samples also collected.
GPAA - Use liver samples to sequence genomes. Proteomics collaborators have developed techniques for extracting proteins from old FFPE blocks. Makes genomic and proteomics analysis possible today.
Time for confession.
I ignored earlier advice about letting requirements and constraints drive database selection in GPAA project.
I’ve worked with relational databases extensively, developed models for demographic, phenotypic, genomic and proteomic data before.
I did not pay enough attention to the “unknown unknowns” – collaborators had additional ideas of how to leverage other data about GWAS, eQTL, histones, chromatins, etc.
Did not appreciate how much would change.
Could have stayed with relational model, but:
Requirements were changing
New data sets: GWAS, eQTL, Chromatin Segmentation, Histones
Unknown data structures for Multiple Reaction Monitoring (MRM) Mass Spec and SWATH
Normalized model was beginning to be more trouble than it was worth.
Flexibility was a primary concern.
First 4 especially important to organizations with big data and need for constant access to data and applications – e.g. Facebook, Amazon, Google
Flexibility is primary driver for us to consider and eventually adopt a NoSQL database.
4 most commonly referenced database types in NoSQL community and press.
Will not discuss Search databases here. PATRIC is using hybrid Relational-Search database strategy which is significantly improving performance over relational-only approach.
Integration key for bioinformaticians and biologist; Don’t make them integrate data.
So simple, it is almost trivial.
Can store non-atomic values as well, e.g. JSON documents, but can only access entire document, cannot select a single value in the document or search for values of a particular field.
Example KV databases.
Redis – popular, easy to use, commonly used for caching; master-slave replication; multiple servers respond to read request; one server handles writes
Riak – scalable, masterless
BerkeleyDB – first widely used KV data store
Areospike and FoundationDB – supports ACID transactions
Amazon DynamoDB available in cloud (just announced on 10/9/2014 DynamoDB will support documents as well as KVs)
JSON/BSON or XML storage
Cassandra developed by Facebook
Hbase part of Hadoop ecosystem
Accumulo designed to support cell level access control; originally created by NSA
Hypertable – used commercially
Neo4j is probably most widely used of graph dbs
OrientDB incorportes document db features as well as graphdb
Titan runs on cluster, used Cassandra or HDFS (I think) for distributed storgae
GraphChi-DB – project to run large graphs on small machines, e.g. Mac Mini’s
AllegroGraph – commercial product from Franz, a long established Lisp vendor