Vector Databases 101 - An introduction to the world of Vector Databases
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
1. Big Bio and Fun with HBase
Brian O'Connor
Pipeline Architect
UNC Lineberger Comprehensive
Cancer Center
2. Overview
● The era of “Big Data” in biology
● Sequencing and scientific queries
● Computational requirements and data growth
● The appeal of HBase/Hadoop
● The SeqWare project and the Query Engine (QE)
● Implementation of HBase QE
● How well it works
● The future
● Adding indexing and search engine
● Why HBase/Hadoop are important to modern biology
3. Biology In Transition
● Biologists are used to working on one gene
their entire careers
● In the last 15 years biology has been
transitioning to a high-throughput, data-driven
discipline
● Now scientist study systems:
Thousands of genes Millions of SNPs Billions of bases of sequence
● Biology is not physics
4. The Beginnings
Capillary Sequencer
● The Human Genome with
Sanger Sequencing
● 1990-2000
Output ~1000 bases
● $3 billion
● Gave us the blueprints for a
human being, let us find
most of the genes in
humans, understand
variation in humans
5. The Era of “Big Data” in Biology
SOLiD 3 Plus Illumina GAIIx 454 GS FLX HeliScope
● 2x50 run ~14 days ● 2x76 run, ~8 days ● 400bp, ~10 hours ● ~35bp run, ~8 days
● >= 99.94% accuracy ● >= 98.5% accuracy ● Q20 at 400 bases ● <5% raw error rate
(colorspace corrected) at 76 bases ● ~1M reads / run ● ~700M reads/run
● ~1G reads / run, ● ~300M reads / ~400MB high qual ~25GB/run
~60GB high quality flowcell, ~20-25GB ● Homopolymer issue ● Cost ?
● Conservatively maybe high qual ● Cost ~$14K for ● So ~25GB in
40GB aligned 38M/lane, 2.5GB/lane, 70x75, ~$90K/GB ~8days
● ~60GB in ~14 days about 1.25GB aligned ● So ~1GB in ~1day
● Cost ~$20K, ● Cost ~$10K, about
$333/GB $450/GB
● A Human Genome in ● ~20GB in ~8 days
2 Weeks!
6. Types of Questions Biologists Ask
● Want to ask simple questions:
● “What SNVs are in 5'UTR of phosphatases?”
● “What frameshifts affect PTEN in lung cancer?”
● “What genes include homozygous, non-
synonymous SNVs in all ovarian cancers?”
● Biologists want to see data across samples
● Database natural choice for this data, many examples
7. Impact of Next Gen Sequencing
● If the human genome gave us the generic blue
prints, next gen sequencing lets us look at
individuals blue prints
● Applications Sequence individuals cancer
and find distinct druggable targets
Sequence many people tumor
look at mutation patterns
subtract
mutations just
in tumor
Have disease
normal
Does not have disease
8. The Big Problem
● Biologist think about reagents and tumor not
hard drives and CPUs
● The crazy growth exceeds the growth of all IT
components
● Dramatically different models for scalability are
required
● We're soon reaching a point where a
genome will cost less to sequence than it
does to look at the data!
9. Increase in Sequencer Output
Illumina Sequencer Ouput
Log scale
Sequence File Sizes Per Lane
100000000000
10000000000
File Size (Bytes)
Moore's Law:
1000000000
CPU Power Doubles
Every 2 Years
100000000
10000000
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
Date
Suggests Sequencer Output
Increases by 10x Every 2 Years!
Kryder's Law:
Far outpacing hard drive, Storage Quadruples
CPU, and bandwidth growth Every 2 Years
http://genome.wellcome.ac.uk/doc_WTX059576.html
10. Lowering Costs = Bigger Projects
● Falling costs increase scope of projects
● The Cancer Genome Atlas (TCGA)
● 20 tumor types, 200 samples each in 4 years
● Around 4 Petabytes of data
● Once costs < $1000 many predict the
technology will move into clinical applications
● 1.5 million new cancer patients in 2010
● 1,500 Petabytes per year!?
● 4 PB/day vs 0.2 PB/day Facebook
http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/
http://seer.cancer.gov/statfacts/html/all.html
11. So We Need to Rethink Scaling Up
Particularly for Databases
● The old ways of growing can't keep up... can't
just “buy a bigger box”
● We need to embrace what the Peta-scale
community is currently doing
● The Googles, Facebooks, Twitters, etc of the
world are solving their scalability problems,
biology needs to learn from their solutions
● Clusters can be expanded, distributed storage
used, least scalable portion is the database
which biologist need to ask questions
12. Technologies That Can Help
● Many open source tools designed for Petascale
● Map/Reduce for data processing
● Hadoop for distributed file systems (HDFS) and
robust process control
● Pig/HIVE/etc processing unstructured/semi-
structured data
● HBase/Cassandra/etc for databasing
● Could go unstructured but biological data is
most useful when aggregated and a database
is extremely good for this
13. HBase to the Rescue
● Billions of rows x millions of columns!
● Focus on random access (vs. HDFS)
● Table is column oriented, sparse matrix
● Versioning (timestamps) built in
● Flexible storage of different data types
● Splits DB across many nodes transparently
● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node
● 22M variants processed <1 minute on 5 node cluster
14. Magically Scalable Databases
● Talking about distributed databases, forces a
huge shift in what you can and can't do
● “NoSQL is an umbrella term for a loosely
defined class of non-relational data stores that
break with a long history of relational databases
and ACID guarantees. Data stores that fall
under this term may not require fixed table
schemas, and usually avoid join operations.
The term was first popularized in early 2009.”
http://en.wikipedia.org/wiki/Nosql
15. What You Give Up
● SQL queries
● Well defined schema, normalized data structure
● Relationships manged by DB
● Flexible and easy indexing of table columns
● Existing tools that query a SQL database must
be re-written
● Certain ACID aspects
● Software maturity, most distributed NOSQL
projects are very new
16. What You Gain
● Scalability is the clear win, you can have many
processes on many nodes hit the collection of
database servers
● Ability to look at very large datasets and do
complex computations across a cluster
● More flexibility in representing information now
and in the future
● HBase includes data timestamps/versions
● Integration with Hadoop
19. HBase APIs
● Basic API
● Connect to an HBase server like it's a single server
● Lets you iterate over the contents of the DB, with
flexible filtering of the results
● Can pull back/write key/values easily
● Map/Reduce source/sink
● Can use HBase tables easily from Map/Reduce
● May be easier/faster just to Map/Reduce than filter
with the API
● I want to use this more in the future
20. HBase Installation
● Install Hadoop first, version 0.20.x
● I installed HBase version 0.20.1
● Documented: http://tinyurl.com/23ftbrk
● Currently have a 7 node Hadoop/HBase cluster
running at UNC with about 40TB of storage, 56
CPUs, 168GB RAM
● RPMs from Cloudera (0.89.20100924)
http://www.cloudera.com
21. The SeqWare Project
Genome Browser
SeqWare LIMS
SeqWare Query Engine
High-performance, distributed
Hadoop HBase
Genome browser and Cluster
query engine frontend database and web query engine,
powers both browser and
interactive queries HBase
Hadoop
Central DB Cluster
Central portal for users that links
to the tools to upload samples, coordinates
SeqWare Nodes
trigger analysis, and view results all analysis
and results MetaDB
metadata SeqWare Pipeline SGE
Cluster
SeqWare API* Storage
Java Perl Python
S SGE
thrift
Cluster
network
S Controls analysis on
Developer API, savvy users the cluster, models
can control all of SeqWare's' tools S analysis workflows for SGE
programmatically RNA-Seq and other Cluster
S NGS experimental
designs
Fully open source Import Daemons Big Data Small Data
http://seqware.sf.net Data import tool facilitates
sequence delivery to the
storage via network
* future project
22. The SeqWare Query Engine Project
● SeqWare Query Engine is our HBase database
for next gen sequence data
Interactive Web Forms
SeqWare Query Engine
High-performance, distributed
database and web query engine,
powers both browser and Hadoop
interactive queries
Cluster
HBase
Hadoop
Cluster
HBase
Genome Browser Hadoop
HBase Cluster
Web API Service
Genome browser and
query engine frontend
23. Webservice Interfaces
RESTful XML Client API or HTML Forms
SeqWare Query Engine includes a
RESTful web service that returns XML
describing variant DBs, a web form for
querying, and returns BED/WIG data
available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN
track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009"
chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
24. SeqWare QE Architecture Concepts
● Types: variants, Indel
chr1
translocations, coverage,
* nonsynonmous
consequence, and * is_dbSNP | rs10292192
generic “features”
● Each can be “tagged” SNV
chr12
with arbitrary key-value
pairs, allows for the * nonsynonmous
encoding of a surprising * is_dbSNP | rs10292192
amount of annotation!
● Fundamental query is a chr3
translocation
chr12
list of objects filtered by
* gene_fusion
these “tags”
25. Requirements for Query
Engine Backend
The backend must:
– Represent many types of data
– Support a rich level of annotation
– Support very large variant databases
(~3 billion rows x thousands of columns)
– Be distributed across a cluster
– Support processing, annotating, querying &
comparing samples (variants, coverage,
annotations)
– Support a crazy growth of data
26. HBase Query Engine Backend
● Stores variants, translocations, coverage info,
coding consequence reports, annotations as
“tags”, and generic “features”
● Common interface, uses HBase as backend
● Create a single database for all genomes
● Database is auto-sharded across cluster
● Can use Map/Reduce and other projects to do
sophisticated queries
● Performance seems very good!
27. HBase SeqWare Query Engine
Backend Webservice client
MapReduce HBase API
ETL Map Region Analysis BED/WIG
Job Server Node Files
XML
Metadata
RESTful
ETL Map Region Web Nodes Web Service
Job Server
ETL Analysis
Reduce Job HMaster
Node MetaDB
ETL Querying &
jobs extract, HBase on HDFS Loading
transform, &/or Variant & Coverage Nodes process
load in parallel Database System queries via API
28. Underlying HBase Tables
hg18Table family label
key variant:genome4 variant:genome7 coverage:genome7
chr15:00000123454 byte[] byte[] byte[]
Variant object byte array
Genome1102NTagIndexTable
key rowId:
is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[]
queries look up by tag then
Database on filesystem (HDFS) filter the variant results
key timestamp column:variant
chr15:00000123454 t1 genome7 byte[]
29. Backend Performance Comparison
Pileup Load Time 1102N
HBase vs. Berkeley DB
8000000
7000000
6000000
5000000
load time bdb
variants
4000000 load time hbase
3000000
2000000
1000000
0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
time (s)
30. Backend Performance Comparison
BED Export Time 1102N
HBase API vs. M/R vs. BerkeleyDB
8000000
7000000
6000000
5000000 dump time bdb
variants
4000000 dump time hbase
dump time m/r
3000000
2000000
1000000
0
0 1000 2000 3000 4000 5000 6000 7000
time
31. Querying with Map/Reduce
Find all variants seen in ovarian cancer genomes tagged as “frameshift”
Node 1 Reduce
Map
Iterate over every For each item in bin,
Chr1 variant, bin if ovarian print out the variant
Variants and tagged frameshift information
Node 2 ● Finding variants with tag
Chr2
or comparing at same
Map Reduce
Variants genomic position very
efficient
Node 3 ● Problem of overlapping
Chr3 features not starting at the
Map Reduce
Variants same genomic location
32. Problematic Querying Map/Reduce
Indel 1
chr1:1200-1500 ● Hard to do
Indel 2 overlaps
chr1:1400-1700
In the genome SNV 1 queries which
are really
SNV 2 important for
biological DBs
In the DB
● Here if I did a
chr1:000001200 Indel 1 byte[] Map/Reduce
query only
SNV1 and
chr1:000001400 Indel 2 byte[]
SNV2 would
“overlap”
chr1:000001450 SNV 1 byte[] SNV 2 byte[]
33. SeqWare Query Engine Status
● Open source, you can try it now!
● Both BerkeleyDB & HBase backends
● Multiple genomes stored in the same table, very
Map/Reduce compatible for SNVs
● Basic secondary indexing for “tags”
● API used for queries via Webservice
● Prototype Map/Reduce examples including
“somatic” mutation detection in paired
normal/cancer samples
34. SeqWare Query Engine Future
● Dynamically building R-tree indexes or
Nested Containment Lists with Map/Reduce
“Experiences on Processing Spatial Data with
MapReduce” by Cary et al.
● Looking at using Katta or Solr for indexing free
text data such as gene descriptions, OMIM
entries, etc.
● Queries across samples with simple logic
● More testing, pushing our 7 node cluster,
finding the max number of genomes this cluster
can support
35. Final Thoughts
● Era of Big Data for Biology is here!
● CPU bound problems no doubt but as short
reads become long reads and prices per
GBase drop the problems seem to shift to
handling/mining data
● Tools designed for Peta-scale datasets are key
Yahoo's Hadoop
Cluster
36. For More Information
● http://seqware.sf.net
● http://hadoop.apache.org
● http://hbase.apache.org
● Brian O'Connor <briandoconnor@gmail.com>
● Article (12/21/2010):
SeqWare Query Engine: storing and searching sequence
data in the cloud
Brian D O’Connor , Barry Merriman and Stanley F Nelson
BMC Bioinformatics 2010, 11(Suppl 12)
● We have job openings!!