4. Webservice Interfaces
RESTful XML Client API or HTML Forms
SeqWare Query Engine includes a
RESTful web service that returns XML
describing variant DBs, a web form
for querying, and returns BED/WIG
data available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN
track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009"
chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
5. Loading in Genome Browsers
SeqWare Query
Engine URLs can
be directly loaded
into IGV & UCSC
genome browsers
6. Requirements for Query
Engine Backend
The backend must:
– Represent many types of data
– Support a rich level of annotation
– Support very large variant databases
(~3 billion rows x thousands of columns)
– Be distributed across a cluster
– Support processing, annotating, querying &
comparing samples (variants, coverage,
annotations)
– Support a crazy growth of data
7. Increase in Sequencer Output
Nelson Lab - UCLA
Illumina Sequencer Ouput
Sequence File Sizes Per Lane
100000000000
Suggests Sequencer Output
Increases by 5-10x Every 2 Years!
10000000000
Far outpacing hard drive,
CPU, and bandwidth growth
File Size (Bytes)
1000000000
100000000
Log scale
10000000
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
Date
8. HBase to the Rescue?
● Billions of rows x millions of columns!
● Focus on random access (vs. HDFS)
● Table is column oriented, sparse matrix
● Versioning (timestamps) built in
● Flexible storage of different data types
● Splits DB across many nodes transparently
● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node
● 22M variants processed <1 minute on 5 node cluster
9. Underlying HBase Tables
hg18Table family label
key variant:genome4 variant:genome7 coverage:genome7
chr15:00000123454 byte[] byte[] byte[]
Variant object byte array
Genome1102NTagIndexTable
key rowId:
is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[]
queries look up by tag then
Database on filesystem (HDFS) filter the variant results
key timestamp column:variant
chr15:00000123454 t1 genome7 byte[]
10. HBase API & Map/Reduce Querying
● HBase API
● Powers the current Backend & Webservice
● Provides a familiar API, scanners, iterators, etc
● Backend written using this, retrieve variants by tags
● Distributed database but single thread using API
● Prototype somatic mutations by Map/Reduce
● Every row is examined, variants in tumor not in
normal are retrieved
● Map/Reduce jobs run on node with local data
● Highly parallel & faster than API with single thread
11. SeqWare Query Engine on HBase
Backend Webservice
clients
MapReduce HBase API
ETL Map Data Analysis/Web BED/WIG
Job Nodes Files
Node XML
Metadata
RESTful
ETL Map Data Analysis/Web Web Service
Job Node Nodes
ETL Analysis/Web
Name
Reduce Job Nodes
Node MetaDB
ETL Querying &
jobs extract, HBase on HDFS Loading Webservice combines
transform, &/or Variant & Coverage Nodes process Variant/Coverage data
load in parallel Database System queries via API with metadata
12. Status of HBase Backend
● Both BerkeleyDB & HBase, Relational soon
● Multiple genomes stored in the same table,
very Map/Reduce compatible
● Basic secondary indexing for “tags”
● API used for queries via Webservice
● Prototype Map/Reduce example for “somatic”
mutation detection in paired normal/cancer
samples
● Currently loading 1102 normal/tumor (GBM)
13. Backend Performance Comparison
Pileup Load Time 1102N
HBase vs. Berkeley DB
8000000
7000000
6000000
5000000
load time bdb
variants
4000000 load time hbase
3000000
2000000
1000000
0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
time (s)
14. Backend Performance Comparison
BED Export Time 1102N
HBase API vs. M/R vs. BerkeleyDB
8000000
7000000
6000000
5000000 dump time bdb
variants
4000000 dump time hbase
dump time m/r
3000000
2000000
1000000
0
0 1000 2000 3000 4000 5000 6000 7000
time
15. HBase/Hadoop Have Potential!
● Era of Big Data for Biology is here!
● CPU bound problems no doubt but as short
reads become long reads and prices per GBase
drop the problems seem to shift to
handling/mining data
● Tools designed for Peta-scale datasets are key
16. Next Steps
● Model other datatypes: copy number, RNAseq
gene/exon/splice junction counts, isoforms etc
● Focus on porting analysis/querying to Map/Reduce
● Indexing beyond “tags” with Katta (distributed Lucene)
● Push scalability, what are the limits of an 8 node
HBase/Hadoop cluster?
● Look at Cascading, Pig, Hive, etc as advanced
workflow and data mining tools
● Standards for Webservice dialect (DAS?)
● Exposing Query Engine through GALAXY
17. Acknowledgments
UCLA UNC
Jordan Mendler Sara Grimm
Michael Clark Matt Soloway
Hane Lee Jianying Li
Bret Harry Feri Zsuppan
Stanley Nelson Neil Hayes
Chuck Perou
Derek Chiang
18. Resources
● Hbase & Hadoop: http://hadoop.apache.org
● When to use HBase:
http://blog.rapleaf.com/dev/?p=26
● NOSQL presentations:
http://blog.oskarsson.nu/2009/06/nosql-debrief.html
● Other DBs: CouchDB, Hypertable, Cassandra,
Project Voldemort, and more...
● Data mining tools: Pig and Hive
● SeqWare: http://seqware.sourceforge.net
● briandoconnor@gmail.com
20. Overview
● SeqWare Query Engine background
● New tools for combating the data deluge
● HBase/Hadoop in SeqWare Query Engine
● HBase for backend
● Map/Reduce & HBase API for webservice
● Better performance and scalability?
● Next steps
21. SeqWare Query Engine:
Lustre Filesystem
BerkeleyDB
backend webservice clients
Genome BED/WIG
Analysis/Web Nodes Files
Database XML
Metadata
RESTful
Genome Analysis/Web Nodes Web Service
Database
Genome Analysis/Web Nodes
Database MetaDB
BerkeleyDB Web/Analysis Webservice combines
Variant & Coverage Nodes process Variant/Coverage data
Databases queries with metadata
22. More
● Details on API vs. M/R
● Details on XML Restful API & web app including
loading in UCSC browser
● Details on generic store object (BerkeleyDB,
HBase, and Relational at Renci)
● Byte serialization from BerkeleyDB, custom
secondary key creation
23. Pressures of Sequencing
● A lot of data (50GB SRF file, 150GB alignment
files, 60GB variants for a 20x human genome)
● PostgreSQL (2xquad core, 64GB RAM) died
with the Celsius schema (microarray database)
after loading ~1 billion rows
● Needs to be processed, annotated, and
queryable & comparable (variants, coverage,
annotations)
● ~3 billion rows x thousands of columns
● COMBINE WITH PREVIOUS SLIDE
24. Thoughts on BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a
single database daemon
● Provision database to cluster
● Adapt to key-value database semantics
● Limitations:
● Creation on single node only
● Not inherently distributed
● Performance issues with big DBs, high I/O wait
● Google to the rescue?
25. HBase Backend
● How the table(s) are actually structured
● Variants
● Coverage
● Etc
● How I do indexing currently (similar to indexing
feature extension)
● Multiple secondary indexes
26. Frontend
● RESTlet API
● What queries can you do?
● Examples
● URLs
● Potential for swapping out generic M/R for
many of these queries (less reliance on indexes
which will speed things up as DB grows)
27. Ideas for a distributed future
● Federated Dbs/datastores/clusters for
computation rather than one giant datacenter
● Distribute software not data
28. Potential Questions
● How big is the DB to store whole human
genome?
● How long does it take to M/R 3 billion positions
on 5 node cluster?
● How does my stuff compare to other bioinf
software? GATK, Crossbow, etc
● How did I choose HBase instead of Pig, Hive,
etc?
29. Current Prototyping Work
● Validate creation of U87 (genome resequencing
at 20x) genome database
● SNVs
● Coverage
● Annotations
● Test fast querying of record subsets
● Test fast processing of whole DB using
MapReduce
● Test stability, fault-tolerance, auto-balancing,
and deployment issues along the way
30. What About Fast Queries?
● I'm fairly convinced I can create a distributed
HBase database on a Hadoop cluster
● I have a prototype HBase database running on
two nodes
● But HBase shines when bulk processing DB
● Big question is how to make individual lookups
fast
● Possible solution is Hbase+Katta for indexes
(distributed Lucene)
32. How Do We Scale Up the QE?
● Sequencers are increasing output by a factor of
10 every two years!
● Hard drives: 4x every 2 years
● CPUs: 2x every 2 years
● Bandwidth: 2x every 2 years (really?!)
● So there's a huge disconnect, can't just throw
more hardware at a single database server!
● Must look for better ways to scale for
33. Google to the Rescue?
● Companies like Google, Amazon, Facebook,
etc have had to deal with massive scalability
issues over the last 10+ years
● Solutions include:
● Frameworks like MapReduce
● Distributed file systems like HDFS
● Distributed databases like HBase
● Focus here on HBase
34. What Do You Give Up?
● SQL queries
● Well defined schema, normalized data structure
● Relationships manged by DB
● Flexible and easy indexing of table columns
● Existing tools that query a SQL database must
be re-written
● Certain ACID aspects
● Software maturity, most distributed NOSQL
projects are very new
35. What Do You Gain?
● Scalability is the clear win, you can have many
processes on many nodes hit the collection of
database servers
● Ability to look at very large datasets and do
complex computations across a cluster
● More flexibility in representing information now
and in the future
● HBase includes data timestamps/versions
● Integration with Hadoop
36. SeqWare Query Engine
Hadoop Hadoop HDFS
Map Reduce backend webservice clients
ETL Map Data Analysis/Web BED/WIG
Job Nodes Files
Node XML
Metadata
RESTful
ETL Map Data Analysis/Web Web Service
Job Node Nodes
ETL Analysis/Web
Name
Reduce Job Nodes Annotation
Node
Database
MapReduce
jobs extract, HBase Web/Analysis Webservice combines
transform, & Variant & Coverage Nodes process Variant/Coverage data
load in parallel Database System queries with annotations (hg18)
37. What an HBase DB Looks Like
A Record in my HBase family label
key variant:genome4 variant:genome7 coverage:genome7
chr15:00000123454 byte[] byte[] byte[]
Variant object to byte array
Database on filesystem (HDFS)
key timestamp column:variant
chr15:00000123454 t1 genome7 byte[]
38. Scalability and BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a
single database daemon
● Provision database to cluster for distributed analysis
● Adapt to key-value database semantics with nice API
● Limitations:
● Creation on single node only
● Want to query easily across genomes
● Database are not distributed
● I saw performance issues, high I/O wait
39. Would 2,000 Genomes Kill SQL?
● Say each genome has 5M variants (not counting
coverage!)
● 5M variant rows x 2,000 genomes = 10 billion rows
● Our DB server running PostgreSQL (2xquad core,
64GB RAM) died with the Celsius (Chado)
schema after loading ~1 billion rows
● So maybe conservatively we would have issues
with 150+ genomes
● That threshold is probably 1 year away with public
datasets available via SRA, 1000 genomes, TCGA
41. My Abstract
● backend/frontend
● Traverse and query with Map/Reduce
● Java web service with RESTlet
● Deployment on 8 node cluster
42. Background on Problem
● Why abandon PostgreSQL/MySQL/SQL?
● Experience with Celsius...
● What you give up
● What you gain
43. First Solution: BerkeleyDB
● Good:
● key/value data store
● Easy to use
● Great for testing
● Bad:
● Not performant for multiple genomes
● Manual distribution across cluster
● Annoying phobia of shared filesystems
44. Sequencers vs. Information Technology
● Sequencers are increasing output by a factor of
10 every two years!
● Hard drives: 4x every 2 years
● CPUs: 2x every 2 years
● Bandwidth: 2x every 2 years (really?!)
● So there's a huge disconnect, can't just throw
more hardware at a single database server!
● Must look for better ways to scale
45. ● What are we doing, what are the challenges. Big picture of
the project (webservice, backend etc)
● How did people solve this problem before? How did I
attempt to solve this problem? Where did it break down?
● “New” approach, looking to Google et al for scalability for
big data problems
● What is Hbase/Hadoop & what do they provide?
● How did I adapt Hbase/hadoop to my problem?
● Specifics of implementation: overall flow, tables, query
engine search (API), example M/R task
● Is this performant, does this scale? Can I get
billionxmillion? Fast arbitrary retrieval?
● Next steps: more data types, focus on M/R for analytical
tasks, focus on Katta for rich querying, push scalability w/
8 nodes (test with genomes), look at Cascading & other
tools for datamining