SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Big Bio and Fun with HBase

        Brian O'Connor
       Pipeline Architect
 UNC Lineberger Comprehensive
        Cancer Center
Overview
●   The era of “Big Data” in biology
    ●   Sequencing and scientific queries
    ●   Computational requirements and data growth
●   The appeal of HBase/Hadoop
●   The SeqWare project and the Query Engine (QE)
    ●   Implementation of HBase QE
    ●   How well it works
●   The future
    ●   Adding indexing and search engine
    ●   Why HBase/Hadoop are important to modern biology
Biology In Transition
●   Biologists are used to working on one gene
    their entire careers
●   In the last 15 years biology has been
    transitioning to a high-throughput, data-driven
    discipline
●   Now scientist study systems:
    Thousands of genes   Millions of SNPs   Billions of bases of sequence




●   Biology is not physics
The Beginnings
                                   Capillary Sequencer
●   The Human Genome with
    Sanger Sequencing
    ●   1990-2000
                                   Output ~1000 bases
    ●   $3 billion
●   Gave us the blueprints for a
    human being, let us find
    most of the genes in
    humans, understand
    variation in humans
The Era of “Big Data” in Biology




     SOLiD 3 Plus            Illumina GAIIx        454 GS FLX             HeliScope
● 2x50 run ~14 days      ● 2x76 run, ~8 days   ● 400bp, ~10 hours    ● ~35bp run, ~8 days

● >= 99.94% accuracy     ● >= 98.5% accuracy   ● Q20 at 400 bases    ● <5% raw error rate

(colorspace corrected)   at 76 bases           ● ~1M reads / run     ● ~700M reads/run

● ~1G reads / run,       ● ~300M reads /       ~400MB high qual      ~25GB/run
~60GB high quality       flowcell, ~20-25GB    ● Homopolymer issue   ● Cost ?

● Conservatively maybe   high qual             ● Cost ~$14K for      ● So ~25GB in

40GB aligned             38M/lane, 2.5GB/lane, 70x75, ~$90K/GB       ~8days
● ~60GB in ~14 days      about 1.25GB aligned ● So ~1GB in ~1day
● Cost ~$20K,            ● Cost ~$10K, about

$333/GB                  $450/GB
● A Human Genome in      ● ~20GB in ~8 days

2 Weeks!
Types of Questions Biologists Ask
●   Want to ask simple questions:
    ●  “What SNVs are in 5'UTR of phosphatases?”
    ● “What frameshifts affect PTEN in lung cancer?”


    ● “What genes include homozygous, non-

       synonymous SNVs in all ovarian cancers?”
●   Biologists want to see data across samples
●   Database natural choice for this data, many examples
Impact of Next Gen Sequencing
●   If the human genome gave us the generic blue
    prints, next gen sequencing lets us look at
    individuals blue prints
●   Applications                    Sequence individuals cancer
                                 and find distinct druggable targets
      Sequence many people         tumor
     look at mutation patterns


                                   subtract

                                              mutations just
                                              in tumor
                  Have disease
                                   normal

         Does not have disease
The Big Problem
●   Biologist think about reagents and tumor not
    hard drives and CPUs
●   The crazy growth exceeds the growth of all IT
    components
●   Dramatically different models for scalability are
    required
●   We're soon reaching a point where a
    genome will cost less to sequence than it
    does to look at the data!
Increase in Sequencer Output
                                          Illumina Sequencer Ouput
       Log scale
                                           Sequence File Sizes Per Lane
                    100000000000




                     10000000000
File Size (Bytes)




                                                                                                                   Moore's Law:
                      1000000000
                                                                                                                   CPU Power Doubles
                                                                                                                   Every 2 Years
                      100000000




                       10000000
                             08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
                                                             Date


                              Suggests Sequencer Output
                              Increases by 10x Every 2 Years!
                                                                                                                Kryder's Law:
                              Far outpacing hard drive,                                                         Storage Quadruples
                              CPU, and bandwidth growth                                                         Every 2 Years


http://genome.wellcome.ac.uk/doc_WTX059576.html
Lowering Costs = Bigger Projects
    ●   Falling costs increase scope of projects
    ●   The Cancer Genome Atlas (TCGA)
          ●   20 tumor types, 200 samples each in 4 years
          ●   Around 4 Petabytes of data
    ●   Once costs < $1000 many predict the
        technology will move into clinical applications
          ●   1.5 million new cancer patients in 2010
          ●   1,500 Petabytes per year!?
          ●   4 PB/day vs 0.2 PB/day Facebook
http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/
http://seer.cancer.gov/statfacts/html/all.html
So We Need to Rethink Scaling Up
    Particularly for Databases
●   The old ways of growing can't keep up... can't
    just “buy a bigger box”
●   We need to embrace what the Peta-scale
    community is currently doing
●   The Googles, Facebooks, Twitters, etc of the
    world are solving their scalability problems,
    biology needs to learn from their solutions
●   Clusters can be expanded, distributed storage
    used, least scalable portion is the database
    which biologist need to ask questions
Technologies That Can Help
●   Many open source tools designed for Petascale
    ●   Map/Reduce for data processing
    ●   Hadoop for distributed file systems (HDFS) and
        robust process control
    ●   Pig/HIVE/etc processing unstructured/semi-
        structured data
    ●   HBase/Cassandra/etc for databasing
●   Could go unstructured but biological data is
    most useful when aggregated and a database
    is extremely good for this
HBase to the Rescue
●   Billions of rows x millions of columns!
●   Focus on random access (vs. HDFS)
●   Table is column oriented, sparse matrix
●   Versioning (timestamps) built in
●   Flexible storage of different data types
●   Splits DB across many nodes transparently
●   Locality of data, I can run map/reduce jobs that
    process the table rows present on a given node
    ●   22M variants processed <1 minute on 5 node cluster
Magically Scalable Databases
●   Talking about distributed databases, forces a
    huge shift in what you can and can't do
●   “NoSQL is an umbrella term for a loosely
    defined class of non-relational data stores that
    break with a long history of relational databases
    and ACID guarantees. Data stores that fall
    under this term may not require fixed table
    schemas, and usually avoid join operations.
    The term was first popularized in early 2009.”

              http://en.wikipedia.org/wiki/Nosql
What You Give Up
●   SQL queries
●   Well defined schema, normalized data structure
●   Relationships manged by DB
●   Flexible and easy indexing of table columns
●   Existing tools that query a SQL database must
    be re-written
●   Certain ACID aspects
●   Software maturity, most distributed NOSQL
    projects are very new
What You Gain
●   Scalability is the clear win, you can have many
    processes on many nodes hit the collection of
    database servers
●   Ability to look at very large datasets and do
    complex computations across a cluster
●   More flexibility in representing information now
    and in the future
●   HBase includes data timestamps/versions
●   Integration with Hadoop
HBase Architecture




http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
HBase Tables
Conceptual View




Physical Storage View
HBase APIs
●   Basic API
    ●   Connect to an HBase server like it's a single server
    ●   Lets you iterate over the contents of the DB, with
        flexible filtering of the results
    ●   Can pull back/write key/values easily
●   Map/Reduce source/sink
    ●   Can use HBase tables easily from Map/Reduce
    ●   May be easier/faster just to Map/Reduce than filter
        with the API
    ●   I want to use this more in the future
HBase Installation
●   Install Hadoop first, version 0.20.x
●   I installed HBase version 0.20.1
●   Documented: http://tinyurl.com/23ftbrk
●   Currently have a 7 node Hadoop/HBase cluster
    running at UNC with about 40TB of storage, 56
    CPUs, 168GB RAM
●   RPMs from Cloudera (0.89.20100924)
    http://www.cloudera.com
The SeqWare Project
                                              Genome Browser

      SeqWare LIMS
                                                                               SeqWare Query Engine
                                                                                High-performance, distributed
                                                                                                                    Hadoop           HBase
                                               Genome browser and                                                   Cluster
                                               query engine frontend            database and web query engine,
                                                                                powers both browser and
                                                                                interactive queries                                          HBase
                                                                                                                              Hadoop
                                                    Central DB                                                                Cluster
  Central portal for users that links
  to the tools to upload samples,                   coordinates
                                                                   SeqWare                                                                           Nodes
  trigger analysis, and view results                all analysis
                                                    and results    MetaDB
                                                    metadata                                               SeqWare Pipeline                   SGE
                                                                                                                                             Cluster
            SeqWare API*                                                             Storage
              Java   Perl      Python
                                                                                         S                                                     SGE
                      thrift
                                                                                                                                              Cluster
                     network
                                                                                         S                  Controls analysis on
        Developer API, savvy users                                                                          the cluster, models
        can control all of SeqWare's' tools                                              S                  analysis workflows for        SGE
        programmatically                                                                                    RNA-Seq and other            Cluster
                                                                                         S                  NGS experimental
                                                                                                            designs


  Fully open source                               Import Daemons                                                           Big Data             Small Data

  http://seqware.sf.net                         Data import tool facilitates
                                                sequence delivery to the
                                                storage via network


* future project
The SeqWare Query Engine Project
●   SeqWare Query Engine is our HBase database
    for next gen sequence data
          Interactive Web Forms
                                    SeqWare Query Engine
                                      High-performance, distributed
                                      database and web query engine,
                                      powers both browser and                  Hadoop
                                      interactive queries
                                                                               Cluster

                                                                       HBase
                                                                                   Hadoop
                                                                                   Cluster
                                                                       HBase

       Genome Browser                                                              Hadoop
                                                                       HBase       Cluster



                                  Web API Service
        Genome browser and
        query engine frontend
Webservice Interfaces
   RESTful XML Client API                           or                HTML Forms




SeqWare Query Engine includes a
RESTful web service that returns XML
describing variant DBs, a web form for
querying, and returns BED/WIG data
available via a well-defined URL
  http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN

            track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009"
            chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
SeqWare QE Architecture Concepts
●   Types: variants,                         Indel
                                                             chr1

    translocations, coverage,
                                               * nonsynonmous
    consequence, and                           * is_dbSNP | rs10292192
    generic “features”
●   Each can be “tagged”                 SNV
                                                     chr12
    with arbitrary key-value
    pairs, allows for the                      * nonsynonmous
    encoding of a surprising                   * is_dbSNP | rs10292192
    amount of annotation!
●   Fundamental query is a        chr3
                                         translocation
                                                                         chr12

    list of objects filtered by
                                             * gene_fusion
    these “tags”
Requirements for Query
        Engine Backend
The backend must:
 –   Represent many types of data
 –   Support a rich level of annotation
 –   Support very large variant databases
     (~3 billion rows x thousands of columns)
 –   Be distributed across a cluster
 –   Support processing, annotating, querying &
     comparing samples (variants, coverage,
     annotations)
 –   Support a crazy growth of data
HBase Query Engine Backend
●   Stores variants, translocations, coverage info,
    coding consequence reports, annotations as
    “tags”, and generic “features”
●   Common interface, uses HBase as backend
    ●   Create a single database for all genomes
    ●   Database is auto-sharded across cluster
    ●   Can use Map/Reduce and other projects to do
        sophisticated queries
    ●   Performance seems very good!
HBase SeqWare Query Engine
                                      Backend           Webservice      client

MapReduce                             HBase API

  ETL Map             Region            Analysis          BED/WIG
    Job               Server             Node              Files
                                                                          XML
                                                                        Metadata

                                                           RESTful
  ETL Map             Region          Web Nodes           Web Service
    Job               Server



    ETL                                 Analysis
 Reduce Job           HMaster
                                         Node              MetaDB

ETL                                   Querying &
jobs extract,      HBase on HDFS      Loading
transform, &/or    Variant & Coverage Nodes process
load in parallel   Database System    queries via API
Underlying HBase Tables
hg18Table                                                         family label
key                  variant:genome4 variant:genome7 coverage:genome7

chr15:00000123454         byte[]              byte[]                 byte[]

                                    Variant object         byte array

Genome1102NTagIndexTable
key                                                                  rowId:

is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123          byte[]

                                                       queries look up by tag then
Database on filesystem (HDFS)                          filter the variant results
 key                     timestamp       column:variant
 chr15:00000123454          t1            genome7        byte[]
Backend Performance Comparison
                                        Pileup Load Time 1102N
                                          HBase vs. Berkeley DB
            8000000

            7000000

            6000000

            5000000
                                                                                     load time bdb
 variants




            4000000                                                                  load time hbase

            3000000

            2000000

            1000000

                  0
                      0   2000   4000   6000    8000 10000 12000 14000 16000 18000
                                               time (s)
Backend Performance Comparison
                                    BED Export Time 1102N
                                 HBase API vs. M/R vs. BerkeleyDB
            8000000

            7000000

            6000000

            5000000                                                         dump time bdb
 variants




            4000000                                                         dump time hbase
                                                                            dump time m/r
            3000000

            2000000

            1000000

                  0
                      0   1000    2000   3000   4000   5000   6000   7000
                                         time
Querying with Map/Reduce
Find all variants seen in ovarian cancer genomes tagged as “frameshift”
      Node 1                                          Reduce
                            Map
                       Iterate over every      For each item in bin,
          Chr1       variant, bin if ovarian   print out the variant
         Variants    and tagged frameshift          information



      Node 2                                      ●   Finding variants with tag
         Chr2
                                                      or comparing at same
                      Map             Reduce
        Variants                                      genomic position very
                                                      efficient
      Node 3                                      ●   Problem of overlapping
          Chr3                                        features not starting at the
                      Map             Reduce
         Variants                                     same genomic location
Problematic Querying Map/Reduce
   Indel 1
             chr1:1200-1500                                     ●   Hard to do
                      Indel 2                                       overlaps
                              chr1:1400-1700
In the genome              SNV 1                                    queries which
                                                                    are really
                           SNV 2                                    important for
                                                                    biological DBs
In the DB
                                                                ●   Here if I did a
  chr1:000001200           Indel 1 byte[]                           Map/Reduce
                                                                    query only
                                                                    SNV1 and
      chr1:000001400            Indel 2 byte[]
                                                                    SNV2 would
                                                                    “overlap”
             chr1:000001450      SNV 1 byte[]    SNV 2 byte[]
SeqWare Query Engine Status
●   Open source, you can try it now!
●   Both BerkeleyDB & HBase backends
●   Multiple genomes stored in the same table, very
    Map/Reduce compatible for SNVs
●   Basic secondary indexing for “tags”
●   API used for queries via Webservice
●   Prototype Map/Reduce examples including
    “somatic” mutation detection in paired
    normal/cancer samples
SeqWare Query Engine Future
●   Dynamically building R-tree indexes or
    Nested Containment Lists with Map/Reduce
    “Experiences on Processing Spatial Data with
    MapReduce” by Cary et al.
●   Looking at using Katta or Solr for indexing free
    text data such as gene descriptions, OMIM
    entries, etc.
●   Queries across samples with simple logic
●   More testing, pushing our 7 node cluster,
    finding the max number of genomes this cluster
    can support
Final Thoughts
●   Era of Big Data for Biology is here!
●   CPU bound problems no doubt but as short
    reads become long reads and prices per
    GBase drop the problems seem to shift to
    handling/mining data
●   Tools designed for Peta-scale datasets are key
                                      Yahoo's Hadoop
                                      Cluster
For More Information
●   http://seqware.sf.net
●   http://hadoop.apache.org
●   http://hbase.apache.org
●   Brian O'Connor <briandoconnor@gmail.com>
●   Article (12/21/2010):
    SeqWare Query Engine: storing and searching sequence
    data in the cloud
    Brian D O’Connor , Barry Merriman and Stanley F Nelson
    BMC Bioinformatics 2010, 11(Suppl 12)
●   We have job openings!!

Contenu connexe

Similaire à Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010

E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchTom Connor
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceChef Software, Inc.
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintrothomasrconnor
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017Stacy Véronneau
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Guy Coates
 
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTimeScience
 

Similaire à Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 (20)

E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Utility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right ScienceUtility HPC: Right Systems, Right Scale, Right Science
Utility HPC: Right Systems, Right Scale, Right Science
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Climb bath
Climb bathClimb bath
Climb bath
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintro
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystemTraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
TraitCapture: NextGen Monitoring and Visualization from seed to ecosystem
 

Dernier

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Dernier (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010

  • 1. Big Bio and Fun with HBase Brian O'Connor Pipeline Architect UNC Lineberger Comprehensive Cancer Center
  • 2. Overview ● The era of “Big Data” in biology ● Sequencing and scientific queries ● Computational requirements and data growth ● The appeal of HBase/Hadoop ● The SeqWare project and the Query Engine (QE) ● Implementation of HBase QE ● How well it works ● The future ● Adding indexing and search engine ● Why HBase/Hadoop are important to modern biology
  • 3. Biology In Transition ● Biologists are used to working on one gene their entire careers ● In the last 15 years biology has been transitioning to a high-throughput, data-driven discipline ● Now scientist study systems: Thousands of genes Millions of SNPs Billions of bases of sequence ● Biology is not physics
  • 4. The Beginnings Capillary Sequencer ● The Human Genome with Sanger Sequencing ● 1990-2000 Output ~1000 bases ● $3 billion ● Gave us the blueprints for a human being, let us find most of the genes in humans, understand variation in humans
  • 5. The Era of “Big Data” in Biology SOLiD 3 Plus Illumina GAIIx 454 GS FLX HeliScope ● 2x50 run ~14 days ● 2x76 run, ~8 days ● 400bp, ~10 hours ● ~35bp run, ~8 days ● >= 99.94% accuracy ● >= 98.5% accuracy ● Q20 at 400 bases ● <5% raw error rate (colorspace corrected) at 76 bases ● ~1M reads / run ● ~700M reads/run ● ~1G reads / run, ● ~300M reads / ~400MB high qual ~25GB/run ~60GB high quality flowcell, ~20-25GB ● Homopolymer issue ● Cost ? ● Conservatively maybe high qual ● Cost ~$14K for ● So ~25GB in 40GB aligned 38M/lane, 2.5GB/lane, 70x75, ~$90K/GB ~8days ● ~60GB in ~14 days about 1.25GB aligned ● So ~1GB in ~1day ● Cost ~$20K, ● Cost ~$10K, about $333/GB $450/GB ● A Human Genome in ● ~20GB in ~8 days 2 Weeks!
  • 6. Types of Questions Biologists Ask ● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?” ● “What frameshifts affect PTEN in lung cancer?” ● “What genes include homozygous, non- synonymous SNVs in all ovarian cancers?” ● Biologists want to see data across samples ● Database natural choice for this data, many examples
  • 7. Impact of Next Gen Sequencing ● If the human genome gave us the generic blue prints, next gen sequencing lets us look at individuals blue prints ● Applications Sequence individuals cancer and find distinct druggable targets Sequence many people tumor look at mutation patterns subtract mutations just in tumor Have disease normal Does not have disease
  • 8. The Big Problem ● Biologist think about reagents and tumor not hard drives and CPUs ● The crazy growth exceeds the growth of all IT components ● Dramatically different models for scalability are required ● We're soon reaching a point where a genome will cost less to sequence than it does to look at the data!
  • 9. Increase in Sequencer Output Illumina Sequencer Ouput Log scale Sequence File Sizes Per Lane 100000000000 10000000000 File Size (Bytes) Moore's Law: 1000000000 CPU Power Doubles Every 2 Years 100000000 10000000 08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10 Date Suggests Sequencer Output Increases by 10x Every 2 Years! Kryder's Law: Far outpacing hard drive, Storage Quadruples CPU, and bandwidth growth Every 2 Years http://genome.wellcome.ac.uk/doc_WTX059576.html
  • 10. Lowering Costs = Bigger Projects ● Falling costs increase scope of projects ● The Cancer Genome Atlas (TCGA) ● 20 tumor types, 200 samples each in 4 years ● Around 4 Petabytes of data ● Once costs < $1000 many predict the technology will move into clinical applications ● 1.5 million new cancer patients in 2010 ● 1,500 Petabytes per year!? ● 4 PB/day vs 0.2 PB/day Facebook http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/ http://seer.cancer.gov/statfacts/html/all.html
  • 11. So We Need to Rethink Scaling Up Particularly for Databases ● The old ways of growing can't keep up... can't just “buy a bigger box” ● We need to embrace what the Peta-scale community is currently doing ● The Googles, Facebooks, Twitters, etc of the world are solving their scalability problems, biology needs to learn from their solutions ● Clusters can be expanded, distributed storage used, least scalable portion is the database which biologist need to ask questions
  • 12. Technologies That Can Help ● Many open source tools designed for Petascale ● Map/Reduce for data processing ● Hadoop for distributed file systems (HDFS) and robust process control ● Pig/HIVE/etc processing unstructured/semi- structured data ● HBase/Cassandra/etc for databasing ● Could go unstructured but biological data is most useful when aggregated and a database is extremely good for this
  • 13. HBase to the Rescue ● Billions of rows x millions of columns! ● Focus on random access (vs. HDFS) ● Table is column oriented, sparse matrix ● Versioning (timestamps) built in ● Flexible storage of different data types ● Splits DB across many nodes transparently ● Locality of data, I can run map/reduce jobs that process the table rows present on a given node ● 22M variants processed <1 minute on 5 node cluster
  • 14. Magically Scalable Databases ● Talking about distributed databases, forces a huge shift in what you can and can't do ● “NoSQL is an umbrella term for a loosely defined class of non-relational data stores that break with a long history of relational databases and ACID guarantees. Data stores that fall under this term may not require fixed table schemas, and usually avoid join operations. The term was first popularized in early 2009.” http://en.wikipedia.org/wiki/Nosql
  • 15. What You Give Up ● SQL queries ● Well defined schema, normalized data structure ● Relationships manged by DB ● Flexible and easy indexing of table columns ● Existing tools that query a SQL database must be re-written ● Certain ACID aspects ● Software maturity, most distributed NOSQL projects are very new
  • 16. What You Gain ● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers ● Ability to look at very large datasets and do complex computations across a cluster ● More flexibility in representing information now and in the future ● HBase includes data timestamps/versions ● Integration with Hadoop
  • 19. HBase APIs ● Basic API ● Connect to an HBase server like it's a single server ● Lets you iterate over the contents of the DB, with flexible filtering of the results ● Can pull back/write key/values easily ● Map/Reduce source/sink ● Can use HBase tables easily from Map/Reduce ● May be easier/faster just to Map/Reduce than filter with the API ● I want to use this more in the future
  • 20. HBase Installation ● Install Hadoop first, version 0.20.x ● I installed HBase version 0.20.1 ● Documented: http://tinyurl.com/23ftbrk ● Currently have a 7 node Hadoop/HBase cluster running at UNC with about 40TB of storage, 56 CPUs, 168GB RAM ● RPMs from Cloudera (0.89.20100924) http://www.cloudera.com
  • 21. The SeqWare Project Genome Browser SeqWare LIMS SeqWare Query Engine High-performance, distributed Hadoop HBase Genome browser and Cluster query engine frontend database and web query engine, powers both browser and interactive queries HBase Hadoop Central DB Cluster Central portal for users that links to the tools to upload samples, coordinates SeqWare Nodes trigger analysis, and view results all analysis and results MetaDB metadata SeqWare Pipeline SGE Cluster SeqWare API* Storage Java Perl Python S SGE thrift Cluster network S Controls analysis on Developer API, savvy users the cluster, models can control all of SeqWare's' tools S analysis workflows for SGE programmatically RNA-Seq and other Cluster S NGS experimental designs Fully open source Import Daemons Big Data Small Data http://seqware.sf.net Data import tool facilitates sequence delivery to the storage via network * future project
  • 22. The SeqWare Query Engine Project ● SeqWare Query Engine is our HBase database for next gen sequence data Interactive Web Forms SeqWare Query Engine High-performance, distributed database and web query engine, powers both browser and Hadoop interactive queries Cluster HBase Hadoop Cluster HBase Genome Browser Hadoop HBase Cluster Web API Service Genome browser and query engine frontend
  • 23. Webservice Interfaces RESTful XML Client API or HTML Forms SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
  • 24. SeqWare QE Architecture Concepts ● Types: variants, Indel chr1 translocations, coverage, * nonsynonmous consequence, and * is_dbSNP | rs10292192 generic “features” ● Each can be “tagged” SNV chr12 with arbitrary key-value pairs, allows for the * nonsynonmous encoding of a surprising * is_dbSNP | rs10292192 amount of annotation! ● Fundamental query is a chr3 translocation chr12 list of objects filtered by * gene_fusion these “tags”
  • 25. Requirements for Query Engine Backend The backend must: – Represent many types of data – Support a rich level of annotation – Support very large variant databases (~3 billion rows x thousands of columns) – Be distributed across a cluster – Support processing, annotating, querying & comparing samples (variants, coverage, annotations) – Support a crazy growth of data
  • 26. HBase Query Engine Backend ● Stores variants, translocations, coverage info, coding consequence reports, annotations as “tags”, and generic “features” ● Common interface, uses HBase as backend ● Create a single database for all genomes ● Database is auto-sharded across cluster ● Can use Map/Reduce and other projects to do sophisticated queries ● Performance seems very good!
  • 27. HBase SeqWare Query Engine Backend Webservice client MapReduce HBase API ETL Map Region Analysis BED/WIG Job Server Node Files XML Metadata RESTful ETL Map Region Web Nodes Web Service Job Server ETL Analysis Reduce Job HMaster Node MetaDB ETL Querying & jobs extract, HBase on HDFS Loading transform, &/or Variant & Coverage Nodes process load in parallel Database System queries via API
  • 28. Underlying HBase Tables hg18Table family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object byte array Genome1102NTagIndexTable key rowId: is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[] queries look up by tag then Database on filesystem (HDFS) filter the variant results key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  • 29. Backend Performance Comparison Pileup Load Time 1102N HBase vs. Berkeley DB 8000000 7000000 6000000 5000000 load time bdb variants 4000000 load time hbase 3000000 2000000 1000000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 time (s)
  • 30. Backend Performance Comparison BED Export Time 1102N HBase API vs. M/R vs. BerkeleyDB 8000000 7000000 6000000 5000000 dump time bdb variants 4000000 dump time hbase dump time m/r 3000000 2000000 1000000 0 0 1000 2000 3000 4000 5000 6000 7000 time
  • 31. Querying with Map/Reduce Find all variants seen in ovarian cancer genomes tagged as “frameshift” Node 1 Reduce Map Iterate over every For each item in bin, Chr1 variant, bin if ovarian print out the variant Variants and tagged frameshift information Node 2 ● Finding variants with tag Chr2 or comparing at same Map Reduce Variants genomic position very efficient Node 3 ● Problem of overlapping Chr3 features not starting at the Map Reduce Variants same genomic location
  • 32. Problematic Querying Map/Reduce Indel 1 chr1:1200-1500 ● Hard to do Indel 2 overlaps chr1:1400-1700 In the genome SNV 1 queries which are really SNV 2 important for biological DBs In the DB ● Here if I did a chr1:000001200 Indel 1 byte[] Map/Reduce query only SNV1 and chr1:000001400 Indel 2 byte[] SNV2 would “overlap” chr1:000001450 SNV 1 byte[] SNV 2 byte[]
  • 33. SeqWare Query Engine Status ● Open source, you can try it now! ● Both BerkeleyDB & HBase backends ● Multiple genomes stored in the same table, very Map/Reduce compatible for SNVs ● Basic secondary indexing for “tags” ● API used for queries via Webservice ● Prototype Map/Reduce examples including “somatic” mutation detection in paired normal/cancer samples
  • 34. SeqWare Query Engine Future ● Dynamically building R-tree indexes or Nested Containment Lists with Map/Reduce “Experiences on Processing Spatial Data with MapReduce” by Cary et al. ● Looking at using Katta or Solr for indexing free text data such as gene descriptions, OMIM entries, etc. ● Queries across samples with simple logic ● More testing, pushing our 7 node cluster, finding the max number of genomes this cluster can support
  • 35. Final Thoughts ● Era of Big Data for Biology is here! ● CPU bound problems no doubt but as short reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data ● Tools designed for Peta-scale datasets are key Yahoo's Hadoop Cluster
  • 36. For More Information ● http://seqware.sf.net ● http://hadoop.apache.org ● http://hbase.apache.org ● Brian O'Connor <briandoconnor@gmail.com> ● Article (12/21/2010): SeqWare Query Engine: storing and searching sequence data in the cloud Brian D O’Connor , Barry Merriman and Stanley F Nelson BMC Bioinformatics 2010, 11(Suppl 12) ● We have job openings!!