SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
SeqWare Query Engine
Storing & Searching Sequence Data in the Cloud

              Brian O'Connor
       UNC Lineberger Comprehensive
              Cancer Center

                    BOSC
                July 9th, 2010
SeqWare Query Engine
●   Want to ask simple questions:
    ●   “What SNVs are in 5'UTR of phosphatases?”
    ●   “What frameshift indels affect PTEN?”
    ●   “What genes include homozygous, non-
        synonymous SNVs?”
●   SeqWare Query Engine created to query data
    ●   RESTful Webservice
    ●   Scalable/Queryable Backend
Variant Annotation with SeqWare
                 Whole Genome/Exome
                                                  WIG      BED


                                              SeqWare Query
          Alignment      BAM                 Engine Webservice
                                   RESTlet



           Variant                             SeqWare Query
           Calling      pileup                 Engine Backend
                                 Backend
                                 Interface

  dbSNP
                                     Variant,           HBase or
                                   Coverage, &          Berkeley DB
                                   Consequence          Stores
               Consequence


                                   SeqWare Query
  SeqWare Pipeline
                                      Engine
Webservice Interfaces
   RESTful XML Client API                          or                 HTML Forms




SeqWare Query Engine includes a
RESTful web service that returns XML
describing variant DBs, a web form
for querying, and returns BED/WIG
data available via a well-defined URL
  http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN

            track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009"
            chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
Loading in Genome Browsers
                     SeqWare Query
                     Engine URLs can
                     be directly loaded
                     into IGV & UCSC
                     genome browsers
Requirements for Query
        Engine Backend
The backend must:
 –   Represent many types of data
 –   Support a rich level of annotation
 –   Support very large variant databases
     (~3 billion rows x thousands of columns)
 –   Be distributed across a cluster
 –   Support processing, annotating, querying &
     comparing samples (variants, coverage,
     annotations)
 –   Support a crazy growth of data
Increase in Sequencer Output
                                                     Nelson Lab - UCLA
                                                                Illumina Sequencer Ouput
                                                                Sequence File Sizes Per Lane
                     100000000000

                                     Suggests Sequencer Output
                                     Increases by 5-10x Every 2 Years!

                      10000000000
                                     Far outpacing hard drive,
                                     CPU, and bandwidth growth
 File Size (Bytes)




                       1000000000




                        100000000




Log scale
                         10000000
                               08/10/06   02/26/07   09/14/07        04/01/08      10/18/08    05/06/09   11/22/09   06/10/10   12/27/10

                                                                                Date
HBase to the Rescue?
●   Billions of rows x millions of columns!
●   Focus on random access (vs. HDFS)
●   Table is column oriented, sparse matrix
●   Versioning (timestamps) built in
●   Flexible storage of different data types
●   Splits DB across many nodes transparently
●   Locality of data, I can run map/reduce jobs that
    process the table rows present on a given node
    ●   22M variants processed <1 minute on 5 node cluster
Underlying HBase Tables
hg18Table                                                         family label
key                  variant:genome4 variant:genome7 coverage:genome7

chr15:00000123454         byte[]              byte[]                 byte[]

                                    Variant object         byte array

Genome1102NTagIndexTable
key                                                                  rowId:

is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123          byte[]

                                                       queries look up by tag then
Database on filesystem (HDFS)                          filter the variant results
 key                     timestamp       column:variant
 chr15:00000123454          t1            genome7        byte[]
HBase API & Map/Reduce Querying
●   HBase API
    ●   Powers the current Backend & Webservice
    ●   Provides a familiar API, scanners, iterators, etc
    ●   Backend written using this, retrieve variants by tags
    ●   Distributed database but single thread using API
●   Prototype somatic mutations by Map/Reduce
    ●   Every row is examined, variants in tumor not in
        normal are retrieved
    ●   Map/Reduce jobs run on node with local data
    ●   Highly parallel & faster than API with single thread
SeqWare Query Engine on HBase
                                      Backend           Webservice
                                                                        clients
MapReduce                             HBase API

  ETL Map              Data          Analysis/Web           BED/WIG
    Job                                 Nodes                Files
                       Node                                                XML
                                                                          Metadata


                                                           RESTful
  ETL Map              Data          Analysis/Web         Web Service
    Job                Node             Nodes



    ETL                              Analysis/Web
                       Name
 Reduce Job                             Nodes
                       Node                                MetaDB

ETL                                   Querying &
jobs extract,      HBase on HDFS      Loading            Webservice combines
transform, &/or    Variant & Coverage Nodes process      Variant/Coverage data
load in parallel   Database System    queries via API    with metadata
Status of HBase Backend
●   Both BerkeleyDB & HBase, Relational soon
●   Multiple genomes stored in the same table,
    very Map/Reduce compatible
●   Basic secondary indexing for “tags”
●   API used for queries via Webservice
●   Prototype Map/Reduce example for “somatic”
    mutation detection in paired normal/cancer
    samples
●   Currently loading 1102 normal/tumor (GBM)
Backend Performance Comparison
                                        Pileup Load Time 1102N
                                          HBase vs. Berkeley DB
            8000000

            7000000

            6000000

            5000000
                                                                                     load time bdb
 variants




            4000000                                                                  load time hbase

            3000000

            2000000

            1000000

                 0
                      0   2000   4000   6000    8000 10000 12000 14000 16000 18000
                                               time (s)
Backend Performance Comparison
                                    BED Export Time 1102N
                                 HBase API vs. M/R vs. BerkeleyDB
            8000000

            7000000

            6000000

            5000000                                                         dump time bdb
 variants




            4000000                                                         dump time hbase
                                                                            dump time m/r
            3000000

            2000000

            1000000

                 0
                      0   1000    2000   3000   4000   5000   6000   7000
                                         time
HBase/Hadoop Have Potential!
●   Era of Big Data for Biology is here!
●   CPU bound problems no doubt but as short
    reads become long reads and prices per GBase
    drop the problems seem to shift to
    handling/mining data
●   Tools designed for Peta-scale datasets are key
Next Steps
●   Model other datatypes: copy number, RNAseq
    gene/exon/splice junction counts, isoforms etc
●   Focus on porting analysis/querying to Map/Reduce
●   Indexing beyond “tags” with Katta (distributed Lucene)
●   Push scalability, what are the limits of an 8 node
    HBase/Hadoop cluster?
●   Look at Cascading, Pig, Hive, etc as advanced
    workflow and data mining tools
●   Standards for Webservice dialect (DAS?)
●   Exposing Query Engine through GALAXY
Acknowledgments

      UCLA                 UNC
   Jordan Mendler      Sara Grimm
   Michael Clark       Matt Soloway
   Hane Lee            Jianying Li
   Bret Harry          Feri Zsuppan
   Stanley Nelson      Neil Hayes
                        Chuck Perou
                        Derek Chiang
Resources
●   Hbase & Hadoop: http://hadoop.apache.org
●   When to use HBase:
    http://blog.rapleaf.com/dev/?p=26
●   NOSQL presentations:
    http://blog.oskarsson.nu/2009/06/nosql-debrief.html
●   Other DBs: CouchDB, Hypertable, Cassandra,
    Project Voldemort, and more...
●   Data mining tools: Pig and Hive
●   SeqWare: http://seqware.sourceforge.net
●   briandoconnor@gmail.com
Extra Slides
Overview
●   SeqWare Query Engine background
●   New tools for combating the data deluge
●   HBase/Hadoop in SeqWare Query Engine
    ●   HBase for backend
    ●   Map/Reduce & HBase API for webservice
●   Better performance and scalability?
●   Next steps
SeqWare Query Engine:
Lustre Filesystem
                          BerkeleyDB
                                     backend   webservice        clients


 Genome                                               BED/WIG
                     Analysis/Web Nodes                 Files
 Database                                                             XML
                                                                     Metadata


                                                    RESTful
 Genome              Analysis/Web Nodes            Web Service
 Database



 Genome              Analysis/Web Nodes
 Database                                           MetaDB


BerkeleyDB                 Web/Analysis           Webservice combines
Variant & Coverage         Nodes process          Variant/Coverage data
Databases                  queries                with metadata
More
●   Details on API vs. M/R
●   Details on XML Restful API & web app including
    loading in UCSC browser
●   Details on generic store object (BerkeleyDB,
    HBase, and Relational at Renci)
●   Byte serialization from BerkeleyDB, custom
    secondary key creation
Pressures of Sequencing
●   A lot of data (50GB SRF file, 150GB alignment
    files, 60GB variants for a 20x human genome)
●   PostgreSQL (2xquad core, 64GB RAM) died
    with the Celsius schema (microarray database)
    after loading ~1 billion rows
●   Needs to be processed, annotated, and
    queryable & comparable (variants, coverage,
    annotations)
●   ~3 billion rows x thousands of columns
●   COMBINE WITH PREVIOUS SLIDE
Thoughts on BerkeleyDB
●   BerkeleyDB let me:
    ●   Create a database per genome, independent from a
        single database daemon
    ●   Provision database to cluster
    ●   Adapt to key-value database semantics
●   Limitations:
    ●   Creation on single node only
    ●   Not inherently distributed
    ●   Performance issues with big DBs, high I/O wait
●   Google to the rescue?
HBase Backend
●   How the table(s) are actually structured
    ●   Variants
    ●   Coverage
    ●   Etc
●   How I do indexing currently (similar to indexing
    feature extension)
    ●   Multiple secondary indexes
Frontend
●   RESTlet API
●   What queries can you do?
    ●   Examples
    ●   URLs
●   Potential for swapping out generic M/R for
    many of these queries (less reliance on indexes
    which will speed things up as DB grows)
Ideas for a distributed future
●   Federated Dbs/datastores/clusters for
    computation rather than one giant datacenter
●   Distribute software not data
Potential Questions
●   How big is the DB to store whole human
    genome?
●   How long does it take to M/R 3 billion positions
    on 5 node cluster?
●   How does my stuff compare to other bioinf
    software? GATK, Crossbow, etc
●   How did I choose HBase instead of Pig, Hive,
    etc?
Current Prototyping Work
●   Validate creation of U87 (genome resequencing
    at 20x) genome database
    ●   SNVs
    ●   Coverage
    ●   Annotations
●   Test fast querying of record subsets
●   Test fast processing of whole DB using
    MapReduce
●   Test stability, fault-tolerance, auto-balancing,
    and deployment issues along the way
What About Fast Queries?
●   I'm fairly convinced I can create a distributed
    HBase database on a Hadoop cluster
●   I have a prototype HBase database running on
    two nodes
●   But HBase shines when bulk processing DB
●   Big question is how to make individual lookups
    fast
●   Possible solution is Hbase+Katta for indexes
    (distributed Lucene)
SeqWare Query Engine
Lustre Filesystem
                                     backend   webservice         clients


 Genome              Analysis/Web Nodes               BED/WIG
                                                        Files
 Database            (8 CPU, 32GB RAM)                                DASXML


                                                    RESTful
 Genome              Analysis/Web Nodes            Web Service
 Database            (8 CPU, 32GB RAM)



 Genome              Analysis/Web Nodes
                                                   Annotation
 Database            (8 CPU, 32GB RAM)
                                                   Database

BerkeleyDB                 Web/Analysis           Webservice combines
Variant & Coverage         Nodes process          Variant/Coverage data
Databases                  queries                with annotations (hg18)
How Do We Scale Up the QE?
●   Sequencers are increasing output by a factor of
    10 every two years!
●   Hard drives: 4x every 2 years
●   CPUs: 2x every 2 years
●   Bandwidth: 2x every 2 years (really?!)
●   So there's a huge disconnect, can't just throw
    more hardware at a single database server!
●   Must look for better ways to scale for
Google to the Rescue?
●   Companies like Google, Amazon, Facebook,
    etc have had to deal with massive scalability
    issues over the last 10+ years
●   Solutions include:
    ●   Frameworks like MapReduce
    ●   Distributed file systems like HDFS
    ●   Distributed databases like HBase
●   Focus here on HBase
What Do You Give Up?
●   SQL queries
●   Well defined schema, normalized data structure
●   Relationships manged by DB
●   Flexible and easy indexing of table columns
●   Existing tools that query a SQL database must
    be re-written
●   Certain ACID aspects
●   Software maturity, most distributed NOSQL
    projects are very new
What Do You Gain?
●   Scalability is the clear win, you can have many
    processes on many nodes hit the collection of
    database servers
●   Ability to look at very large datasets and do
    complex computations across a cluster
●   More flexibility in representing information now
    and in the future
●   HBase includes data timestamps/versions
●   Integration with Hadoop
SeqWare Query Engine
Hadoop              Hadoop HDFS
Map Reduce                                 backend    webservice         clients


  ETL Map              Data         Analysis/Web             BED/WIG
    Job                                Nodes                  Files
                       Node                                                 XML
                                                                           Metadata


                                                           RESTful
  ETL Map              Data          Analysis/Web         Web Service
    Job                Node             Nodes



    ETL                              Analysis/Web
                       Name
 Reduce Job                             Nodes             Annotation
                       Node
                                                          Database
MapReduce
jobs extract,      HBase              Web/Analysis       Webservice combines
transform, &       Variant & Coverage Nodes process      Variant/Coverage data
load in parallel   Database System    queries            with annotations (hg18)
What an HBase DB Looks Like
 A Record in my HBase                                  family       label

key                 variant:genome4 variant:genome7 coverage:genome7

chr15:00000123454        byte[]          byte[]            byte[]


      Variant object to byte array

 Database on filesystem (HDFS)

 key                     timestamp    column:variant
 chr15:00000123454           t1       genome7     byte[]
Scalability and BerkeleyDB
●   BerkeleyDB let me:
    ●   Create a database per genome, independent from a
        single database daemon
    ●   Provision database to cluster for distributed analysis
    ●   Adapt to key-value database semantics with nice API
●   Limitations:
    ●   Creation on single node only
    ●   Want to query easily across genomes
    ●   Database are not distributed
    ●   I saw performance issues, high I/O wait
Would 2,000 Genomes Kill SQL?
●   Say each genome has 5M variants (not counting
    coverage!)
●   5M variant rows x 2,000 genomes = 10 billion rows
●   Our DB server running PostgreSQL (2xquad core,
    64GB RAM) died with the Celsius (Chado)
    schema after loading ~1 billion rows
●   So maybe conservatively we would have issues
    with 150+ genomes
●   That threshold is probably 1 year away with public
    datasets available via SRA, 1000 genomes, TCGA
Related Projects
My Abstract
●   backend/frontend
●   Traverse and query with Map/Reduce
●   Java web service with RESTlet
●   Deployment on 8 node cluster
Background on Problem
●   Why abandon PostgreSQL/MySQL/SQL?
    ●   Experience with Celsius...
●   What you give up
●   What you gain
First Solution: BerkeleyDB
●   Good:
    ●   key/value data store
    ●   Easy to use
    ●   Great for testing
●   Bad:
    ●   Not performant for multiple genomes
    ●   Manual distribution across cluster
    ●   Annoying phobia of shared filesystems
Sequencers vs. Information Technology
●   Sequencers are increasing output by a factor of
    10 every two years!
●   Hard drives: 4x every 2 years
●   CPUs: 2x every 2 years
●   Bandwidth: 2x every 2 years (really?!)
●   So there's a huge disconnect, can't just throw
    more hardware at a single database server!
●   Must look for better ways to scale
●   What are we doing, what are the challenges. Big picture of
    the project (webservice, backend etc)
●   How did people solve this problem before? How did I
    attempt to solve this problem? Where did it break down?
●   “New” approach, looking to Google et al for scalability for
    big data problems
●   What is Hbase/Hadoop & what do they provide?
●   How did I adapt Hbase/hadoop to my problem?
●   Specifics of implementation: overall flow, tables, query
    engine search (API), example M/R task
●   Is this performant, does this scale? Can I get
    billionxmillion? Fast arbitrary retrieval?
●   Next steps: more data types, focus on M/R for analytical
    tasks, focus on Katta for rich querying, push scalability w/
    8 nodes (test with genomes), look at Cascading & other
    tools for datamining

Contenu connexe

Tendances (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop
HadoopHadoop
Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 

En vedette

Graduate Students Workshop
Graduate Students Workshop Graduate Students Workshop
Graduate Students Workshop Naz Torabi
 
M M M 45 Lecciones En Aguas Quietas
M M M 45  Lecciones En Aguas QuietasM M M 45  Lecciones En Aguas Quietas
M M M 45 Lecciones En Aguas QuietasErgaro S.A. de C.V
 
E Brochure Of Northeast Construction
E Brochure Of Northeast ConstructionE Brochure Of Northeast Construction
E Brochure Of Northeast Constructiondcheon1
 
теперь спасён я
теперь спасён ятеперь спасён я
теперь спасён яko63ar
 
Holidays and traditions
Holidays and traditionsHolidays and traditions
Holidays and traditionsiesmaruxamallo
 
#LSCon 2015: 10 quick learning hacks that started my session
#LSCon 2015: 10 quick learning hacks that started my session#LSCon 2015: 10 quick learning hacks that started my session
#LSCon 2015: 10 quick learning hacks that started my sessionglowman71
 
Unknown parameter value (1)
Unknown parameter value (1)Unknown parameter value (1)
Unknown parameter value (1)Dulcelenacosta
 
Ppt The Dilemma Of Death
Ppt The Dilemma Of DeathPpt The Dilemma Of Death
Ppt The Dilemma Of DeathLarry Langley
 
2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactorenCVO-SSH
 
C:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betaC:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betayahohsoaho
 
Proyecto de aula
Proyecto de aula Proyecto de aula
Proyecto de aula saul
 
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...cloudcontroller
 
자바스터디 1
자바스터디 1자바스터디 1
자바스터디 1jangpd007
 

En vedette (20)

Access Great Students
Access Great StudentsAccess Great Students
Access Great Students
 
Graduate Students Workshop
Graduate Students Workshop Graduate Students Workshop
Graduate Students Workshop
 
M M M 45 Lecciones En Aguas Quietas
M M M 45  Lecciones En Aguas QuietasM M M 45  Lecciones En Aguas Quietas
M M M 45 Lecciones En Aguas Quietas
 
E Brochure Of Northeast Construction
E Brochure Of Northeast ConstructionE Brochure Of Northeast Construction
E Brochure Of Northeast Construction
 
теперь спасён я
теперь спасён ятеперь спасён я
теперь спасён я
 
Holidays and traditions
Holidays and traditionsHolidays and traditions
Holidays and traditions
 
Climate Change and Extreme Precipitation
Climate Change and Extreme PrecipitationClimate Change and Extreme Precipitation
Climate Change and Extreme Precipitation
 
#LSCon 2015: 10 quick learning hacks that started my session
#LSCon 2015: 10 quick learning hacks that started my session#LSCon 2015: 10 quick learning hacks that started my session
#LSCon 2015: 10 quick learning hacks that started my session
 
Era Regions 2002 Corpakis
Era Regions 2002 CorpakisEra Regions 2002 Corpakis
Era Regions 2002 Corpakis
 
MBA 512 Tips
MBA 512 TipsMBA 512 Tips
MBA 512 Tips
 
Unknown parameter value (1)
Unknown parameter value (1)Unknown parameter value (1)
Unknown parameter value (1)
 
Stockholm shopping guide
Stockholm shopping guideStockholm shopping guide
Stockholm shopping guide
 
Ppt The Dilemma Of Death
Ppt The Dilemma Of DeathPpt The Dilemma Of Death
Ppt The Dilemma Of Death
 
Introduction to LEAN (handout)
Introduction to LEAN (handout)Introduction to LEAN (handout)
Introduction to LEAN (handout)
 
Eceee 2013 r bull
Eceee 2013 r bullEceee 2013 r bull
Eceee 2013 r bull
 
2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren2 de versie 4de lesdag kindfactoren
2 de versie 4de lesdag kindfactoren
 
C:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betaC:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)beta
 
Proyecto de aula
Proyecto de aula Proyecto de aula
Proyecto de aula
 
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
How Edmodo Uses Splunk For Real-Time Tag-Based Reporting of AWS Billing and U...
 
자바스터디 1
자바스터디 1자바스터디 1
자바스터디 1
 

Similaire à O connor bosc2010

No sql solutions - 공개용
No sql solutions - 공개용No sql solutions - 공개용
No sql solutions - 공개용Byeongweon Moon
 
eBay From Ground Level to the Clouds
eBay From Ground Level to the CloudseBay From Ground Level to the Clouds
eBay From Ground Level to the CloudsX.commerce
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud ComputingAmazon Web Services
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBaseEffective
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
CouchDB Talk JChris NYC
CouchDB Talk JChris NYCCouchDB Talk JChris NYC
CouchDB Talk JChris NYCChris Anderson
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentationmarkgrover
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!Brian Culver
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Alex Diachenko
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010Membase
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 

Similaire à O connor bosc2010 (20)

No sql solutions - 공개용
No sql solutions - 공개용No sql solutions - 공개용
No sql solutions - 공개용
 
eBay From Ground Level to the Clouds
eBay From Ground Level to the CloudseBay From Ground Level to the Clouds
eBay From Ground Level to the Clouds
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
CouchDB Talk JChris NYC
CouchDB Talk JChris NYCCouchDB Talk JChris NYC
CouchDB Talk JChris NYC
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!SharePoint 2010 Boost your farm performance!
SharePoint 2010 Boost your farm performance!
 
Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 

Plus de BOSC 2010

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkBOSC 2010
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsBOSC 2010
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesBOSC 2010
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenisBOSC 2010
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 embossBOSC 2010
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evokerBOSC 2010
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorBOSC 2010
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisBOSC 2010
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorBOSC 2010
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfBOSC 2010
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsBOSC 2010
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perlBOSC 2010
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopythonBOSC 2010
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBOSC 2010
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaBOSC 2010
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytowebBOSC 2010
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloBOSC 2010
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptxBOSC 2010
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiBOSC 2010
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitBOSC 2010
 

Plus de BOSC 2010 (20)

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomics
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-services
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 emboss
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evoker
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projector
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductor
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasf
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perl
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopython
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rna
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytoweb
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phylo
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptx
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 

Dernier

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Dernier (20)

Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

O connor bosc2010

  • 1. SeqWare Query Engine Storing & Searching Sequence Data in the Cloud Brian O'Connor UNC Lineberger Comprehensive Cancer Center BOSC July 9th, 2010
  • 2. SeqWare Query Engine ● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?” ● “What frameshift indels affect PTEN?” ● “What genes include homozygous, non- synonymous SNVs?” ● SeqWare Query Engine created to query data ● RESTful Webservice ● Scalable/Queryable Backend
  • 3. Variant Annotation with SeqWare Whole Genome/Exome WIG BED SeqWare Query Alignment BAM Engine Webservice RESTlet Variant SeqWare Query Calling pileup Engine Backend Backend Interface dbSNP Variant, HBase or Coverage, & Berkeley DB Consequence Stores Consequence SeqWare Query SeqWare Pipeline Engine
  • 4. Webservice Interfaces RESTful XML Client API or HTML Forms SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
  • 5. Loading in Genome Browsers SeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers
  • 6. Requirements for Query Engine Backend The backend must: – Represent many types of data – Support a rich level of annotation – Support very large variant databases (~3 billion rows x thousands of columns) – Be distributed across a cluster – Support processing, annotating, querying & comparing samples (variants, coverage, annotations) – Support a crazy growth of data
  • 7. Increase in Sequencer Output Nelson Lab - UCLA Illumina Sequencer Ouput Sequence File Sizes Per Lane 100000000000 Suggests Sequencer Output Increases by 5-10x Every 2 Years! 10000000000 Far outpacing hard drive, CPU, and bandwidth growth File Size (Bytes) 1000000000 100000000 Log scale 10000000 08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10 Date
  • 8. HBase to the Rescue? ● Billions of rows x millions of columns! ● Focus on random access (vs. HDFS) ● Table is column oriented, sparse matrix ● Versioning (timestamps) built in ● Flexible storage of different data types ● Splits DB across many nodes transparently ● Locality of data, I can run map/reduce jobs that process the table rows present on a given node ● 22M variants processed <1 minute on 5 node cluster
  • 9. Underlying HBase Tables hg18Table family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object byte array Genome1102NTagIndexTable key rowId: is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123 byte[] queries look up by tag then Database on filesystem (HDFS) filter the variant results key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  • 10. HBase API & Map/Reduce Querying ● HBase API ● Powers the current Backend & Webservice ● Provides a familiar API, scanners, iterators, etc ● Backend written using this, retrieve variants by tags ● Distributed database but single thread using API ● Prototype somatic mutations by Map/Reduce ● Every row is examined, variants in tumor not in normal are retrieved ● Map/Reduce jobs run on node with local data ● Highly parallel & faster than API with single thread
  • 11. SeqWare Query Engine on HBase Backend Webservice clients MapReduce HBase API ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Node MetaDB ETL Querying & jobs extract, HBase on HDFS Loading Webservice combines transform, &/or Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries via API with metadata
  • 12. Status of HBase Backend ● Both BerkeleyDB & HBase, Relational soon ● Multiple genomes stored in the same table, very Map/Reduce compatible ● Basic secondary indexing for “tags” ● API used for queries via Webservice ● Prototype Map/Reduce example for “somatic” mutation detection in paired normal/cancer samples ● Currently loading 1102 normal/tumor (GBM)
  • 13. Backend Performance Comparison Pileup Load Time 1102N HBase vs. Berkeley DB 8000000 7000000 6000000 5000000 load time bdb variants 4000000 load time hbase 3000000 2000000 1000000 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 time (s)
  • 14. Backend Performance Comparison BED Export Time 1102N HBase API vs. M/R vs. BerkeleyDB 8000000 7000000 6000000 5000000 dump time bdb variants 4000000 dump time hbase dump time m/r 3000000 2000000 1000000 0 0 1000 2000 3000 4000 5000 6000 7000 time
  • 15. HBase/Hadoop Have Potential! ● Era of Big Data for Biology is here! ● CPU bound problems no doubt but as short reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data ● Tools designed for Peta-scale datasets are key
  • 16. Next Steps ● Model other datatypes: copy number, RNAseq gene/exon/splice junction counts, isoforms etc ● Focus on porting analysis/querying to Map/Reduce ● Indexing beyond “tags” with Katta (distributed Lucene) ● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster? ● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools ● Standards for Webservice dialect (DAS?) ● Exposing Query Engine through GALAXY
  • 17. Acknowledgments UCLA UNC  Jordan Mendler  Sara Grimm  Michael Clark  Matt Soloway  Hane Lee  Jianying Li  Bret Harry  Feri Zsuppan  Stanley Nelson  Neil Hayes  Chuck Perou  Derek Chiang
  • 18. Resources ● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase: http://blog.rapleaf.com/dev/?p=26 ● NOSQL presentations: http://blog.oskarsson.nu/2009/06/nosql-debrief.html ● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more... ● Data mining tools: Pig and Hive ● SeqWare: http://seqware.sourceforge.net ● briandoconnor@gmail.com
  • 20. Overview ● SeqWare Query Engine background ● New tools for combating the data deluge ● HBase/Hadoop in SeqWare Query Engine ● HBase for backend ● Map/Reduce & HBase API for webservice ● Better performance and scalability? ● Next steps
  • 21. SeqWare Query Engine: Lustre Filesystem BerkeleyDB backend webservice clients Genome BED/WIG Analysis/Web Nodes Files Database XML Metadata RESTful Genome Analysis/Web Nodes Web Service Database Genome Analysis/Web Nodes Database MetaDB BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with metadata
  • 22. More ● Details on API vs. M/R ● Details on XML Restful API & web app including loading in UCSC browser ● Details on generic store object (BerkeleyDB, HBase, and Relational at Renci) ● Byte serialization from BerkeleyDB, custom secondary key creation
  • 23. Pressures of Sequencing ● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome) ● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows ● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations) ● ~3 billion rows x thousands of columns ● COMBINE WITH PREVIOUS SLIDE
  • 24. Thoughts on BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster ● Adapt to key-value database semantics ● Limitations: ● Creation on single node only ● Not inherently distributed ● Performance issues with big DBs, high I/O wait ● Google to the rescue?
  • 25. HBase Backend ● How the table(s) are actually structured ● Variants ● Coverage ● Etc ● How I do indexing currently (similar to indexing feature extension) ● Multiple secondary indexes
  • 26. Frontend ● RESTlet API ● What queries can you do? ● Examples ● URLs ● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)
  • 27. Ideas for a distributed future ● Federated Dbs/datastores/clusters for computation rather than one giant datacenter ● Distribute software not data
  • 28. Potential Questions ● How big is the DB to store whole human genome? ● How long does it take to M/R 3 billion positions on 5 node cluster? ● How does my stuff compare to other bioinf software? GATK, Crossbow, etc ● How did I choose HBase instead of Pig, Hive, etc?
  • 29. Current Prototyping Work ● Validate creation of U87 (genome resequencing at 20x) genome database ● SNVs ● Coverage ● Annotations ● Test fast querying of record subsets ● Test fast processing of whole DB using MapReduce ● Test stability, fault-tolerance, auto-balancing, and deployment issues along the way
  • 30. What About Fast Queries? ● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster ● I have a prototype HBase database running on two nodes ● But HBase shines when bulk processing DB ● Big question is how to make individual lookups fast ● Possible solution is Hbase+Katta for indexes (distributed Lucene)
  • 31. SeqWare Query Engine Lustre Filesystem backend webservice clients Genome Analysis/Web Nodes BED/WIG Files Database (8 CPU, 32GB RAM) DASXML RESTful Genome Analysis/Web Nodes Web Service Database (8 CPU, 32GB RAM) Genome Analysis/Web Nodes Annotation Database (8 CPU, 32GB RAM) Database BerkeleyDB Web/Analysis Webservice combines Variant & Coverage Nodes process Variant/Coverage data Databases queries with annotations (hg18)
  • 32. How Do We Scale Up the QE? ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale for
  • 33. Google to the Rescue? ● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years ● Solutions include: ● Frameworks like MapReduce ● Distributed file systems like HDFS ● Distributed databases like HBase ● Focus here on HBase
  • 34. What Do You Give Up? ● SQL queries ● Well defined schema, normalized data structure ● Relationships manged by DB ● Flexible and easy indexing of table columns ● Existing tools that query a SQL database must be re-written ● Certain ACID aspects ● Software maturity, most distributed NOSQL projects are very new
  • 35. What Do You Gain? ● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers ● Ability to look at very large datasets and do complex computations across a cluster ● More flexibility in representing information now and in the future ● HBase includes data timestamps/versions ● Integration with Hadoop
  • 36. SeqWare Query Engine Hadoop Hadoop HDFS Map Reduce backend webservice clients ETL Map Data Analysis/Web BED/WIG Job Nodes Files Node XML Metadata RESTful ETL Map Data Analysis/Web Web Service Job Node Nodes ETL Analysis/Web Name Reduce Job Nodes Annotation Node Database MapReduce jobs extract, HBase Web/Analysis Webservice combines transform, & Variant & Coverage Nodes process Variant/Coverage data load in parallel Database System queries with annotations (hg18)
  • 37. What an HBase DB Looks Like A Record in my HBase family label key variant:genome4 variant:genome7 coverage:genome7 chr15:00000123454 byte[] byte[] byte[] Variant object to byte array Database on filesystem (HDFS) key timestamp column:variant chr15:00000123454 t1 genome7 byte[]
  • 38. Scalability and BerkeleyDB ● BerkeleyDB let me: ● Create a database per genome, independent from a single database daemon ● Provision database to cluster for distributed analysis ● Adapt to key-value database semantics with nice API ● Limitations: ● Creation on single node only ● Want to query easily across genomes ● Database are not distributed ● I saw performance issues, high I/O wait
  • 39. Would 2,000 Genomes Kill SQL? ● Say each genome has 5M variants (not counting coverage!) ● 5M variant rows x 2,000 genomes = 10 billion rows ● Our DB server running PostgreSQL (2xquad core, 64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows ● So maybe conservatively we would have issues with 150+ genomes ● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA
  • 41. My Abstract ● backend/frontend ● Traverse and query with Map/Reduce ● Java web service with RESTlet ● Deployment on 8 node cluster
  • 42. Background on Problem ● Why abandon PostgreSQL/MySQL/SQL? ● Experience with Celsius... ● What you give up ● What you gain
  • 43. First Solution: BerkeleyDB ● Good: ● key/value data store ● Easy to use ● Great for testing ● Bad: ● Not performant for multiple genomes ● Manual distribution across cluster ● Annoying phobia of shared filesystems
  • 44. Sequencers vs. Information Technology ● Sequencers are increasing output by a factor of 10 every two years! ● Hard drives: 4x every 2 years ● CPUs: 2x every 2 years ● Bandwidth: 2x every 2 years (really?!) ● So there's a huge disconnect, can't just throw more hardware at a single database server! ● Must look for better ways to scale
  • 45. What are we doing, what are the challenges. Big picture of the project (webservice, backend etc) ● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down? ● “New” approach, looking to Google et al for scalability for big data problems ● What is Hbase/Hadoop & what do they provide? ● How did I adapt Hbase/hadoop to my problem? ● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task ● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval? ● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining