SlideShare une entreprise Scribd logo
1  sur  58
Cascading and BigData
      Problems




       Chris K Wensel
       Concurrent, Inc.
                          Copyright Concurrent, Inc. 2011. All rights reserved.
About Me
•   Concurrent, Inc., Founder
     • Cascading support and tools
     • http://concurrentinc.com/

•   Cascading, Lead Developer (started Sept 2007)
     •  An alternative API to MapReduce
     •  http://cascading.org/

•   Formerly Hadoop mentoring and training
     •  Sun - Apple - HP - LexisNexis - startups - etc

•   Formerly Systems Architect & Consultant
     •  Thomson/Reuters - TeleAtlas - startups - etc
                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Overview

• Case Studies
• What’s in common?
• Where does Hadoop fit?
• Processing vs Innovation

                             Copyright Concurrent, Inc. 2011. All rights reserved.
Case Studies

• ShareThis
• BestBuy
• FlightCaster
• Etsy
• Ion Flux
                   Copyright Concurrent, Inc. 2011. All rights reserved.
Summary
• All running in production with Hadoop
• All use AWS, most use Elastic MapReduce
• All production processing was implemented in
  Cascading

• Various other tools used at different stages of
  development

                                  Copyright Concurrent, Inc. 2011. All rights reserved.
Share This

• Cascading + AWS (pre-EMR)
• Daily event log processing, initially multiple
  TB and growing
• Details in the O’Reilly Hadoop book from
  Tom White


                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Lessons
                              every Y hrs         on crawl completion
        every X hrs




        logprocessor           crawler                 indexer




                        ...                 ...




• Mark data as bad and why, never discard
 • useful for upstream debugging
• Data is seasonal, cyclical, and bursty
• Tune your app and cluster to the workload
• (garbage collect Hadoop clusters)                     Copyright Concurrent, Inc. 2011. All rights reserved.
BestBuy - Behavioral Ad-
       Targeting

• Cascading + AWS (Elastic MapReduce)
• Daily automated User Behavior Segmentation
• 6wks dev, 3T/day, $13k/mo
• 500% increase in return on ad spend from a
  similar campaign a year before
• http://aws.amazon.com/solutions/case-studies/
  razorfish/                        Copyright Concurrent, Inc. 2011. All rights reserved.
Cluster
          Amazon Web Services

           Elastic MapReduce

              Slaves
                                                                                  Ad System

                                                           Map/Reduce
                                   behavior app
                                                                HDFS




                           input                  output
                                                                        S3




              E-Commerce Site




•   200+ nodes, 9-12 hour runs
•   30+ days of history + 3TB daily
•   Remote HTTP update of ad-server
    • of only changed data
                                                                        Copyright Concurrent, Inc. 2011. All rights reserved.
Road Blocks
• No one really understood the data
 • Character formats (UTF-8 vs ...)
 • Zero byte chars
 • Unique columns not unique
 • Outliers in the data
• Creating test data
• QAing the data
 • result data was also big
                             Copyright Concurrent, Inc. 2011. All rights reserved.
FlightCaster - Predicting
      Flight Delays
• Clojure + Cascading + AWS
• Scours data on every domestic flight for
  the past 10-years and matches it to real-
  time conditions

• Machine learning on Cascading, Scoring on
  app server

• 3mos dev, 10G day, <$2k/mos  Copyright Concurrent, Inc. 2011. All rights reserved.
Lessons

• Even with a good abstraction, you must intuit the
  underlying model (MapReduce) to improve
  throughput

• i.e. Logical vs Physical plans
 • we still need DBAs after decades of query
    planner dev


                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Etsy - Online
           Marketplace

• JRuby + Cascading + AWS
• 1B page-views & multi-T data/mo, of logs
• 40-50 cascading.jruby jobs a night
• http://codeascraft.etsy.com/2010/02/24/
  analyzing-etsys-data-with-hadoop-and-cascading/
• http://www.concurrentinc.com/casestudies/etsy
                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Initially

• JRuby for the ‘analysts’
• Log pre-processing,
• db snap shot diffs,
• nightly and ad-hoc analytics

                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Data Driven Products
• Search index/scoring (under dev)
• Taste Test
• Facebook gift recommender
• Suggested shops
• Top query list, etc...
• Many more on the way
                              Copyright Concurrent, Inc. 2011. All rights reserved.
Ion Flux - Gene
        Sequencing

• Cascading + AWS
• Sequence Alignment
• http://aws.amazon.com/solutions/case-
  studies/ion-flux/



                              Copyright Concurrent, Inc. 2011. All rights reserved.
Cluster

• 10-30 nodes, using new HPC instances
• 200-500 cores,
• runs up to 50 hours

                             Copyright Concurrent, Inc. 2011. All rights reserved.
Architecture

                                                                                                                                                                                                                                                                                                                                                   Delivery
                                                                                                                                                                                                                                                                                                                                                                        Ion Flux -
                                                                                                                                                                                                                                                                                                                                                                     Annotation Server
              Clinical Lab                                              Ion Torrent - Torrent Sequencer                                                       Ion Torrent - Torrent Server                                                                                                                                                                                (EC2)
                                                                                                                                                                                                                                                                                            Ion Flux - Pipeline Controller
                                                                                                                                                                                              FastQ
                        Ion Torrent                                                RAW Data                FTP                 FTP               RAW Data                                                         FTP                                                                                 Upload                                                             Annotation
Sample Prep                Chip                             Measure DNA              File                                                          File                Basecalling           Sequence                                                                           Wait                                         Start Pipeline                                                                     Third Party Clients
                                                                                                           Client             Server                                                           File              Server                                                                              Complete?                                                           Database
                                                                                                                                                                                                                                                                                                                                                                           (RDS)
                                                                                                                                                                                                                                                                                                                                                                                                                        Client
                                                                                                                                                                                                                                                                                                                                                                                                                         App


                                                                                                                                                                                                                                                                                                                                                                      Annotation ReST
                                                                                                                                                                                                                                                                                                                                                                          Server
                                                                                                                     Ion Flux - Flux Capacitor                                                                                                              Ion Flux - LIMS

                                                              FastQ                                        FastQ                                 Compressed
DNA Sample                             FTP                                                                                                                                                      Transfer          Cloud          LIMS ReST                                       Chip                   LIMS
                                                             Sequence               Split File            Sequence           Compress             Sequence             Transfer Agent
                                       Client                                                                                                                                                    Agent            Input         Server (EC2)                                    Metadata              Database
                                                               File                                        Chunk                                    File
                                                                                                                                                                                                                                                                                                       (RDS)                                                                                                        Ion Flux - Variant Server (EC2)

                                                                                                                                                                                                                                                                                                                                                                                                                    Variants ReST             Variant
                                                                                                                                                                                                                                                                                                                                                                                                                        Server               Database
                                                   Ion Flux -                                                                                                                                                                                                                                                                                                                                                                                 (RDS)
                                                Client Website

 External                    Variant              Complete
 Partners                    Report                Runs




                                                    (EC2)
                                                                                                                                                                                                                                                                                                           AWS - S3 Storage

                                                                                                                                                                                                                                  FastQ
                                                                                                                                                                                                Software &                                                                                                                                                                                            Performance                PILEUP
                                                                                                                                                                                                                                 Sequence
                                                                                                                                                                                                   Data                                                                                                                                                                                                   Data                   Variants
                                                                                                                                                                                                                                  Chunks




                     Heavy Lifting                                                                                                                                                                                                                                                             Ion Flux - Sequencing Pipeline
                                                                                                                                                                                                                                                                                                    AWS - EMR Cluster
                                                                                                                                                                                                                                                                                                                                        SAM             Corrected
                                                                                                                                                                                             Bootstrap Cluster      Configure                       SAM               Sort by           Sorted SAM             Split to                                                                     PILEUP
                                                                                                                                                                                                                                   TMAP                                                                                              Alignment   SRMA       SAM      PILEUP
                                                                                                                                                                                                  Nodes              Pipeline                  Alignments            position          Alignments              Bins                                                                        Variants
                                                                                                                                                                                                                                                                                                                                        Bins            Alignments

                                                                                                                                                                             Create                                                                                                                                                                                                                                 Cluster                 Shutdown


                                                                                                          Cascading
                                                                                                                                                                             Cluster                                                                                                                                                                                                                                Cleanup                  Cluster

                                                                                                                                                                                                Start Node                                                                                                                                                                               Performance
                                                                                                                                                                                                  Profiler                                                                                                                                                                                    Data




                                                                                                                                                                                                                                                                                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Common Architecture
                      intermediate
                          data




           raw data
           loggers         ?         valuable
          loggers                      data
         loggers




                         Analyst
          Producer                    Consumer
                        Developer


                       Value


•   New data continuously arriving
•   Actively incorporating the new with the old
•   Updating backend systems
                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Common Constraints

• Speed of light
• Understanding the data
• Creating tests and validating the results
• Lifecycle phases have different environments
 • dev vs. integration vs. prod
• Better algorithms, less cost, more complexity
                                 Copyright Concurrent, Inc. 2011. All rights reserved.
Apps Have Many Stages

• Heavy Lifting   • Scoring
• Modeling &      • Processing
  Learning



                       Copyright Concurrent, Inc. 2011. All rights reserved.
Heavy Lifting

• ETL Style processes hampered by physics
• Moving/Transferring/Packaging data
• Data cleansing and value normalization

                             Copyright Concurrent, Inc. 2011. All rights reserved.
Modeling & Learning
• Also known as “Data Mining”
• Ask lots of questions to understand the
  data
• Machine learning, or
• Ad-hoc queries
• Where the innovation happens
                              Copyright Concurrent, Inc. 2011. All rights reserved.
Processing

• Transforming and/or combining multiple
  data sets into new data sets or models


 • Analytics,            • indexing (w/
 • statistics,             scoring),
 • enrichment,           • feature reduction,
 • entity extraction,    • matching
                              Copyright Concurrent, Inc. 2011. All rights reserved.
Scoring

• Apply what’s learned
• Sometimes batch (as part of Processing)
 • indices with search result ranking
• Sometimes transactional, req/resp
 • prediction, recommendations, etc
                              Copyright Concurrent, Inc. 2011. All rights reserved.
In Summary
        collection           cleansing            processing                  delivery


event                data                signal                   info                     knowledge


                            normalization          scoring

                                         mining



   The point of computing systems is to make data
                   more valuable
                                                               Copyright Concurrent, Inc. 2011. All rights reserved.
Where does Hadoop
       fit?


            Copyright Concurrent, Inc. 2011. All rights reserved.
Hadoop
           Cluster




                Rack            Rack                 Rack

                Node   Node     Node        Node     ...


                              Global Compute-space


                               Global Namespace




• Distributed replicated storage for large files
• Distributed fault tolerant exec of batch processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis, more freedom            Copyright Concurrent, Inc. 2011. All rights reserved.
MapReduce
•   A “divide and conquer” strategy for
    parallelizing workloads against collections of
    data


•   Map & Reduce are two user defined functions
    chained via Key Value Pairs


•   It’s really Map->Group->Reduce where Group
    is built in

                                    Copyright Concurrent, Inc. 2011. All rights reserved.
Keys and Values
•   Map translates input to keys
    and values to new keys and
    values                             [K1,V1]               Map                     [K2,V2]*




•   System Groups each unique          [K2,V2]              Group               [K2,{V2,V2,....}]
    key with all its values

                                   [K2,{V2,V2,....}]        Reduce                   [K3,V3]*

•   Reduce translates the values
    of each unique key to new
    keys and values                                                    * = zero or more


                                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Word Count
Mapper
 [0, "when in the course of
       human events"]            Map     ["when",1]     ["in",1]         ["the",1]          [...,1]


              ["when",1]
               ["when",1]
                ["when",1]
                 ["when",1]     Group    ["when",{1,1,1,1,1}]
                  ["when",1]
Reducer

         ["when",{1,1,1,1,1}]   Reduce   ["when",5]




                                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Divide and Conquer
          Parallelism
• Since the ‘records’ entering the Map and ‘groups’
  entering the Reduce are independent

• That is, there is no expectation of order or
  requirement to share state between records/
  groups

• Arbitrary numbers of Map and Reduce function
  instances can be created against arbitrary portions
  of input data
                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Cluster
         Cluster




              Rack                 Rack            Rack

              Node       Node      Node    Node    ...

                   map     map       map     map         map



               reduce     reduce                     reduce




• Multiple instances of each Map and Reduce
  function are distributed throughout the cluster

                                                           Copyright Concurrent, Inc. 2011. All rights reserved.
Another View
                  [K1,V1]            Map     [K2,V2]
                                             Combine   Group    [K2,{V2,...}]   Reduce    [K3,V3]


                                    Mapper
                                     Task                       same code



                                    Mapper                                      Reducer
                                                       Shuffle
                                     Task                                        Task


                                    Mapper                                      Reducer
                                                       Shuffle
                                     Task                                        Task


                                    Mapper                                      Reducer
                                                       Shuffle                    Task
                                     Task


                                    Mapper
                                     Task
                                                    Mappers must
                                                   complete before
                                                    Reducers can
                                                       begin
split1   split2   split3   split4      ...                              part-00000    part-00001    part-000N

                     file                                                             directory



                                                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Architectural
                             Components
                                              NameNode                DataNode
                                                                       DataNode
                                                                         DataNode
                                                                           DataNode                             data block



                       ns                                                                          read/write
                    operations                Secondary                            ns
                                                                                operations
                                 read/write                           ns
                                                                   operations         read/write
                                                                                                          mapper
                                                                                                              mapper
                                                                                                          child jvm
                                                                                                                 mapper
                                                                                                            child jvm
                                 jobs                      tasks                                                child jvm
           Client                             JobTracker

                                                                           TaskTracker
                                                                                                          reducer
                                                                                                              reducer
                                                                                                          child jvm
                                                                                                                 reducer
                                                                                                            child jvm
                                                                                                                child jvm

•   Solid boxes are unique applications
•   Dashed boxes are child JVM instances on same node as parent
•   Dotted boxes are blocks of managed files on same node as parent
                                                                                             Copyright Concurrent, Inc. 2011. All rights reserved.
Deployment Topology
               Node                   Node                        Node

                               jobs                       tasks
                      Client                 JobTracker              TaskTracker




                                                                         DataNode
                                      Node


                                         NameNode



                                                                  Not uncommon to
                                      Node                         be same node

                                             Secondary




•   Job Client may run on any node
•   NameNode and JobTracker may run on same node (Master)
•   DataNode and TaskTracker instances should run on same node (Slaves)
•   NameNode and SecondaryNode shouldn’t typically run on same node
                                                                          Copyright Concurrent, Inc. 2011. All rights reserved.
Complex job
                      assemblies
•   Real applications are many MapReduce jobs chained together

•   Linked by intermediate (usually temporary) files

•   Executed in order, by hand, from the ‘client’ application

       Count Job                                Sort Job
                    [ k, [v] ]                                    [ k, [v] ]
            Map                   Reduce              Map                         Reduce


      [ k, v ]                   [ k, v ]              [ k, v ]                         [ k, v ]


             File                            File                                    File



                                            [ k, v ] = key and value pair
                                            [ k, [v] ] = key and associated values collection
                                                                         Copyright Concurrent, Inc. 2011. All rights reserved.
Tokenize              Count Job
                      Map     Reduce         Map           Reduce




           File


                      File       File

                                        Join Tokens/Counts Job
            File                            Map         Reduce




                                 File

                                        Sort/Prefix Filter Job
                                            Map         Reduce




                                 File


Match two sets                          Self Join Job
                                            Map         Reduce



 using prefix                    File



   filtering                             Unique Pairs Job
                                            Map         Reduce



                                 File


                                        Join LHS Job
                                            Map         Reduce



                                 File



                                        Join RHS / Match Job
                                            Map         Reduce             File


                                                                Copyright Concurrent, Inc. 2011. All rights reserved.
Real World Apps
                                                                                                                                                                                                    [37/75] map+reduce




                                                                                                                                                                                                    [54/75] map+reduce




[41/75] map+reduce      [43/75] map+reduce       [42/75] map+reduce      [45/75] map+reduce       [44/75] map+reduce      [39/75] map+reduce    [36/75] map+reduce        [46/75] map+reduce        [40/75] map+reduce        [50/75] map+reduce     [38/75] map+reduce     [49/75] map+reduce     [51/75] map+reduce     [47/75] map+reduce     [52/75] map+reduce        [53/75] map+reduce    [48/75] map+reduce




[23/75] map+reduce      [25/75] map+reduce       [24/75] map+reduce      [27/75] map+reduce       [26/75] map+reduce      [21/75] map+reduce    [19/75] map+reduce        [28/75] map+reduce        [22/75] map+reduce        [32/75] map+reduce     [20/75] map+reduce     [31/75] map+reduce     [33/75] map+reduce     [29/75] map+reduce     [34/75] map+reduce        [35/75] map+reduce    [30/75] map+reduce




    [7/75] map+reduce        [2/75] map+reduce       [8/75] map+reduce       [10/75] map+reduce       [9/75] map+reduce     [5/75] map+reduce    [3/75] map+reduce        [11/75] map+reduce         [6/75] map+reduce        [13/75] map+reduce     [4/75] map+reduce    [16/75] map+reduce     [14/75] map+reduce     [15/75] map+reduce     [17/75] map+reduce        [18/75] map+reduce     [12/75] map+reduce




       [60/75] map              [62/75] map             [61/75] map                                                            [58/75] map          [55/75] map                                                     [56/75] map+reduce                  [57/75] map                                                                               [71/75] map               [72/75] map
                                                                                                                                                                                               [59/75] map




                                                                                                  [64/75] map+reduce                                 [63/75] map+reduce                        [65/75] map+reduce          [68/75] map+reduce      [67/75] map+reduce     [70/75] map+reduce     [69/75] map+reduce     [73/75] map+reduce     [66/75] map+reduce        [74/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                    [75/75] map+reduce




                                                                                                                                                                                                                                                                                                                                                     [1/75] map+reduce




1 app, 75 jobs

green                                                     =                map + reduce
purple                                                    =                map
blue                                                      =                join/merge
orange                                                    =                map split
                                                                                                                                                                                                                                                                                  Copyright Concurrent, Inc. 2011. All rights reserved.
Cascading
                                    Word Count/Sort Flow
         Map                           Reduce                              Map           Reduce
                     [ f1,f2,.. ]             [ f1,f2,.. ]            [ f1,f2,.. ]
          Parse                     Group                    Count                    Sort

                                                                                             [ f1,f2,.. ]
                  [ f1,f2,.. ]


          Data                              [ f1, f2,... ] = tuples with field names             Data




• Alternative model & API to MapReduce
 • pipe/filters of re-usable operations
• For rapidly implementing Data Processing Systems
                                                                                       Copyright Concurrent, Inc. 2011. All rights reserved.
Cascading

• Allows for Unit testing independent of
  integration
• Re-usable libraries
• Integration is first class
• Homogeneous framework for scheduling
• Any JVM based host language
                               Copyright Concurrent, Inc. 2011. All rights reserved.
Elastic MapReduce
              Amazon Web Services
                   Elastic MapReduce
User     CLI
         Console    Master      Slaves

                      Client




                                           mr                  mr        Map/Reduce


                                                        temp                  HDFS




                                                input           output
                                                                                      S3
                                     jar




       • Clusters typically single purpose
       • S3 used for storage between runs                           Copyright Concurrent, Inc. 2011. All rights reserved.
Architecture Isn’t
            Innovation
                                 operationalization


        collection           cleansing            processing              delivery


event                data                signal                info                   knowledge


                            normalization          scoring

                                         mining


                                      innovation

  Rate of innovation and arrival of answers are
                  proportional
                                                               Copyright Concurrent, Inc. 2011. All rights reserved.
Big vs Lots
                                       Lots of
                          "Big" Data
                                        Data


               Data                                     ! = Hadoop
              Mining*         !          ?
                                                        ? = RDBMS, R, etc

                Data                                    * Data Warehousing
             Processing       !          !


• Big - too much to fit in/on any one thing
• Lots - complexity arising from keeping
  track of all the bits
                                                 Copyright Concurrent, Inc. 2011. All rights reserved.
At Rest vs In Motion
                                        data
                                       mining
  raw data    ETL   data warehousing
  loggers
 loggers
loggers
                           ETL
                                                Analyst




                    Data At Rest
                                                   raw data   data processing               valuable
                                                   loggers                                    data
                                                  loggers
                                                 loggers
                                                                                                              Consumer




                                                               process


                                                              Data In Motion


             • Hub/Spoke vs Incremental Layers
             • Static Schema vs Dynamic Views
             • Monolithic vs Distributed                         Copyright Concurrent, Inc. 2011. All rights reserved.
Hadoop for Processing
                    Value Creation

                      Scalability

                      Simplicity




• Delivering Value from Innovation
• Scalability, Not Performance
• Simplifies Infrastructure
                                     Copyright Concurrent, Inc. 2011. All rights reserved.
Simplicity
            Cluster




                 Rack                  Rack                 Rack

                 Node         Node     Node        Node     ...


                      cpus           Global Compute-space


                      disks           Global Namespace




• Virtualization across resources, not within (PaaS)
  • A single FileSystem across disks - no DBA
  • A single Execution System across CPUs - less IT
• One app installed and managed across hardware
                                                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Scalability
           Users        Cluster

              Client

                             Rack                Rack                    Rack

                             Node         Node   Node           Node     ...
              Client
                                    job
                                                          job
                                                    job
              Client




• Scalability - continued reliability and met expectations as
  demand changes
• Application Scalability - data grows, app/infra expand
• Organizational Scalability - simpler infra and apps                  Copyright Concurrent, Inc. 2011. All rights reserved.
Delivering Value
                                  events


                                                  reporting
                   raw data
                   loggers
                  loggers     data processing
                 loggers           Hadoop
                                  + Hadoop
                               etlCascading
                                    analytics
                                  Cascading
     Producer                                                      Consumer


                                                   product

                              operational



                               Value


• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind       Copyright Concurrent, Inc. 2011. All rights reserved.
Data In Motion
       raw data   data processing   valuable
       loggers                        data
      loggers
     loggers
                                                       Consumer




                   process


                  Data In Motion




• Data always arriving, results being delivered
• Not paying the upfront cost of indexing
• No upfront schema design
• “ETL” is built into the processing pipeline
                                        Copyright Concurrent, Inc. 2011. All rights reserved.
Where to Innovate?
                                  Lots of
                     "Big" Data
                                   Data


          Data                              ! = Hadoop
         Mining*         !          ?
                                            ? = RDBMS, R, etc

           Data                             * Data Warehousing
        Processing       !          !



• Depends on the problem whether Hadoop
  makes sense as your innovation platform

                                            Copyright Concurrent, Inc. 2011. All rights reserved.
Hadoop for Innovating

                           value

                                    innovation
innovation




                                                  innovation
                 latency                                             degrees of freedom




             • Need to ask similar questions repeatedly
               • Indexes help here
             • Need a reasonably high abstraction
               • Existing libraries and a simple syntax
             • Third-party Tool support             Copyright Concurrent, Inc. 2011. All rights reserved.
Innovation Abstractions
• Syntax
 • Pig
 • Hive - now has some indexing support
• Language (easier to operationalize)
 • Cascalog
 • Cascading.jruby
 • 3 new Scala languages pending release
                             Copyright Concurrent, Inc. 2011. All rights reserved.
Data At Rest
                                             data
                                            mining
        raw data   ETL   data warehousing
        loggers
       loggers
      loggers
                                ETL
                                                         Analyst




                         Data At Rest




• Hadoop becomes a warehouse (with Schemas)
• and without indexes, high latency queries
• ETL becomes an independent architecture
                                            Copyright Concurrent, Inc. 2011. All rights reserved.
Don’t throw out the baby
      with the bath water
• Need low latency responses
• Need support for existing tools
• Need to not retrain analysts
   •   RDBMS (Aster,
                              •   SAS
       GreenPlum, Vertica,
       Oracle)                •   MicroStrategies

   •   R                      •   Tableaux
                                    Copyright Concurrent, Inc. 2011. All rights reserved.
Bailing Wire & Bubble
         Gum

• Integrating them with Hadoop adds
  brittleness and inefficiencies
 • Hadoop Streaming
 • RHIPE, etc..
                                  Copyright Concurrent, Inc. 2011. All rights reserved.
Operationalizing
                                     operationalization


            collection           cleansing            processing              delivery


    event                data                signal                info                   knowledge


                                normalization          scoring

                                             mining


                                          innovation

•    Minimize the number of processing tech (debt)
•    Don’t lose sight of the physical model/plan
•    XML is not a programming language
•    String concatenation isn’t programming
                                                                   Copyright Concurrent, Inc. 2011. All rights reserved.
Resources
•   Chris K Wensel
    •chris@wensel.net
    •@cwensel

•   Cascading & Cascalog
    •http://cascading.org
    •@cascading

•   Concurrent, Inc.
    •http://concurrentinc.com
    •@concurrent
    •http://concurrentinc.com/careers
                                  Copyright Concurrent, Inc. 2011. All rights reserved.

Contenu connexe

En vedette

On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision treesJulià Minguillón
 
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan Phd
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan PhdSMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan Phd
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan PhdHealthcare consultant
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
 
The Future of the Workplace
The Future of the WorkplaceThe Future of the Workplace
The Future of the Workplaceamanda gore
 
Crime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia TechCrime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia TechJonathan D'Cruz
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesChamath Sajeewa
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 

En vedette (8)

On cascading small decision trees
On cascading small decision treesOn cascading small decision trees
On cascading small decision trees
 
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan Phd
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan PhdSMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan Phd
SMART HEALTH PREDICTION USING DATA MINING by Dr.Mahboob Khan Phd
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
The Future of the Workplace
The Future of the WorkplaceThe Future of the Workplace
The Future of the Workplace
 
Rw 2014 data visulization
Rw 2014 data visulizationRw 2014 data visulization
Rw 2014 data visulization
 
Crime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia TechCrime Mapping & Analysis – Georgia Tech
Crime Mapping & Analysis – Georgia Tech
 
Crime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articlesCrime Analytics: Analysis of crimes through news paper articles
Crime Analytics: Analysis of crimes through news paper articles
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 

Similaire à Cascading and BigData Problems

Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsAdrian Cockcroft
 
Evolving Services Into A Cloud Native World
Evolving Services Into A Cloud Native WorldEvolving Services Into A Cloud Native World
Evolving Services Into A Cloud Native WorldIain Hull
 
2011 State of the Cloud: A Year's Worth of Innovation in 30 Minutes - Jinesh...
2011 State of the Cloud:  A Year's Worth of Innovation in 30 Minutes - Jinesh...2011 State of the Cloud:  A Year's Worth of Innovation in 30 Minutes - Jinesh...
2011 State of the Cloud: A Year's Worth of Innovation in 30 Minutes - Jinesh...Amazon Web Services
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopSudhir Tonse
 
Cloud Computing in Practice
Cloud Computing in PracticeCloud Computing in Practice
Cloud Computing in PracticeKing Huang
 
SRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateSRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateAmazon Web Services
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumAdrian Cockcroft
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloudzain1425
 
Scale, baby, scale!
Scale, baby, scale!Scale, baby, scale!
Scale, baby, scale!Julien SIMON
 
Moving complex enterprise ecommerce systems to the cloud
Moving complex enterprise ecommerce systems to the cloudMoving complex enterprise ecommerce systems to the cloud
Moving complex enterprise ecommerce systems to the cloudElastic Path
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組みRyousei Takano
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Open World Forum 2009 Migration With Telosys
Open World Forum 2009 Migration With TelosysOpen World Forum 2009 Migration With Telosys
Open World Forum 2009 Migration With TelosysLaurent Guérin
 
Ow2 Open World Forum09 Migration With Telosys
Ow2 Open World Forum09 Migration With TelosysOw2 Open World Forum09 Migration With Telosys
Ow2 Open World Forum09 Migration With TelosysOW2
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013aspyker
 
Cloud computing benefits start-up companies
Cloud computing benefits start-up companiesCloud computing benefits start-up companies
Cloud computing benefits start-up companiesSang-il Jung
 

Similaire à Cascading and BigData Problems (20)

Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Evolving Services Into A Cloud Native World
Evolving Services Into A Cloud Native WorldEvolving Services Into A Cloud Native World
Evolving Services Into A Cloud Native World
 
2011 State of the Cloud: A Year's Worth of Innovation in 30 Minutes - Jinesh...
2011 State of the Cloud:  A Year's Worth of Innovation in 30 Minutes - Jinesh...2011 State of the Cloud:  A Year's Worth of Innovation in 30 Minutes - Jinesh...
2011 State of the Cloud: A Year's Worth of Innovation in 30 Minutes - Jinesh...
 
Log Analysis At Scale
Log Analysis At ScaleLog Analysis At Scale
Log Analysis At Scale
 
Demandware krueger
Demandware kruegerDemandware krueger
Demandware krueger
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
 
Cloud Computing in Practice
Cloud Computing in PracticeCloud Computing in Practice
Cloud Computing in Practice
 
SRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateSRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS Fargate
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Netflix in the Cloud at SV Forum
Netflix in the Cloud at SV ForumNetflix in the Cloud at SV Forum
Netflix in the Cloud at SV Forum
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloud
 
Scale, baby, scale!
Scale, baby, scale!Scale, baby, scale!
Scale, baby, scale!
 
Moving complex enterprise ecommerce systems to the cloud
Moving complex enterprise ecommerce systems to the cloudMoving complex enterprise ecommerce systems to the cloud
Moving complex enterprise ecommerce systems to the cloud
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Open World Forum 2009 Migration With Telosys
Open World Forum 2009 Migration With TelosysOpen World Forum 2009 Migration With Telosys
Open World Forum 2009 Migration With Telosys
 
Ow2 Open World Forum09 Migration With Telosys
Ow2 Open World Forum09 Migration With TelosysOw2 Open World Forum09 Migration With Telosys
Ow2 Open World Forum09 Migration With Telosys
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
Cloud computing benefits start-up companies
Cloud computing benefits start-up companiesCloud computing benefits start-up companies
Cloud computing benefits start-up companies
 

Plus de cwensel

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 

Plus de cwensel (7)

Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
Buzz words
Buzz wordsBuzz words
Buzz words
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Cascading and BigData Problems

  • 1. Cascading and BigData Problems Chris K Wensel Concurrent, Inc. Copyright Concurrent, Inc. 2011. All rights reserved.
  • 2. About Me • Concurrent, Inc., Founder • Cascading support and tools • http://concurrentinc.com/ • Cascading, Lead Developer (started Sept 2007) • An alternative API to MapReduce • http://cascading.org/ • Formerly Hadoop mentoring and training • Sun - Apple - HP - LexisNexis - startups - etc • Formerly Systems Architect & Consultant • Thomson/Reuters - TeleAtlas - startups - etc Copyright Concurrent, Inc. 2011. All rights reserved.
  • 3. Overview • Case Studies • What’s in common? • Where does Hadoop fit? • Processing vs Innovation Copyright Concurrent, Inc. 2011. All rights reserved.
  • 4. Case Studies • ShareThis • BestBuy • FlightCaster • Etsy • Ion Flux Copyright Concurrent, Inc. 2011. All rights reserved.
  • 5. Summary • All running in production with Hadoop • All use AWS, most use Elastic MapReduce • All production processing was implemented in Cascading • Various other tools used at different stages of development Copyright Concurrent, Inc. 2011. All rights reserved.
  • 6. Share This • Cascading + AWS (pre-EMR) • Daily event log processing, initially multiple TB and growing • Details in the O’Reilly Hadoop book from Tom White Copyright Concurrent, Inc. 2011. All rights reserved.
  • 7. Lessons every Y hrs on crawl completion every X hrs logprocessor crawler indexer ... ... • Mark data as bad and why, never discard • useful for upstream debugging • Data is seasonal, cyclical, and bursty • Tune your app and cluster to the workload • (garbage collect Hadoop clusters) Copyright Concurrent, Inc. 2011. All rights reserved.
  • 8. BestBuy - Behavioral Ad- Targeting • Cascading + AWS (Elastic MapReduce) • Daily automated User Behavior Segmentation • 6wks dev, 3T/day, $13k/mo • 500% increase in return on ad spend from a similar campaign a year before • http://aws.amazon.com/solutions/case-studies/ razorfish/ Copyright Concurrent, Inc. 2011. All rights reserved.
  • 9. Cluster Amazon Web Services Elastic MapReduce Slaves Ad System Map/Reduce behavior app HDFS input output S3 E-Commerce Site • 200+ nodes, 9-12 hour runs • 30+ days of history + 3TB daily • Remote HTTP update of ad-server • of only changed data Copyright Concurrent, Inc. 2011. All rights reserved.
  • 10. Road Blocks • No one really understood the data • Character formats (UTF-8 vs ...) • Zero byte chars • Unique columns not unique • Outliers in the data • Creating test data • QAing the data • result data was also big Copyright Concurrent, Inc. 2011. All rights reserved.
  • 11. FlightCaster - Predicting Flight Delays • Clojure + Cascading + AWS • Scours data on every domestic flight for the past 10-years and matches it to real- time conditions • Machine learning on Cascading, Scoring on app server • 3mos dev, 10G day, <$2k/mos Copyright Concurrent, Inc. 2011. All rights reserved.
  • 12. Lessons • Even with a good abstraction, you must intuit the underlying model (MapReduce) to improve throughput • i.e. Logical vs Physical plans • we still need DBAs after decades of query planner dev Copyright Concurrent, Inc. 2011. All rights reserved.
  • 13. Etsy - Online Marketplace • JRuby + Cascading + AWS • 1B page-views & multi-T data/mo, of logs • 40-50 cascading.jruby jobs a night • http://codeascraft.etsy.com/2010/02/24/ analyzing-etsys-data-with-hadoop-and-cascading/ • http://www.concurrentinc.com/casestudies/etsy Copyright Concurrent, Inc. 2011. All rights reserved.
  • 14. Initially • JRuby for the ‘analysts’ • Log pre-processing, • db snap shot diffs, • nightly and ad-hoc analytics Copyright Concurrent, Inc. 2011. All rights reserved.
  • 15. Data Driven Products • Search index/scoring (under dev) • Taste Test • Facebook gift recommender • Suggested shops • Top query list, etc... • Many more on the way Copyright Concurrent, Inc. 2011. All rights reserved.
  • 16. Ion Flux - Gene Sequencing • Cascading + AWS • Sequence Alignment • http://aws.amazon.com/solutions/case- studies/ion-flux/ Copyright Concurrent, Inc. 2011. All rights reserved.
  • 17. Cluster • 10-30 nodes, using new HPC instances • 200-500 cores, • runs up to 50 hours Copyright Concurrent, Inc. 2011. All rights reserved.
  • 18. Architecture Delivery Ion Flux - Annotation Server Clinical Lab Ion Torrent - Torrent Sequencer Ion Torrent - Torrent Server (EC2) Ion Flux - Pipeline Controller FastQ Ion Torrent RAW Data FTP FTP RAW Data FTP Upload Annotation Sample Prep Chip Measure DNA File File Basecalling Sequence Wait Start Pipeline Third Party Clients Client Server File Server Complete? Database (RDS) Client App Annotation ReST Server Ion Flux - Flux Capacitor Ion Flux - LIMS FastQ FastQ Compressed DNA Sample FTP Transfer Cloud LIMS ReST Chip LIMS Sequence Split File Sequence Compress Sequence Transfer Agent Client Agent Input Server (EC2) Metadata Database File Chunk File (RDS) Ion Flux - Variant Server (EC2) Variants ReST Variant Server Database Ion Flux - (RDS) Client Website External Variant Complete Partners Report Runs (EC2) AWS - S3 Storage FastQ Software & Performance PILEUP Sequence Data Data Variants Chunks Heavy Lifting Ion Flux - Sequencing Pipeline AWS - EMR Cluster SAM Corrected Bootstrap Cluster Configure SAM Sort by Sorted SAM Split to PILEUP TMAP Alignment SRMA SAM PILEUP Nodes Pipeline Alignments position Alignments Bins Variants Bins Alignments Create Cluster Shutdown Cascading Cluster Cleanup Cluster Start Node Performance Profiler Data Copyright Concurrent, Inc. 2011. All rights reserved.
  • 19. Common Architecture intermediate data raw data loggers ? valuable loggers data loggers Analyst Producer Consumer Developer Value • New data continuously arriving • Actively incorporating the new with the old • Updating backend systems Copyright Concurrent, Inc. 2011. All rights reserved.
  • 20. Common Constraints • Speed of light • Understanding the data • Creating tests and validating the results • Lifecycle phases have different environments • dev vs. integration vs. prod • Better algorithms, less cost, more complexity Copyright Concurrent, Inc. 2011. All rights reserved.
  • 21. Apps Have Many Stages • Heavy Lifting • Scoring • Modeling & • Processing Learning Copyright Concurrent, Inc. 2011. All rights reserved.
  • 22. Heavy Lifting • ETL Style processes hampered by physics • Moving/Transferring/Packaging data • Data cleansing and value normalization Copyright Concurrent, Inc. 2011. All rights reserved.
  • 23. Modeling & Learning • Also known as “Data Mining” • Ask lots of questions to understand the data • Machine learning, or • Ad-hoc queries • Where the innovation happens Copyright Concurrent, Inc. 2011. All rights reserved.
  • 24. Processing • Transforming and/or combining multiple data sets into new data sets or models • Analytics, • indexing (w/ • statistics, scoring), • enrichment, • feature reduction, • entity extraction, • matching Copyright Concurrent, Inc. 2011. All rights reserved.
  • 25. Scoring • Apply what’s learned • Sometimes batch (as part of Processing) • indices with search result ranking • Sometimes transactional, req/resp • prediction, recommendations, etc Copyright Concurrent, Inc. 2011. All rights reserved.
  • 26. In Summary collection cleansing processing delivery event data signal info knowledge normalization scoring mining The point of computing systems is to make data more valuable Copyright Concurrent, Inc. 2011. All rights reserved.
  • 27. Where does Hadoop fit? Copyright Concurrent, Inc. 2011. All rights reserved.
  • 28. Hadoop Cluster Rack Rack Rack Node Node Node Node ... Global Compute-space Global Namespace • Distributed replicated storage for large files • Distributed fault tolerant exec of batch processes • Scale out vs (legacy) scale up • Java API allows complex analysis, more freedom Copyright Concurrent, Inc. 2011. All rights reserved.
  • 29. MapReduce • A “divide and conquer” strategy for parallelizing workloads against collections of data • Map & Reduce are two user defined functions chained via Key Value Pairs • It’s really Map->Group->Reduce where Group is built in Copyright Concurrent, Inc. 2011. All rights reserved.
  • 30. Keys and Values • Map translates input to keys and values to new keys and values [K1,V1] Map [K2,V2]* • System Groups each unique [K2,V2] Group [K2,{V2,V2,....}] key with all its values [K2,{V2,V2,....}] Reduce [K3,V3]* • Reduce translates the values of each unique key to new keys and values * = zero or more Copyright Concurrent, Inc. 2011. All rights reserved.
  • 31. Word Count Mapper [0, "when in the course of human events"] Map ["when",1] ["in",1] ["the",1] [...,1] ["when",1] ["when",1] ["when",1] ["when",1] Group ["when",{1,1,1,1,1}] ["when",1] Reducer ["when",{1,1,1,1,1}] Reduce ["when",5] Copyright Concurrent, Inc. 2011. All rights reserved.
  • 32. Divide and Conquer Parallelism • Since the ‘records’ entering the Map and ‘groups’ entering the Reduce are independent • That is, there is no expectation of order or requirement to share state between records/ groups • Arbitrary numbers of Map and Reduce function instances can be created against arbitrary portions of input data Copyright Concurrent, Inc. 2011. All rights reserved.
  • 33. Cluster Cluster Rack Rack Rack Node Node Node Node ... map map map map map reduce reduce reduce • Multiple instances of each Map and Reduce function are distributed throughout the cluster Copyright Concurrent, Inc. 2011. All rights reserved.
  • 34. Another View [K1,V1] Map [K2,V2] Combine Group [K2,{V2,...}] Reduce [K3,V3] Mapper Task same code Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Reducer Shuffle Task Task Mapper Task Mappers must complete before Reducers can begin split1 split2 split3 split4 ... part-00000 part-00001 part-000N file directory Copyright Concurrent, Inc. 2011. All rights reserved.
  • 35. Architectural Components NameNode DataNode DataNode DataNode DataNode data block ns read/write operations Secondary ns operations read/write ns operations read/write mapper mapper child jvm mapper child jvm jobs tasks child jvm Client JobTracker TaskTracker reducer reducer child jvm reducer child jvm child jvm • Solid boxes are unique applications • Dashed boxes are child JVM instances on same node as parent • Dotted boxes are blocks of managed files on same node as parent Copyright Concurrent, Inc. 2011. All rights reserved.
  • 36. Deployment Topology Node Node Node jobs tasks Client JobTracker TaskTracker DataNode Node NameNode Not uncommon to Node be same node Secondary • Job Client may run on any node • NameNode and JobTracker may run on same node (Master) • DataNode and TaskTracker instances should run on same node (Slaves) • NameNode and SecondaryNode shouldn’t typically run on same node Copyright Concurrent, Inc. 2011. All rights reserved.
  • 37. Complex job assemblies • Real applications are many MapReduce jobs chained together • Linked by intermediate (usually temporary) files • Executed in order, by hand, from the ‘client’ application Count Job Sort Job [ k, [v] ] [ k, [v] ] Map Reduce Map Reduce [ k, v ] [ k, v ] [ k, v ] [ k, v ] File File File [ k, v ] = key and value pair [ k, [v] ] = key and associated values collection Copyright Concurrent, Inc. 2011. All rights reserved.
  • 38. Tokenize Count Job Map Reduce Map Reduce File File File Join Tokens/Counts Job File Map Reduce File Sort/Prefix Filter Job Map Reduce File Match two sets Self Join Job Map Reduce using prefix File filtering Unique Pairs Job Map Reduce File Join LHS Job Map Reduce File Join RHS / Match Job Map Reduce File Copyright Concurrent, Inc. 2011. All rights reserved.
  • 39. Real World Apps [37/75] map+reduce [54/75] map+reduce [41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce [23/75] map+reduce [25/75] map+reduce [24/75] map+reduce [27/75] map+reduce [26/75] map+reduce [21/75] map+reduce [19/75] map+reduce [28/75] map+reduce [22/75] map+reduce [32/75] map+reduce [20/75] map+reduce [31/75] map+reduce [33/75] map+reduce [29/75] map+reduce [34/75] map+reduce [35/75] map+reduce [30/75] map+reduce [7/75] map+reduce [2/75] map+reduce [8/75] map+reduce [10/75] map+reduce [9/75] map+reduce [5/75] map+reduce [3/75] map+reduce [11/75] map+reduce [6/75] map+reduce [13/75] map+reduce [4/75] map+reduce [16/75] map+reduce [14/75] map+reduce [15/75] map+reduce [17/75] map+reduce [18/75] map+reduce [12/75] map+reduce [60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map [59/75] map [64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce [75/75] map+reduce [1/75] map+reduce 1 app, 75 jobs green = map + reduce purple = map blue = join/merge orange = map split Copyright Concurrent, Inc. 2011. All rights reserved.
  • 40. Cascading Word Count/Sort Flow Map Reduce Map Reduce [ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ] Parse Group Count Sort [ f1,f2,.. ] [ f1,f2,.. ] Data [ f1, f2,... ] = tuples with field names Data • Alternative model & API to MapReduce • pipe/filters of re-usable operations • For rapidly implementing Data Processing Systems Copyright Concurrent, Inc. 2011. All rights reserved.
  • 41. Cascading • Allows for Unit testing independent of integration • Re-usable libraries • Integration is first class • Homogeneous framework for scheduling • Any JVM based host language Copyright Concurrent, Inc. 2011. All rights reserved.
  • 42. Elastic MapReduce Amazon Web Services Elastic MapReduce User CLI Console Master Slaves Client mr mr Map/Reduce temp HDFS input output S3 jar • Clusters typically single purpose • S3 used for storage between runs Copyright Concurrent, Inc. 2011. All rights reserved.
  • 43. Architecture Isn’t Innovation operationalization collection cleansing processing delivery event data signal info knowledge normalization scoring mining innovation Rate of innovation and arrival of answers are proportional Copyright Concurrent, Inc. 2011. All rights reserved.
  • 44. Big vs Lots Lots of "Big" Data Data Data ! = Hadoop Mining* ! ? ? = RDBMS, R, etc Data * Data Warehousing Processing ! ! • Big - too much to fit in/on any one thing • Lots - complexity arising from keeping track of all the bits Copyright Concurrent, Inc. 2011. All rights reserved.
  • 45. At Rest vs In Motion data mining raw data ETL data warehousing loggers loggers loggers ETL Analyst Data At Rest raw data data processing valuable loggers data loggers loggers Consumer process Data In Motion • Hub/Spoke vs Incremental Layers • Static Schema vs Dynamic Views • Monolithic vs Distributed Copyright Concurrent, Inc. 2011. All rights reserved.
  • 46. Hadoop for Processing Value Creation Scalability Simplicity • Delivering Value from Innovation • Scalability, Not Performance • Simplifies Infrastructure Copyright Concurrent, Inc. 2011. All rights reserved.
  • 47. Simplicity Cluster Rack Rack Rack Node Node Node Node ... cpus Global Compute-space disks Global Namespace • Virtualization across resources, not within (PaaS) • A single FileSystem across disks - no DBA • A single Execution System across CPUs - less IT • One app installed and managed across hardware Copyright Concurrent, Inc. 2011. All rights reserved.
  • 48. Scalability Users Cluster Client Rack Rack Rack Node Node Node Node ... Client job job job Client • Scalability - continued reliability and met expectations as demand changes • Application Scalability - data grows, app/infra expand • Organizational Scalability - simpler infra and apps Copyright Concurrent, Inc. 2011. All rights reserved.
  • 49. Delivering Value events reporting raw data loggers loggers data processing loggers Hadoop + Hadoop etlCascading analytics Cascading Producer Consumer product operational Value • Unconstrained processing model • Data processing requires integration • Processing must not fail or fall behind Copyright Concurrent, Inc. 2011. All rights reserved.
  • 50. Data In Motion raw data data processing valuable loggers data loggers loggers Consumer process Data In Motion • Data always arriving, results being delivered • Not paying the upfront cost of indexing • No upfront schema design • “ETL” is built into the processing pipeline Copyright Concurrent, Inc. 2011. All rights reserved.
  • 51. Where to Innovate? Lots of "Big" Data Data Data ! = Hadoop Mining* ! ? ? = RDBMS, R, etc Data * Data Warehousing Processing ! ! • Depends on the problem whether Hadoop makes sense as your innovation platform Copyright Concurrent, Inc. 2011. All rights reserved.
  • 52. Hadoop for Innovating value innovation innovation innovation latency degrees of freedom • Need to ask similar questions repeatedly • Indexes help here • Need a reasonably high abstraction • Existing libraries and a simple syntax • Third-party Tool support Copyright Concurrent, Inc. 2011. All rights reserved.
  • 53. Innovation Abstractions • Syntax • Pig • Hive - now has some indexing support • Language (easier to operationalize) • Cascalog • Cascading.jruby • 3 new Scala languages pending release Copyright Concurrent, Inc. 2011. All rights reserved.
  • 54. Data At Rest data mining raw data ETL data warehousing loggers loggers loggers ETL Analyst Data At Rest • Hadoop becomes a warehouse (with Schemas) • and without indexes, high latency queries • ETL becomes an independent architecture Copyright Concurrent, Inc. 2011. All rights reserved.
  • 55. Don’t throw out the baby with the bath water • Need low latency responses • Need support for existing tools • Need to not retrain analysts • RDBMS (Aster, • SAS GreenPlum, Vertica, Oracle) • MicroStrategies • R • Tableaux Copyright Concurrent, Inc. 2011. All rights reserved.
  • 56. Bailing Wire & Bubble Gum • Integrating them with Hadoop adds brittleness and inefficiencies • Hadoop Streaming • RHIPE, etc.. Copyright Concurrent, Inc. 2011. All rights reserved.
  • 57. Operationalizing operationalization collection cleansing processing delivery event data signal info knowledge normalization scoring mining innovation • Minimize the number of processing tech (debt) • Don’t lose sight of the physical model/plan • XML is not a programming language • String concatenation isn’t programming Copyright Concurrent, Inc. 2011. All rights reserved.
  • 58. Resources • Chris K Wensel •chris@wensel.net •@cwensel • Cascading & Cascalog •http://cascading.org •@cascading • Concurrent, Inc. •http://concurrentinc.com •@concurrent •http://concurrentinc.com/careers Copyright Concurrent, Inc. 2011. All rights reserved.

Notes de l'éditeur

  1. Startups expecting to need &apos;web scale&apos; implementations are committing to technologies that might not be a good fit. Doing so can be a dramatic waste of time, money and resources when they can ill afford to do so. Do you really have a Big Data problem? Do you have a plan for what you are going to do with it? Chris will try to explain where he sees Hadoop being used most successfully and will offer up some guidelines on when to consider adopting it and any complimentary technologies.\n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n