SlideShare une entreprise Scribd logo
1  sur  77
Télécharger pour lire hors ligne
“The Workflow Abstraction”

                     Strata SC
                     2013-02-28

                     Paco Nathan
                     Concurrent, Inc.
                     San Francisco, CA
                     @pacoid




                   Copyright @2013, Concurrent, Inc.




Friday, 01 March 13                                                                                           1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
The Workflow Abstraction
                                                                                                                                                            Document
                                                                                                                                                            Collection



                                                                                                                                                                                         Scrub
                                                                                                                                                                         Tokenize
                                                                                                                                                                                         token

                                                                                                                                                                    M




                       1. Funnel
                                                                                                                                                                                                 HashJoin   Regex
                                                                                                                                                                                                   Left     token
                                                                                                                                                                                                                    GroupBy    R
                                                                                                                                                                                    Stop Word                        token
                                                                                                                                                                                       List
                                                                                                                                                                                                   RHS




                                                                                                                                                                                                                       Count




                                                                                                                                                                                                                                   Word
                                                                                                                                                                                                                                   Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                                                                                        2
This talk is about the workflow abstraction:
 * the business process of structuring data
 * the practices of building robust apps at scale
 * the open source projects for Enterprise Data Workflows

We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?

Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
Marketing Funnel – overview

             In reference to Making Data Work…
                                                                                                                               Customers
             Almost every business uses a model
             similar to this – give or take a few steps.                                                                       Campaigns


             Customer leads go in at the top,
                                                                                                                               Awareness
             those get refined through several stages,
             then results flow out the bottom.
                                                                                                                                Interest



                                                                                                                               Evalutation



                                                                                                                               Conversion



                                                                                                                                Referral



                                                                                                                                 Repeat




Friday, 01 March 13                                                                                                                          3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.

This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
Marketing Funnel – clickstream

              Different funnel stages get represented
              in ecommerce by events captured in                                                                                           Customers
              log files, as a class of machine data
              called clickstream                                                                                                           Campaigns

                                                                                                        Impression
                •   ad impressions                                                                                                         Awareness

                •   URL clicks                                                                                Click

                •   landing page views                                                                                                      Interest


                •   new user registrations                                                                            Sign Up

                                                                                                                                           Evalutation
                •   session cookies
                                                                                                                              Purchase
                •   online purchases                                                                                                       Conversion

                •   social network activity                                                                                       "Like"


                •   etc.                                                                                                                    Referral



                                                                                                                                             Repeat




Friday, 01 March 13                                                                                                                                      4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
Marketing Funnel – metrics

             A variety of clickstream metrics can
             be used as performance indicators                                                             Customers
             at different stages of the funnel:
                                                                                                           Campaigns
              •    CPM: cost per thousand                                    Impression

              •    CTR: click-through rate                                                                 Awareness                           CPM

              •    CPA: cost per action                                         Click


              •    etc.                                                                                     Interest                     CTR

                                                                                        Sign Up

                                                                                                           Evalutation                behaviors

                                                                                           Purchase

                                                                                                           Conversion           CPA

                                                                                                  "Like"

                                                                                                            Referral        NPS, social graph, etc.



                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                   5
The many different highly-nuanced metrics which apply are mind-boggling :)
Marketing Funnel – example calculations                                               Customers


                                                                                                   Campaigns



                                                                                                   Awareness



                                                                                                    Interest




                            metric                       cost     events     formula       rate    Evalutation



                                                                                                   Conversion



                                                                                                    Referral



                                                                                                     Repeat




                                                                              $4,000
                              CPM                     $4,000       10^6          ÷         $4.00
                                                                           (10^6 ÷ 10^3)



                                                                               3∙10^3
                               CTR                            -   3∙10^3
                                                                              ÷ 10^6
                                                                                           0.3%




                                                                              $4,000
                               CPA                            -     20           ÷         $200
                                                                                20




Friday, 01 March 13                                                                                              6
Here are examples of the kinds of calculations performed...
Marketing Funnel – predictive model

             Given these metrics, we can go further
             to estimate cost per paying user (CPP)                                                                                       Customers
             customer lifetime value (LTV), etc.
                                                                                                                                          Campaigns
             Then we can build a predictive model for
             return on investment (ROI) per customer,                                                                                     Awareness
             summarizing the funnel performance:
                     ROI = (LTV − CPP) ∕ CPP                                                                                               Interest




             As an example, after crunching lots of logs,                                                                                 Evalutation

             suppose that…
                                                                                                                                          Conversion

                     CPP = $200
                     LTV = $2000                                                                                                           Referral

                     ROI = ($2000 − $200) ∕ $200
                                                                                                                                            Repeat
             for a 9x multiple

Friday, 01 March 13                                                                                                                                     7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
Marketing Funnel – example architecture                                                                        Customers


                                                                                                                            Campaigns




                                                                                                              Customers
                                                                                                                            Awareness




             Let’s consider an example architecture                                                                          Interest



                                                                                                                            Evalutation




             for calculating, reporting, and taking action                                                      Web
                                                                                                                            Conversion




             on funnel metrics, based on large-scale                                                            App
                                                                                                                             Referral



                                                                                                                              Repeat




             clickstream data…
                                                                                                  logs         Cache
                                                                                                    logs
                                                                                                      Logs

                                                                      Support
                                                                                                     source
                                                                                           trap                  sink
                                                                                                       tap
                                                                                            tap                  tap


                                                                                                   Data
                                                                     Modeling            PMML
                                                                                                  Workflow

                                                                                                                source
                                                                                           sink
                                                                                                                  tap
                                                                                           tap

                                                                     Analytics
                                                                      Cubes                                    customer
                                                                                                                Customer
                                                                                                              profile DBs
                                                                                                                  Prefs
                                                                                                    Hadoop
                                                                                                    Cluster
                                                                     Reporting




Friday, 01 March 13                                                                                                                       8
Here’s an example architecture of using clickstream metrics within an online business.
Marketing Funnel – complexities

             Multiple ad partners, different contracts
             terms, reporting different metrics at                                                                                  Customers
                                                                                                                                                ×
                                                                                                    ×
             different times, click scrubs, etc.
                                                                                                                                    Campaigns
             Campaigns target specific geo/demo,                                                     Impression




                                                                                                    ×                                           ×
             test alternate landing pages, probably                                                                                 Awareness                           CPM
             need to segment customer base…                                                              Click


             These issues make clickstream data                                                                                      Interest                     CTR


             large and yet sparse.                                                                               Sign Up

                                                                                                                                    Evalutation                behaviors

             Other issues:


                                                                                                                                                ×
                                                                                                                    Purchase

             • seasonal variation                                                                                                   Conversion           CPA


             • fluctuating currency exchange rates                                                                          "Like"

                                                                                                                                     Referral        NPS, social graph, etc.
             • distortions due to credit card fraud
             • diminishing returns                                                                                                    Repeat      loyalty, win back, etc.

             • forecasting requirements
Friday, 01 March 13                                                                                                                                                            9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.

scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards

social dimension makes this convoluted
not simple
Marketing Funnel – very large scale

             Even a small start-up may need to
             make decisions about billions of                                                                                              Customers
             events, many millions of users, and
             millions of dollars in annual ad spend.                                                                                       Campaigns

                                                                                               Impression
             Ad networks attempt to simplify and                                                                                           Awareness                           CPM
             optimize parts of the funnel process                                                   Click
             as a value-add.                                                                                                                Interest                     CTR

             The need for these insights has been a                                                         Sign Up

             driver for Hadoop-related technologies.                                                                                       Evalutation                behaviors

                                                                                                                 Purchase

                                                                                                                                           Conversion           CPA

                                                                                                                        "Like"

                                                                                                                                            Referral        NPS, social graph, etc.



                                                                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                                                                   10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
Marketing Funnel – very large scale

            Even a small start-up may need to
            make decisions about billions of                                               Customers
            events, many millions of users, and
            millions of dollars in annual ad spend.                                        Campaigns

                                                             Impression
            Ad networks attempt to simplify and                                            Awareness                           CPM
            optimize parts of the funnel process                Click
            as a value-add.
                                      funnel modeling and optimization                      Interest                     CTR

            The need for these insights has been a                      Sign Up

            driver for Hadoop-relatedrequires complex data workflows
                                       technologies.                                       Evalutation                behaviors

                                      to obtain the required insights      Purchase

                                                                                           Conversion           CPA

                                                                                  "Like"

                                                                                            Referral        NPS, social graph, etc.



                                                                                             Repeat      loyalty, win back, etc.




Friday, 01 March 13                                                                                                                   11
These needs imply complex data workflows.

It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
The Workflow Abstraction
                                                                                                      Document
                                                                                                      Collection



                                                                                                                                   Scrub
                                                                                                                   Tokenize
                                                                                                                                   token

                                                                                                              M




                      1. Funnel
                                                                                                                                           HashJoin   Regex
                                                                                                                                             Left     token
                                                                                                                                                              GroupBy    R
                                                                                                                              Stop Word                        token
                                                                                                                                 List
                                                                                                                                             RHS




                                                                                                                                                                 Count




                                                                                                                                                                             Word
                                                                                                                                                                             Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                                                  12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
Circa 2008 – Hadoop at scale
                                                                                                                                                Customers




             Scenario: Analytics team at a large ad network…                                                                                    Campaigns



                                                                                                                                                Awareness




             Company had invested $MM capex in a                                                                                                 Interest




             large data warehouse across LOBs                                                                                                   Evalutation



                                                                                                                                                Conversion




             Mission-critical app had been written as
                                                                                                                                                 Referral




                                                                                                                                     collab       Repeat



             a large SQL workflow in the DW                                                                            roll-ups
                                                                                                                                     filter


             Marketing funnel metrics were estimated
             for many advertisers, many campaigns,                                                                                   per-user
                                                                                                                                   recommends
             many publishers, many customers –
             billions of calculations daily
                                                                                                                     query/load
             Predictive models matched publisher ~ advertiser                                                        clickstream     RDBMS

             and campaign ~ user, to optimize marketing
             funnel performance




Friday, 01 March 13                                                                                                                                           13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..

Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
Circa 2008 – Hadoop at scale
                                                                                                                                                         Customers




             Issues:                                                                                                                                     Campaigns



                                                                                                                                                         Awareness




              • critical app had hit hard limits for scalability                                                                                          Interest




              • several Tb data, 100’s of servers
                                                                                                                                                         Evalutation



                                                                                                                                                         Conversion




              • batch window length vs. failure rate vs. SLA                                                                                collab
                                                                                                                                                          Referral



                                                                                                                                                           Repeat



                in the context of business growth posed                                                                      roll-ups
                                                                                                                                            filter
                an existential risk




                                                                                                                                                     ×
             We built out a team to address these issues                                                                                    per-user
                                                                                                                                          recommends
             as rapidly as possible…
             Needed to re-create that data workflows                                                                         query/load
             based on Enterprise requirements.                                                                              clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                    14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.

We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
Circa 2008 – Hadoop at scale

            Approach:                                                           roll-ups
                                                                                               collab
                                                                                               filter
             • reverse-engineered business process from
               ~1500 lines of undocumented SQL
                                                                                               per-user
             • created a large, multi-step Apache Hadoop                                     recommends
               app on AWS                                                        HDFS


             • leveraged cloud strategy to trade $MM
               capex for lower, scalable opex
             • Amazon identified our app as one of the                             msg
                                                                                 queue
               largest Hadoop deployments on EC2
             • our app became a case study for AWS                             query/load
                                                                                               RDBMS
               prior to Elastic MapReduce launch                               clickstream




Friday, 01 March 13                                                                                       15
Our solution involved dependencies among more than a dozen Hadoop job steps.
Circa 2008 – Hadoop at scale




                                                                                                                                 ×
             Unresolved:                                                                                                                 roll-ups
                                                                                                                                                        collab
                                                                                                                                                        filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                                                                                            per-user
                debugging, etc., across the entire workflow                                                                                            recommends
                                                                                                                                          HDFS
              • data scientists wore beepers since Ops

                                                                                                                                × ×
                lacked visibility into business process
              • coding directly in MapReduce created
                a staffing bottleneck                                                                                                       msg
                                                                                                                                          queue



                                                                                                                                        query/load
                                                                                                                                        clickstream     RDBMS




Friday, 01 March 13                                                                                                                                                16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.

Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.

Three issues about Enterprise workflows:
 * staffing bottleneck unless there’s a good abstraction layer
 * operational complexity, mostly due to lack of transparency
 * system integration problems *are* the main problem to solve
Circa 2008 – Hadoop at scale

             Unresolved:                                           roll-ups
                                                                               collab
                                                                                filter
              • ETL was still a separate app
              • difficult to handle exceptions, notifications,                  per-user
                debugging, etc., across the entire workflow                  recommends

              • data scientists worea good since Ops for a large, commercial
                                       beepers solution
                                                                    HDFS

                lacked visibility into Apachebusiness logic deployment, but
                                       the app’s Hadoop
              • coding directly in MapReduce created
                a staffing bottleneck   workflow management lacked crucial
                                                                     msg
                                                                    queue
                                                             features…
                                                                                                                                     query/load
                                                             which led to a search for a better                                      clickstream                RDBMS


                                                             workflow abstraction



Friday, 01 March 13                                                                                                                                                                           17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.

I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M




                       1. Funnel
                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                       2. Circa 2008
                       3. Cascading
                       4. Sample Code
                       5. Workflows
                       6. Abstraction
                       7. Trendlines


Friday, 01 March 13                                                                                                                                                             18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins

             API author Chris Wensel worked as a system architect
             at an Enterprise firm well-known for several popular
             data products.
             Wensel was following the Nutch open source project –
             before Hadoop even had a name.
             He noted that it would become difficult to find Java
             developers to write complex Enterprise apps directly
             in Apache Hadoop – a potential blocker for leveraging
             this new open source technology.




Friday, 01 March 13                                                                                                                                                            19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming

             Key insight: MapReduce is based on functional programming
             – back to LISP in 1970s. Apache Hadoop use cases are
             mostly about data pipelines, which are functional in nature.
             To ease staffing problems as “Main Street” Enterprise firms
             began to embrace Hadoop, Cascading was introduced
             in late 2007, as a new Java API to implement functional
             programming for large-scale data workflows:

               • leverages JVM and Java-based tools without an need
                    to create an entirely new language
               •    allows many programmers who have J2EE expertise
                    to build apps that leverage the economics of Hadoop
                    clusters




Friday, 01 March 13                                                                                                                           20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
quotes…

                       “Cascading gives Java developers the ability to build
                        Big Data applications on Hadoop using their existing
                        skillset … Management can really go out and build a
                        team around folks that are already very experienced
                        with Java. Switching over to this is really a very short
                        exercise.”
                            CIO, Thor Olavsrud
                            2012-06-06
                            cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

                       “Masks the complexity of MapReduce, simplifies the
                        programming, and speeds you on your journey toward
                        actionable analytics … A vast improvement over native
                        MapReduce functions or Pig UDFs.”
                            2012 BOSSIE Awards, James Borck
                            2012-09-18
                            infoworld.com/slideshow/65089




Friday, 01 March 13                                                                           21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”

The issues:
 * staffing bottleneck
 * operational complexity
 * system integration
Cascading – deployments

              • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
                   uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
              • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                   MapR, EMC, SpringSource, Cloudera
              • 5+ history of Enterprise production deployments,
                   ASL 2 license, GitHub src, http://conjars.org
              • use cases: ETL, marketing funnel, anti-fraud, social media,
                   retail pricing, search analytics, recommenders, eCRM,
                   utility grids, genomics, climatology, etc.




Friday, 01 March 13                                                                  22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:

                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)


                     github.com/nathanmarz/cascalog/wiki
                     github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    23
Many case studies, many Enterprise production deployments now for 5+ years.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:
                                         Cascading as the basis for workflow
                                         abstractions atop Hadoop and more,
                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)
                                         with a 5+ year history of production
                                         deployments across multiple verticals
                      github.com/nathanmarz/cascalog/wiki
                      github.com/twitter/scalding/wiki




Friday, 01 March 13                                                                    24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
The Workflow Abstraction
                                                                          Document
                                                                          Collection



                                                                                                       Scrub
                                                                                       Tokenize
                                                                                                       token

                                                                                  M




                      1. Funnel
                                                                                                               HashJoin   Regex
                                                                                                                 Left     token
                                                                                                                                  GroupBy    R
                                                                                                  Stop Word                        token
                                                                                                     List
                                                                                                                 RHS




                                                                                                                                     Count




                                                                                                                                                 Word
                                                                                                                                                 Count




                      2. Circa 2008
                      3. Cascading
                      4. Sample Code
                      5. Workflows
                      6. Abstraction
                      7. Trendlines


Friday, 01 March 13                                                                                                                                      25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
The Ubiquitous Word Count
                                                                                                                     Document
                                                                                                                     Collection




             Definition:                                                                                                     M
                                                                                                                                  Tokenize
                                                                                                                                             GroupBy
                                                                                                                                              token    Count




                 count how often each word appears
               count how often each word appears
                                                                                                                                                R              Word
                                                                                                                                                               Count




               inin a collection of text documents
                  a collection of text documents
             This simple program provides an excellent test case for
             parallel processing, since it illustrates:                                                    void map (String doc_id, String text):
                                                                                                            for each word w in segment(text):
              • requires a minimal amount of code                                                             emit(w, "1");

              • demonstrates use of both symbolic and numeric values
              • shows a dependency graph of tuples as an abstraction                                       void reduce (String word, Iterator group):


              • is not many steps away from useful search indexing
                                                                                                            int count = 0;



              • serves as a “Hello World” for Hadoop apps                                                   for each pc in group:
                                                                                                              count += Int(pc);


             Any distributed computing framework which can run Word                                         emit(word, String(count));
             Count efficiently in parallel at scale can handle much
             larger and more interesting compute problems.


Friday, 01 March 13                                                                                                                                                    26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
word count – conceptual flow diagram


                 Document
                 Collection




                                                       Tokenize
                                                                                                       GroupBy
                               M                                                                        token                                             Count




                                                                                                             R                                                                                Word
                                                                                                                                                                                              Count




                1 map                                                                                            cascading.org/category/impatient
                1 reduce
               18 lines code                                                                                                               gist.github.com/3900702


Friday, 01 March 13                                                                                                                                                                                                      27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java
                                                                                                     Document
                                                                                                     Collection




             String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                             GroupBy
                                                                                                                              token
             String wcPath = args[ 1 ];                                                                      M                         Count




             Properties properties = new Properties();                                                                          R              Word
                                                                                                                                               Count


             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );
             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/wc.dot" );
             wcFlow.complete();



Friday, 01 March 13                                                                                                                                    28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram
word count – generated flow diagram
                                                                                                                                                      Document
                                                                                                                                                      Collection




                                                                                                                                                                   Tokenize
                                                                                                      [head]                                                  M
                                                                                                                                                                              GroupBy
                                                                                                                                                                               token    Count




                                                                                                                                                                                 R              Word
                                                                                                                                                                                                Count




                                                                        Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                [{2}:'doc_id', 'text']
                                                                                                [{2}:'doc_id', 'text']




                                                                                                                                             map
                                                                         Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                                                                    [{1}:'token']
                                                                                                    [{1}:'token']



                                                                                          GroupBy('wc')[by:['token']]

                                                                                                  wc[{1}:'token']
                                                                                                  [{1}:'token']




                                                                                                                                             reduce
                                                                                       Every('wc')[Count[decl:'count']]

                                                                                                [{2}:'token', 'count']
                                                                                                [{1}:'token']



                                                                    Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                [{2}:'token', 'count']
                                                                                                [{2}:'token', 'count']



                                                                                                       [tail]


Friday, 01 March 13                                                                                                                                                                                     29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
word count – Cascalog / Clojure
                                                                      Document
                                                                      Collection




             (ns impatient.core                                               M
                                                                                   Tokenize
                                                                                              GroupBy
                                                                                               token    Count



               (:use [cascalog.api]                                                              R              Word
                                                                                                                Count


                     [cascalog.more-taps :only (hfs-delimited)])
               (:require [clojure.string :as s]
                         [cascalog.ops :as c])
               (:gen-class))

             (defmapcatop split [line]
               "reads in a line of string and splits it by regex"
               (s/split line #"[[](),.)s]+"))

             (defn -main [in out & args]
               (?<- (hfs-delimited out)
                    [?word ?count]
                    ((hfs-delimited in :skip-header? true) _ ?line)
                    (split ?line :> ?word)
                    (c/count ?count)))

             ; Paul Lam
             ; github.com/Quantisan/Impatient




Friday, 01 March 13                                                                                                     30
Here is the same Word Count app written in Clojure, using Cascalog.
word count – Cascalog / Clojure
                                                                                                                    Document
                                                                                                                    Collection




             github.com/nathanmarz/cascalog/wiki
                                                                                                                                 Tokenize
                                                                                                                                            GroupBy
                                                                                                                            M                token    Count




                                                                                                                                               R              Word
                                                                                                                                                              Count




               • implements Datalog in Clojure, with predicates backed
                 by Cascading – for a highly declarative language
               • run ad-hoc queries from the Clojure REPL –
                 approx. 10:1 code reduction compared with SQL
               • composable subqueries, used for test-driven development
                 (TDD) practices at scale
               • Leiningen build: simple, no surprises, in Clojure itself
               • more new deployments than other Cascading DSLs –
                 Climate Corp is largest use case: 90% Clojure/Cascalog
               • has a learning curve, limited number of Clojure developers
               • aggregators are the magic, and those take effort to learn




Friday, 01 March 13                                                                                                                                                   31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.
word count – Scalding / Scala
                                                                                 Document
                                                                                 Collection




           import com.twitter.scalding._                                                 M
                                                                                              Tokenize
                                                                                                         GroupBy
                                                                                                          token    Count



                                                                                                            R              Word
                                                                                                                           Count


           class WordCount(args : Args) extends Job(args) {
             Tsv(args("doc"),
                  ('doc_id, 'text),
                  skipHeader = true)
               .read
               .flatMap('text -> 'token) {
                  text : String => text.split("[ [](),.]")
                }
               .groupBy('token) { _.size('count) }
               .write(Tsv(args("wc"), writeHeader = true))
           }




Friday, 01 March 13                                                                                                                32
Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.
word count – Scalding / Scala
                                                                                                                                                                                  Document
                                                                                                                                                                                  Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                                               Tokenize
                                                                                                                                                                                                          GroupBy
                                                                                                                                                                                          M                token    Count




                                                                                                                                                                                                             R              Word
                                                                                                                                                                                                                            Count




                • extends the Scala collections API so that distributed lists
                  become “pipes” backed by Cascading
                • code is compact, easy to understand
                • nearly 1:1 between elements of conceptual flow diagram
                  and function calls
                • extensive libraries are available for linear algebra, abstract
                  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                • significant investments by Twitter, Etsy, eBay, etc.
                • great for data services at scale
                • less learning curve than Cascalog,
                  not as much of a high-level language




Friday, 01 March 13                                                                                                                                                                                                                 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
word count – Scalding / Scala
                                                                                                                                                    Document
                                                                                                                                                    Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                 Tokenize
                                                                                                                                                                            GroupBy
                                                                                                                                                            M                token    Count




                                                                                                                                                                               R              Word
                                                                                                                                                                                              Count




               • extends the Scala collections API so that distributed lists
                 become “pipes” backed by Cascading
               • code is compact, easy to understand
               • nearly 1:1 between elements of conceptual flow diagram
                 and function calls        Cascalog and Scalding DSLs
               • extensive libraries are available for linear algebra, abstractaspects
                                           leverage the functional
                 algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                           of MapReduce, helping to limit
               • significant investments by Twitter, Etsy, eBay, etc.
                                           complexity in process
               • great for data services at scale
                 (imagine SOA infra @ Google as an open source project)
               • less learning curve than Cascalog,
                 not as much of a high-level language



Friday, 01 March 13                                                                                                                                                                                   34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction
                                                                                  Document
                                                                                  Collection



                                                                                                               Scrub
                                                                                               Tokenize
                                                                                                               token

                                                                                          M




                     1. Funnel
                                                                                                                       HashJoin   Regex
                                                                                                                         Left     token
                                                                                                                                          GroupBy    R
                                                                                                          Stop Word                        token
                                                                                                             List
                                                                                                                         RHS




                                                                                                                                             Count




                                                                                                                                                         Word
                                                                                                                                                         Count




                     2. Circa 2008
                     3. Cascading
                     4. Sample Code
                     5. Workflows
                     6. Abstraction
                     7. Trendlines


Friday, 01 March 13                                                                                                                                              35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
Enterprise Data Workflows
                                                                                    Customers
            Back to our marketing funnel, let’s consider
            an example app… at the front end                                          Web
                                                                                      App
            LOB use cases drive demand for apps
                                                                        logs         Cache
                                                                          logs
                                                                            Logs

                                                   Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Friday, 01 March 13                                                                               36
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
                                                                                                                 Customers
             An example… in the back office
             Organizations have substantial investments                                                            Web
                                                                                                                   App
             in people, infrastructure, process
                                                                                                     logs         Cache
                                                                                                       logs
                                                                                                         Logs

                                                                      Support
                                                                                                        source
                                                                                              trap                  sink
                                                                                                          tap
                                                                                               tap                  tap


                                                                                                      Data
                                                                     Modeling            PMML
                                                                                                     Workflow

                                                                                                                   source
                                                                                              sink
                                                                                                                     tap
                                                                                              tap

                                                                     Analytics
                                                                      Cubes                                       customer
                                                                                                                   Customer
                                                                                                                 profile DBs
                                                                                                                     Prefs
                                                                                                       Hadoop
                                                                                                       Cluster
                                                                    Reporting




Friday, 01 March 13                                                                                                            37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
                                                                                                          Customers
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
                                                                                                            App
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache
                                                                                                logs
                                                                                                  Logs

                                                                          Support
                                                                                                 source
                                                                                       trap                  sink
                                                                                                   tap
                                                                                        tap                  tap


                                                                                               Data
                                                                         Modeling    PMML
                                                                                              Workflow

                                                                                                            source
                                                                                       sink
                                                                                                              tap
                                                                                       tap

                                                                         Analytics
                                                                          Cubes                            customer
                                                                                                            Customer
                                                                                                          profile DBs
                                                                                                              Prefs
                                                                                                Hadoop
                                                                                                Cluster
                                                                        Reporting




Friday, 01 March 13                                                                                                     38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Cascading workflows – taps

               •   taps integrate other data frameworks, as tuple streams
                                                                                                            Customers

               •   these are “plumbing” endpoints in the pattern language
               •   sources (inputs), sinks (outputs), traps (exceptions)                                      Web
                                                                                                              App


               •   text delimited, JDBC, Memcached,
                   HBase, Cassandra, MongoDB, etc.                                              logs
                                                                                                  logs
                                                                                                    Logs
                                                                                                             Cache



               • data serialization: Avro, Thrift,
                                                                           Support
                                                                                                   source
                                                                                         trap                  sink
                                                                                                     tap
                   Kryo, JSON, etc.                                                       tap                  tap




               • extend a new kind of tap in just
                                                                                                 Data
                                                                           Modeling    PMML
                                                                                                Workflow

                   a few lines of Java                                                   sink
                                                                                                              source
                                                                                                                tap
                                                                                         tap

                                                                           Analytics
                                                                            Cubes                            customer
                                                                                                              Customer
                                                                                                            profile DBs
             schema and provenance get                                                            Hadoop
                                                                                                                Prefs


             derived from analysis of the taps                             Reporting
                                                                                                  Cluster




Friday, 01 March 13                                                                                                       39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
Cascading workflows – taps

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                source and sink taps
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );                      for TSV data in HDFS
            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                        40
Here are the taps in the WordCount source
Cascading workflows – topologies

               • topologies execute workflows on clusters
                                                                                                                                              Customers

               • flow planner is like a compiler for queries
                 - Hadoop (MapReduce jobs)                                                                                                      Web
                                                                                                                                                App


                 - local mode (dev/test or special config)
                                                                                                                                  logs         Cache
                 - in-memory data grids (real-time)                                                                                 logs
                                                                                                                                      Logs

                                                                                                             Support

               • flow planner can be extended                                                                               trap
                                                                                                                            tap
                                                                                                                                     source
                                                                                                                                       tap       sink
                                                                                                                                                 tap
                   to support other topologies
                                                                                                                                   Data
                                                                                                             Modeling    PMML
                                                                                                                                  Workflow

                                                                                                                                                source
                                                                                                                           sink
                                                                                                                                                  tap
             blend flows in different topologies                                                                            tap

                                                                                                             Analytics
             into the same app – for example,                                                                 Cubes                            customer
                                                                                                                                                Customer
                                                                                                                                              profile DBs
             batch (Hadoop) + transactions (IMDG)                                                                                   Hadoop
                                                                                                                                                  Prefs

                                                                                                                                    Cluster
                                                                                                             Reporting




Friday, 01 March 13                                                                                                                                         41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
Cascading workflows – topologies

            String docPath = args[ 0 ];
            String wcPath = args[ 1 ];
            Properties properties = new Properties();
            AppProps.setApplicationJarClass( properties, Main.class );
            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

            // create source and sink taps
            Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
            Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

            // specify a regex to split "document" text lines into token stream
            Fields token = new Fields( "token" );
            Fields text = new Fields( "text" );
            RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );   flow planner for
            // only returns "token"
            Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                     Apache Hadoop
            // determine the word counts
            Pipe wcPipe = new Pipe( "wc", docPipe );                                                topology
            wcPipe = new GroupBy( wcPipe, token );
            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

            // connect the taps, pipes, etc., into a flow
            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
             .addSource( docPipe, docTap )
             .addTailSink( wcPipe, wcTap );
            // write a DOT file and run the flow
            Flow wcFlow = flowConnector.connect( flowDef );
            wcFlow.writeDOT( "dot/wc.dot" );
            wcFlow.complete();



Friday, 01 March 13                                                                                                   42
Here is the flow planner for Hadoop in the WordCount source
example topologies…




Friday, 01 March 13                                                                             43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction
The Workflow Abstraction

Contenu connexe

Similaire à The Workflow Abstraction

Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...Rob Cottingham
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brandRob Cottingham
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinMojisola Erdt née Anjorin
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrapScott Farquhar
 
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney LogicaMark Hellyer
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profileinkscoop
 
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson ExecutiveMark Hellyer
 
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebBeat Signer
 

Similaire à The Workflow Abstraction (19)

Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
Starter day presentation art of the bootstrap
Starter day presentation   art of the bootstrapStarter day presentation   art of the bootstrap
Starter day presentation art of the bootstrap
 
North Sydney Logica
North Sydney    LogicaNorth Sydney    Logica
North Sydney Logica
 
DTO #ChefConf2012
DTO #ChefConf2012DTO #ChefConf2012
DTO #ChefConf2012
 
Parramatta Aegon
Parramatta    AegonParramatta    Aegon
Parramatta Aegon
 
Hyena Labs Works
Hyena Labs WorksHyena Labs Works
Hyena Labs Works
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Inkscoop company profile
Inkscoop company profileInkscoop company profile
Inkscoop company profile
 
Sydney Johnson Executive
Sydney    Johnson ExecutiveSydney    Johnson Executive
Sydney Johnson Executive
 
Chatswood Kumon
Chatswood   KumonChatswood   Kumon
Chatswood Kumon
 
Print-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper WebPrint-n-Link: Weaving the Paper Web
Print-n-Link: Weaving the Paper Web
 

Plus de Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Plus de Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Dernier

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Dernier (20)

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

The Workflow Abstraction

  • 1. “The Workflow Abstraction” Strata SC 2013-02-28 Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Friday, 01 March 13 1 Background: dual in quantitative and distributed systems. I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
  • 2. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 2 This talk is about the workflow abstraction: * the business process of structuring data * the practices of building robust apps at scale * the open source projects for Enterprise Data Workflows We’ll consider some theory, examples, best practices, trendlines -- what are the drivers that brought us, and where is this work heading toward? Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
  • 3. Marketing Funnel – overview In reference to Making Data Work… Customers Almost every business uses a model similar to this – give or take a few steps. Campaigns Customer leads go in at the top, Awareness those get refined through several stages, then results flow out the bottom. Interest Evalutation Conversion Referral Repeat Friday, 01 March 13 3 Let’s consider one of the most fundamental predictive models used in business: a marketing funnel. This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
  • 4. Marketing Funnel – clickstream Different funnel stages get represented in ecommerce by events captured in Customers log files, as a class of machine data called clickstream Campaigns Impression • ad impressions Awareness • URL clicks Click • landing page views Interest • new user registrations Sign Up Evalutation • session cookies Purchase • online purchases Conversion • social network activity "Like" • etc. Referral Repeat Friday, 01 March 13 4 Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
  • 5. Marketing Funnel – metrics A variety of clickstream metrics can be used as performance indicators Customers at different stages of the funnel: Campaigns • CPM: cost per thousand Impression • CTR: click-through rate Awareness CPM • CPA: cost per action Click • etc. Interest CTR Sign Up Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 5 The many different highly-nuanced metrics which apply are mind-boggling :)
  • 6. Marketing Funnel – example calculations Customers Campaigns Awareness Interest metric cost events formula rate Evalutation Conversion Referral Repeat $4,000 CPM $4,000 10^6 ÷ $4.00 (10^6 ÷ 10^3) 3∙10^3 CTR - 3∙10^3 ÷ 10^6 0.3% $4,000 CPA - 20 ÷ $200 20 Friday, 01 March 13 6 Here are examples of the kinds of calculations performed...
  • 7. Marketing Funnel – predictive model Given these metrics, we can go further to estimate cost per paying user (CPP) Customers customer lifetime value (LTV), etc. Campaigns Then we can build a predictive model for return on investment (ROI) per customer, Awareness summarizing the funnel performance: ROI = (LTV − CPP) ∕ CPP Interest As an example, after crunching lots of logs, Evalutation suppose that… Conversion CPP = $200 LTV = $2000 Referral ROI = ($2000 − $200) ∕ $200 Repeat for a 9x multiple Friday, 01 March 13 7 For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers, which describes the efficiency of the marketing funnel at different stages.
  • 8. Marketing Funnel – example architecture Customers Campaigns Customers Awareness Let’s consider an example architecture Interest Evalutation for calculating, reporting, and taking action Web Conversion on funnel metrics, based on large-scale App Referral Repeat clickstream data… logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 8 Here’s an example architecture of using clickstream metrics within an online business.
  • 9. Marketing Funnel – complexities Multiple ad partners, different contracts terms, reporting different metrics at Customers × × different times, click scrubs, etc. Campaigns Campaigns target specific geo/demo, Impression × × test alternate landing pages, probably Awareness CPM need to segment customer base… Click These issues make clickstream data Interest CTR large and yet sparse. Sign Up Evalutation behaviors Other issues: × Purchase • seasonal variation Conversion CPA • fluctuating currency exchange rates "Like" Referral NPS, social graph, etc. • distortions due to credit card fraud • diminishing returns Repeat loyalty, win back, etc. • forecasting requirements Friday, 01 March 13 9 However, real life intercedes. In many businesses, this is a complicated model to calculate correctly. scrubs many vendors, data sources, different metrics to be aligned lots of roll-ups Bayesian point estimates forecasts and dashboards social dimension makes this convoluted not simple
  • 10. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. Interest CTR The need for these insights has been a Sign Up driver for Hadoop-related technologies. Evalutation behaviors Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 10 The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
  • 11. Marketing Funnel – very large scale Even a small start-up may need to make decisions about billions of Customers events, many millions of users, and millions of dollars in annual ad spend. Campaigns Impression Ad networks attempt to simplify and Awareness CPM optimize parts of the funnel process Click as a value-add. funnel modeling and optimization Interest CTR The need for these insights has been a Sign Up driver for Hadoop-relatedrequires complex data workflows technologies. Evalutation behaviors to obtain the required insights Purchase Conversion CPA "Like" Referral NPS, social graph, etc. Repeat loyalty, win back, etc. Friday, 01 March 13 11 These needs imply complex data workflows. It’s not about doing a BI query or a pivot table; that’s how retailers were thinking when Amazon came along.
  • 12. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 12 A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
  • 13. Circa 2008 – Hadoop at scale Customers Scenario: Analytics team at a large ad network… Campaigns Awareness Company had invested $MM capex in a Interest large data warehouse across LOBs Evalutation Conversion Mission-critical app had been written as Referral collab Repeat a large SQL workflow in the DW roll-ups filter Marketing funnel metrics were estimated for many advertisers, many campaigns, per-user recommends many publishers, many customers – billions of calculations daily query/load Predictive models matched publisher ~ advertiser clickstream RDBMS and campaign ~ user, to optimize marketing funnel performance Friday, 01 March 13 13 Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network.. Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
  • 14. Circa 2008 – Hadoop at scale Customers Issues: Campaigns Awareness • critical app had hit hard limits for scalability Interest • several Tb data, 100’s of servers Evalutation Conversion • batch window length vs. failure rate vs. SLA collab Referral Repeat in the context of business growth posed roll-ups filter an existential risk × We built out a team to address these issues per-user recommends as rapidly as possible… Needed to re-create that data workflows query/load based on Enterprise requirements. clickstream RDBMS Friday, 01 March 13 14 Marching orders: 5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City; 5 weeks to reverse engineer the mission-critical app without any access to its author; 5 weeks to implement a Hadoop version which could scale-out on EC2. We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
  • 15. Circa 2008 – Hadoop at scale Approach: roll-ups collab filter • reverse-engineered business process from ~1500 lines of undocumented SQL per-user • created a large, multi-step Apache Hadoop recommends app on AWS HDFS • leveraged cloud strategy to trade $MM capex for lower, scalable opex • Amazon identified our app as one of the msg queue largest Hadoop deployments on EC2 • our app became a case study for AWS query/load RDBMS prior to Elastic MapReduce launch clickstream Friday, 01 March 13 15 Our solution involved dependencies among more than a dozen Hadoop job steps.
  • 16. Circa 2008 – Hadoop at scale × Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends HDFS • data scientists wore beepers since Ops × × lacked visibility into business process • coding directly in MapReduce created a staffing bottleneck msg queue query/load clickstream RDBMS Friday, 01 March 13 16 This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM -- for troubleshooting, handling exceptions, notifications, etc. Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea. Three issues about Enterprise workflows: * staffing bottleneck unless there’s a good abstraction layer * operational complexity, mostly due to lack of transparency * system integration problems *are* the main problem to solve
  • 17. Circa 2008 – Hadoop at scale Unresolved: roll-ups collab filter • ETL was still a separate app • difficult to handle exceptions, notifications, per-user debugging, etc., across the entire workflow recommends • data scientists worea good since Ops for a large, commercial beepers solution HDFS lacked visibility into Apachebusiness logic deployment, but the app’s Hadoop • coding directly in MapReduce created a staffing bottleneck workflow management lacked crucial msg queue features… query/load which led to a search for a better clickstream RDBMS workflow abstraction Friday, 01 March 13 17 While leading this team, I sought out other ways of managing a complex workflow involving Hadoop. I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
  • 18. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 18 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 19. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology. Friday, 01 March 13 19 Cascading initially grew from interaction with the Nutch project, before Hadoop had a name API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  • 20. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without an need to create an entirely new language • allows many programmers who have J2EE expertise to build apps that leverage the economics of Hadoop clusters Friday, 01 March 13 20 Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  • 21. quotes… “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck 2012-09-18 infoworld.com/slideshow/65089 Friday, 01 March 13 21 Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch” The issues: * staffing bottleneck * operational complexity * system integration
  • 22. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc. Friday, 01 March 13 22 Several published case studies about Cascading, Cascalog, Scalding, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  • 23. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 23 Many case studies, many Enterprise production deployments now for 5+ years.
  • 24. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascading as the basis for workflow abstractions atop Hadoop and more, Cascalog in Clojure (2010) Scalding in Scala (2012) with a 5+ year history of production deployments across multiple verticals github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Friday, 01 March 13 24 Cascading as a basis for workflow abstraction, for Enterprise data workflows
  • 25. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 25 Code samples in Cascading / Cascalog / Scalding, based on Word Count
  • 26. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Friday, 01 March 13 26 Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already... Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
  • 27. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 Friday, 01 March 13 27 Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  • 28. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 28 Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop 2nd to last line: generates a DOT file for the flow diagram
  • 29. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Friday, 01 March 13 29 As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
  • 30. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Friday, 01 March 13 30 Here is the same Word Count app written in Clojure, using Cascalog.
  • 31. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Friday, 01 March 13 31 From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments. Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  • 32. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Friday, 01 March 13 32 Here is the same Word Count app written in Scala, using Scalding. Very compact, easy to understand; however, also more imperative than Cascalog.
  • 33. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 33 If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  • 34. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level language Friday, 01 March 13 34 Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  • 35. The Workflow Abstraction Document Collection Scrub Tokenize token M 1. Funnel HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. Circa 2008 3. Cascading 4. Sample Code 5. Workflows 6. Abstraction 7. Trendlines Friday, 01 March 13 35 Tracking back to the Marketing Funnel as an example workflow… Let’s consider how Cascading apps incorporate other components beyond Hadoop
  • 36. Enterprise Data Workflows Customers Back to our marketing funnel, let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 36 LOB use cases drive the demand for Big Data apps
  • 37. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 37 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 38. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Friday, 01 March 13 38 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 39. Cascading workflows – taps • taps integrate other data frameworks, as tuple streams Customers • these are “plumbing” endpoints in the pattern language • sources (inputs), sinks (outputs), traps (exceptions) Web App • text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc. logs logs Logs Cache • data serialization: Avro, Thrift, Support source trap sink tap Kryo, JSON, etc. tap tap • extend a new kind of tap in just Data Modeling PMML Workflow a few lines of Java sink source tap tap Analytics Cubes customer Customer profile DBs schema and provenance get Hadoop Prefs derived from analysis of the taps Reporting Cluster Friday, 01 March 13 39 Speaking of system integration, taps provide the simplest approach for integrating different frameworks.
  • 40. Cascading workflows – taps String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 40 Here are the taps in the WordCount source
  • 41. Cascading workflows – topologies • topologies execute workflows on clusters Customers • flow planner is like a compiler for queries - Hadoop (MapReduce jobs) Web App - local mode (dev/test or special config) logs Cache - in-memory data grids (real-time) logs Logs Support • flow planner can be extended trap tap source tap sink tap to support other topologies Data Modeling PMML Workflow source sink tap blend flows in different topologies tap Analytics into the same app – for example, Cubes customer Customer profile DBs batch (Hadoop) + transactions (IMDG) Hadoop Prefs Cluster Reporting Friday, 01 March 13 41 Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
  • 42. Cascading workflows – topologies String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); topology wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Friday, 01 March 13 42 Here is the flow planner for Hadoop in the WordCount source
  • 43. example topologies… Friday, 01 March 13 43 Here are some examples of topologies for distributed computing -- Apache Hadoop being the first supported by Cascading, followed by local mode, and now a tuple space (IMDG) flow planner in the works. Several other widely used platforms would also be likely suspects for Cascading flow planners.