1. “The Workflow Abstraction”
Strata SC
2013-02-28
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
Friday, 01 March 13 1
Background: dual in quantitative and distributed systems.
I’ve spent the past decade leading innovative Data teams responsible for many successful large-scale apps -
2. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 2
This talk is about the workflow abstraction:
* the business process of structuring data
* the practices of building robust apps at scale
* the open source projects for Enterprise Data Workflows
We’ll consider some theory, examples, best practices, trendlines --
what are the drivers that brought us, and where is this work heading toward?
Most of all, make it easy for people from all kinds of backgrounds to build Enterprise Data Workflows -- robust apps at scale -- for Hadoop and beyond.
3. Marketing Funnel – overview
In reference to Making Data Work…
Customers
Almost every business uses a model
similar to this – give or take a few steps. Campaigns
Customer leads go in at the top,
Awareness
those get refined through several stages,
then results flow out the bottom.
Interest
Evalutation
Conversion
Referral
Repeat
Friday, 01 March 13 3
Let’s consider one of the most fundamental predictive models used in business: a marketing funnel.
This is an exercise which I’ve had to run through at nearly every firm in recent years -- analytics for the marketing funnel.
4. Marketing Funnel – clickstream
Different funnel stages get represented
in ecommerce by events captured in Customers
log files, as a class of machine data
called clickstream Campaigns
Impression
• ad impressions Awareness
• URL clicks Click
• landing page views Interest
• new user registrations Sign Up
Evalutation
• session cookies
Purchase
• online purchases Conversion
• social network activity "Like"
• etc. Referral
Repeat
Friday, 01 March 13 4
Online advertising involves what we call “clickstream” data, lots of events in log files -- i.e., lots of unstructured data.
5. Marketing Funnel – metrics
A variety of clickstream metrics can
be used as performance indicators Customers
at different stages of the funnel:
Campaigns
• CPM: cost per thousand Impression
• CTR: click-through rate Awareness CPM
• CPA: cost per action Click
• etc. Interest CTR
Sign Up
Evalutation behaviors
Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 5
The many different highly-nuanced metrics which apply are mind-boggling :)
6. Marketing Funnel – example calculations Customers
Campaigns
Awareness
Interest
metric cost events formula rate Evalutation
Conversion
Referral
Repeat
$4,000
CPM $4,000 10^6 ÷ $4.00
(10^6 ÷ 10^3)
3∙10^3
CTR - 3∙10^3
÷ 10^6
0.3%
$4,000
CPA - 20 ÷ $200
20
Friday, 01 March 13 6
Here are examples of the kinds of calculations performed...
7. Marketing Funnel – predictive model
Given these metrics, we can go further
to estimate cost per paying user (CPP) Customers
customer lifetime value (LTV), etc.
Campaigns
Then we can build a predictive model for
return on investment (ROI) per customer, Awareness
summarizing the funnel performance:
ROI = (LTV − CPP) ∕ CPP Interest
As an example, after crunching lots of logs, Evalutation
suppose that…
Conversion
CPP = $200
LTV = $2000 Referral
ROI = ($2000 − $200) ∕ $200
Repeat
for a 9x multiple
Friday, 01 March 13 7
For applications within a business, we can use these calculated metrics to create a predictive model for the profitability of customers,
which describes the efficiency of the marketing funnel at different stages.
8. Marketing Funnel – example architecture Customers
Campaigns
Customers
Awareness
Let’s consider an example architecture Interest
Evalutation
for calculating, reporting, and taking action Web
Conversion
on funnel metrics, based on large-scale App
Referral
Repeat
clickstream data…
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 8
Here’s an example architecture of using clickstream metrics within an online business.
9. Marketing Funnel – complexities
Multiple ad partners, different contracts
terms, reporting different metrics at Customers
×
×
different times, click scrubs, etc.
Campaigns
Campaigns target specific geo/demo, Impression
× ×
test alternate landing pages, probably Awareness CPM
need to segment customer base… Click
These issues make clickstream data Interest CTR
large and yet sparse. Sign Up
Evalutation behaviors
Other issues:
×
Purchase
• seasonal variation Conversion CPA
• fluctuating currency exchange rates "Like"
Referral NPS, social graph, etc.
• distortions due to credit card fraud
• diminishing returns Repeat loyalty, win back, etc.
• forecasting requirements
Friday, 01 March 13 9
However, real life intercedes. In many businesses, this is a complicated model to calculate correctly.
scrubs
many vendors, data sources, different metrics to be aligned
lots of roll-ups
Bayesian point estimates
forecasts and dashboards
social dimension makes this convoluted
not simple
10. Marketing Funnel – very large scale
Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns
Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add. Interest CTR
The need for these insights has been a Sign Up
driver for Hadoop-related technologies. Evalutation behaviors
Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 10
The needs for large scale funnel modeling and optimization have been drivers for MapReduce, Hadoop, and related “Big Data” technologies.
11. Marketing Funnel – very large scale
Even a small start-up may need to
make decisions about billions of Customers
events, many millions of users, and
millions of dollars in annual ad spend. Campaigns
Impression
Ad networks attempt to simplify and Awareness CPM
optimize parts of the funnel process Click
as a value-add.
funnel modeling and optimization Interest CTR
The need for these insights has been a Sign Up
driver for Hadoop-relatedrequires complex data workflows
technologies. Evalutation behaviors
to obtain the required insights Purchase
Conversion CPA
"Like"
Referral NPS, social graph, etc.
Repeat loyalty, win back, etc.
Friday, 01 March 13 11
These needs imply complex data workflows.
It’s not about doing a BI query or a pivot table;
that’s how retailers were thinking when Amazon came along.
12. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 12
A personal history of ad networks, Apache Hadoop apps, and Enterprise data workflows, circa 2008.
13. Circa 2008 – Hadoop at scale
Customers
Scenario: Analytics team at a large ad network… Campaigns
Awareness
Company had invested $MM capex in a Interest
large data warehouse across LOBs Evalutation
Conversion
Mission-critical app had been written as
Referral
collab Repeat
a large SQL workflow in the DW roll-ups
filter
Marketing funnel metrics were estimated
for many advertisers, many campaigns, per-user
recommends
many publishers, many customers –
billions of calculations daily
query/load
Predictive models matched publisher ~ advertiser clickstream RDBMS
and campaign ~ user, to optimize marketing
funnel performance
Friday, 01 March 13 13
Experience with a large marketing funnel optimization problem, as Director of Analytics at an ad network..
Most of the revenue depended on one app, written in a DW -- monolithic SQL which nobody at the company understood.
14. Circa 2008 – Hadoop at scale
Customers
Issues: Campaigns
Awareness
• critical app had hit hard limits for scalability Interest
• several Tb data, 100’s of servers
Evalutation
Conversion
• batch window length vs. failure rate vs. SLA collab
Referral
Repeat
in the context of business growth posed roll-ups
filter
an existential risk
×
We built out a team to address these issues per-user
recommends
as rapidly as possible…
Needed to re-create that data workflows query/load
based on Enterprise requirements. clickstream RDBMS
Friday, 01 March 13 14
Marching orders:
5 weeks to build a Data Science team of 10 (mostly Stats PhDs and DevOps) in Kansas City;
5 weeks to reverse engineer the mission-critical app without any access to its author;
5 weeks to implement a Hadoop version which could scale-out on EC2.
We had a great team, the members of which have moved on to senior roles at Apple, Facebook, Merkle, Quantcast, IMVU, etc.
15. Circa 2008 – Hadoop at scale
Approach: roll-ups
collab
filter
• reverse-engineered business process from
~1500 lines of undocumented SQL
per-user
• created a large, multi-step Apache Hadoop recommends
app on AWS HDFS
• leveraged cloud strategy to trade $MM
capex for lower, scalable opex
• Amazon identified our app as one of the msg
queue
largest Hadoop deployments on EC2
• our app became a case study for AWS query/load
RDBMS
prior to Elastic MapReduce launch clickstream
Friday, 01 March 13 15
Our solution involved dependencies among more than a dozen Hadoop job steps.
16. Circa 2008 – Hadoop at scale
×
Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends
HDFS
• data scientists wore beepers since Ops
× ×
lacked visibility into business process
• coding directly in MapReduce created
a staffing bottleneck msg
queue
query/load
clickstream RDBMS
Friday, 01 March 13 16
This underscores the need for a unified space for the entire data workflow, visible to the compiler and JVM --
for troubleshooting, handling exceptions, notifications, etc.
Otherwise, for apps at scale, Ops will give up and force the data scientists to wear beepers 24/7, which is almost never a good idea.
Three issues about Enterprise workflows:
* staffing bottleneck unless there’s a good abstraction layer
* operational complexity, mostly due to lack of transparency
* system integration problems *are* the main problem to solve
17. Circa 2008 – Hadoop at scale
Unresolved: roll-ups
collab
filter
• ETL was still a separate app
• difficult to handle exceptions, notifications, per-user
debugging, etc., across the entire workflow recommends
• data scientists worea good since Ops for a large, commercial
beepers solution
HDFS
lacked visibility into Apachebusiness logic deployment, but
the app’s Hadoop
• coding directly in MapReduce created
a staffing bottleneck workflow management lacked crucial
msg
queue
features…
query/load
which led to a search for a better clickstream RDBMS
workflow abstraction
Friday, 01 March 13 17
While leading this team, I sought out other ways of managing a complex workflow involving Hadoop.
I found out about the Cascading open source project, and called the API author. Oddly enough, as I was walking into the interview for my next job, we passed each other in the parking lot.
18. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 18
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
19. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for several popular
data products.
Wensel was following the Nutch open source project –
before Hadoop even had a name.
He noted that it would become difficult to find Java
developers to write complex Enterprise apps directly
in Apache Hadoop – a potential blocker for leveraging
this new open source technology.
Friday, 01 March 13 19
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name
API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
20. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without an need
to create an entirely new language
• allows many programmers who have J2EE expertise
to build apps that leverage the economics of Hadoop
clusters
Friday, 01 March 13 20
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
21. quotes…
“Cascading gives Java developers the ability to build
Big Data applications on Hadoop using their existing
skillset … Management can really go out and build a
team around folks that are already very experienced
with Java. Switching over to this is really a very short
exercise.”
CIO, Thor Olavsrud
2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading
“Masks the complexity of MapReduce, simplifies the
programming, and speeds you on your journey toward
actionable analytics … A vast improvement over native
MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck
2012-09-18
infoworld.com/slideshow/65089
Friday, 01 March 13 21
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
The issues:
* staffing bottleneck
* operational complexity
* system integration
22. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera
• 5+ history of Enterprise production deployments,
ASL 2 license, GitHub src, http://conjars.org
• use cases: ETL, marketing funnel, anti-fraud, social media,
retail pricing, search analytics, recommenders, eCRM,
utility grids, genomics, climatology, etc.
Friday, 01 March 13 22
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.
Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
23. examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Friday, 01 March 13 23
Many case studies, many Enterprise production deployments now for 5+ years.
24. examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
in functional programming open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascading as the basis for workflow
abstractions atop Hadoop and more,
Cascalog in Clojure (2010)
Scalding in Scala (2012)
with a 5+ year history of production
deployments across multiple verticals
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Friday, 01 March 13 24
Cascading as a basis for workflow abstraction, for Enterprise data workflows
25. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 25
Code samples in Cascading / Cascalog / Scalding, based on Word Count
26. The Ubiquitous Word Count
Document
Collection
Definition: M
Tokenize
GroupBy
token Count
count how often each word appears
count how often each word appears
R Word
Count
inin a collection of text documents
a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
for each word w in segment(text):
• requires a minimal amount of code emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing
int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Friday, 01 March 13 26
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
Due to my close ties to Freemasonry, I’m obligated to speak about WordCount at this point.
27. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
Friday, 01 March 13 27
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
28. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
token
String wcPath = args[ 1 ]; M Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 28
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop
2nd to last line: generates a DOT file for the flow diagram
29. word count – generated flow diagram
Document
Collection
Tokenize
[head] M
GroupBy
token Count
R Word
Count
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Friday, 01 March 13 29
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
30. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Friday, 01 March 13 30
Here is the same Word Count app written in Clojure, using Cascalog.
31. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Friday, 01 March 13 31
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.
Great for large-scale, complex apps, where small teams must limit the complexities in their process.
32. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Friday, 01 March 13 32
Here is the same Word Count app written in Scala, using Scalding.
Very compact, easy to understand; however, also more imperative than Cascalog.
33. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog,
not as much of a high-level language
Friday, 01 March 13 33
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
34. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping to limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
(imagine SOA infra @ Google as an open source project)
• less learning curve than Cascalog,
not as much of a high-level language
Friday, 01 March 13 34
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
35. The Workflow Abstraction
Document
Collection
Scrub
Tokenize
token
M
1. Funnel
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
2. Circa 2008
3. Cascading
4. Sample Code
5. Workflows
6. Abstraction
7. Trendlines
Friday, 01 March 13 35
Tracking back to the Marketing Funnel as an example workflow…
Let’s consider how Cascading apps incorporate other components beyond Hadoop
36. Enterprise Data Workflows
Customers
Back to our marketing funnel, let’s consider
an example app… at the front end Web
App
LOB use cases drive demand for apps
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 36
LOB use cases drive the demand for Big Data apps
37. Enterprise Data Workflows
Customers
An example… in the back office
Organizations have substantial investments Web
App
in people, infrastructure, process
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 37
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
38. Enterprise Data Workflows
Customers
An example… for the heavy lifting!
“Main Street” firms are migrating Web
App
workflows to Hadoop, for cost
savings and scale-out
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Friday, 01 March 13 38
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
39. Cascading workflows – taps
• taps integrate other data frameworks, as tuple streams
Customers
• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions) Web
App
• text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc. logs
logs
Logs
Cache
• data serialization: Avro, Thrift,
Support
source
trap sink
tap
Kryo, JSON, etc. tap tap
• extend a new kind of tap in just
Data
Modeling PMML
Workflow
a few lines of Java sink
source
tap
tap
Analytics
Cubes customer
Customer
profile DBs
schema and provenance get Hadoop
Prefs
derived from analysis of the taps Reporting
Cluster
Friday, 01 March 13 39
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
40. Cascading workflows – taps
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); source and sink taps
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); for TSV data in HDFS
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 40
Here are the taps in the WordCount source
41. Cascading workflows – topologies
• topologies execute workflows on clusters
Customers
• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs) Web
App
- local mode (dev/test or special config)
logs Cache
- in-memory data grids (real-time) logs
Logs
Support
• flow planner can be extended trap
tap
source
tap sink
tap
to support other topologies
Data
Modeling PMML
Workflow
source
sink
tap
blend flows in different topologies tap
Analytics
into the same app – for example, Cubes customer
Customer
profile DBs
batch (Hadoop) + transactions (IMDG) Hadoop
Prefs
Cluster
Reporting
Friday, 01 March 13 41
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
42. Cascading workflows – topologies
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); flow planner for
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); Apache Hadoop
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe ); topology
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Friday, 01 March 13 42
Here is the flow planner for Hadoop in the WordCount source
43. example topologies…
Friday, 01 March 13 43
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.
Several other widely used platforms would also be likely suspects for Cascading flow planners.