2. • Lead developer of the Cascading open-source project
• Founder of Concurrent, Inc.
• Involved with Apache Hadoop since it was called Apache Nutch
!
• Systems Architect, not a Data Scientist
WHOAMI?
2
3. 3
For creating data oriented applications, frameworks,
and languages [on Apache Hadoop]
Originally designed to hide complexity of Hadoop and
prevent thinking in MapReduce
cascading.org
4. • Started in 2007
• 2.0 released June 2012
• 2.5 out now
• 3.0 WIP (if you look for it)
• Apache 2.0 Licensed
• Supports all Hadoop distros
SOMESTATS
4
6. 6
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools
• Easy to operationalize heavy lifting of data
7. 7
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
8. 8
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
TWITTER
9. 9
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
11. • Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
11
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
12. • Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
‣ Rolling windows
SOMECOMMONPATTERNS
12
filter
filter
function
functionfilterfunction
data
Pipeline
Split Join
Merge
data
Topology
13. 13
word count – Cascading Java API
!
String docPath = args[ 0 ];!
String wcPath = args[ 1 ];!
Properties properties = new Properties();!
AppProps.setApplicationJarClass( properties, Main.class );!
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!
!
// create source and sink taps!
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );!
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );!
!
// specify a regex to split "document" text lines into token stream!
Fields token = new Fields( "token" );!
Fields text = new Fields( "text" );!
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );!
// only returns "token"!
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!
// determine the word counts!
Pipe wcPipe = new Pipe( "wc", docPipe );!
wcPipe = new GroupBy( wcPipe, token );!
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
!
// connect the taps, pipes, etc., into a flow definition!
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )!
.addSource( docPipe, docTap )!
.addTailSink( wcPipe, wcTap );!
// create the Flow!
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!
wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide!
wcFlow.complete(); // <<-- Runs jobs on Cluster
1
3
2
scheduling
processing
integration
configuration
17. 17
word count – Scalding (Scala)
// Sujit Pal!
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!
!
package com.mycompany.impatient!
!
import com.twitter.scalding._!
!
class Part2(args : Args) extends Job(args) {!
val input = Tsv(args("input"), ('docId, 'text))!
val output = Tsv(args("output"))!
input.read.!
flatMap('text -> 'word) {!
text : String => text.split("""s+""")!
}.!
groupBy('word) { group => group.size }.!
write(output)!
}!
18. 18
word count – Cascalog (Clojure)
; Paul Lam!
; github.com/Quantisan/Impatient!
!
(ns impatient.core!
(:use [cascalog.api]!
[cascalog.more-taps :only (hfs-delimited)])!
(:require [clojure.string :as s]!
[cascalog.ops :as c])!
(:gen-class))!
!
(defmapcatop split [line]!
"reads in a line of string and splits it by regex"!
(s/split line #"[[](),.)s]+"))!
!
(defn -main [in out & args]!
(?<- (hfs-delimited out)!
[?word ?count]!
((hfs-delimited in :skip-header? true) _ ?line)!
(split ?line :> ?word)!
(c/count ?count)))!
19. • Step by step tutorials on Cascading on GitHub
• Community has ported them to Scalding and Cascalog
!
• http://docs.cascading.org/impatient/
“FORTHEIMPATIENT”SERIES
19
20. • Foundation of patterns and best practices for building
Languages, Frameworks, and Applications
• Designed to abstract Hadoop away from the business logic
• Other models than MapReduce on the way!
WHYCASCADING?
20
21. • ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
21
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache HadoopLingual Data Stores
CLI / Shell Enterprise Java
Catalog
25. 25
# load the JDBC package!
library(RJDBC)!
!
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-
jdbc.jar")!
!
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")!
!
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
!
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
28. • Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster
DATAPROVIDERAPI
28
30. • Quickly migrate existing work loads from RDBMS to Hadoop
• Quickly extract data from Hadoop into applications
WHYLINGUAL
30
31. • Predictive model scoring
• Java API and PMML parser
• Supports:
‣ (General) Regression
‣ Clustering
‣ Decisions Trees
‣ Random Forest
‣ and ensembles of models
PATTERN
31
PMML Parser Pattern API
Cascading
Apache Hadoop
Pattern
Data Stores
Enterprise Java
33. • Standards compliance provides integration with many tools
• Models are independent of data and integration
• Only debugging Cascading, not an ensemble of applications
WHYPATTERN
33
35. • Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
35
http://cascading.io/driven/
38. • New query planner
‣ User definable Assertion and Transformation rules
‣ Sub-Graph Isomorphism Pattern Matching
‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph
isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75
• Hadoop Tez support
• And likely other platforms
CASCADING3.0
38