SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
AQUICKINTRODUCTIONTO
THECASCADINGECOSYSTEM
Chris K Wensel | Hadoop Summit EU 2014
• Lead developer of the Cascading open-source project
• Founder of Concurrent, Inc.
• Involved with Apache Hadoop since it was called Apache Nutch
!
• Systems Architect, not a Data Scientist
WHOAMI?
2
3
For creating data oriented applications, frameworks,
and languages [on Apache Hadoop]
Originally designed to hide complexity of Hadoop and
prevent thinking in MapReduce
cascading.org
• Started in 2007
• 2.0 released June 2012
• 2.5 out now
• 3.0 WIP (if you look for it)
• Apache 2.0 Licensed
• Supports all Hadoop distros
SOMESTATS
4
5
What’s it used for?
6
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools
• Easy to operationalize heavy lifting of data
7
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
8
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
TWITTER
9
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
KEYPROJECTS
10
Lingual Pattern
Cascading
Apache Hadoop
Scalding Cascalog
• Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
11
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
‣ Rolling windows
SOMECOMMONPATTERNS
12
filter
filter
function
functionfilterfunction
data
Pipeline
Split Join
Merge
data
Topology
13
word count – Cascading Java API	

!
String docPath = args[ 0 ];!
String wcPath = args[ 1 ];!
Properties properties = new Properties();!
AppProps.setApplicationJarClass( properties, Main.class );!
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!
!
// create source and sink taps!
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );!
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );!
!
// specify a regex to split "document" text lines into token stream!
Fields token = new Fields( "token" );!
Fields text = new Fields( "text" );!
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );!
// only returns "token"!
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!
// determine the word counts!
Pipe wcPipe = new Pipe( "wc", docPipe );!
wcPipe = new GroupBy( wcPipe, token );!
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
!
// connect the taps, pipes, etc., into a flow definition!
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )!
.addSource( docPipe, docTap )!
 .addTailSink( wcPipe, wcTap );!
// create the Flow!
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!
wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide!
wcFlow.complete(); // <<-- Runs jobs on Cluster
1
3
2
scheduling
processing
integration
configuration
14
mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{2}:'token', 'count']
[{2}:'token', 'count']
[{1}:'token']
[{1}:'token']
wc.dot
AREALWORLDAPP
15
[1/75] map+reduce
[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce
[19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce
[36/75] map+reduce
[37/75] map+reduce
[38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce
[54/75] map+reduce
[55/75] map [56/75] map+reduce [57/75] map[58/75] map
[59/75] map
[60/75] map [61/75] map[62/75] map
[63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce
[71/75] map [72/75] map
[73/75] map+reduce [74/75] map+reduce
[75/75] map+reduce
1 App, 1 Flow, 75 Steps/MRJobs
!
green = map + reduce
purple = map
blue = join/merge
orange = map split
A graph of jobs, not
operations!
16
It’s not just for Java
17
word count – Scalding (Scala)	

// Sujit Pal!
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!
!
package com.mycompany.impatient!
!
import com.twitter.scalding._!
!
class Part2(args : Args) extends Job(args) {!
  val input = Tsv(args("input"), ('docId, 'text))!
  val output = Tsv(args("output"))!
  input.read.!
    flatMap('text -> 'word) {!
text : String => text.split("""s+""")!
}.!
    groupBy('word) { group => group.size }.!
    write(output)!
}!
18
word count – Cascalog (Clojure)	

; Paul Lam!
; github.com/Quantisan/Impatient!
!
(ns impatient.core!
  (:use [cascalog.api]!
        [cascalog.more-taps :only (hfs-delimited)])!
  (:require [clojure.string :as s]!
            [cascalog.ops :as c])!
  (:gen-class))!
!
(defmapcatop split [line]!
  "reads in a line of string and splits it by regex"!
  (s/split line #"[[](),.)s]+"))!
!
(defn -main [in out & args]!
  (?<- (hfs-delimited out)!
       [?word ?count]!
       ((hfs-delimited in :skip-header? true) _ ?line)!
       (split ?line :> ?word)!
       (c/count ?count)))!
• Step by step tutorials on Cascading on GitHub
• Community has ported them to Scalding and Cascalog
!
• http://docs.cascading.org/impatient/
“FORTHEIMPATIENT”SERIES
19
• Foundation of patterns and best practices for building
Languages, Frameworks, and Applications
• Designed to abstract Hadoop away from the business logic
• Other models than MapReduce on the way!
WHYCASCADING?
20
• ANSI Compatible SQL
• JDBC Driver
• Cascading Java API
• SQL Command Shell
• Catalog Manager Tool
• Data Provider API
LINGUAL
21
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache HadoopLingual Data Stores
CLI / Shell Enterprise Java
Catalog
22
Cascading API	

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "sqlflow" )!
.addSource( "example.employee", emplTap )!
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );!
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
!
23
JDBC driver	

public void run() throws ClassNotFoundException, SQLException {!
Class.forName( "cascading.lingual.jdbc.Driver" );!
Connection connection =!
DriverManager.getConnection(!
"jdbc:lingual:local;schemas=src/main/resources/data/example" );!
Statement statement = connection.createStatement();!
 !
ResultSet resultSet = statement.executeQuery(!
"select *n"!
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"!
+ "join "EXAMPLE"."EMPLOYEE" as en"!
+ "on e."EMPID" = s."CUST_ID"" );!
 !
// do something!
 !
resultSet.close();!
statement.close();!
connection.close();!
}
SHELL-!TABLES
24
25
# load the JDBC package!
library(RJDBC)!
 !
# set up the driver!
drv <- JDBC("cascading.lingual.jdbc.Driver", !
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-
jdbc.jar")!
 !
# set up a database connection to a local repository!
connection <- dbConnect(drv, !
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")!
 !
# query the repository: in this case the MySQL sample database (CSV files)!
df <- dbGetQuery(connection, !
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!
head(df)!
 !
# use R functions to summarize and visualize part of the data!
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!
summary(df$hire_age)!
!
library(ggplot2)!
m <- ggplot(df, aes(x=hire_age))!
m <- m + ggtitle("Age at hire, people named Gina")!
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
26
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
27
“But we use a custom data format”
• Any Cascading Tap and/or Scheme can be used from JDBC
• Use a “fat jar” on local disk or from a Maven repo
‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0
• The Jar is dynamically loaded into cluster
DATAPROVIDERAPI
28
29
Amazon Elastic MapReduce
Job Job Job Job
SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...
Amazon S3
Amazon RedShift
file1 file2
results
• Quickly migrate existing work loads from RDBMS to Hadoop
• Quickly extract data from Hadoop into applications
WHYLINGUAL
30
• Predictive model scoring
• Java API and PMML parser
• Supports:
‣ (General) Regression
‣ Clustering
‣ Decisions Trees
‣ Random Forest
‣ and ensembles of models
PATTERN
31
PMML Parser Pattern API
Cascading
Apache Hadoop
Pattern
Data Stores
Enterprise Java
32
!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
 !
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
.retainOnlyActiveIncomingFields();!
 !
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
!
• Standards compliance provides integration with many tools
• Models are independent of data and integration
• Only debugging Cascading, not an ensemble of applications
WHYPATTERN
33
CLOSINGTHELOOP
34
Cluster
Pattern
Desktop
Job
PMML
Flow
JDBC
Flow
import data
create models
export models
execute models
import results
JDBC
Flow
PMML
DATA
DATA
test results
Job Job
• Understand how your application maps onto your cluster
• Identify bottlenecks (data, code, or the system)
• Jump to the line of code implicated on a failure
• Plugin available via Maven repo
• Beta UI hosted online
DRIVEN
35
http://cascading.io/driven/
MANAGEDWITHDRIVEN
36
37
• New query planner
‣ User definable Assertion and Transformation rules
‣ Sub-Graph Isomorphism Pattern Matching
‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph
isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75
• Hadoop Tez support
• And likely other platforms
CASCADING3.0
38
THERE’SABOOK!
39
Enterprise DataWorkflows with Cascading	

- Paco Nathan	

O’Reilly, 2013	

amazon.com/dp/1449358721
CONTACT
40
@cwensel | @cascading	

chris@wensel.net	

www.cascading.org	

www.concurrentinc.com

Contenu connexe

Tendances

20141001 delapsley-oc-openstack-final
20141001 delapsley-oc-openstack-final20141001 delapsley-oc-openstack-final
20141001 delapsley-oc-openstack-finalDavid Lapsley
 
Async Redux Actions With RxJS - React Rally 2016
Async Redux Actions With RxJS - React Rally 2016Async Redux Actions With RxJS - React Rally 2016
Async Redux Actions With RxJS - React Rally 2016Ben Lesh
 
Spring data iii
Spring data iiiSpring data iii
Spring data iii명철 강
 
An ADF Special Report
An ADF Special Report An ADF Special Report
An ADF Special Report Luc Bors
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Mark Proctor
 
SpringとGrarlVM Native Image -2019/12-
SpringとGrarlVM Native Image -2019/12-SpringとGrarlVM Native Image -2019/12-
SpringとGrarlVM Native Image -2019/12-Takuya Iwatsuka
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETTomas Jansson
 
20141002 delapsley-socalangularjs-final
20141002 delapsley-socalangularjs-final20141002 delapsley-socalangularjs-final
20141002 delapsley-socalangularjs-finalDavid Lapsley
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesDoris Chen
 
Full stack development with node and NoSQL - All Things Open - October 2017
Full stack development with node and NoSQL - All Things Open - October 2017Full stack development with node and NoSQL - All Things Open - October 2017
Full stack development with node and NoSQL - All Things Open - October 2017Matthew Groves
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UNLucidworks
 

Tendances (20)

20141001 delapsley-oc-openstack-final
20141001 delapsley-oc-openstack-final20141001 delapsley-oc-openstack-final
20141001 delapsley-oc-openstack-final
 
Async Redux Actions With RxJS - React Rally 2016
Async Redux Actions With RxJS - React Rally 2016Async Redux Actions With RxJS - React Rally 2016
Async Redux Actions With RxJS - React Rally 2016
 
Spring data iii
Spring data iiiSpring data iii
Spring data iii
 
An ADF Special Report
An ADF Special Report An ADF Special Report
An ADF Special Report
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)
 
SpringとGrarlVM Native Image -2019/12-
SpringとGrarlVM Native Image -2019/12-SpringとGrarlVM Native Image -2019/12-
SpringとGrarlVM Native Image -2019/12-
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
 
20141002 delapsley-socalangularjs-final
20141002 delapsley-socalangularjs-final20141002 delapsley-socalangularjs-final
20141002 delapsley-socalangularjs-final
 
Presentation
PresentationPresentation
Presentation
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Not your Grandma's XQuery
Not your Grandma's XQueryNot your Grandma's XQuery
Not your Grandma's XQuery
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Full stack development with node and NoSQL - All Things Open - October 2017
Full stack development with node and NoSQL - All Things Open - October 2017Full stack development with node and NoSQL - All Things Open - October 2017
Full stack development with node and NoSQL - All Things Open - October 2017
 
XQuery Rocks
XQuery RocksXQuery Rocks
XQuery Rocks
 
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UNSolr vs. Elasticsearch,  Case by Case: Presented by Alexandre Rafalovitch, UN
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
 
XQuery in the Cloud
XQuery in the CloudXQuery in the Cloud
XQuery in the Cloud
 

En vedette

Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5klizloden
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011cwensel
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 
Real Social Media Recruitment ROI
Real Social Media Recruitment ROIReal Social Media Recruitment ROI
Real Social Media Recruitment ROIMikeVangel
 
An Integrated Marketing Plan
An Integrated Marketing PlanAn Integrated Marketing Plan
An Integrated Marketing PlanLinda Dacosta
 
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012MikeVangel
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
Isra' wal mikraj
Isra' wal mikrajIsra' wal mikraj
Isra' wal mikrajZuraimi Ali
 

En vedette (18)

Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k2015 Title Pckg_HEART YOUR LADY PARTS 5k
2015 Title Pckg_HEART YOUR LADY PARTS 5k
 
Digital Marketing Lecture 2015
Digital Marketing Lecture 2015Digital Marketing Lecture 2015
Digital Marketing Lecture 2015
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
Buzz words
Buzz wordsBuzz words
Buzz words
 
Illinois Birds2
Illinois Birds2Illinois Birds2
Illinois Birds2
 
Illinois Birds
Illinois BirdsIllinois Birds
Illinois Birds
 
BigDataCamp 2011
BigDataCamp 2011BigDataCamp 2011
BigDataCamp 2011
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Social Media Lecture Summer 2011
Social Media Lecture Summer 2011Social Media Lecture Summer 2011
Social Media Lecture Summer 2011
 
Real Social Media Recruitment ROI
Real Social Media Recruitment ROIReal Social Media Recruitment ROI
Real Social Media Recruitment ROI
 
An Integrated Marketing Plan
An Integrated Marketing PlanAn Integrated Marketing Plan
An Integrated Marketing Plan
 
Engaging the customer
Engaging the customer Engaging the customer
Engaging the customer
 
Dialog Marketing with Digital Media
Dialog Marketing with Digital MediaDialog Marketing with Digital Media
Dialog Marketing with Digital Media
 
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
Making the Quantum Leap: UPS Social Media Recruitment ROI 2012
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
Digital Marketing Lecture 2016
Digital Marketing Lecture 2016Digital Marketing Lecture 2016
Digital Marketing Lecture 2016
 
Isra' wal mikraj
Isra' wal mikrajIsra' wal mikraj
Isra' wal mikraj
 

Similaire à Hadoop User Group EU 2014

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programmingCasear Chu
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaKazuhiro Sera
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopHikmat Dhamee
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scalascalaconfjp
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application frameworkModern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebJames Rakich
 
[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVCAlive Kuo
 
Quick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerQuick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerNic Raboy
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to CascadingBen Speakmon
 
PySpark with Juypter
PySpark with JuypterPySpark with Juypter
PySpark with JuypterLi Ming Tsai
 

Similaire à Hadoop User Group EU 2014 (20)

Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Hd insight programming
Hd insight programmingHd insight programming
Hd insight programming
 
Apache Cassandra and Go
Apache Cassandra and GoApache Cassandra and Go
Apache Cassandra and Go
 
Solid And Sustainable Development in Scala
Solid And Sustainable Development in ScalaSolid And Sustainable Development in Scala
Solid And Sustainable Development in Scala
 
Data Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache HadoopData Processing with Cascading Java API on Apache Hadoop
Data Processing with Cascading Java API on Apache Hadoop
 
Solid and Sustainable Development in Scala
Solid and Sustainable Development in ScalaSolid and Sustainable Development in Scala
Solid and Sustainable Development in Scala
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC[Coscup 2012] JavascriptMVC
[Coscup 2012] JavascriptMVC
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Quick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase ServerQuick and Easy Development with Node.js and Couchbase Server
Quick and Easy Development with Node.js and Couchbase Server
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to Cascading
 
PySpark with Juypter
PySpark with JuypterPySpark with Juypter
PySpark with Juypter
 

Dernier

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 

Dernier (20)

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 

Hadoop User Group EU 2014

  • 2. • Lead developer of the Cascading open-source project • Founder of Concurrent, Inc. • Involved with Apache Hadoop since it was called Apache Nutch ! • Systems Architect, not a Data Scientist WHOAMI? 2
  • 3. 3 For creating data oriented applications, frameworks, and languages [on Apache Hadoop] Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce cascading.org
  • 4. • Started in 2007 • 2.0 released June 2012 • 2.5 out now • 3.0 WIP (if you look for it) • Apache 2.0 Licensed • Supports all Hadoop distros SOMESTATS 4
  • 6. 6 • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools • Easy to operationalize heavy lifting of data
  • 7. 7 • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US
  • 8. 8 • Scalding (Scala) • Machine learning (linear algebra) to improve • User experience • Ad quality (matching users and ad effectiveness) • All revenue applications are running on Cascading/Scalding • IPO TWITTER
  • 9. 9 • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields • http://www.durkheimproject.org
  • 11. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 11 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 12. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc ‣ Rolling windows SOMECOMMONPATTERNS 12 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  • 13. 13 word count – Cascading Java API ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )!  .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide! wcFlow.complete(); // <<-- Runs jobs on Cluster 1 3 2 scheduling processing integration configuration
  • 15. AREALWORLDAPP 15 [1/75] map+reduce [2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce [19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce [36/75] map+reduce [37/75] map+reduce [38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce [54/75] map+reduce [55/75] map [56/75] map+reduce [57/75] map[58/75] map [59/75] map [60/75] map [61/75] map[62/75] map [63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce [71/75] map [72/75] map [73/75] map+reduce [74/75] map+reduce [75/75] map+reduce 1 App, 1 Flow, 75 Steps/MRJobs ! green = map + reduce purple = map blue = join/merge orange = map split A graph of jobs, not operations!
  • 16. 16 It’s not just for Java
  • 17. 17 word count – Scalding (Scala) // Sujit Pal! // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html! ! package com.mycompany.impatient! ! import com.twitter.scalding._! ! class Part2(args : Args) extends Job(args) {!   val input = Tsv(args("input"), ('docId, 'text))!   val output = Tsv(args("output"))!   input.read.!     flatMap('text -> 'word) {! text : String => text.split("""s+""")! }.!     groupBy('word) { group => group.size }.!     write(output)! }!
  • 18. 18 word count – Cascalog (Clojure) ; Paul Lam! ; github.com/Quantisan/Impatient! ! (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))!
  • 19. • Step by step tutorials on Cascading on GitHub • Community has ported them to Scalding and Cascalog ! • http://docs.cascading.org/impatient/ “FORTHEIMPATIENT”SERIES 19
  • 20. • Foundation of patterns and best practices for building Languages, Frameworks, and Applications • Designed to abstract Hadoop away from the business logic • Other models than MapReduce on the way! WHYCASCADING? 20
  • 21. • ANSI Compatible SQL • JDBC Driver • Cascading Java API • SQL Command Shell • Catalog Manager Tool • Data Provider API LINGUAL 21 Query Planner JDBC API Lingual APIProvider API Cascading Apache HadoopLingual Data Stores CLI / Shell Enterprise Java Catalog
  • 22. 22 Cascading API ! FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );!  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! !
  • 23. 23 JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();!  ! ResultSet resultSet = statement.executeQuery(! "select *n"! + "from "EXAMPLE"."SALES_FACT_1997" as sn"! + "join "EXAMPLE"."EMPLOYEE" as en"! + "on e."EMPID" = s."CUST_ID"" );!  ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }
  • 25. 25 # load the JDBC package! library(RJDBC)!  ! # set up the driver! drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev- jdbc.jar")!  ! # set up a database connection to a local repository! connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")!  ! # query the repository: in this case the MySQL sample database (CSV files)! df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")! head(df)!  ! # use R functions to summarize and visualize part of the data! df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25! summary(df$hire_age)! ! library(ggplot2)! m <- ggplot(df, aes(x=hire_age))! m <- m + ggtitle("Age at hire, people named Gina")! m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
  • 26. 26 > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92
  • 27. 27 “But we use a custom data format”
  • 28. • Any Cascading Tap and/or Scheme can be used from JDBC • Use a “fat jar” on local disk or from a Maven repo ‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0 • The Jar is dynamically loaded into cluster DATAPROVIDERAPI 28
  • 29. 29 Amazon Elastic MapReduce Job Job Job Job SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ... Amazon S3 Amazon RedShift file1 file2 results
  • 30. • Quickly migrate existing work loads from RDBMS to Hadoop • Quickly extract data from Hadoop into applications WHYLINGUAL 30
  • 31. • Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models PATTERN 31 PMML Parser Pattern API Cascading Apache Hadoop Pattern Data Stores Enterprise Java
  • 32. 32 ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();!  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! !
  • 33. • Standards compliance provides integration with many tools • Models are independent of data and integration • Only debugging Cascading, not an ensemble of applications WHYPATTERN 33
  • 34. CLOSINGTHELOOP 34 Cluster Pattern Desktop Job PMML Flow JDBC Flow import data create models export models execute models import results JDBC Flow PMML DATA DATA test results Job Job
  • 35. • Understand how your application maps onto your cluster • Identify bottlenecks (data, code, or the system) • Jump to the line of code implicated on a failure • Plugin available via Maven repo • Beta UI hosted online DRIVEN 35 http://cascading.io/driven/
  • 37. 37
  • 38. • New query planner ‣ User definable Assertion and Transformation rules ‣ Sub-Graph Isomorphism Pattern Matching ‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75 • Hadoop Tez support • And likely other platforms CASCADING3.0 38
  • 39. THERE’SABOOK! 39 Enterprise DataWorkflows with Cascading - Paco Nathan O’Reilly, 2013 amazon.com/dp/1449358721