4. In a Nutshell
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Alternative Java API to MapReduce
with built in Processing Planner and
Workload Scheduler
5. On Many Platforms
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• Apache Hadoop • MapR
• Amazon Elastic • EMC/GreenPlum
MapReduce • and more**
7. RazorFish/BestBuy
Java
[unit, regression, & integration testing]
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• E-Commerce visitor/customer behavior
classification
• Rule processing against proprietary logs
• Backend system integration
8. FlightCaster
JVM Language/DSL
[scripting, ad-hoc queries, etc]
Logical Planner
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• They predict flight delays 6 hrs in advance
• Created own API/DSL in Clojure
• Used to build predictive models
9. Etsy
JVM Language/DSL
[scripting, ad-hoc queries, etc]
Logical Planner
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• Online retailer
• Forked own API/DSL in JRuby
• Cascading.JRuby - avail on github
10. What
• User behavior on site
• Data driven site features
• Taste Test
• Facebook gift recommender
• Suggested Shops
• Top Query List
• plus many more on the way
11. BackType
JVM Language/DSL
[scripting, ad-hoc queries, etc]
Logical Planner
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• Marketing intelligence
• Created Cascalog
• an API/DSL in Clojure, avail on github
12. Ion Flux
Java
[unit, regression, & integration testing]
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
Gene sequencing
15. Pig/Hive
Query Syntax Extension API
Logical Planner
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
Great for ad-hoc queries, but hard to
operationalize
16. Oozie/Azkaban
Scheduler
Syntax
Processing API Integration API
Scheduler API
Physical Planner
Scheduler
Platform
• Great for gluing command line apps together
• JVM scripting language + Cascading is less
brittle and with more degrees of freedom
17. But They are
Complementary
• No reason Oozie (or Talend) can’t be used
to drive Cascading apps
• No reason Cascading can’t drive raw MR/
Pig/Hive processes (see Riffle)
18. Architecture isn’t
Innovation
collection cleansing processing delivery
event data signal info knowledge
normalization scoring
mining
The point of computing systems is to make data
more valuable
Everything else is an implementation detail
Copyright Concurrent, Inc. 2011. All rights reserved.
19. Cascading 2.0
• Removed dependencies on Hadoop
• Improved Processing Planner architecture
• Improved integration APIs
Copyright Concurrent, Inc. 2011. All rights reserved.
20. To Do
• Support more platforms, including in-
memory stream processing
• Make Planner more intelligent and leverage
more complex data flow topologies
• Integrate with more systems and
applications
Copyright Concurrent, Inc. 2011. All rights reserved.