1. Cascading and BigData
Problems
Chris K Wensel
Concurrent, Inc.
Copyright Concurrent, Inc. 2011. All rights reserved.
2. About Me
• Concurrent, Inc., Founder
• Cascading support and tools
• http://concurrentinc.com/
• Cascading, Lead Developer (started Sept 2007)
• An alternative API to MapReduce
• http://cascading.org/
• Formerly Hadoop mentoring and training
• Sun - Apple - HP - LexisNexis - startups - etc
• Formerly Systems Architect & Consultant
• Thomson/Reuters - TeleAtlas - startups - etc
Copyright Concurrent, Inc. 2011. All rights reserved.
3. Overview
• Case Studies
• What’s in common?
• Where does Hadoop fit?
• Processing vs Innovation
Copyright Concurrent, Inc. 2011. All rights reserved.
4. Case Studies
• ShareThis
• BestBuy
• FlightCaster
• Etsy
• Ion Flux
Copyright Concurrent, Inc. 2011. All rights reserved.
5. Summary
• All running in production with Hadoop
• All use AWS, most use Elastic MapReduce
• All production processing was implemented in
Cascading
• Various other tools used at different stages of
development
Copyright Concurrent, Inc. 2011. All rights reserved.
6. Share This
• Cascading + AWS (pre-EMR)
• Daily event log processing, initially multiple
TB and growing
• Details in the O’Reilly Hadoop book from
Tom White
Copyright Concurrent, Inc. 2011. All rights reserved.
7. Lessons
every Y hrs on crawl completion
every X hrs
logprocessor crawler indexer
... ...
• Mark data as bad and why, never discard
• useful for upstream debugging
• Data is seasonal, cyclical, and bursty
• Tune your app and cluster to the workload
• (garbage collect Hadoop clusters) Copyright Concurrent, Inc. 2011. All rights reserved.
8. BestBuy - Behavioral Ad-
Targeting
• Cascading + AWS (Elastic MapReduce)
• Daily automated User Behavior Segmentation
• 6wks dev, 3T/day, $13k/mo
• 500% increase in return on ad spend from a
similar campaign a year before
• http://aws.amazon.com/solutions/case-studies/
razorfish/ Copyright Concurrent, Inc. 2011. All rights reserved.
9. Cluster
Amazon Web Services
Elastic MapReduce
Slaves
Ad System
Map/Reduce
behavior app
HDFS
input output
S3
E-Commerce Site
• 200+ nodes, 9-12 hour runs
• 30+ days of history + 3TB daily
• Remote HTTP update of ad-server
• of only changed data
Copyright Concurrent, Inc. 2011. All rights reserved.
10. Road Blocks
• No one really understood the data
• Character formats (UTF-8 vs ...)
• Zero byte chars
• Unique columns not unique
• Outliers in the data
• Creating test data
• QAing the data
• result data was also big
Copyright Concurrent, Inc. 2011. All rights reserved.
11. FlightCaster - Predicting
Flight Delays
• Clojure + Cascading + AWS
• Scours data on every domestic flight for
the past 10-years and matches it to real-
time conditions
• Machine learning on Cascading, Scoring on
app server
• 3mos dev, 10G day, <$2k/mos Copyright Concurrent, Inc. 2011. All rights reserved.
12. Lessons
• Even with a good abstraction, you must intuit the
underlying model (MapReduce) to improve
throughput
• i.e. Logical vs Physical plans
• we still need DBAs after decades of query
planner dev
Copyright Concurrent, Inc. 2011. All rights reserved.
13. Etsy - Online
Marketplace
• JRuby + Cascading + AWS
• 1B page-views & multi-T data/mo, of logs
• 40-50 cascading.jruby jobs a night
• http://codeascraft.etsy.com/2010/02/24/
analyzing-etsys-data-with-hadoop-and-cascading/
• http://www.concurrentinc.com/casestudies/etsy
Copyright Concurrent, Inc. 2011. All rights reserved.
14. Initially
• JRuby for the ‘analysts’
• Log pre-processing,
• db snap shot diffs,
• nightly and ad-hoc analytics
Copyright Concurrent, Inc. 2011. All rights reserved.
15. Data Driven Products
• Search index/scoring (under dev)
• Taste Test
• Facebook gift recommender
• Suggested shops
• Top query list, etc...
• Many more on the way
Copyright Concurrent, Inc. 2011. All rights reserved.
16. Ion Flux - Gene
Sequencing
• Cascading + AWS
• Sequence Alignment
• http://aws.amazon.com/solutions/case-
studies/ion-flux/
Copyright Concurrent, Inc. 2011. All rights reserved.
17. Cluster
• 10-30 nodes, using new HPC instances
• 200-500 cores,
• runs up to 50 hours
Copyright Concurrent, Inc. 2011. All rights reserved.
18. Architecture
Delivery
Ion Flux -
Annotation Server
Clinical Lab Ion Torrent - Torrent Sequencer Ion Torrent - Torrent Server (EC2)
Ion Flux - Pipeline Controller
FastQ
Ion Torrent RAW Data FTP FTP RAW Data FTP Upload Annotation
Sample Prep Chip Measure DNA File File Basecalling Sequence Wait Start Pipeline Third Party Clients
Client Server File Server Complete? Database
(RDS)
Client
App
Annotation ReST
Server
Ion Flux - Flux Capacitor Ion Flux - LIMS
FastQ FastQ Compressed
DNA Sample FTP Transfer Cloud LIMS ReST Chip LIMS
Sequence Split File Sequence Compress Sequence Transfer Agent
Client Agent Input Server (EC2) Metadata Database
File Chunk File
(RDS) Ion Flux - Variant Server (EC2)
Variants ReST Variant
Server Database
Ion Flux - (RDS)
Client Website
External Variant Complete
Partners Report Runs
(EC2)
AWS - S3 Storage
FastQ
Software & Performance PILEUP
Sequence
Data Data Variants
Chunks
Heavy Lifting Ion Flux - Sequencing Pipeline
AWS - EMR Cluster
SAM Corrected
Bootstrap Cluster Configure SAM Sort by Sorted SAM Split to PILEUP
TMAP Alignment SRMA SAM PILEUP
Nodes Pipeline Alignments position Alignments Bins Variants
Bins Alignments
Create Cluster Shutdown
Cascading
Cluster Cleanup Cluster
Start Node Performance
Profiler Data
Copyright Concurrent, Inc. 2011. All rights reserved.
19. Common Architecture
intermediate
data
raw data
loggers ? valuable
loggers data
loggers
Analyst
Producer Consumer
Developer
Value
• New data continuously arriving
• Actively incorporating the new with the old
• Updating backend systems
Copyright Concurrent, Inc. 2011. All rights reserved.
20. Common Constraints
• Speed of light
• Understanding the data
• Creating tests and validating the results
• Lifecycle phases have different environments
• dev vs. integration vs. prod
• Better algorithms, less cost, more complexity
Copyright Concurrent, Inc. 2011. All rights reserved.
21. Apps Have Many Stages
• Heavy Lifting • Scoring
• Modeling & • Processing
Learning
Copyright Concurrent, Inc. 2011. All rights reserved.
22. Heavy Lifting
• ETL Style processes hampered by physics
• Moving/Transferring/Packaging data
• Data cleansing and value normalization
Copyright Concurrent, Inc. 2011. All rights reserved.
23. Modeling & Learning
• Also known as “Data Mining”
• Ask lots of questions to understand the
data
• Machine learning, or
• Ad-hoc queries
• Where the innovation happens
Copyright Concurrent, Inc. 2011. All rights reserved.
24. Processing
• Transforming and/or combining multiple
data sets into new data sets or models
• Analytics, • indexing (w/
• statistics, scoring),
• enrichment, • feature reduction,
• entity extraction, • matching
Copyright Concurrent, Inc. 2011. All rights reserved.
25. Scoring
• Apply what’s learned
• Sometimes batch (as part of Processing)
• indices with search result ranking
• Sometimes transactional, req/resp
• prediction, recommendations, etc
Copyright Concurrent, Inc. 2011. All rights reserved.
26. In Summary
collection cleansing processing delivery
event data signal info knowledge
normalization scoring
mining
The point of computing systems is to make data
more valuable
Copyright Concurrent, Inc. 2011. All rights reserved.
27. Where does Hadoop
fit?
Copyright Concurrent, Inc. 2011. All rights reserved.
28. Hadoop
Cluster
Rack Rack Rack
Node Node Node Node ...
Global Compute-space
Global Namespace
• Distributed replicated storage for large files
• Distributed fault tolerant exec of batch processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis, more freedom Copyright Concurrent, Inc. 2011. All rights reserved.
29. MapReduce
• A “divide and conquer” strategy for
parallelizing workloads against collections of
data
• Map & Reduce are two user defined functions
chained via Key Value Pairs
• It’s really Map->Group->Reduce where Group
is built in
Copyright Concurrent, Inc. 2011. All rights reserved.
30. Keys and Values
• Map translates input to keys
and values to new keys and
values [K1,V1] Map [K2,V2]*
• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}]
key with all its values
[K2,{V2,V2,....}] Reduce [K3,V3]*
• Reduce translates the values
of each unique key to new
keys and values * = zero or more
Copyright Concurrent, Inc. 2011. All rights reserved.
31. Word Count
Mapper
[0, "when in the course of
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]
["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{1,1,1,1,1}]
["when",1]
Reducer
["when",{1,1,1,1,1}] Reduce ["when",5]
Copyright Concurrent, Inc. 2011. All rights reserved.
32. Divide and Conquer
Parallelism
• Since the ‘records’ entering the Map and ‘groups’
entering the Reduce are independent
• That is, there is no expectation of order or
requirement to share state between records/
groups
• Arbitrary numbers of Map and Reduce function
instances can be created against arbitrary portions
of input data
Copyright Concurrent, Inc. 2011. All rights reserved.
33. Cluster
Cluster
Rack Rack Rack
Node Node Node Node ...
map map map map map
reduce reduce reduce
• Multiple instances of each Map and Reduce
function are distributed throughout the cluster
Copyright Concurrent, Inc. 2011. All rights reserved.
34. Another View
[K1,V1] Map [K2,V2]
Combine Group [K2,{V2,...}] Reduce [K3,V3]
Mapper
Task same code
Mapper Reducer
Shuffle
Task Task
Mapper Reducer
Shuffle
Task Task
Mapper Reducer
Shuffle Task
Task
Mapper
Task
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N
file directory
Copyright Concurrent, Inc. 2011. All rights reserved.
35. Architectural
Components
NameNode DataNode
DataNode
DataNode
DataNode data block
ns read/write
operations Secondary ns
operations
read/write ns
operations read/write
mapper
mapper
child jvm
mapper
child jvm
jobs tasks child jvm
Client JobTracker
TaskTracker
reducer
reducer
child jvm
reducer
child jvm
child jvm
• Solid boxes are unique applications
• Dashed boxes are child JVM instances on same node as parent
• Dotted boxes are blocks of managed files on same node as parent
Copyright Concurrent, Inc. 2011. All rights reserved.
36. Deployment Topology
Node Node Node
jobs tasks
Client JobTracker TaskTracker
DataNode
Node
NameNode
Not uncommon to
Node be same node
Secondary
• Job Client may run on any node
• NameNode and JobTracker may run on same node (Master)
• DataNode and TaskTracker instances should run on same node (Slaves)
• NameNode and SecondaryNode shouldn’t typically run on same node
Copyright Concurrent, Inc. 2011. All rights reserved.
37. Complex job
assemblies
• Real applications are many MapReduce jobs chained together
• Linked by intermediate (usually temporary) files
• Executed in order, by hand, from the ‘client’ application
Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce
[ k, v ] [ k, v ] [ k, v ] [ k, v ]
File File File
[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection
Copyright Concurrent, Inc. 2011. All rights reserved.
38. Tokenize Count Job
Map Reduce Map Reduce
File
File File
Join Tokens/Counts Job
File Map Reduce
File
Sort/Prefix Filter Job
Map Reduce
File
Match two sets Self Join Job
Map Reduce
using prefix File
filtering Unique Pairs Job
Map Reduce
File
Join LHS Job
Map Reduce
File
Join RHS / Match Job
Map Reduce File
Copyright Concurrent, Inc. 2011. All rights reserved.
40. Cascading
Word Count/Sort Flow
Map Reduce Map Reduce
[ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ]
Parse Group Count Sort
[ f1,f2,.. ]
[ f1,f2,.. ]
Data [ f1, f2,... ] = tuples with field names Data
• Alternative model & API to MapReduce
• pipe/filters of re-usable operations
• For rapidly implementing Data Processing Systems
Copyright Concurrent, Inc. 2011. All rights reserved.
41. Cascading
• Allows for Unit testing independent of
integration
• Re-usable libraries
• Integration is first class
• Homogeneous framework for scheduling
• Any JVM based host language
Copyright Concurrent, Inc. 2011. All rights reserved.
42. Elastic MapReduce
Amazon Web Services
Elastic MapReduce
User CLI
Console Master Slaves
Client
mr mr Map/Reduce
temp HDFS
input output
S3
jar
• Clusters typically single purpose
• S3 used for storage between runs Copyright Concurrent, Inc. 2011. All rights reserved.
43. Architecture Isn’t
Innovation
operationalization
collection cleansing processing delivery
event data signal info knowledge
normalization scoring
mining
innovation
Rate of innovation and arrival of answers are
proportional
Copyright Concurrent, Inc. 2011. All rights reserved.
44. Big vs Lots
Lots of
"Big" Data
Data
Data ! = Hadoop
Mining* ! ?
? = RDBMS, R, etc
Data * Data Warehousing
Processing ! !
• Big - too much to fit in/on any one thing
• Lots - complexity arising from keeping
track of all the bits
Copyright Concurrent, Inc. 2011. All rights reserved.
45. At Rest vs In Motion
data
mining
raw data ETL data warehousing
loggers
loggers
loggers
ETL
Analyst
Data At Rest
raw data data processing valuable
loggers data
loggers
loggers
Consumer
process
Data In Motion
• Hub/Spoke vs Incremental Layers
• Static Schema vs Dynamic Views
• Monolithic vs Distributed Copyright Concurrent, Inc. 2011. All rights reserved.
46. Hadoop for Processing
Value Creation
Scalability
Simplicity
• Delivering Value from Innovation
• Scalability, Not Performance
• Simplifies Infrastructure
Copyright Concurrent, Inc. 2011. All rights reserved.
47. Simplicity
Cluster
Rack Rack Rack
Node Node Node Node ...
cpus Global Compute-space
disks Global Namespace
• Virtualization across resources, not within (PaaS)
• A single FileSystem across disks - no DBA
• A single Execution System across CPUs - less IT
• One app installed and managed across hardware
Copyright Concurrent, Inc. 2011. All rights reserved.
48. Scalability
Users Cluster
Client
Rack Rack Rack
Node Node Node Node ...
Client
job
job
job
Client
• Scalability - continued reliability and met expectations as
demand changes
• Application Scalability - data grows, app/infra expand
• Organizational Scalability - simpler infra and apps Copyright Concurrent, Inc. 2011. All rights reserved.
49. Delivering Value
events
reporting
raw data
loggers
loggers data processing
loggers Hadoop
+ Hadoop
etlCascading
analytics
Cascading
Producer Consumer
product
operational
Value
• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind Copyright Concurrent, Inc. 2011. All rights reserved.
50. Data In Motion
raw data data processing valuable
loggers data
loggers
loggers
Consumer
process
Data In Motion
• Data always arriving, results being delivered
• Not paying the upfront cost of indexing
• No upfront schema design
• “ETL” is built into the processing pipeline
Copyright Concurrent, Inc. 2011. All rights reserved.
51. Where to Innovate?
Lots of
"Big" Data
Data
Data ! = Hadoop
Mining* ! ?
? = RDBMS, R, etc
Data * Data Warehousing
Processing ! !
• Depends on the problem whether Hadoop
makes sense as your innovation platform
Copyright Concurrent, Inc. 2011. All rights reserved.
52. Hadoop for Innovating
value
innovation
innovation
innovation
latency degrees of freedom
• Need to ask similar questions repeatedly
• Indexes help here
• Need a reasonably high abstraction
• Existing libraries and a simple syntax
• Third-party Tool support Copyright Concurrent, Inc. 2011. All rights reserved.
53. Innovation Abstractions
• Syntax
• Pig
• Hive - now has some indexing support
• Language (easier to operationalize)
• Cascalog
• Cascading.jruby
• 3 new Scala languages pending release
Copyright Concurrent, Inc. 2011. All rights reserved.
54. Data At Rest
data
mining
raw data ETL data warehousing
loggers
loggers
loggers
ETL
Analyst
Data At Rest
• Hadoop becomes a warehouse (with Schemas)
• and without indexes, high latency queries
• ETL becomes an independent architecture
Copyright Concurrent, Inc. 2011. All rights reserved.
55. Don’t throw out the baby
with the bath water
• Need low latency responses
• Need support for existing tools
• Need to not retrain analysts
• RDBMS (Aster,
• SAS
GreenPlum, Vertica,
Oracle) • MicroStrategies
• R • Tableaux
Copyright Concurrent, Inc. 2011. All rights reserved.
56. Bailing Wire & Bubble
Gum
• Integrating them with Hadoop adds
brittleness and inefficiencies
• Hadoop Streaming
• RHIPE, etc..
Copyright Concurrent, Inc. 2011. All rights reserved.
57. Operationalizing
operationalization
collection cleansing processing delivery
event data signal info knowledge
normalization scoring
mining
innovation
• Minimize the number of processing tech (debt)
• Don’t lose sight of the physical model/plan
• XML is not a programming language
• String concatenation isn’t programming
Copyright Concurrent, Inc. 2011. All rights reserved.
58. Resources
• Chris K Wensel
•chris@wensel.net
•@cwensel
• Cascading & Cascalog
•http://cascading.org
•@cascading
• Concurrent, Inc.
•http://concurrentinc.com
•@concurrent
•http://concurrentinc.com/careers
Copyright Concurrent, Inc. 2011. All rights reserved.
Notes de l'éditeur
Startups expecting to need 'web scale' implementations are committing to technologies that might not be a good fit. Doing so can be a dramatic waste of time, money and resources when they can ill afford to do so. Do you really have a Big Data problem? Do you have a plan for what you are going to do with it? Chris will try to explain where he sees Hadoop being used most successfully and will offer up some guidelines on when to consider adopting it and any complimentary technologies.\n