Cascading and BigData Problems

Cascading and BigData
Problems

Chris K Wensel
Concurrent, Inc.
Copyright Concurrent, Inc. 2011. All rights reserved.

About Me
• Concurrent, Inc., Founder
• Cascading support and tools
• http://concurrentinc.com/

• Cascading, Lead Developer (started Sept 2007)
• An alternative API to MapReduce
• http://cascading.org/

• Formerly Hadoop mentoring and training
• Sun - Apple - HP - LexisNexis - startups - etc

• Formerly Systems Architect & Consultant
• Thomson/Reuters - TeleAtlas - startups - etc

Overview

• Case Studies
• What’s in common?
• Where does Hadoop ﬁt?
• Processing vs Innovation


Case Studies

• ShareThis
• BestBuy
• FlightCaster
• Etsy
• Ion Flux

Summary
• All running in production with Hadoop
• All use AWS, most use Elastic MapReduce
• All production processing was implemented in
Cascading

• Various other tools used at different stages of
development


Share This

• Cascading + AWS (pre-EMR)
• Daily event log processing, initially multiple
TB and growing
• Details in the O’Reilly Hadoop book from
Tom White


Lessons
every Y hrs on crawl completion
every X hrs

logprocessor crawler indexer

... ...

• Mark data as bad and why, never discard
• useful for upstream debugging
• Data is seasonal, cyclical, and bursty
• Tune your app and cluster to the workload
• (garbage collect Hadoop clusters) Copyright Concurrent, Inc. 2011. All rights reserved.

BestBuy - Behavioral Ad-
Targeting

• Cascading + AWS (Elastic MapReduce)
• Daily automated User Behavior Segmentation
• 6wks dev, 3T/day, $13k/mo
• 500% increase in return on ad spend from a
similar campaign a year before
• http://aws.amazon.com/solutions/case-studies/
razorﬁsh/ Copyright Concurrent, Inc. 2011. All rights reserved.

Cluster
Amazon Web Services

Elastic MapReduce

Slaves
Ad System

Map/Reduce
behavior app
HDFS

input output
S3

E-Commerce Site

• 200+ nodes, 9-12 hour runs
• 30+ days of history + 3TB daily
• Remote HTTP update of ad-server
• of only changed data

Road Blocks
• No one really understood the data
• Character formats (UTF-8 vs ...)
• Zero byte chars
• Unique columns not unique
• Outliers in the data
• Creating test data
• QAing the data
• result data was also big

FlightCaster - Predicting
Flight Delays
• Clojure + Cascading + AWS
• Scours data on every domestic ﬂight for
the past 10-years and matches it to real-
time conditions

• Machine learning on Cascading, Scoring on
app server

• 3mos dev, 10G day, <$2k/mos Copyright Concurrent, Inc. 2011. All rights reserved.

Lessons

• Even with a good abstraction, you must intuit the
underlying model (MapReduce) to improve
throughput

• i.e. Logical vs Physical plans
• we still need DBAs after decades of query
planner dev


Etsy - Online
Marketplace

• JRuby + Cascading + AWS
• 1B page-views & multi-T data/mo, of logs
• 40-50 cascading.jruby jobs a night
• http://codeascraft.etsy.com/2010/02/24/
analyzing-etsys-data-with-hadoop-and-cascading/
• http://www.concurrentinc.com/casestudies/etsy

Initially

• JRuby for the ‘analysts’
• Log pre-processing,
• db snap shot diffs,
• nightly and ad-hoc analytics


Data Driven Products
• Search index/scoring (under dev)
• Taste Test
• Facebook gift recommender
• Suggested shops
• Top query list, etc...
• Many more on the way

Ion Flux - Gene
Sequencing

• Cascading + AWS
• Sequence Alignment
• http://aws.amazon.com/solutions/case-
studies/ion-ﬂux/


Cluster

• 10-30 nodes, using new HPC instances
• 200-500 cores,
• runs up to 50 hours


Architecture

Delivery
Ion Flux -
Annotation Server
Clinical Lab Ion Torrent - Torrent Sequencer Ion Torrent - Torrent Server (EC2)
Ion Flux - Pipeline Controller
FastQ
Ion Torrent RAW Data FTP FTP RAW Data FTP Upload Annotation
Sample Prep Chip Measure DNA File File Basecalling Sequence Wait Start Pipeline Third Party Clients
Client Server File Server Complete? Database
(RDS)
Client
App

Annotation ReST
Server
Ion Flux - Flux Capacitor Ion Flux - LIMS

FastQ FastQ Compressed
DNA Sample FTP Transfer Cloud LIMS ReST Chip LIMS
Sequence Split File Sequence Compress Sequence Transfer Agent
Client Agent Input Server (EC2) Metadata Database
File Chunk File
(RDS) Ion Flux - Variant Server (EC2)

Variants ReST Variant
Server Database
Ion Flux - (RDS)
Client Website

External Variant Complete
Partners Report Runs

(EC2)
AWS - S3 Storage

FastQ
Software & Performance PILEUP
Sequence
Data Data Variants
Chunks

Heavy Lifting Ion Flux - Sequencing Pipeline
AWS - EMR Cluster
SAM Corrected
Bootstrap Cluster Conﬁgure SAM Sort by Sorted SAM Split to PILEUP
TMAP Alignment SRMA SAM PILEUP
Nodes Pipeline Alignments position Alignments Bins Variants
Bins Alignments

Create Cluster Shutdown

Cascading
Cluster Cleanup Cluster

Start Node Performance
Proﬁler Data


Common Architecture
intermediate
data

raw data
loggers ? valuable
loggers data
loggers

Analyst
Producer Consumer
Developer

Value

• New data continuously arriving
• Actively incorporating the new with the old
• Updating backend systems

Common Constraints

• Speed of light
• Understanding the data
• Creating tests and validating the results
• Lifecycle phases have different environments
• dev vs. integration vs. prod
• Better algorithms, less cost, more complexity

Apps Have Many Stages

• Heavy Lifting • Scoring
• Modeling & • Processing
Learning


Heavy Lifting

• ETL Style processes hampered by physics
• Moving/Transferring/Packaging data
• Data cleansing and value normalization


Modeling & Learning
• Also known as “Data Mining”
• Ask lots of questions to understand the
data
• Machine learning, or
• Ad-hoc queries
• Where the innovation happens

Processing

• Transforming and/or combining multiple
data sets into new data sets or models

• Analytics, • indexing (w/
• statistics, scoring),
• enrichment, • feature reduction,
• entity extraction, • matching

Scoring

• Apply what’s learned
• Sometimes batch (as part of Processing)
• indices with search result ranking
• Sometimes transactional, req/resp
• prediction, recommendations, etc

In Summary
collection cleansing processing delivery

event data signal info knowledge

normalization scoring

mining

The point of computing systems is to make data
more valuable

Where does Hadoop
ﬁt?


Hadoop
Cluster

Rack Rack Rack

Node Node Node Node ...

Global Compute-space

Global Namespace

• Distributed replicated storage for large ﬁles
• Distributed fault tolerant exec of batch processes
• Scale out vs (legacy) scale up
• Java API allows complex analysis, more freedom Copyright Concurrent, Inc. 2011. All rights reserved.

MapReduce
• A “divide and conquer” strategy for
parallelizing workloads against collections of
data

• Map & Reduce are two user deﬁned functions
chained via Key Value Pairs

• It’s really Map->Group->Reduce where Group
is built in


Keys and Values
• Map translates input to keys
and values to new keys and
values [K1,V1] Map [K2,V2]*

• System Groups each unique [K2,V2] Group [K2,{V2,V2,....}]
key with all its values

[K2,{V2,V2,....}] Reduce [K3,V3]*

• Reduce translates the values
of each unique key to new
keys and values * = zero or more


Word Count
Mapper
[0, "when in the course of
human events"] Map ["when",1] ["in",1] ["the",1] [...,1]

["when",1]
["when",1]
["when",1]
["when",1] Group ["when",{1,1,1,1,1}]
["when",1]
Reducer

["when",{1,1,1,1,1}] Reduce ["when",5]


Divide and Conquer
Parallelism
• Since the ‘records’ entering the Map and ‘groups’
entering the Reduce are independent

• That is, there is no expectation of order or
requirement to share state between records/
groups

• Arbitrary numbers of Map and Reduce function
instances can be created against arbitrary portions
of input data

Cluster
Cluster

Rack Rack Rack


map map map map map

reduce reduce reduce

• Multiple instances of each Map and Reduce
function are distributed throughout the cluster


Another View
[K1,V1] Map [K2,V2]
Combine Group [K2,{V2,...}] Reduce [K3,V3]

Mapper
Task same code

Mapper Reducer
Shuffle
Task Task

Mapper Reducer
Shuffle
Task Task

Mapper Reducer
Shuffle Task
Task

Mapper
Task
Mappers must
complete before
Reducers can
begin
split1 split2 split3 split4 ... part-00000 part-00001 part-000N

file directory


Architectural
Components
NameNode DataNode
DataNode
DataNode
DataNode data block

ns read/write
operations Secondary ns
operations
read/write ns
operations read/write
mapper
mapper
child jvm
mapper
child jvm
jobs tasks child jvm
Client JobTracker

TaskTracker
reducer
reducer
child jvm
reducer
child jvm
child jvm

• Solid boxes are unique applications
• Dashed boxes are child JVM instances on same node as parent
• Dotted boxes are blocks of managed ﬁles on same node as parent

Deployment Topology
Node Node Node

jobs tasks
Client JobTracker TaskTracker

DataNode
Node

NameNode

Not uncommon to
Node be same node

Secondary

• Job Client may run on any node
• NameNode and JobTracker may run on same node (Master)
• DataNode and TaskTracker instances should run on same node (Slaves)
• NameNode and SecondaryNode shouldn’t typically run on same node

Complex job
assemblies
• Real applications are many MapReduce jobs chained together

• Linked by intermediate (usually temporary) ﬁles

• Executed in order, by hand, from the ‘client’ application

Count Job Sort Job
[ k, [v] ] [ k, [v] ]
Map Reduce Map Reduce

[ k, v ] [ k, v ] [ k, v ] [ k, v ]

File File File

[ k, v ] = key and value pair
[ k, [v] ] = key and associated values collection

Tokenize Count Job

File

File File

Join Tokens/Counts Job
File Map Reduce

File

Sort/Prefix Filter Job
Map Reduce

File

Match two sets Self Join Job
Map Reduce

using prefix File

filtering Unique Pairs Job
Map Reduce

File

Join LHS Job
Map Reduce

File

Join RHS / Match Job
Map Reduce File


Real World Apps
[37/75] map+reduce

[54/75] map+reduce

[41/75] map+reduce [43/75] map+reduce [42/75] map+reduce [45/75] map+reduce [44/75] map+reduce [39/75] map+reduce [36/75] map+reduce [46/75] map+reduce [40/75] map+reduce [50/75] map+reduce [38/75] map+reduce [49/75] map+reduce [51/75] map+reduce [47/75] map+reduce [52/75] map+reduce [53/75] map+reduce [48/75] map+reduce



[60/75] map [62/75] map [61/75] map [58/75] map [55/75] map [56/75] map+reduce [57/75] map [71/75] map [72/75] map
[59/75] map

[64/75] map+reduce [63/75] map+reduce [65/75] map+reduce [68/75] map+reduce [67/75] map+reduce [70/75] map+reduce [69/75] map+reduce [73/75] map+reduce [66/75] map+reduce [74/75] map+reduce

[75/75] map+reduce

[1/75] map+reduce

1 app, 75 jobs

green = map + reduce
purple = map
blue = join/merge
orange = map split

Cascading
Word Count/Sort Flow
[ f1,f2,.. ] [ f1,f2,.. ] [ f1,f2,.. ]
Parse Group Count Sort

[ f1,f2,.. ]
[ f1,f2,.. ]

Data [ f1, f2,... ] = tuples with ﬁeld names Data

• Alternative model & API to MapReduce
• pipe/ﬁlters of re-usable operations
• For rapidly implementing Data Processing Systems

Cascading

• Allows for Unit testing independent of
integration
• Re-usable libraries
• Integration is ﬁrst class
• Homogeneous framework for scheduling
• Any JVM based host language

Elastic MapReduce
Amazon Web Services
Elastic MapReduce
User CLI
Console Master Slaves

Client

mr mr Map/Reduce

temp HDFS

input output
S3
jar

• Clusters typically single purpose
• S3 used for storage between runs Copyright Concurrent, Inc. 2011. All rights reserved.

Architecture Isn’t
Innovation
operationalization




mining

innovation

Rate of innovation and arrival of answers are
proportional

Big vs Lots
Lots of
"Big" Data
Data

Data ! = Hadoop
Mining* ! ?
? = RDBMS, R, etc

Data * Data Warehousing
Processing ! !

• Big - too much to ﬁt in/on any one thing
• Lots - complexity arising from keeping
track of all the bits

At Rest vs In Motion
data
mining
raw data ETL data warehousing
loggers
loggers
loggers
ETL
Analyst

Data At Rest
raw data data processing valuable
loggers data
loggers
loggers
Consumer

process

Data In Motion

• Hub/Spoke vs Incremental Layers
• Static Schema vs Dynamic Views
• Monolithic vs Distributed Copyright Concurrent, Inc. 2011. All rights reserved.

Hadoop for Processing
Value Creation

Scalability

Simplicity

• Delivering Value from Innovation
• Scalability, Not Performance
• Simpliﬁes Infrastructure

Simplicity
Cluster

Rack Rack Rack


cpus Global Compute-space

disks Global Namespace

• Virtualization across resources, not within (PaaS)
• A single FileSystem across disks - no DBA
• A single Execution System across CPUs - less IT
• One app installed and managed across hardware

Scalability
Users Cluster

Client

Rack Rack Rack

Client
job
job
job
Client

• Scalability - continued reliability and met expectations as
demand changes
• Application Scalability - data grows, app/infra expand
• Organizational Scalability - simpler infra and apps Copyright Concurrent, Inc. 2011. All rights reserved.

Delivering Value
events

reporting
raw data
loggers
loggers data processing
loggers Hadoop
+ Hadoop
etlCascading
analytics
Cascading
Producer Consumer

product

operational

Value

• Unconstrained processing model
• Data processing requires integration
• Processing must not fail or fall behind Copyright Concurrent, Inc. 2011. All rights reserved.

Data In Motion
raw data data processing valuable
loggers data
loggers
loggers
Consumer

process

Data In Motion

• Data always arriving, results being delivered
• Not paying the upfront cost of indexing
• No upfront schema design
• “ETL” is built into the processing pipeline

Where to Innovate?
Lots of
"Big" Data
Data

Data ! = Hadoop
Mining* ! ?
? = RDBMS, R, etc

Data * Data Warehousing
Processing ! !

• Depends on the problem whether Hadoop
makes sense as your innovation platform


Hadoop for Innovating

value

innovation
innovation

innovation
latency degrees of freedom

• Need to ask similar questions repeatedly
• Indexes help here
• Need a reasonably high abstraction
• Existing libraries and a simple syntax
• Third-party Tool support Copyright Concurrent, Inc. 2011. All rights reserved.

Innovation Abstractions
• Syntax
• Pig
• Hive - now has some indexing support
• Language (easier to operationalize)
• Cascalog
• Cascading.jruby
• 3 new Scala languages pending release

Data At Rest
data
mining
raw data ETL data warehousing
loggers
loggers
loggers
ETL
Analyst

Data At Rest

• Hadoop becomes a warehouse (with Schemas)
• and without indexes, high latency queries
• ETL becomes an independent architecture

Don’t throw out the baby
with the bath water
• Need low latency responses
• Need support for existing tools
• Need to not retrain analysts
• RDBMS (Aster,
• SAS
GreenPlum, Vertica,
Oracle) • MicroStrategies

• R • Tableaux

Bailing Wire & Bubble
Gum

• Integrating them with Hadoop adds
brittleness and inefﬁciencies
• Hadoop Streaming
• RHIPE, etc..

Operationalizing
operationalization




mining

innovation

• Minimize the number of processing tech (debt)
• Don’t lose sight of the physical model/plan
• XML is not a programming language
• String concatenation isn’t programming

Resources
• Chris K Wensel
•chris@wensel.net
•@cwensel

• Cascading & Cascalog
•http://cascading.org
•@cascading

• Concurrent, Inc.
•http://concurrentinc.com
•@concurrent
•http://concurrentinc.com/careers

Cascading and BigData Problems

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (8)

Similaire à Cascading and BigData Problems

Similaire à Cascading and BigData Problems (20)

Plus de cwensel

Plus de cwensel (7)

Dernier

Dernier (20)

Cascading and BigData Problems

Notes de l'éditeur