2. Agenda
•
Part #1 Big Data
•
Part #2 Why Hadoop, How, and When
•
Part #3 Overview of the Coding Ecosystem
Pig / Hive / Cascading
•
Part #4 Overview of the Machine Learning Ecosystem
Mahout
•
Part #5 Overview of the Extended Ecosystem
5. “Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku
1 Month
5
1/8/14
6. Big Data in 2013
Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time
1 Hour
6
Dataiku 1/8/14
7. To Hadoop
1 TB
1B $
1 TB
?$
1 TB
100M $
Web Search
1999
Logistics
2004
10 TB
10M $
100 TB
?$
Banking
CRM
2008
SQL OR AD HOC
50TB
1B$
1000TB
500M $
E-Commerce
2013
Social Gaming
2011
Web
Search
2010
Online
Advertising
2012
SQL + HADOOP
8. Meet Hal Alowne
Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dataiku - Data Tuesday
‟
Dim Sum
CEO & Founder
Dim‟s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project
”
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14
8
24. MERIT = TIME + ROI
TIME : 6 MONTHS
ROI : APPS
2014
2013
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
2013
Build the lab
(6 months)
• Train People
• Reuse working patterns
Build a lab in 6 months
(rather than 18 months)
Dataiku
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
Deploy apps
24
that actually deliver value
1/9/14
27. CHOOSE TECHNOLOGY
NoSQL-Slavia
Machine Learning
Mystery Land
Scalability Central
Hadoop
ElasticSearch
Ceph
SOLR
Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA
Sphere
Cassandra
MongoDB
Riak
CouchBase
MLBase
LibSVM
Real-time island
SQL Colunnar Republic
InfiniDB
Drill
Kafka Flume
Spark Storm
RapidMiner
Vertica
GreenPlum
Impala
Netezza
QlickView
Cascading
Tableau
Vizualization County
Dataiku - Pig, Hive and Cascading
SPSS
Panda
Pig
Kibana
SpotFire D3
R
SAS
Talend
Data Clean Wasteland
Statistician Old
House
28. Big Data Use Case #1
Manage Volumes
Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
28
Dataiku 1/9/14
29. Big Data Use Case #1
Manage Volumes
•
•
•
Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience
1h12
to perform the aggregate,
available every morning
New
home page personalization
deployed in a few weeks
Hadoop
Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
29
Dataiku - Data Tuesday 1/9/14
30. Big Data Use Case #2
Find Patterns
Correlation
◦ between community size and
engagement / virality
Some mid-size
communities
Meaningul patterns
◦ 2 players / Family / Group
What is the minimum
number of friends to have in
the application to get
additional engagement ?
A very large community
Lots of small clusters
mostly 2 players)
30
Dataiku
1/9/14
31. How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Online User
Information
Transformation
Predictor
500TB
Transformation
Matrix
Explicit User Data
Predictor
Runtime
(Click, Buy, …)
Per User Stats
Rank Predictor
50TB
Per Content Stats
User Information
(Location, Graph…)
User Similarity
1TB
Content Data
(Title, Categories, Price, …)
200GB
Content Similarity
A/B Test Data
Dataiku - Pig, Hive and Cascading
33. The Questions
Pour Data In
How often ?
What kind of
interaction?
How much ?
Compute Something
Smart About It
How complex ?
Do you need all
data at once ?
How incremental
?
Make Available
Interaction ?
Random Access ?
35. The Text Use Case
Pour Data In
Large Volume
1TB
Textual Like Data
(Logs, Docs,….)
Compute Something
Smart About It
Massive Global
Transformation
Then Aggregation
(Counting, Invert
Index, ….)
Make Available
Every Day
36. What‟s Difficult
(back in 2000)
•
Large Data won‟t fit in one server
•
Large computation (a few hours) are
bound to fail one time or another
•
Data is so big that my memory is too big
to perform full aggregations
•
Parallelization with threading is errorprone
•
Data is so big that my Ethernet cable is
not that big
37. What‟s Difficult
(back in 2000)
•
Large Data won‟t fit in one server
HDFS
•
Large computation (a few hours) are
bound to fail one time or another
•
Data is so big that my memory is too big
to perform full aggregations
•
Parallelization with threading is errorprone
•
Data is so big that my Ethernet cable is
not that big
MAP REDUCE
JOB TRACKER
44. Pig History
Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project
Initial motivation
◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …
words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
45. Hive History
Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation
◦ Provide a SQL like abstraction to perform statistics on
status updates
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;
Dataiku - Pig, Hive and Cascading
46. Cascading History
Authored by Chris Wensel 2008
Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading
Dataiku - Pig, Hive and Cascading
47. Pig Hive
Mapping to Mapreduce jobs
events
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
Job 1 : Mapper
LOAD
FILTER
Job 1 : Reducer1
Shuffle and
sort by user
GROUP
FOREACH
FILTER
* VAT
excluded
Dataiku - Innovation Services
1/8/14
47
48. Pig Hive
Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
recent_high
= ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO „/output‟;
Job 1: Mapper
LOAD
FILTER
Job 1 :Reducer
Shuffle and
sort by user
Job 2: Mapper
LOAD
(from tmp)
GROUP
FOREACH
FILTER
Job 2: Reducer
Shuffle and
sort by max_ts
STORE
48
Dataiku - Innovation Services
1/8/14
49. Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
Dataiku - Pig, Hive and Cascading
50. Hive Joins
How to join with MapReduce ?
Uid
tbl_idx
uid
1
2
1
1
2
Dupont
Type2
Type1
2
Type2
type
Tbl_idx
Name
Type
Uid
1
Type
Durand
Type1
Durand
Type2
2
Name
2
Type1
2
2
Type1
Reducer 1
2
2
Dupont
1
2
Durand
Uid
2
Type
Dupont
Shuffle by uid
Sort by (uid, tbl_idx)
uid
Name
1
1
Dupont
1
tbl_idx
Type
Uid
1
1
Name
name
1
1
Tbl_idx
Type1
Type1
Mappers output
Reducer 2
50
Dataiku - Innovation Services
1/8/14
52. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
53. Procedural Vs Declarative
Transformation as a
sequence of operations
Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value 0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
Transformation as a set of
formulas
insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value 0;
) using ipaddr
group by dma;
Dataiku - Pig, Hive and Cascading
54. Data type and Model
Rationale
All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Dataiku - Pig, Hive and Cascading
55. Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);
STRING,
INT,
STRUCTage:INT, zipcode:INT
Simple type
Details
TINYINT, SMALLINT, INT, BIGINT
1, 2, 4 and 8 bytes
FLOAT, DOUBLE
4 and 8 bytes
BOOLEAN
STRING
Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type
Details
ARRAY
Array of typed items (0-indexed)
MAP
Associative map
STRUCT
Complex class-like objects
55
Dataiku Training – Hadoop for Data Science
1/8/14
56. Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type
Details
int, long, float,
double
32 and 64 bits, signed
chararray
A string
bytearray
An array of … bytes
boolean
A boolean
Complex type
Details
tuple
a tuple is an ordered fieldname:value map
bag
a bag is a set of tuples
56
Dataiku Training – Hadoop for Data Science
1/8/14
57. Data Type and Schema
Cascading
Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type
Details
Int, Long, Float,
Double
32 and 64 bits, signed
String
A string
byte[]
An array of … bytes
Boolean
A boolean
Complex type
Object
Dataiku - Pig, Hive and Cascading
Details
Object must be « Hadoop serializable »
58. Style Summary
Style
Typing
Data Model
Metadata
store
Pig
Procedural
Static +
Dynamic
scalar +
tuple+ bag
(fully
recursive)
No
(HCatalog)
Hive
Declarative
Static +
Dynamic,
enforced at
execution
time
scalar+ list +
map
Integrated
Cascading
Procedural
Weak
scalar+ java
objects
No
Dataiku - Pig, Hive and Cascading
59. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
61. Headaches
Pig
Out Of Memory Error (Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
Dataiku - Pig, Hive and Cascading
63. Headaches
Hive
Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
Dataiku - Pig, Hive and Cascading
64. Headaches
Cascading
Weak Typing Errors (comparing
Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
65. Testing
Motivation
How to perform unit tests ?
How to have different versions of the same script
(parameter) ?
Dataiku - Pig, Hive and Cascading
68. Checkpointing
Motivation
Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …
Parse Logs
Per Page Stats
Page User Correlation
FIX and
relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
69. Pig
Manual Checkpointing
STORE Command to manually
store files
Parse Logs
Per Page Stats
Page User Correlation
// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
71. Cascading
Topological Scheduler
Check each file intermediate timestamp
Execute only if more recent
Parse Logs
Per Page Stats
Page User Correlation
Filtering
Dataiku - Pig, Hive and Cascading
Output
73. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
74. Formats Integration
Motivation
Ability to integrate different file formats
Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
Format impact on size and performance
Format
Size on Disk (GB)
HIVE Processing time (24 cores)
Text File, uncompressed
18.7
1m32s
1 Text File, Gzipped
3.89
6m23s
JSON compressed
7.89
2m42s
multiple text file gzipped
4.02
43s
Sequence File, Block, Gzip
5.32
1m18s
Text File, LZO Indexed
7.03
1m22s
Dataiku - Pig, Hive and Cascading
(no parallelization)
76. Partitions
Motivation
No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦
By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above
Dataiku - Pig, Hive and Cascading
77. Hive Partitioning
Partitioned tables
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;
Dataiku Training – Hadoop for Data Science
1/8/14
77
78. Cascading Partition
No Direct support for partition
Support for “Glob” Tap, to build read from files using patterns
➔
You can code your own custom or virtual partition schemes
Dataiku - Pig, Hive and Cascading
82. Integration
Summary
Partition/Increme External Code
ntal Updates
Pig
No Direct Support
Hive
Cascading
Dataiku - Pig, Hive and Cascading
Fully integrated,
SQL Like
With Coding
Simple
Format
Integration
Doable and rich
community
Very simple, but
Doable and existing
complex dev setup
community
Complex UDFS
but regular, and
Java Expression
embeddable
Doable and
growing
commuinty
83. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
84. Optimization
Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦
Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism
Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Dataiku - Pig, Hive and Cascading
85. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354
Map
…
2012-02-14 4354
2012-02-15 21we2
…
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 qa334
…
2012-02-15 23aq2
Dataiku - Pig, Hive and Cascading
2012-02-16 1
86. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354
2012-02-14 8
…
2012-02-15 12
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
Reduced network bandwith. Better
parallelism
Dataiku - Pig, Hive and Cascading
2012-02-16 1
87. Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig
Cascadin
g
( no aggregation support after HashJoin)
Dataiku - Pig, Hive and Cascading
88. Number of Reducers
Critical for performance
Estimated per the size of input file
◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Dataiku - Pig, Hive and Cascading
94. clustering applications
•
Fraud: Detect Outliers
•
CRM : Mine for customer segments
•
Image Processing : Similar Images
•
Search : Similar documents
•
Search : Allocate Topics
95. K-Means
Guess an initial placement for centroids
Assign each point to closest Center
Reposition Center
MAP
REDUCE
96.
97.
98.
99.
100.
101.
102.
103.
104.
105. clustering challenges
•
Curse of Dimensionality
•
Choice of distance / number of parameters
•
Performance
•
Choice # of clusters
106. Mahout Clustering
Challenges
•
No Integrated Feature Engineering Stack:
Get ready to write data processing in Java
•
Hadoop SequenceFile required as an input
•
Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing
111. Convert a CSV File to
Mahout Vector
•
Real Code would have
•
Converting Categorical
variables to dimensions
•
Variable Rescaling
•
Dropping IDs (name,
forname …)
112. Mahout Algorithms
Parameters
Implicit Assumption
Ouput
K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId
Fuzzy K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId * , Probability
Expectation
Maximization
K (Number of clusterS)
Convergence
Gaussian distribution
Point - ClusterId*, Probability
Mean-Shift
Clustering
Distance boundaries,
Convergence
Gradient like distribution
Point - Cluster ID
Top Down
Clustering
Two Clustering Algorithns
Hierarchy
Point - Large ClusterId, Small
ClusterId
Dirichlet
Process
Model Distribution
Points are a mixture of
distribution
Point - ClusterId, Probability
Spectral
Clustering
-
-
Point - ClusterId
MinHash
Clustering
Number of hash / keys
Hash Type
High Dimension
Point - Hash*
116. What if ?
Pour Data In
Data Comes
continously ?
Compute Something
Smart About It
Aggregation
patterns are not
“hashable”
Make Available
Human
Interaction
requires results
fast or
incrementally
available ?
117. After Hadoop
Random Access
In Memory
MultiCore
Machine Learning
Faster in Memory
Computation
Massive Batch
Map Reduce Over HDFS
Real-Time
Distributed
Computation
Faster SQL Analytics
Queries
118. HBase
• Started by Powerset (now in Bing) in 2007
• Provide a key-value store on top of Hadoop
120. GRAPHLAB
•
High-Perfomance, distributed computing framework, in C++
•
Started in 2009, Carneggie-Mellon
•
Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision
•
Can read data in HDFS
121. SPARK
• Developped in 2010 at UC Berkeley
• Provide a distributed memory abstraction for
efficient sequence of map/filter/join applications.
• Can Read/Store to HDFS or file
123. STORM
• Developped in 2011 by Nathan Marz at BackType
(then Twitter)
• Provide a framework for distributed real-time fault
tolerant computation
• Not a message queuing system, a complex event
processing system