Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

How do
Elephant
Make Babies
Florian Douetteau
CEO, Dataiku

Agenda
•

Part #1 Big Data

•

Part #2 Why Hadoop, How, and When

•

Part #3 Overview of the Coding Ecosystem
Pig / Hive / Cascading

•

Part #4 Overview of the Machine Learning Ecosystem
Mahout

•

Part #5 Overview of the Extended Ecosystem

PART #1
BIG DATA

3
Dataiku 1/8/14

Collocation

Dataiku

C
o
l
l
o
c
a
t

A familiar grouping of words,
especially words that habitually
appear together and thereby
convey meaning by association.

Big

Apple

Big

Mama

Big

Data
4
1/8/14

“Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….

C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku

1 Month
5
1/8/14

Big Data in 2013







Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time

1 Hour
6
Dataiku 1/8/14

To Hadoop
1 TB
1B $

1 TB
?$
1 TB
100M $

Web Search
1999
Logistics
2004

10 TB
10M $
100 TB
?$

Banking
CRM
2008

SQL OR AD HOC

50TB
1B$
1000TB
500M $
E-Commerce
2013

Social Gaming
2011
Web
Search
2010

Online
Advertising
2012

SQL + HADOOP

Meet Hal Alowne

Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)

Dataiku - Data Tuesday

‟

Dim Sum
CEO & Founder
Dim‟s Private Showroom

Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project

”

Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14

8

QUESTION #1
IS IT EASY OR NOT
?

I SEND IT
TO THE
MARKETING
CLOUD

WE ALL LIVE
IN A BIG DATA
LAKE

ALL MY DATA
PROBABLY FITS
IN HERE

QUESTION #5 (at last)
HUMAN OR NOT ?

MACHINE
LEARNING
WILL SAVE
US ALL

MERIT = TIME + ROI
TIME : 6 MONTHS

ROI : APPS
2014

2013

Find the right
people
(6 months?)

Choose the
technology
(6 months?)

Make it work
(6 months?)

2013

Build the lab
(6 months)
• Train People
• Reuse working patterns

Build a lab in 6 months
(rather than 18 months)

Dataiku

Targeted
Newsletter
Recommender
Systems

Adapted Product
/ Promotions
Deploy apps
24
that actually deliver value
1/9/14

Statistics and Machine Learning is complex
!
Try to
understand
myself

25
Dataiku

1/9/14

(Some Book you might want to read)

26
Dataiku

1/9/14

CHOOSE TECHNOLOGY
NoSQL-Slavia

Machine Learning
Mystery Land

Scalability Central

Hadoop

ElasticSearch

Ceph

SOLR

Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA

Sphere

Cassandra

MongoDB
Riak
CouchBase

MLBase

LibSVM

Real-time island
SQL Colunnar Republic
InfiniDB

Drill

Kafka Flume
Spark Storm

RapidMiner

Vertica

GreenPlum
Impala
Netezza

QlickView

Cascading

Tableau

Vizualization County
Dataiku - Pig, Hive and Cascading

SPSS

Panda

Pig

Kibana
SpotFire D3

R

SAS

Talend

Data Clean Wasteland

Statistician Old
House

Big Data Use Case #1
Manage Volumes






Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information

Main Pain Point:

23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.

28
Dataiku 1/9/14

Manage Volumes
•

•

•

Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience

1h12

to perform the aggregate,
available every morning

New

home page personalization
deployed in a few weeks

Hadoop

Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects

29
Dataiku - Data Tuesday 1/9/14

Find Patterns


Correlation
◦ between community size and
engagement / virality



Some mid-size
communities

Meaningul patterns

◦ 2 players / Family / Group



What is the minimum
number of friends to have in
the application to get
additional engagement ?

A very large community

Lots of small clusters
mostly 2 players)

30
Dataiku

1/9/14

How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User
Information
Transformation
Predictor

500TB
Transformation
Matrix

Explicit User Data

Predictor
Runtime

(Click, Buy, …)

Per User Stats

Rank Predictor

50TB
Per Content Stats

User Information
(Location, Graph…)
User Similarity

1TB
Content Data
(Title, Categories, Price, …)

200GB

Content Similarity

A/B Test Data


Always the same
Pour Data In

Compute Something
Smart About It

Make Available

The Questions
Pour Data In

How often ?
What kind of
interaction?
How much ?

Compute Something
Smart About It

How complex ?
Do you need all
data at once ?
How incremental
?

Make Available

Interaction ?
Random Access ?

PART #2
AT THE BEGINNING

WAS THE
ELEPHANT

The Text Use Case
Pour Data In

Large Volume
1TB
Textual Like Data
(Logs, Docs,….)

Compute Something
Smart About It

Massive Global
Transformation
Then Aggregation
(Counting, Invert
Index, ….)

Make Available

Every Day

What‟s Difficult
(back in 2000)
•

Large Data won‟t fit in one server

•

Large computation (a few hours) are
bound to fail one time or another

•

Data is so big that my memory is too big
to perform full aggregations

•

Parallelization with threading is errorprone

•

Data is so big that my Ethernet cable is
not that big

What‟s Difficult
(back in 2000)
•

Large Data won‟t fit in one server

HDFS
•

Large computation (a few hours) are
bound to fail one time or another

•

Data is so big that my memory is too big
to perform full aggregations

•

Parallelization with threading is errorprone

•

Data is so big that my Ethernet cable is
not that big

MAP REDUCE

JOB TRACKER

MapReduce
How to count works in many many boxes

MapReduce
PREREQUESITES

GROUPS CAN
BE DETERMINED
AT THE ROW
LEVEL

AGGREGATION
OPERATION IS
IDEMPOTENT
41

Pig History



Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project



Initial motivation




◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …

words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;

Hive History


Developed by Facebook in January 2007



Open source in August 2008



Initial Motivation

◦ Provide a SQL like abstraction to perform statistics on
status updates

create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;

Cascading History


Authored by Chris Wensel 2008



Associated Projects

◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading


Pig Hive

Mapping to Mapreduce jobs
events

= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price 1000;

Job 1 : Mapper
LOAD

FILTER

Job 1 : Reducer1
Shuffle and
sort by user

GROUP

FOREACH

FILTER

* VAT
excluded
Dataiku - Innovation Services

1/8/14

47

Pig Hive

Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price 1000;

recent_high

= ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO „/output‟;

Job 1: Mapper
LOAD

FILTER

Job 1 :Reducer
Shuffle and
sort by user

Job 2: Mapper
LOAD
(from tmp)

GROUP

FOREACH

FILTER

Job 2: Reducer
Shuffle and
sort by max_ts

STORE
48


1/8/14

Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)


Hive Joins

How to join with MapReduce ?
Uid
tbl_idx

uid

1
2

1
1
2

Dupont

Type2

Type1

2

Type2

type

Tbl_idx

Name

Type
Uid

1

Type

Durand

Type1

Durand

Type2
2

Name

2

Type1

2
2

Type1

Reducer 1

2
2

Dupont

1

2

Durand

Uid
2

Type

Dupont

Shuffle by uid
Sort by (uid, tbl_idx)
uid

Name

1

1

Dupont

1

tbl_idx

Type
Uid

1
1

Name

name
1

1

Tbl_idx

Type1

Type1

Mappers output

Reducer 2
50


1/8/14

Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization


Procedural Vs Declarative


Transformation as a
sequence of operations

Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value 0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';



Transformation as a set of
formulas

insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value 0;
) using ipaddr
group by dma;


Data type and Model
Rationale


All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}



Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing


Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);

STRING,
INT,
STRUCTage:INT, zipcode:INT

Simple type

Details

TINYINT, SMALLINT, INT, BIGINT

1, 2, 4 and 8 bytes

FLOAT, DOUBLE

4 and 8 bytes

BOOLEAN
STRING

Arbitrary-length, replaces VARCHAR

TIMESTAMP
Complex type

Details

ARRAY

Array of typed items (0-indexed)

MAP

Associative map

STRUCT

Complex class-like objects
55

Dataiku Training – Hadoop for Data Science

1/8/14

Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type

Details

int, long, float,
double

32 and 64 bits, signed

chararray

A string

bytearray

An array of … bytes

boolean

A boolean

Complex type

Details

tuple

a tuple is an ordered fieldname:value map

bag

a bag is a set of tuples
56


1/8/14

Data Type and Schema
Cascading




Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type

Details

Int, Long, Float,
Double

32 and 64 bits, signed

String

A string

byte[]

An array of … bytes

Boolean

A boolean

Complex type
Object


Details
Object must be « Hadoop serializable »

Style Summary
Style

Typing

Data Model

Metadata
store

Pig

Procedural

Static +
Dynamic

scalar +
tuple+ bag
(fully
recursive)

No
(HCatalog)

Hive

Declarative

Static +
Dynamic,
enforced at
execution
time

scalar+ list +
map

Integrated

Cascading

Procedural

Weak

scalar+ java
objects

No




Philosophy



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment



Integration
◦ Partitioning





Headachility
Motivation


Does debugging the
tool lead to bad
headaches ?


Headaches
Pig


Out Of Memory Error (Reducer)



Exception in Building /
Extended Functions
(handling of null)



Null vs “”



Nested Foreach and scoping



Date Management (pig 0.10)



Field implicit ordering


A Pig Error


Headaches
Hive


Out of Memory Errors in
Reducers



Few Debugging Options



Null / “”



No builtin “first”


Headaches
Cascading


Weak Typing Errors (comparing
Int and String … )



Illegal Operation Sequence
(Group after group …)



Field Implicit Ordering


Testing
Motivation



How to perform unit tests ?
How to have different versions of the same script
(parameter) ?


Testing
Pig





System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files


Testing / Environment
Cascading



Junit Tests are possible
Ability to use code to actually comment out some
variables


Checkpointing
Motivation





Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …

Parse Logs

Per Page Stats

Page User Correlation

FIX and
relaunch

Filtering

Output

Pig
Manual Checkpointing


STORE Command to manually
store files

Parse Logs

Per Page Stats


// COMMENT Beginning
of script and relaunch

Filtering

Output

Cascading
Automated Checkpointing


Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(…
)


Cascading
Topological Scheduler




Check each file intermediate timestamp
Execute only if more recent

Parse Logs

Per Page Stats


Filtering


Output

Productivity Summary
Headaches
Pig

Hive

Cascading

Checkpointing/Rep
lay

Testing /
Metaprogrammation

Lots

Manual Save

Difficult Meta
programming, easy local
testing

Few, but without None (That‟s SQL)
debugging
options
Weak Typing
Complexity


Checkpointing
Partial Updates

None (That‟s SQL)

Possible



Philosophy



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning





Formats Integration
Motivation


Ability to integrate different file formats



Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)

◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..

Format impact on size and performance
Format

Size on Disk (GB)

HIVE Processing time (24 cores)

Text File, uncompressed

18.7

1m32s

1 Text File, Gzipped

3.89

6m23s

JSON compressed

7.89

2m42s

multiple text file gzipped

4.02

43s

Sequence File, Block, Gzip

5.32

1m18s

Text File, LZO Indexed

7.03

1m22s


(no parallelization)

Format Integration





Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap


Partitions
Motivation




No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦

By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above


Hive Partitioning
Partitioned tables

CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
…

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;

1/8/14

77

Cascading Partition
No Direct support for partition
 Support for “Glob” Tap, to build read from files using patterns




➔

You can code your own custom or virtual partition schemes


External Code Integration
Simple UDF
Pig

Hive

Cascadin
g


Hive Complex UDF
(Aggregators)


Cascading
Direct Code Evaluation

Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO


Integration
Summary

Partition/Increme External Code
ntal Updates
Pig

No Direct Support

Hive

Cascading


Fully integrated,
SQL Like

With Coding

Simple

Format
Integration
Doable and rich
community

Very simple, but
Doable and existing
complex dev setup
community

Complex UDFS
but regular, and
Java Expression
embeddable

Doable and
growing
commuinty

Optimization


Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦



Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism

Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write


Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354

Map

…

2012-02-14 4354

2012-02-15 21we2

…

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 qa334
…
2012-02-15 23aq2


2012-02-16 1

Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354

2012-02-14 8

…

2012-02-15 12

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 12
2012-02-15 23
2012-02-16 1

Reduced network bandwith. Better
parallelism

2012-02-16 1

Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig

Cascadin
g

( no aggregation support after HashJoin)


Number of Reducers


Critical for performance



Estimated per the size of input file

◦ Hive
 divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
 divide size pig.exec.reducers.bytes.per.reducer (default 1GB)


Performance Optimization
Summary

Combiner
Optimization

Pig
Cascading
Hive


Join
Optimization

Number of
reducers
optimization

Automatic

Option

Estimate or DIY

DIY

HashJoin

DIY

Partial
DIY

Automatic
(Map Join)

Estimate or DIY

Clustering
Revenu
e

One Cluster
Centroid
== Center of
the cluster

c

Age

clustering applications
•

Fraud: Detect Outliers

•

CRM : Mine for customer segments

•

Image Processing : Similar Images

•

Search : Similar documents

•

Search : Allocate Topics

K-Means
Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE

clustering challenges
•

Curse of Dimensionality

•

Choice of distance / number of parameters

•

Performance

•

Choice # of clusters

Mahout Clustering
Challenges
•

No Integrated Feature Engineering Stack:
Get ready to write data processing in Java

•

Hadoop SequenceFile required as an input

•

Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing

Data Processing

Image

Voice

Log / DB

Data Processing

Vectorized
Data

Mahout K-Means on Text
Workflow
Text
Files
mahout
seqdirectory

Mahout Sequence Files
mahout
seq2parse

Tfidf Vectors
mahout
kmeans

Clusters

Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conve
rsion.InputDriver

Mahout Vectors
mahout
kmeans

Clusters

Convert a CSV File to
Mahout Vector
•

Real Code would have
•

Converting Categorical
variables to dimensions

•

Variable Rescaling

•

Dropping IDs (name,
forname …)

Mahout Algorithms
Parameters

Implicit Assumption

Ouput

K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId

Fuzzy K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId * , Probability

Expectation
Maximization

K (Number of clusterS)
Convergence

Gaussian distribution

Point - ClusterId*, Probability

Mean-Shift
Clustering

Distance boundaries,
Convergence

Gradient like distribution

Point - Cluster ID

Top Down
Clustering

Two Clustering Algorithns

Hierarchy

Point - Large ClusterId, Small
ClusterId

Dirichlet
Process

Model Distribution

Points are a mixture of
distribution

Point - ClusterId, Probability

Spectral
Clustering

-

-

Point - ClusterId

MinHash
Clustering

Number of hash / keys
Hash Type

High Dimension

Point - Hash*

Comparing Clustering
KMeans

MeanShif
t

Dirichlet

Fuzzy
KMeans

What if ?
Pour Data In

Data Comes
continously ?

Compute Something
Smart About It

Aggregation
patterns are not
“hashable”

Make Available

Human
Interaction
requires results
fast or
incrementally
available ?

After Hadoop
Random Access
In Memory
MultiCore
Machine Learning

Faster in Memory
Computation

Massive Batch
Map Reduce Over HDFS

Real-Time
Distributed
Computation
Faster SQL Analytics
Queries

HBase
• Started by Powerset (now in Bing) in 2007

• Provide a key-value store on top of Hadoop

GRAPHLAB
•

High-Perfomance, distributed computing framework, in C++

•

Started in 2009, Carneggie-Mellon

•

Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision

•

Can read data in HDFS

SPARK
• Developped in 2010 at UC Berkeley

• Provide a distributed memory abstraction for
efficient sequence of map/filter/join applications.
• Can Read/Store to HDFS or file

STORM
• Developped in 2011 by Nathan Marz at BackType
(then Twitter)
• Provide a framework for distributed real-time fault
tolerant computation
• Not a message queuing system, a complex event
processing system

IMPALA
• Started by Cloudera in 2012

• Provide real-time answers to SQL Queries on top
of HDFS

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Similar to Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014 (20)

More from Dataiku

More from Dataiku (14)

Recently uploaded

Recently uploaded (20)

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014