SlideShare a Scribd company logo
1 of 128
How do
Elephant
Make Babies
Florian Douetteau
CEO, Dataiku
Agenda
•

Part #1 Big Data

•

Part #2 Why Hadoop, How, and When

•

Part #3 Overview of the Coding Ecosystem
Pig / Hive / Cascading

•

Part #4 Overview of the Machine Learning Ecosystem
Mahout

•

Part #5 Overview of the Extended Ecosystem
PART #1
BIG DATA

3
Dataiku 1/8/14
Collocation

Dataiku

C
o
l
l
o
c
a
t

A familiar grouping of words,
especially words that habitually
appear together and thereby
convey meaning by association.

Big

Apple

Big

Mama

Big

Data
4
1/8/14
“Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….

C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku

1 Month
5
1/8/14
Big Data in 2013







Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time

1 Hour
6
Dataiku 1/8/14
To Hadoop
1 TB
1B $

1 TB
?$
1 TB
100M $

Web Search
1999
Logistics
2004

10 TB
10M $
100 TB
?$

Banking
CRM
2008

SQL OR AD HOC

50TB
1B$
1000TB
500M $
E-Commerce
2013

Social Gaming
2011
Web
Search
2010

Online
Advertising
2012

SQL + HADOOP
Meet Hal Alowne

Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)

Dataiku - Data Tuesday

‟

Dim Sum
CEO & Founder
Dim‟s Private Showroom

Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project

”

Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14

8
QUESTION #1
IS IT EASY OR NOT
?
SUBTLE
PATTERN
S
"MORE
BUSINESS"
BUTTONS
QUESTION #2
WHO TO HIRE
?
DATA SCIENTIST
AT NIGHT
DATA CLEANER
THE DAY
PARADOX #3
WHERE ?
MY DATA
IS WORTH
MILLIONS
I SEND IT
TO THE
MARKETING
CLOUD
QUERSTION #4
IS IT BIG OR NOT
WE ALL LIVE
IN A BIG DATA
LAKE
ALL MY DATA
PROBABLY FITS
IN HERE
QUESTION #5 (at last)
HUMAN OR NOT ?
MACHINE
LEARNING
WILL SAVE
US ALL
I JUST WANT
MORE
REPORTS
MERIT = TIME + ROI
TIME : 6 MONTHS

ROI : APPS
2014

2013

Find the right
people
(6 months?)

Choose the
technology
(6 months?)

Make it work
(6 months?)

2013

Build the lab
(6 months)
• Train People
• Reuse working patterns

 Build a lab in 6 months
(rather than 18 months)

Dataiku

Targeted
Newsletter
Recommender
Systems

Adapted Product
/ Promotions
 Deploy apps
24
that actually deliver value
1/9/14
Statistics and Machine Learning is complex
!
 Try to
understand
myself

25
Dataiku

1/9/14
(Some Book you might want to read)

26
Dataiku

1/9/14
CHOOSE TECHNOLOGY
NoSQL-Slavia

Machine Learning
Mystery Land

Scalability Central

Hadoop

ElasticSearch

Ceph

SOLR

Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA

Sphere

Cassandra

MongoDB
Riak
CouchBase

MLBase

LibSVM

Real-time island
SQL Colunnar Republic
InfiniDB

Drill

Kafka Flume
Spark Storm

RapidMiner

Vertica

GreenPlum
Impala
Netezza

QlickView

Cascading

Tableau

Vizualization County
Dataiku - Pig, Hive and Cascading

SPSS

Panda

Pig

Kibana
SpotFire D3

R

SAS

Talend

Data Clean Wasteland

Statistician Old
House
Big Data Use Case #1
Manage Volumes






Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information

Main Pain Point:

23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.

28
Dataiku 1/9/14
Big Data Use Case #1
Manage Volumes
•

•

•

Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience

1h12

to perform the aggregate,
available every morning

New

home page personalization
deployed in a few weeks

Hadoop

Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects

29
Dataiku - Data Tuesday 1/9/14
Big Data Use Case #2
Find Patterns


Correlation
◦ between community size and
engagement / virality



Some mid-size
communities

Meaningul patterns

◦ 2 players / Family / Group



What is the minimum
number of friends to have in
the application to get
additional engagement ?

A very large community

Lots of small clusters
mostly 2 players)

30
Dataiku

1/9/14
How do I (pre)process data?
Implicit User Data
(Views, Searches…)

Online User
Information
Transformation
Predictor

500TB
Transformation
Matrix

Explicit User Data

Predictor
Runtime

(Click, Buy, …)

Per User Stats

Rank Predictor

50TB
Per Content Stats

User Information
(Location, Graph…)
User Similarity

1TB
Content Data
(Title, Categories, Price, …)

200GB

Content Similarity

A/B Test Data

Dataiku - Pig, Hive and Cascading
Always the same
Pour Data In

Compute Something
Smart About It

Make Available
The Questions
Pour Data In

How often ?
What kind of
interaction?
How much ?

Compute Something
Smart About It

How complex ?
Do you need all
data at once ?
How incremental
?

Make Available

Interaction ?
Random Access ?
PART #2
AT THE BEGINNING

WAS THE
ELEPHANT
The Text Use Case
Pour Data In

Large Volume
1TB
Textual Like Data
(Logs, Docs,….)

Compute Something
Smart About It

Massive Global
Transformation
Then Aggregation
(Counting, Invert
Index, ….)

Make Available

Every Day
What‟s Difficult
(back in 2000)
•

Large Data won‟t fit in one server

•

Large computation (a few hours) are
bound to fail one time or another

•

Data is so big that my memory is too big
to perform full aggregations

•

Parallelization with threading is errorprone

•

Data is so big that my Ethernet cable is
not that big
What‟s Difficult
(back in 2000)
•

Large Data won‟t fit in one server

HDFS
•

Large computation (a few hours) are
bound to fail one time or another

•

Data is so big that my memory is too big
to perform full aggregations

•

Parallelization with threading is errorprone

•

Data is so big that my Ethernet cable is
not that big

MAP REDUCE

JOB TRACKER
MapReduce
How to count works in many many boxes
MapReduce
PREREQUESITES

GROUPS CAN
BE DETERMINED
AT THE ROW
LEVEL

AGGREGATION
OPERATION IS
IDEMPOTENT
41
Questions ?
PART #3
CODING
HADOOP
Pig History



Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project



Initial motivation




◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …

words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
Hive History


Developed by Facebook in January 2007



Open source in August 2008



Initial Motivation

◦ Provide a SQL like abstraction to perform statistics on
status updates

create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;
Dataiku - Pig, Hive and Cascading
Cascading History


Authored by Chris Wensel 2008



Associated Projects

◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading

Dataiku - Pig, Hive and Cascading
Pig  Hive

Mapping to Mapreduce jobs
events

= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price  1000;

Job 1 : Mapper
LOAD

FILTER

Job 1 : Reducer1
Shuffle and
sort by user

GROUP

FOREACH

FILTER

* VAT
excluded
Dataiku - Innovation Services

1/8/14

47
Pig  Hive

Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);

events

events_filtered = FILTER events BY type;
by_user

= GROUP events_filtered BY user;

price_by_user

= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;

high_pbu

= FILTER price_by_user BY total_price  1000;

recent_high

= ORDER high_pbu BY max_ts DESC;

STORE recent_high INTO „/output‟;

Job 1: Mapper
LOAD

FILTER

Job 1 :Reducer
Shuffle and
sort by user

Job 2: Mapper
LOAD
(from tmp)

GROUP

FOREACH

FILTER

Job 2: Reducer
Shuffle and
sort by max_ts

STORE
48

Dataiku - Innovation Services

1/8/14
Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)

Dataiku - Pig, Hive and Cascading
Hive Joins

How to join with MapReduce ?
Uid
tbl_idx

uid

1
2

1
1
2

Dupont

Type2

Type1

2

Type2

type

Tbl_idx

Name

Type
Uid

1

Type

Durand

Type1

Durand

Type2
2

Name

2

Type1

2
2

Type1

Reducer 1

2
2

Dupont

1

2

Durand

Uid
2

Type

Dupont

Shuffle by uid
Sort by (uid, tbl_idx)
uid

Name

1

1

Dupont

1

tbl_idx

Type
Uid

1
1

Name

name
1

1

Tbl_idx

Type1

Type1

Mappers output

Reducer 2
50

Dataiku - Innovation Services

1/8/14
WHAT IS
THE BEST
TOOL ?
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Procedural Vs Declarative


Transformation as a
sequence of operations

Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value  0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';



Transformation as a set of
formulas

insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value  0;
) using ipaddr
group by dma;

Dataiku - Pig, Hive and Cascading
Data type and Model
Rationale


All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}



Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing

Dataiku - Pig, Hive and Cascading
Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);

STRING,
INT,
STRUCTage:INT, zipcode:INT

Simple type

Details

TINYINT, SMALLINT, INT, BIGINT

1, 2, 4 and 8 bytes

FLOAT, DOUBLE

4 and 8 bytes

BOOLEAN
STRING

Arbitrary-length, replaces VARCHAR

TIMESTAMP
Complex type

Details

ARRAY

Array of typed items (0-indexed)

MAP

Associative map

STRUCT

Complex class-like objects
55

Dataiku Training – Hadoop for Data Science

1/8/14
Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type

Details

int, long, float,
double

32 and 64 bits, signed

chararray

A string

bytearray

An array of … bytes

boolean

A boolean

Complex type

Details

tuple

a tuple is an ordered fieldname:value map

bag

a bag is a set of tuples
56

Dataiku Training – Hadoop for Data Science

1/8/14
Data Type and Schema
Cascading




Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type

Details

Int, Long, Float,
Double

32 and 64 bits, signed

String

A string

byte[]

An array of … bytes

Boolean

A boolean

Complex type
Object

Dataiku - Pig, Hive and Cascading

Details
Object must be « Hadoop serializable »
Style Summary
Style

Typing

Data Model

Metadata
store

Pig

Procedural

Static +
Dynamic

scalar +
tuple+ bag
(fully
recursive)

No
(HCatalog)

Hive

Declarative

Static +
Dynamic,
enforced at
execution
time

scalar+ list +
map

Integrated

Cascading

Procedural

Weak

scalar+ java
objects

No

Dataiku - Pig, Hive and Cascading
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment



Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Headachility
Motivation


Does debugging the
tool lead to bad
headaches ?

Dataiku - Pig, Hive and Cascading
Headaches
Pig


Out Of Memory Error (Reducer)



Exception in Building /
Extended Functions
(handling of null)



Null vs “”



Nested Foreach and scoping



Date Management (pig 0.10)



Field implicit ordering

Dataiku - Pig, Hive and Cascading
A Pig Error

Dataiku - Pig, Hive and Cascading
Headaches
Hive


Out of Memory Errors in
Reducers



Few Debugging Options



Null / “”



No builtin “first”

Dataiku - Pig, Hive and Cascading
Headaches
Cascading


Weak Typing Errors (comparing
Int and String … )



Illegal Operation Sequence
(Group after group …)



Field Implicit Ordering

Dataiku - Pig, Hive and Cascading
Testing
Motivation



How to perform unit tests ?
How to have different versions of the same script
(parameter) ?

Dataiku - Pig, Hive and Cascading
Testing
Pig





System Variables
Comment to test
No Meta Programming
pig –x local to execute on local files

Dataiku - Pig, Hive and Cascading
Testing / Environment
Cascading



Junit Tests are possible
Ability to use code to actually comment out some
variables

Dataiku - Pig, Hive and Cascading
Checkpointing
Motivation





Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …

Parse Logs

Per Page Stats

Page User Correlation

FIX and
relaunch
Dataiku - Pig, Hive and Cascading

Filtering

Output
Pig
Manual Checkpointing


STORE Command to manually
store files

Parse Logs

Per Page Stats

Page User Correlation

// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading

Filtering

Output
Cascading
Automated Checkpointing


Ability to re-run a
flow automatically
from the last saved
checkpoint

addCheckpoint(…
)

Dataiku - Pig, Hive and Cascading
Cascading
Topological Scheduler




Check each file intermediate timestamp
Execute only if more recent

Parse Logs

Per Page Stats

Page User Correlation

Filtering

Dataiku - Pig, Hive and Cascading

Output
Productivity Summary
Headaches
Pig

Hive

Cascading

Checkpointing/Rep
lay

Testing /
Metaprogrammation

Lots

Manual Save

Difficult Meta
programming, easy local
testing

Few, but without None (That‟s SQL)
debugging
options
Weak Typing
Complexity

Dataiku - Pig, Hive and Cascading

Checkpointing
Partial Updates

None (That‟s SQL)

Possible
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Formats Integration
Motivation


Ability to integrate different file formats



Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)

◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..

Format impact on size and performance
Format

Size on Disk (GB)

HIVE Processing time (24 cores)

Text File, uncompressed

18.7

1m32s

1 Text File, Gzipped

3.89

6m23s

JSON compressed

7.89

2m42s

multiple text file gzipped

4.02

43s

Sequence File, Block, Gzip

5.32

1m18s

Text File, LZO Indexed

7.03

1m22s

Dataiku - Pig, Hive and Cascading

(no parallelization)
Format Integration





Hive: Serde (Serialize-Deserializer)
Pig : Storage
Cascading: Tap

Dataiku - Pig, Hive and Cascading
Partitions
Motivation




No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦

By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above

Dataiku - Pig, Hive and Cascading
Hive Partitioning
Partitioned tables

CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1

INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;
Dataiku Training – Hadoop for Data Science

1/8/14

77
Cascading Partition
No Direct support for partition
 Support for “Glob” Tap, to build read from files using patterns




➔

You can code your own custom or virtual partition schemes

Dataiku - Pig, Hive and Cascading
External Code Integration
Simple UDF
Pig

Hive

Cascadin
g

Dataiku - Pig, Hive and Cascading
Hive Complex UDF
(Aggregators)

Dataiku - Pig, Hive and Cascading
Cascading
Direct Code Evaluation

Uses Janino, a very cool project:
http://docs.codehaus.org/display/JANINO

Dataiku - Pig, Hive and Cascading
Integration
Summary

Partition/Increme External Code
ntal Updates
Pig

No Direct Support

Hive

Cascading

Dataiku - Pig, Hive and Cascading

Fully integrated,
SQL Like

With Coding

Simple

Format
Integration
Doable and rich
community

Very simple, but
Doable and existing
complex dev setup
community

Complex UDFS
but regular, and
Java Expression
embeddable

Doable and
growing
commuinty
Comparing without Comparable


Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema



Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment



Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration



Performance and optimization

Dataiku - Pig, Hive and Cascading
Optimization


Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦



Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism

Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write

Dataiku - Pig, Hive and Cascading
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354

Map

…

2012-02-14 4354

2012-02-15 21we2

…

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 qa334
…
2012-02-15 23aq2

Dataiku - Pig, Hive and Cascading

2012-02-16 1
Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354

2012-02-14 8

…

2012-02-15 12

Reduc
e

2012-02-14 20

2012-02-15 21we2

2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2

2012-02-14 12
2012-02-15 23
2012-02-16 1

Reduced network bandwith. Better
parallelism
Dataiku - Pig, Hive and Cascading

2012-02-16 1
Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig

Cascadin
g

( no aggregation support after HashJoin)

Dataiku - Pig, Hive and Cascading
Number of Reducers


Critical for performance



Estimated per the size of input file

◦ Hive
 divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
 divide size pig.exec.reducers.bytes.per.reducer (default 1GB)

Dataiku - Pig, Hive and Cascading
Performance  Optimization
Summary

Combiner
Optimization

Pig
Cascading
Hive

Dataiku - Pig, Hive and Cascading

Join
Optimization

Number of
reducers
optimization

Automatic

Option

Estimate or DIY

DIY

HashJoin

DIY

Partial
DIY

Automatic
(Map Join)

Estimate or DIY
Questions ?
PART #4
QUICK
MAHOUT
Clustering
Revenu
e

c

Age
Clustering
Revenu
e

One Cluster
Centroid
== Center of
the cluster

c

Age
clustering applications
•

Fraud: Detect Outliers

•

CRM : Mine for customer segments

•

Image Processing : Similar Images

•

Search : Similar documents

•

Search : Allocate Topics
K-Means
Guess an initial placement for centroids

Assign each point to closest Center

Reposition Center

MAP

REDUCE
clustering challenges
•

Curse of Dimensionality

•

Choice of distance / number of parameters

•

Performance

•

Choice # of clusters
Mahout Clustering
Challenges
•

No Integrated Feature Engineering Stack:
Get ready to write data processing in Java

•

Hadoop SequenceFile required as an input

•

Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing
Data Processing

Image

Voice

Log / DB

Data Processing

Vectorized
Data
Mahout K-Means on Text
Workflow
Text
Files
mahout
seqdirectory

Mahout Sequence Files
mahout
seq2parse

Tfidf Vectors
mahout
kmeans

Clusters
Mahout K-Means on
Database Extract Worflow
Database Dump (CSV)
org.apache.mahout.clustering.conve
rsion.InputDriver

Mahout Vectors
mahout
kmeans

Clusters
Convert a CSV File to
Mahout Vector
•

Real Code would have
•

Converting Categorical
variables to dimensions

•

Variable Rescaling

•

Dropping IDs (name,
forname …)
Mahout Algorithms
Parameters

Implicit Assumption

Ouput

K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId

Fuzzy K-Means

K (number of clusters)
Convergence

Circles

Point - ClusterId * , Probability

Expectation
Maximization

K (Number of clusterS)
Convergence

Gaussian distribution

Point - ClusterId*, Probability

Mean-Shift
Clustering

Distance boundaries,
Convergence

Gradient like distribution

Point - Cluster ID

Top Down
Clustering

Two Clustering Algorithns

Hierarchy

Point - Large ClusterId, Small
ClusterId

Dirichlet
Process

Model Distribution

Points are a mixture of
distribution

Point - ClusterId, Probability

Spectral
Clustering

-

-

Point - ClusterId

MinHash
Clustering

Number of hash / keys
Hash Type

High Dimension

Point - Hash*
Comparing Clustering
KMeans

MeanShif
t

Dirichlet

Fuzzy
KMeans
Questions ?
PART #5

ELEPHANT MAKE BABIES
What if ?
Pour Data In

Data Comes
continously ?

Compute Something
Smart About It

Aggregation
patterns are not
“hashable”

Make Available

Human
Interaction
requires results
fast or
incrementally
available ?
After Hadoop
Random Access
In Memory
MultiCore
Machine Learning

Faster in Memory
Computation

Massive Batch
Map Reduce Over HDFS

Real-Time
Distributed
Computation
Faster SQL Analytics
Queries
HBase
• Started by Powerset (now in Bing) in 2007

• Provide a key-value store on top of Hadoop
HBASE
GRAPHLAB
•

High-Perfomance, distributed computing framework, in C++

•

Started in 2009, Carneggie-Mellon

•

Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision

•

Can read data in HDFS
SPARK
• Developped in 2010 at UC Berkeley

• Provide a distributed memory abstraction for
efficient sequence of map/filter/join applications.
• Can Read/Store to HDFS or file
SPARK
STORM
• Developped in 2011 by Nathan Marz at BackType
(then Twitter)
• Provide a framework for distributed real-time fault
tolerant computation
• Not a message queuing system, a complex event
processing system
STORM
STORM WITH
HADOOP
IMPALA
• Started by Cloudera in 2012

• Provide real-time answers to SQL Queries on top
of HDFS
BENCHMARK
Questions ?

More Related Content

What's hot

How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontierSnowplow Analytics
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You! DataKitchen
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?Snowplow Analytics
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 

What's hot (20)

How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontier
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 

Similar to Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopFang Mac
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 

Similar to Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014 (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Final deck
Final deckFinal deck
Final deck
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

More from Dataiku (14)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

  • 1. How do Elephant Make Babies Florian Douetteau CEO, Dataiku
  • 2. Agenda • Part #1 Big Data • Part #2 Why Hadoop, How, and When • Part #3 Overview of the Coding Ecosystem Pig / Hive / Cascading • Part #4 Overview of the Machine Learning Ecosystem Mahout • Part #5 Overview of the Extended Ecosystem
  • 4. Collocation Dataiku C o l l o c a t A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. Big Apple Big Mama Big Data 4 1/8/14
  • 5. “Big” Data in 1999 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 Dataiku 1 Month 5 1/8/14
  • 6. Big Data in 2013      Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time 1 Hour 6 Dataiku 1/8/14
  • 7. To Hadoop 1 TB 1B $ 1 TB ?$ 1 TB 100M $ Web Search 1999 Logistics 2004 10 TB 10M $ 100 TB ?$ Banking CRM 2008 SQL OR AD HOC 50TB 1B$ 1000TB 500M $ E-Commerce 2013 Social Gaming 2011 Web Search 2010 Online Advertising 2012 SQL + HADOOP
  • 8. Meet Hal Alowne Hal Alowne BI Manager Dim‟s Private Showroom European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dataiku - Data Tuesday ‟ Dim Sum CEO & Founder Dim‟s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let‟s just do as they do! Big Data Copy Cat Project ” Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist 1/8/14 8
  • 9. QUESTION #1 IS IT EASY OR NOT ?
  • 17. I SEND IT TO THE MARKETING CLOUD
  • 18. QUERSTION #4 IS IT BIG OR NOT
  • 19. WE ALL LIVE IN A BIG DATA LAKE
  • 20. ALL MY DATA PROBABLY FITS IN HERE
  • 21. QUESTION #5 (at last) HUMAN OR NOT ?
  • 24. MERIT = TIME + ROI TIME : 6 MONTHS ROI : APPS 2014 2013 Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) 2013 Build the lab (6 months) • Train People • Reuse working patterns Build a lab in 6 months (rather than 18 months) Dataiku Targeted Newsletter Recommender Systems Adapted Product / Promotions Deploy apps 24 that actually deliver value 1/9/14
  • 25. Statistics and Machine Learning is complex ! Try to understand myself 25 Dataiku 1/9/14
  • 26. (Some Book you might want to read) 26 Dataiku 1/9/14
  • 27. CHOOSE TECHNOLOGY NoSQL-Slavia Machine Learning Mystery Land Scalability Central Hadoop ElasticSearch Ceph SOLR Scikit-Learn GraphLAB prediction.io jubatus Mahout WEKA Sphere Cassandra MongoDB Riak CouchBase MLBase LibSVM Real-time island SQL Colunnar Republic InfiniDB Drill Kafka Flume Spark Storm RapidMiner Vertica GreenPlum Impala Netezza QlickView Cascading Tableau Vizualization County Dataiku - Pig, Hive and Cascading SPSS Panda Pig Kibana SpotFire D3 R SAS Talend Data Clean Wasteland Statistician Old House
  • 28. Big Data Use Case #1 Manage Volumes    Business Intelligence Stack as Scalability and maintenance issues Backoffice implements business rules that are challenged Existing infrastructure cannot cope with per-user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 28 Dataiku 1/9/14
  • 29. Big Data Use Case #1 Manage Volumes • • • Relieve their current DWH and accelerate production of some aggregates/KPIs Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., Train existing people around machine learning and segmentation experience 1h12 to perform the aggregate, available every morning New home page personalization deployed in a few weeks Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 29 Dataiku - Data Tuesday 1/9/14
  • 30. Big Data Use Case #2 Find Patterns  Correlation ◦ between community size and engagement / virality  Some mid-size communities Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? A very large community Lots of small clusters mostly 2 players) 30 Dataiku 1/9/14
  • 31. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation Predictor 500TB Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  • 32. Always the same Pour Data In Compute Something Smart About It Make Available
  • 33. The Questions Pour Data In How often ? What kind of interaction? How much ? Compute Something Smart About It How complex ? Do you need all data at once ? How incremental ? Make Available Interaction ? Random Access ?
  • 34. PART #2 AT THE BEGINNING WAS THE ELEPHANT
  • 35. The Text Use Case Pour Data In Large Volume 1TB Textual Like Data (Logs, Docs,….) Compute Something Smart About It Massive Global Transformation Then Aggregation (Counting, Invert Index, ….) Make Available Every Day
  • 36. What‟s Difficult (back in 2000) • Large Data won‟t fit in one server • Large computation (a few hours) are bound to fail one time or another • Data is so big that my memory is too big to perform full aggregations • Parallelization with threading is errorprone • Data is so big that my Ethernet cable is not that big
  • 37. What‟s Difficult (back in 2000) • Large Data won‟t fit in one server HDFS • Large computation (a few hours) are bound to fail one time or another • Data is so big that my memory is too big to perform full aggregations • Parallelization with threading is errorprone • Data is so big that my Ethernet cable is not that big MAP REDUCE JOB TRACKER
  • 38.
  • 39.
  • 40. MapReduce How to count works in many many boxes
  • 41. MapReduce PREREQUESITES GROUPS CAN BE DETERMINED AT THE ROW LEVEL AGGREGATION OPERATION IS IDEMPOTENT 41
  • 44. Pig History  Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from 2003 2007 as an Apache Project  Initial motivation   ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
  • 45. Hive History  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like „th%‟; Dataiku - Pig, Hive and Cascading
  • 46. Cascading History  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
  • 47. Pig Hive Mapping to Mapreduce jobs events = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; Job 1 : Mapper LOAD FILTER Job 1 : Reducer1 Shuffle and sort by user GROUP FOREACH FILTER * VAT excluded Dataiku - Innovation Services 1/8/14 47
  • 48. Pig Hive Mapping to Mapreduce jobs = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO „/output‟; Job 1: Mapper LOAD FILTER Job 1 :Reducer Shuffle and sort by user Job 2: Mapper LOAD (from tmp) GROUP FOREACH FILTER Job 2: Reducer Shuffle and sort by max_ts STORE 48 Dataiku - Innovation Services 1/8/14
  • 49. Pig How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) Dataiku - Pig, Hive and Cascading
  • 50. Hive Joins How to join with MapReduce ? Uid tbl_idx uid 1 2 1 1 2 Dupont Type2 Type1 2 Type2 type Tbl_idx Name Type Uid 1 Type Durand Type1 Durand Type2 2 Name 2 Type1 2 2 Type1 Reducer 1 2 2 Dupont 1 2 Durand Uid 2 Type Dupont Shuffle by uid Sort by (uid, tbl_idx) uid Name 1 1 Dupont 1 tbl_idx Type Uid 1 1 Name name 1 1 Tbl_idx Type1 Type1 Mappers output Reducer 2 50 Dataiku - Innovation Services 1/8/14
  • 52. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 53. Procedural Vs Declarative  Transformation as a sequence of operations Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  Transformation as a set of formulas insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value 0; ) using ipaddr group by dma; Dataiku - Pig, Hive and Cascading
  • 54. Data type and Model Rationale  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Dataiku - Pig, Hive and Cascading
  • 55. Hive Data Type and Schema CREATE TABLE visit ( user_name user_id user_details ); STRING, INT, STRUCTage:INT, zipcode:INT Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects 55 Dataiku Training – Hadoop for Data Science 1/8/14
  • 56. Data types and Schema Pig rel = LOAD '/folder/path/' USING PigStorage(„t‟) AS (col:type, col:type, col:type); Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples 56 Dataiku Training – Hadoop for Data Science 1/8/14
  • 57. Data Type and Schema Cascading   Support for Any Java Types, provided they can be serialized in Hadoop No support for Typing Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Object Dataiku - Pig, Hive and Cascading Details Object must be « Hadoop serializable »
  • 58. Style Summary Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No Dataiku - Pig, Hive and Cascading
  • 59. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 60. Headachility Motivation  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  • 61. Headaches Pig  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading
  • 62. A Pig Error Dataiku - Pig, Hive and Cascading
  • 63. Headaches Hive  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading
  • 64. Headaches Cascading  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  • 65. Testing Motivation   How to perform unit tests ? How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
  • 66. Testing Pig     System Variables Comment to test No Meta Programming pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
  • 67. Testing / Environment Cascading   Junit Tests are possible Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
  • 68. Checkpointing Motivation    Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation FIX and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  • 69. Pig Manual Checkpointing  STORE Command to manually store files Parse Logs Per Page Stats Page User Correlation // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  • 70. Cascading Automated Checkpointing  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(… ) Dataiku - Pig, Hive and Cascading
  • 71. Cascading Topological Scheduler   Check each file intermediate timestamp Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Dataiku - Pig, Hive and Cascading Output
  • 72. Productivity Summary Headaches Pig Hive Cascading Checkpointing/Rep lay Testing / Metaprogrammation Lots Manual Save Difficult Meta programming, easy local testing Few, but without None (That‟s SQL) debugging options Weak Typing Complexity Dataiku - Pig, Hive and Cascading Checkpointing Partial Updates None (That‟s SQL) Possible
  • 73. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 74. Formats Integration Motivation  Ability to integrate different file formats  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift .. Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading (no parallelization)
  • 75. Format Integration    Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap Dataiku - Pig, Hive and Cascading
  • 76. Partitions Motivation   No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition Common partition schemas on Hadoop ◦ ◦ ◦ ◦ ◦ By Date /apache_logs/dt=2013-01-23 By Data center /apache_logs/dc=redbus01/… By Country … Or any combination of the above Dataiku - Pig, Hive and Cascading
  • 77. Hive Partitioning Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=„s1‟) SELECT * FROM event_tmp; Dataiku Training – Hadoop for Data Science 1/8/14 77
  • 78. Cascading Partition No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   ➔ You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
  • 79. External Code Integration Simple UDF Pig Hive Cascadin g Dataiku - Pig, Hive and Cascading
  • 80. Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
  • 81. Cascading Direct Code Evaluation Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO Dataiku - Pig, Hive and Cascading
  • 82. Integration Summary Partition/Increme External Code ntal Updates Pig No Direct Support Hive Cascading Dataiku - Pig, Hive and Cascading Fully integrated, SQL Like With Coding Simple Format Integration Doable and rich community Very simple, but Doable and existing complex dev setup community Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
  • 83. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  • 84. Optimization  Several Common Map Reduce Optimization Patterns ◦ ◦ ◦ ◦ ◦  Combiners MapJoin Job Fusion Job Parallelism Reducer Parallelism Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Dataiku - Pig, Hive and Cascading
  • 85. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date 2012-02-14 4354 Map … 2012-02-14 4354 2012-02-15 21we2 … Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 qa334 … 2012-02-15 23aq2 Dataiku - Pig, Hive and Cascading 2012-02-16 1
  • 86. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date Map 2012-02-14 4354 2012-02-14 8 … 2012-02-15 12 Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading 2012-02-16 1
  • 87. Join Optimization Map Join Hive set hive.auto.convert.join = true; Pig Cascadin g ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
  • 88. Number of Reducers  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
  • 89. Performance Optimization Summary Combiner Optimization Pig Cascading Hive Dataiku - Pig, Hive and Cascading Join Optimization Number of reducers optimization Automatic Option Estimate or DIY DIY HashJoin DIY Partial DIY Automatic (Map Join) Estimate or DIY
  • 94. clustering applications • Fraud: Detect Outliers • CRM : Mine for customer segments • Image Processing : Similar Images • Search : Similar documents • Search : Allocate Topics
  • 95. K-Means Guess an initial placement for centroids Assign each point to closest Center Reposition Center MAP REDUCE
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105. clustering challenges • Curse of Dimensionality • Choice of distance / number of parameters • Performance • Choice # of clusters
  • 106. Mahout Clustering Challenges • No Integrated Feature Engineering Stack: Get ready to write data processing in Java • Hadoop SequenceFile required as an input • Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
  • 107.
  • 108. Data Processing Image Voice Log / DB Data Processing Vectorized Data
  • 109. Mahout K-Means on Text Workflow Text Files mahout seqdirectory Mahout Sequence Files mahout seq2parse Tfidf Vectors mahout kmeans Clusters
  • 110. Mahout K-Means on Database Extract Worflow Database Dump (CSV) org.apache.mahout.clustering.conve rsion.InputDriver Mahout Vectors mahout kmeans Clusters
  • 111. Convert a CSV File to Mahout Vector • Real Code would have • Converting Categorical variables to dimensions • Variable Rescaling • Dropping IDs (name, forname …)
  • 112. Mahout Algorithms Parameters Implicit Assumption Ouput K-Means K (number of clusters) Convergence Circles Point - ClusterId Fuzzy K-Means K (number of clusters) Convergence Circles Point - ClusterId * , Probability Expectation Maximization K (Number of clusterS) Convergence Gaussian distribution Point - ClusterId*, Probability Mean-Shift Clustering Distance boundaries, Convergence Gradient like distribution Point - Cluster ID Top Down Clustering Two Clustering Algorithns Hierarchy Point - Large ClusterId, Small ClusterId Dirichlet Process Model Distribution Points are a mixture of distribution Point - ClusterId, Probability Spectral Clustering - - Point - ClusterId MinHash Clustering Number of hash / keys Hash Type High Dimension Point - Hash*
  • 116. What if ? Pour Data In Data Comes continously ? Compute Something Smart About It Aggregation patterns are not “hashable” Make Available Human Interaction requires results fast or incrementally available ?
  • 117. After Hadoop Random Access In Memory MultiCore Machine Learning Faster in Memory Computation Massive Batch Map Reduce Over HDFS Real-Time Distributed Computation Faster SQL Analytics Queries
  • 118. HBase • Started by Powerset (now in Bing) in 2007 • Provide a key-value store on top of Hadoop
  • 119. HBASE
  • 120. GRAPHLAB • High-Perfomance, distributed computing framework, in C++ • Started in 2009, Carneggie-Mellon • Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision • Can read data in HDFS
  • 121. SPARK • Developped in 2010 at UC Berkeley • Provide a distributed memory abstraction for efficient sequence of map/filter/join applications. • Can Read/Store to HDFS or file
  • 122. SPARK
  • 123. STORM • Developped in 2011 by Nathan Marz at BackType (then Twitter) • Provide a framework for distributed real-time fault tolerant computation • Not a message queuing system, a complex event processing system
  • 124. STORM
  • 126. IMPALA • Started by Cloudera in 2012 • Provide real-time answers to SQL Queries on top of HDFS