Open Source Big Data Analytics Toolkits Comparison

Big Data Analytics –
Open Source Toolkits
Prafulla Wani
Snehalata Deorukhkar

Introduction
 Talk Background
– More Data Beats Better Algorithms
– Evaluate “Analytics Toolkits” that support Hadoop
 Speaker Backgrounds
 Data Engineers
 No PhDs in statistics
2

Big Data Analytics Toolkits
 Evaluation parameters
– Ease of use
• Development APIs
• # of Algorithms supported
– Performance
• Scalable Architecture
• Disk-based / Memory-based
 Open-source
only
– RHadoop
– Mahout
– MADLib
– HiveMall
– H2O
– Spark-MLLib
3

Analytics Project lifecycle
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
 Train Algorithm 1
(Logistic regression)
 Train Algorithm 2
(SVM)
 ......
 Train Algorithm N
4

Analytics (Pre-Hadoop era)
Performance
Ease
of
use
Single
Machine
R,
Octave
R –
 Started in 1993
 Very Popular
 5589 packages
 Written primarily in
C and Fortran
Octave –
 Started in 1988
 Open source and
features
comparable with
Matlab
5

Timeline
2014201320122008
6
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN

Architecture
R
R R R R
R R R R
Client/
Edge Node
Hadoop Cluster
Client/
Edge Node
Hadoop Cluster
RHadoop
Mahout
Mahout
Map/
Reduce
Map/
Reduce
7

Timeline
2014201320122008
8
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
RHadoop
rhdfs, rmr rmr 2.0 plyrmr

RHadoop
 Provides R packages –
– rhdfs - to read/write from/to HDFS
– rhbase - to read/write from/to HBase
– rmr - to express map-reduce programs in R
 Does not provide out-of-box packages for
model training
9

RHadoop
logistic.regression = function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values( from.dfs( mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
lr.map =
function(., M) {
Y = M[,1]
X = M[,-1]
keyval(
1, +
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
10

Timeline
2014201320122008
11
2006
Hadoop
Mahout
Started as a
subproject
of Apache
Lucene
20112010
Decision to reject
new MapReduce
implementation
Future
implementations
on top of Apache
Spark
Integration with
H2O platform
Top level
apache
project
4 releases
(0.1 – 0.4)
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Mahout
Avro,
Sqoop
Cloudera
Impala
YARN
0.8
release
Recomme
ndation
Engines –
Common
Case
study for
Hadoop
RHadoop
rhdfs, rmr rmr 2.0 plyrmr

Mahout
 Original goal - To implement all 10 algorithms from Andrew
Ng's paper "Map-Reduce for Machine Learning on
Multicore"
 Java based library having MapReduce implementation of
common analytics algorithms
 Key algorithms
– Recommendation algorithms / Collaborative filtering
– Classification
– Clustering
– Frequent Pattern Growth
12

Mahout
 Train the model:
mahout org.apache.mahout.df.mapreduce.BuildForest -
Dmapred.max.split.size=1884231 -oob -d train.arff -ds
train.info -sl 5 -t 1000 -o crwd_forest
 Test the model:
mahout org.apache.mahout.df.mapreduce.TestForest -i
test.arff -ds train.info -m crwd_forest -a -mr -o
crwd_predictions
13

Summary
Performance
Ease of use
Distributed
Disk-based
Single
Machine
R,
Octave
Mahout
RHadoop
14

Aging MapReduce
 Machine learning algorithms are iterative in nature
 Mahout algorithms involve multiple MapReduce stages
 Intermediate results are written to HDFS
 MR job is launched for each iteration
 IO overhead
Input
Input
HDFS
read
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Query 1
Query 2
Query 3
result 1
result 2
result 3
…
…
Slow due to replication and disk IO
15

Disk Trend
 Disk throughput increasing slowly
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
16

Memory Trend
 RAM throughput increasing exponentially
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
17

Timeline
2014201320122011
18
2010
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark

Spark – Data sharing
 Resilient Distributed Datasets (RDDs)
– Distributed collections of objects that can be cached in
memory across cluster nodes
– Manipulated through various parallel operations
– Automatically rebuilt on failures
Input
Input
One-time
Processing
iter. 1 iter. 2
Query 1
Query 2
Query 3
…
10-100x faster than network and disk
…
Distributed
memory
19

MLLib
 Spark implementation of some common machine
learning algorithms and utilities, including
– Classification
– Regression
– Clustering
 Pre-packaged libraries (in scala, Java, Python) for
analytics algorithms –
– val model = SVMWithSGD.train(training, numIterations)
– val clusters = KMeans.train(parsedData, numClusters,
numIterations)
20

SparkR - R Interface over Spark
 Currently supports using data transformation
functions lapply() etc. on distributed spark model
 It does not support running out of the box model (e.g.
SVMWithSGD.train or KMeans.train)
 The work is in progress on sparkR - MLLib
integration which may address this limitation
21

Timeline
2014201320122011
22
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
MADLib
MADLib-
port for
Impala

MADLib
 An open-source library for scalable in-database analytics
 Supports Postgres, Pivotal GreenPlum Database, and
Pivotal HAWQ
 Key MADLib architecture principles are:
– Operating on the data locally-in database.
– Utilizing best of breed database engines, but separate
the machine learning logic from database specific
implementation details.
– Leveraging MPP Share nothing technology, such as the
Pivotal Greenplum Database, to provide parallelism and
scalability.
– Open implementation maintaining active ties into
ongoing academic research."
23

MADLib Architecture
24
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High – level Abstraction Layer
(iteration controller, …)
RDBMS
Built-in
functions
MPP Query Processing
(Greenplum, PostgreSQL, Impala …)
Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS type
bridge, …)
SQL, generated from
specification
C++

Timeline
2014201320122011
25
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
H2O Project
open-
sourced
MADLib
H2O
Latest stable
release of H2O
2.4.3.4 released
on May 13, 2014
MADLib-
port for
Impala

H2O
 Open source math and prediction engine
 Distributed, in-memory computations
 Creates a cluster of H2O nodes, which are map-
only tasks
 Provides graphical interface to load-data, view
summaries and train models
 Certified for major hadoop distributions
26

H2O on Hadoop Deployment
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Job
Tracker
hadoop jar …
HDFS
Hadoop edge Node
Hadoop Cluster
Hadoop Task
Tracker Nodes
(H2O Cluster)
Hadoop HDFS
Data Nodes
27
Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12

H2O Programming Interface
 R-Package “H2O”
– prostate.data = h2o.importURL(localH2O, path = “<path>”,
key = “<key>")
– summary(prostate.data)
– h2o.glm
– h2o.kmeans
28

Community involvement
Mahout Spark-MLLib MADLib H2O
# of commits 20 249 0 557
29
For 30 days ending 27 May,

HiveMall
 Machine learning and feature engineering
functions through UDFs/UDAFs/UDTFs of Hive
 Supports various algorithms for –
– Classification – Perceptron, Adaptive Regularization of
Weight Vectors (AROW)
– Regression - Logistic Regression using Stochastic
Gradient Descent
– Recommendation - Minhash (LSH with jaccard index)
– k-Nearest Neighbor
– Feature engineering
30

Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLLIb
MADLib+
Impala
Hive
Mall
31

MLBase - Vision
 Optimizer built on top of Spark & MLLib
 A Declarative Approach
 Abstracts complexities of variable & algorithm
selection
– var X = load (“als_clinical”, 2 to 10)
– var Y = load (“als_clinical”, 1)
– var (fn-model, summary) = doClassify (X , y)
Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
32

Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLBase
MLLIb
MADLib+
Impala
Hive
Mall
33

Yes, We Are Hiring!
Thank You!

Open Source Big Data Analytics Toolkits Comparison

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à Open Source Big Data Analytics Toolkits Comparison

Similaire à Open Source Big Data Analytics Toolkits Comparison (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Open Source Big Data Analytics Toolkits Comparison

Notes de l'éditeur