SlideShare une entreprise Scribd logo
1  sur  34
Big Data Analytics –
Open Source Toolkits
Prafulla Wani
Snehalata Deorukhkar
Introduction
 Talk Background
– More Data Beats Better Algorithms
– Evaluate “Analytics Toolkits” that support Hadoop
 Speaker Backgrounds
 Data Engineers
 No PhDs in statistics
2
Big Data Analytics Toolkits
 Evaluation parameters
– Ease of use
• Development APIs
• # of Algorithms supported
– Performance
• Scalable Architecture
• Disk-based / Memory-based
 Open-source
only
– RHadoop
– Mahout
– MADLib
– HiveMall
– H2O
– Spark-MLLib
3
Analytics Project lifecycle
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
 Train Algorithm 1
(Logistic regression)
 Train Algorithm 2
(SVM)
 ......
 Train Algorithm N
4
Analytics (Pre-Hadoop era)
Performance
Ease
of
use
Single
Machine
R,
Octave
R –
 Started in 1993
 Very Popular
 5589 packages
 Written primarily in
C and Fortran
Octave –
 Started in 1988
 Open source and
features
comparable with
Matlab
5
Timeline
2014201320122008
6
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
Architecture
R
R R R R
R R R R
Client/
Edge Node
Hadoop Cluster
Client/
Edge Node
Hadoop Cluster
RHadoop
Mahout
Mahout
Map/
Reduce
Map/
Reduce
7
Timeline
2014201320122008
8
2006
Hadoop
20112010
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Avro,
Sqoop
Cloudera
Impala
YARN
RHadoop
rhdfs, rmr rmr 2.0 plyrmr
RHadoop
 Provides R packages –
– rhdfs - to read/write from/to HDFS
– rhbase - to read/write from/to HBase
– rmr - to express map-reduce programs in R
 Does not provide out-of-box packages for
model training
9
RHadoop
logistic.regression = function(input, iterations, dims, alpha){
plane = t(rep(0, dims))
g = function(z) 1/(1 + exp(-z))
for (i in 1:iterations) {
gradient =
values( from.dfs( mapreduce(
input,
map = lr.map,
reduce = lr.reduce,
combine = T)))
plane = plane + alpha * gradient }
plane }
lr.map =
function(., M) {
Y = M[,1]
X = M[,-1]
keyval(
1, +
Y * X *
g(-Y * as.numeric(X %*% t(plane))))}
lr.reduce =
function(k, Z)
keyval(k, t(as.matrix(apply(Z,2,sum))))
10
Timeline
2014201320122008
11
2006
Hadoop
Mahout
Started as a
subproject
of Apache
Lucene
20112010
Decision to reject
new MapReduce
implementation
Future
implementations
on top of Apache
Spark
Integration with
H2O platform
Top level
apache
project
4 releases
(0.1 – 0.4)
Core Hadoop
(HDFS,
MapReduce)
HBase,
Zookeeper
, Pig,
Hive…
Mahout
Avro,
Sqoop
Cloudera
Impala
YARN
0.8
release
Recomme
ndation
Engines –
Common
Case
study for
Hadoop
RHadoop
rhdfs, rmr rmr 2.0 plyrmr
Mahout
 Original goal - To implement all 10 algorithms from Andrew
Ng's paper "Map-Reduce for Machine Learning on
Multicore"
 Java based library having MapReduce implementation of
common analytics algorithms
 Key algorithms
– Recommendation algorithms / Collaborative filtering
– Classification
– Clustering
– Frequent Pattern Growth
12
Mahout
 Train the model:
mahout org.apache.mahout.df.mapreduce.BuildForest -
Dmapred.max.split.size=1884231 -oob -d train.arff -ds
train.info -sl 5 -t 1000 -o crwd_forest
 Test the model:
mahout org.apache.mahout.df.mapreduce.TestForest -i
test.arff -ds train.info -m crwd_forest -a -mr -o
crwd_predictions
13
Summary
Performance
Ease of use
Distributed
Disk-based
Single
Machine
R,
Octave
Mahout
RHadoop
14
Aging MapReduce
 Machine learning algorithms are iterative in nature
 Mahout algorithms involve multiple MapReduce stages
 Intermediate results are written to HDFS
 MR job is launched for each iteration
 IO overhead
Input
Input
HDFS
read
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Query 1
Query 2
Query 3
result 1
result 2
result 3
…
…
Slow due to replication and disk IO
15
Disk Trend
 Disk throughput increasing slowly
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
16
Memory Trend
 RAM throughput increasing exponentially
Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf
17
Timeline
2014201320122011
18
2010
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
Spark – Data sharing
 Resilient Distributed Datasets (RDDs)
– Distributed collections of objects that can be cached in
memory across cluster nodes
– Manipulated through various parallel operations
– Automatically rebuilt on failures
Input
Input
One-time
Processing
iter. 1 iter. 2
Query 1
Query 2
Query 3
…
10-100x faster than network and disk
…
Distributed
memory
19
MLLib
 Spark implementation of some common machine
learning algorithms and utilities, including
– Classification
– Regression
– Clustering
 Pre-packaged libraries (in scala, Java, Python) for
analytics algorithms –
– val model = SVMWithSGD.train(training, numIterations)
– val clusters = KMeans.train(parsedData, numClusters,
numIterations)
20
SparkR - R Interface over Spark
 Currently supports using data transformation
functions lapply() etc. on distributed spark model
 It does not support running out of the box model (e.g.
SVMWithSGD.train or KMeans.train)
 The work is in progress on sparkR - MLLib
integration which may address this limitation
21
Timeline
2014201320122011
22
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
MADLib
MADLib-
port for
Impala
MADLib
 An open-source library for scalable in-database analytics
 Supports Postgres, Pivotal GreenPlum Database, and
Pivotal HAWQ
 Key MADLib architecture principles are:
– Operating on the data locally-in database.
– Utilizing best of breed database engines, but separate
the machine learning logic from database specific
implementation details.
– Leveraging MPP Share nothing technology, such as the
Pivotal Greenplum Database, to provide parallelism and
scalability.
– Open implementation maintaining active ties into
ongoing academic research."
23
MADLib Architecture
24
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High – level Abstraction Layer
(iteration controller, …)
RDBMS
Built-in
functions
MPP Query Processing
(Greenplum, PostgreSQL, Impala …)
Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS type
bridge, …)
SQL, generated from
specification
C++
Timeline
2014201320122011
25
Began as a
collaboration
between
researchers,
engineers and
data scientists
2010
Initial
release
Started as
a research
project at
UC
Berkley
AMPLab
2009
Open-
sourced
Accepted
into Apache
incubator
Spark 0.8
release
introduced
MLLib
Spark-MLLib
1.0 released
Wins Best
Paper
Award at
USENIX
NSDI
Spark
H2O Project
open-
sourced
MADLib
H2O
Latest stable
release of H2O
2.4.3.4 released
on May 13, 2014
MADLib-
port for
Impala
H2O
 Open source math and prediction engine
 Distributed, in-memory computations
 Creates a cluster of H2O nodes, which are map-
only tasks
 Provides graphical interface to load-data, view
summaries and train models
 Certified for major hadoop distributions
26
H2O on Hadoop Deployment
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Hadoop
H2O
Map
Task
Job
Tracker
hadoop jar …
HDFS
Hadoop edge Node
Hadoop Cluster
Hadoop Task
Tracker Nodes
(H2O Cluster)
Hadoop HDFS
Data Nodes
27
Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12
H2O Programming Interface
 R-Package “H2O”
– prostate.data = h2o.importURL(localH2O, path = “<path>”,
key = “<key>")
– summary(prostate.data)
– h2o.glm
– h2o.kmeans
28
Community involvement
Mahout Spark-MLLib MADLib H2O
# of commits 20 249 0 557
29
For 30 days ending 27 May,
HiveMall
 Machine learning and feature engineering
functions through UDFs/UDAFs/UDTFs of Hive
 Supports various algorithms for –
– Classification – Perceptron, Adaptive Regularization of
Weight Vectors (AROW)
– Regression - Logistic Regression using Stochastic
Gradient Descent
– Recommendation - Minhash (LSH with jaccard index)
– k-Nearest Neighbor
– Feature engineering
30
Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLLIb
MADLib+
Impala
Hive
Mall
31
MLBase - Vision
 Optimizer built on top of Spark & MLLib
 A Declarative Approach
 Abstracts complexities of variable & algorithm
selection
– var X = load (“als_clinical”, 2 to 10)
– var Y = load (“als_clinical”, 1)
– var (fn-model, summary) = doClassify (X , y)
Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced
Train
Model(s)
Gather
Data
Compare
Accuracy
Predict
Future
32
Summary
Performance
Ease of use
Distributed
Disk-based
Distributed
Memory-based
Single
Machine
R,
Octave
Mahout
RHadoop
H2O
MLBase
MLLIb
MADLib+
Impala
Hive
Mall
33
Yes, We Are Hiring!
Thank You!

Contenu connexe

Tendances

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesDataWorks Summit
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing EcosystemDatabricks
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 

Tendances (20)

Spark 101
Spark 101Spark 101
Spark 101
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Ray and Its Growing Ecosystem
Ray and Its Growing EcosystemRay and Its Growing Ecosystem
Ray and Its Growing Ecosystem
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 

En vedette

Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...DATAVERSITY
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Sri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLSri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLMLconf
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data EnvironmentsSri Ambati
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindDATAVERSITY
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 

En vedette (14)

Green datacenters
Green datacentersGreen datacenters
Green datacenters
 
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
Introducing Stig: A New Open Source, Non-relational, Distributed Graph Databa...
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Sri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATLSri Ambati – CEO, 0xdata at MLconf ATL
Sri Ambati – CEO, 0xdata at MLconf ATL
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Big data with r
Big data with rBig data with r
Big data with r
 
The Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data MindThe Importance of MDM - Eternal Management of the Data Mind
The Importance of MDM - Eternal Management of the Data Mind
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
 

Similaire à Open Source Big Data Analytics Toolkits Comparison

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 

Similaire à Open Source Big Data Analytics Toolkits Comparison (20)

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Data Science
Data ScienceData Science
Data Science
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Open Source Big Data Analytics Toolkits Comparison

  • 1. Big Data Analytics – Open Source Toolkits Prafulla Wani Snehalata Deorukhkar
  • 2. Introduction  Talk Background – More Data Beats Better Algorithms – Evaluate “Analytics Toolkits” that support Hadoop  Speaker Backgrounds  Data Engineers  No PhDs in statistics 2
  • 3. Big Data Analytics Toolkits  Evaluation parameters – Ease of use • Development APIs • # of Algorithms supported – Performance • Scalable Architecture • Disk-based / Memory-based  Open-source only – RHadoop – Mahout – MADLib – HiveMall – H2O – Spark-MLLib 3
  • 4. Analytics Project lifecycle Train Model(s) Gather Data Compare Accuracy Predict Future  Train Algorithm 1 (Logistic regression)  Train Algorithm 2 (SVM)  ......  Train Algorithm N 4
  • 5. Analytics (Pre-Hadoop era) Performance Ease of use Single Machine R, Octave R –  Started in 1993  Very Popular  5589 packages  Written primarily in C and Fortran Octave –  Started in 1988  Open source and features comparable with Matlab 5
  • 7. Architecture R R R R R R R R R Client/ Edge Node Hadoop Cluster Client/ Edge Node Hadoop Cluster RHadoop Mahout Mahout Map/ Reduce Map/ Reduce 7
  • 9. RHadoop  Provides R packages – – rhdfs - to read/write from/to HDFS – rhbase - to read/write from/to HBase – rmr - to express map-reduce programs in R  Does not provide out-of-box packages for model training 9
  • 10. RHadoop logistic.regression = function(input, iterations, dims, alpha){ plane = t(rep(0, dims)) g = function(z) 1/(1 + exp(-z)) for (i in 1:iterations) { gradient = values( from.dfs( mapreduce( input, map = lr.map, reduce = lr.reduce, combine = T))) plane = plane + alpha * gradient } plane } lr.map = function(., M) { Y = M[,1] X = M[,-1] keyval( 1, + Y * X * g(-Y * as.numeric(X %*% t(plane))))} lr.reduce = function(k, Z) keyval(k, t(as.matrix(apply(Z,2,sum)))) 10
  • 11. Timeline 2014201320122008 11 2006 Hadoop Mahout Started as a subproject of Apache Lucene 20112010 Decision to reject new MapReduce implementation Future implementations on top of Apache Spark Integration with H2O platform Top level apache project 4 releases (0.1 – 0.4) Core Hadoop (HDFS, MapReduce) HBase, Zookeeper , Pig, Hive… Mahout Avro, Sqoop Cloudera Impala YARN 0.8 release Recomme ndation Engines – Common Case study for Hadoop RHadoop rhdfs, rmr rmr 2.0 plyrmr
  • 12. Mahout  Original goal - To implement all 10 algorithms from Andrew Ng's paper "Map-Reduce for Machine Learning on Multicore"  Java based library having MapReduce implementation of common analytics algorithms  Key algorithms – Recommendation algorithms / Collaborative filtering – Classification – Clustering – Frequent Pattern Growth 12
  • 13. Mahout  Train the model: mahout org.apache.mahout.df.mapreduce.BuildForest - Dmapred.max.split.size=1884231 -oob -d train.arff -ds train.info -sl 5 -t 1000 -o crwd_forest  Test the model: mahout org.apache.mahout.df.mapreduce.TestForest -i test.arff -ds train.info -m crwd_forest -a -mr -o crwd_predictions 13
  • 15. Aging MapReduce  Machine learning algorithms are iterative in nature  Mahout algorithms involve multiple MapReduce stages  Intermediate results are written to HDFS  MR job is launched for each iteration  IO overhead Input Input HDFS read HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Query 1 Query 2 Query 3 result 1 result 2 result 3 … … Slow due to replication and disk IO 15
  • 16. Disk Trend  Disk throughput increasing slowly Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 16
  • 17. Memory Trend  RAM throughput increasing exponentially Reference - http://www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2013-08-30_AMPCamp2013.pdf 17
  • 18. Timeline 2014201320122011 18 2010 Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark
  • 19. Spark – Data sharing  Resilient Distributed Datasets (RDDs) – Distributed collections of objects that can be cached in memory across cluster nodes – Manipulated through various parallel operations – Automatically rebuilt on failures Input Input One-time Processing iter. 1 iter. 2 Query 1 Query 2 Query 3 … 10-100x faster than network and disk … Distributed memory 19
  • 20. MLLib  Spark implementation of some common machine learning algorithms and utilities, including – Classification – Regression – Clustering  Pre-packaged libraries (in scala, Java, Python) for analytics algorithms – – val model = SVMWithSGD.train(training, numIterations) – val clusters = KMeans.train(parsedData, numClusters, numIterations) 20
  • 21. SparkR - R Interface over Spark  Currently supports using data transformation functions lapply() etc. on distributed spark model  It does not support running out of the box model (e.g. SVMWithSGD.train or KMeans.train)  The work is in progress on sparkR - MLLib integration which may address this limitation 21
  • 22. Timeline 2014201320122011 22 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark MADLib MADLib- port for Impala
  • 23. MADLib  An open-source library for scalable in-database analytics  Supports Postgres, Pivotal GreenPlum Database, and Pivotal HAWQ  Key MADLib architecture principles are: – Operating on the data locally-in database. – Utilizing best of breed database engines, but separate the machine learning logic from database specific implementation details. – Leveraging MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability. – Open implementation maintaining active ties into ongoing academic research." 23
  • 24. MADLib Architecture 24 User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High – level Abstraction Layer (iteration controller, …) RDBMS Built-in functions MPP Query Processing (Greenplum, PostgreSQL, Impala …) Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) SQL, generated from specification C++
  • 25. Timeline 2014201320122011 25 Began as a collaboration between researchers, engineers and data scientists 2010 Initial release Started as a research project at UC Berkley AMPLab 2009 Open- sourced Accepted into Apache incubator Spark 0.8 release introduced MLLib Spark-MLLib 1.0 released Wins Best Paper Award at USENIX NSDI Spark H2O Project open- sourced MADLib H2O Latest stable release of H2O 2.4.3.4 released on May 13, 2014 MADLib- port for Impala
  • 26. H2O  Open source math and prediction engine  Distributed, in-memory computations  Creates a cluster of H2O nodes, which are map- only tasks  Provides graphical interface to load-data, view summaries and train models  Certified for major hadoop distributions 26
  • 27. H2O on Hadoop Deployment Hadoop H2O Map Task Hadoop H2O Map Task Hadoop H2O Map Task Job Tracker hadoop jar … HDFS Hadoop edge Node Hadoop Cluster Hadoop Task Tracker Nodes (H2O Cluster) Hadoop HDFS Data Nodes 27 Reference - http://www.slideshare.net/0xdata/h2o-on-hadoop-dec-12
  • 28. H2O Programming Interface  R-Package “H2O” – prostate.data = h2o.importURL(localH2O, path = “<path>”, key = “<key>") – summary(prostate.data) – h2o.glm – h2o.kmeans 28
  • 29. Community involvement Mahout Spark-MLLib MADLib H2O # of commits 20 249 0 557 29 For 30 days ending 27 May,
  • 30. HiveMall  Machine learning and feature engineering functions through UDFs/UDAFs/UDTFs of Hive  Supports various algorithms for – – Classification – Perceptron, Adaptive Regularization of Weight Vectors (AROW) – Regression - Logistic Regression using Stochastic Gradient Descent – Recommendation - Minhash (LSH with jaccard index) – k-Nearest Neighbor – Feature engineering 30
  • 32. MLBase - Vision  Optimizer built on top of Spark & MLLib  A Declarative Approach  Abstracts complexities of variable & algorithm selection – var X = load (“als_clinical”, 2 to 10) – var Y = load (“als_clinical”, 1) – var (fn-model, summary) = doClassify (X , y) Reference - http://www.slideshare.net/chaochen5496/mlllib-sparkmeetup8613finalreduced Train Model(s) Gather Data Compare Accuracy Predict Future 32
  • 34. Yes, We Are Hiring! Thank You!

Notes de l'éditeur

  1. Gather Data – Exploratory Analytics, Variable selection, Dimensionality Reduction - PCA, SVD Gather Data Train Model Compare Model Performance – AUC Curve etc. Predict the future
  2. Now let us understand how analytics was done in pre-hadoop era. Tools like R and octave were used widely which run on a single machine and give fair enough performance with small dataset. Both R and Octave are open source high level interpreted languages. R started in 1993 .It is mainly written in C and Fortran. R is a very popular tool among statisticians and data scientists for performing computational statistics, visualization and data science. It has a vibrant community noted for its active contributions in terms of packages. It has 5589 packages. Octave is also an open source, high level interpreted language. The octave language is quite similar to Matlab so that most programs are easily portable. But both of these languages have limitations in terms of volume of data that can be handled and are not suitable for analytics on huge and dynamic data sets.Hadoop is a defacto standard for storing and processing huge volume of data.
  3. Hadoop was started by Doug Cutting for Nutch project at Yahoo.Till 2007 it had two core components – HDFS and MapReduce.In 2008, tools like Hbase,ZooKeeper were added in the hadoop ecosystem. In 2010 Avro and sqoop were added and the ecosystem is still growing. Two main tools –Rhadoop and Mahout were developed to leverage the distributed processing of the Hadoop framework. Intoduction of yarn… it opens hadoop framework for many other frameworks beyong mapreduce/ Rhadoop? Rhipe?? 2012?
  4. RHadoop is an open source collection of three R packages that allow users to manage and analyze data with Hadoop from R environment. . R along with R-Hadoop packages needs to be installed on all the nodes including the edge node. And the RHadoop will submit the job from the client/edge node. Mahout is a java library having mapreduce implementation of machine learning algorithms. In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster R along with R-Hadoop, RHipe packages needs to be installed on all the nodes including the edge node. And the Rhadoop/Rhipe will submit the job from the client/edge node In case of mahout, only mahout library needs to be present on the client/edge node and the Mahout job will be submitted, which will be an MR job for distributed algorithms to Hadoop cluster.
  5. Rhadoop? Rhipe ?? 2012? Plurmr – provides additional data manipulation cpabilities
  6. Rhadoop consists of the following packages: • rmr2 -functions providing Hadoop MapReduce functionality in R • rhdfs -functions providing file management of the HDFS from within R • rhbase -functions providing database management for the Hbase distributed database from within R
  7. This is a sample code for logistic regression in Rhadoop. Logistic regression avaiable in R can not be reused
  8. Rhadoop? Rhipe?? 2012? We saw adoption of mahout based recommendation engine across the industry…
  9. Mahout is a java library having MR implementation of common machine learning algorithms.It was developed to provide scalable and parallelized machine learning algorithms based on Hadoop framework.The original aim of the Mahout project was to implement all 10 alogorithms discussed in Andrew Ng’s paper “Mapreduce …. “
  10. One of the reason why Map Reduced is criticized is – Restricted programming framework - MapReduce tasks must be written as acyclic dataflow programs - Stateless mapper followed by a stateless reducer, that are executed by a batch job scheduler - Repeated querying of datasets become difficult - thus hard to write iterative algorithms - After each iteration of Map-Reduce, data has to be persisted on disc for next iteration to proceed with processing.
  11. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
  12. Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010. The AMPLab continues to perform research on both improving Spark and on systems built on top it. After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home. A wide range of contributors now develop the project (over 120 developers from 25 companies).MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release. Spark top level apache project in Feb,2014 Current version 1.0 Included SVM, logistic regression, K-means, ALS Hadoop YARN support in Spark
  13. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization
  14. MADlib grew out of discussions between database-engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal). Today it also includes researchers from Stanford and University of Florida. Latest version 1.5 Algorithms Supported Classification Naive Bayes Classification , Random Forest Regression Logistic Regression, Linear Regression, Multinomial logistic regression, Elastic net regularization Clustering K-Means Topic Modeling Latent Dirichlet Allocation etc. Association Rule Mining Apriori
  15. MADlib grew out of discussions between database engine developers, data scientists, IT architects, and academics interested in new approaches to scalable, sophisticated in-database analytics. Their exchanges were written up in a paper in VLDB 2009 that coined the term "MAD Skills" for data analysis. MADLIB project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at Pivotal, formerly Greenplum and today it also includes researchers from Stanford and University of Florida. Latest version 1.5 MADlib’s Initial release included : Naive Bayes ,k-means, svm, quantile, linear and logistic regression, matrix factorization