Machine Learning With H2O vs SparkML

Machine Learning
by
H2O vs SparkML
Arnab Biswas
June 2018

H2O
Open Source, In-Memory, Distributed Machine Learning Tool
• Open Source (Apache 2.0)
• In-Memory (Faster)
• Distributed (Big Data/No Sampling)
• Third Version (Stable)
• Easy To Use
• Mission - "How do we get this to work efficiently at big data scale?“
http://docs.h2o.ai/

• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file
• Composed in Java, naturally supports Java & Scala
• R, Python, JavaScript, Excel, Tableau, Flow communicates with
H2O clusters using REST API calls
• Easy to switch between R/Python/Java/Flow environments
Multiple LanguageSupport

• Uses in-memory compression(2-4 times smaller than gzip)
• Data frames are much smaller in memory and on disk
• Handles billions of data rows in-memory, even with a small cluster
• Data gets distributed acrossmultiple JVM
• Modelingusing whole set of data (without sampling)
• Faster training/predictiontime
• The larger is the data set, the better is the performance
• Consists of a Flow web-based GUI (Easy to use for Non-Programmers)
• However,notvery impressive!
• Easy to deploy models in production
• Checkpoint
• Continuetraining an existing model with new data
• IterativeMethods (???)
H2O : Advantage
https://en.wikipedia.org/wiki/H2O_(software)

Clustering (1/2)
• Can be deployed on a single node / multi-node cluster / Hadoop cluster
/ Apache Spark cluster
• Clustering enhances speed of computation
• Hadoop/Spark for clustering is NOT mandatory
• Multi-node cluster with shared memory model
• All computation in-memory
• Each node sees only some rows of data
• No limit to cluster size
• Distributed Data Frames (collection of vectors)
• Columns are distributed (across nodes)
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

Clustering : Limitations (2/2)
• For small data, clustering introduces slowness
• Find the sweet spot between data size & number of nodes
• Each node on the cluster must be of same size (Recommended)
• New Nodes can not be added once the cluster starts up
• If any machine dies, the whole cluster must be rebuilt
• If a single node gets removed, whole cluster becomes unusable
• Nodes should be physically close, to minimize network latency
• Each node must be running the same version of h2o.jar

Productionizing H2O
1. Build a Modelusing Python/R/Java/Flow
2. Download the model (as a POJO or MOJO)as a zip file.
3. Download resultingh2o-genmodel.jar (Isa library supportingscoring)
4. Invokethe model fromJava class to generate prediction
• Can be easily embedded inside a Java Application
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

H2O Flow
• Web-based interactive client environment
• Similar to Jupyter Notebook
• Can be used by non-programmer as well (Mouse
clicks!)
• Combine code execution, text, mathematics, plots
& rich media in a single document
• Allows
• Data upload
• View data uploaded directly / through other
clients
• Build Model
• View models built directly / through other
clients
• Predict
• View predictions generated directly or through
other clients
• Check cluster/CPUstatus

Algorithms
Supervised Unsupervised Miscellaneous Common
Cox Proportional
Hazards
Aggregagtor Word2vec Quantiles
Deep Learning Generalized Low Rank
Models (GLRM)
Early Stopping
Distributed Random
Forest
K-Means Clustering
Generalized Linear
Model
Principal Component
Analysis (PCA)
Gradient Boosting
Machine
Naïve Bayes Classifier
Stacked Ensembles
XGBoost
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf

H2O Ecosystem
• H2O
• Steam
• Enterprise Steam
• Sparkling Water
• Driverless AI
• H2O4GPU

H2O Steam
• End-to-end platform that streamlines the entire process of building and deploying
applications
• Cluster Manager
• Start/stop cluster, allocate memory, start/pause/stopH2O instances
• Secure multi-tenant environment
• Model Manager
• Build, store, manage, compare, promote (historical) models
• Run A/B Test for models
• Scoring Server
• Deploys a model
• Scoring through REST API or In-App

Sparkling Water (1/3)
• Combines the fast, scalable machine learning algorithms of H2O with
the capabilities of Spark
• Provides a way to launch the H2O service on each Spark executor in
the Spark cluster, forming a H2O cluster
• “Certified on Spark”

Sparkling Water – Use Case (2/3)
Use Case 1:
Data pipeline consistsof multiple
data transformations withhelp
of Spark API. Final form of data is
transformedinto H2O frame and
passed to an H2O algorithm.
Use Case 2:
Data pipeline consistsof H2O’s
parallel dataload and parse
capabilities, while Spark API is
used as another provider of data
transformations.
H2O can be also be used as in-
place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html

Sparkling Water – Use Case (3/3)
Use Case 3:
1. The off-line training pipeline invoked
regularly utilizes Spark & H2O API and
provides an H2O model as
output.The model is exported in a form
independent on H2O run-time.
2. The streaming datapipeline (Using Spark
Streaming)uses model trained in the first
pipeline to score the incoming data.Since
the model is exported with no run-time
dependency to H2O, the streamingpipeline
can be lightweight and independent on
H2O/ Sparkling Water infrastructure.

Spark (MLib) vs H2O
• Spark is better at the data preparationand data munging steps
• H2O is faster than the algorithmsin SparkMLib
• MLib under performsin terms of Memory,CPU and Time
• H2O provides Web Interface (Flow) for data visualization
• H2O and MLib has overlapof algorithms
• H2O is better for productionization
• POJO/MOJOapproachmorefriendly to integrate with Java applications
• Allows evaluation metrics visualization, tracking jobsand job statuses
• H2O allowsgrid search(Spark doesn’t?)
• Spark has a better community support
• H2O has enterprisesupport
Check the slide on References

• Need for “iyzico”fraud detectionproduct
• Continuous Delivery: Models need to be continuously deployed on production
• Real-Time Fraud Detection: Predictiontime of max 100ms
• HighAvailability &Scalability
• Low Learning Curve: Stack should be usable by data scientist & SW developer
• Open Source
• Fast : Fast prototyping & deploying
• On Premise
• Initial Choice
• prediction.io+ Spark ML
Case Study I : Migration From SparkMLib To H2O (1/3)
Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2

Case Study I (2/3)
• Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner)
• Simplicity of deploying an existingmodel (local env) to production
• POJO based models. Easy to deploy in Java environment
• Release management and DevOps cycle are easy
• Hardwarerequirementsfor training
• Memory need for training with 1 million transactions & 100 features with RF (64 Trees)
Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB
• Decision Trees and BayesianModels
• Python, R, SQL Support
• Experimentationon local environment
• Experiments can be done with Python, R
• Predictiontime (ms)

• Feature Engineering, Data Pipeline was in Java 8. No need of
migration
• Migration from Spark ML + prediction.io to H2O
• 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings)
• 12 cores saved (Spark ML & prediction.io needed these cores to reduce model
training time)
• Response time decreased almost 10 times (300 milliseconds to 35
milliseconds)
Case Study I (3/3)

Case Study II : Booking.com (1/n)
Source: https://www.youtube.com/watch?v=_CBKECLkIt8

Case Study II : Booking.com (2/n)

Case Study II : Booking.com (3/3)

Spark/Sparkling Water – Do I need it?

Benchmarking ML Libraries
https://github.com/szilard/benchm-ml
• Training data
• Number of rowsvaried as 10K, 100K, 1M, 10M
• ~1K features
• Binary ClassificationProblem
• Hardware (Single Instance)
• Amazon EC2 c3.8xlarge (32 cores,60GB RAM)
• If OOM, r3.8xlarge instance(32 cores,250GB RAM)
• Observations
• Training time
• Maximum memory usage duringtraining
• AUC (predictiveaccuracy)

Random Forest
H2O
• Fast, uses all cores, more accurate
• Memory Efficient
• 1M : 5G, 10M : 25 G
SparkMLib
• Slower
• Larger memory footprint
• Runs OOM at n = 1M
• With 250 G, finishes for 1M, but
crashes for 10M
• AUC broke at 1M
• Spark 2.0 is even slower
XGBoost
• Fast
• High accuracy
• Memory efficient
• 1M : 2G, 10M : 9G

Gradient Boosting Machines
Learn_rate=0.01
max_depth=16
n_trees=1000
Learn_rate=0.1
max_depth=6
n_trees=300
• Memory footprint of
GBMs smaller than for RF
• Bottleneck is mainly
training time
• Spark is inefficient in
memory (especially for
deeper trees) & crashes.
Works for shallow trees
• H2O and xgboost are the
fastest

Performance of various GBM implementations
For deployment, H2O has
the best ways to deploy as
a real-time (fast scoring)
application.
https://github.com/szilard/GBM-perf

Do I need Big Data?
• Single Instance vs Cluster
• Sending data over a network vs using shared memory
• Several distributed systems have significant computation & memory overhead
• Map-reduce style communicationpattern : Not best fit for many ML
algorithms
Benchmarking For Bigger Data

Netflix VectorFlow
• Minimalist library
• Specifically optimized for
training sparse data
• Single-machine, multi-core
environment

Benchmarking For Bigger Data
• Not enough clarity about the hardwareused
• For tree-based ensembles (RF, GBM)H2O and xgboost
can train on 100Mrecordson a single server, though
the trainingtimes become several hours
Single Node
Multiple Nodes

Security In H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html

Disadvantages
• No High Availability (HA) for Clusters
• Doesn’t work well on sparse data
• GPU Support is in alpha stage
• There is No SVM
• Cluster support helps Big Data
• For small data needs single, fast machines with lot of cores

References
• https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster-
machine-learning-if-it-is-not-used-on-a-cluster-How
• https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine-
learning-tool
• https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When-
would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for-
consultancy-eventually
• https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2-
OMQAwAJ

H2O Architecture
https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf
http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf

H2O Frame Distributed Fork & Join

Do I need Spark to run H20?
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638

H2O : POJO vs MOJO
- POJOs are not supported for source files larger than 1G
- MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM,
GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost
models.
- POJOs are also not supported for XGBoost, GLRM, or Stacked
Ensembles models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

SparkML vs SparkMLib
• Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html

Machine Learning With H2O vs SparkML

Machine Learning With H2O vs SparkML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning With H2O vs SparkML

Similar to Machine Learning With H2O vs SparkML (20)

Recently uploaded

Recently uploaded (20)

Machine Learning With H2O vs SparkML