SlideShare a Scribd company logo
1 of 39
Machine Learning
by
H2O vs SparkML
Arnab Biswas
June 2018
H2O
Open Source, In-Memory, Distributed Machine Learning Tool
• Open Source (Apache 2.0)
• In-Memory (Faster)
• Distributed (Big Data/No Sampling)
• Third Version (Stable)
• Easy To Use
• Mission - "How do we get this to work efficiently at big data scale?“
http://docs.h2o.ai/
• R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow)
• Entire library is embedded inside a jar file
• Composed in Java, naturally supports Java & Scala
• R, Python, JavaScript, Excel, Tableau, Flow communicates with
H2O clusters using REST API calls
• Easy to switch between R/Python/Java/Flow environments
Multiple LanguageSupport
• Uses in-memory compression(2-4 times smaller than gzip)
• Data frames are much smaller in memory and on disk
• Handles billions of data rows in-memory, even with a small cluster
• Data gets distributed acrossmultiple JVM
• Modelingusing whole set of data (without sampling)
• Faster training/predictiontime
• The larger is the data set, the better is the performance
• Consists of a Flow web-based GUI (Easy to use for Non-Programmers)
• However,notvery impressive!
• Easy to deploy models in production
• Checkpoint
• Continuetraining an existing model with new data
• IterativeMethods (???)
H2O : Advantage
https://en.wikipedia.org/wiki/H2O_(software)
Clustering (1/2)
• Can be deployed on a single node / multi-node cluster / Hadoop cluster
/ Apache Spark cluster
• Clustering enhances speed of computation
• Hadoop/Spark for clustering is NOT mandatory
• Multi-node cluster with shared memory model
• All computation in-memory
• Each node sees only some rows of data
• No limit to cluster size
• Distributed Data Frames (collection of vectors)
• Columns are distributed (across nodes)
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
Clustering : Limitations (2/2)
• For small data, clustering introduces slowness
• Find the sweet spot between data size & number of nodes
• Each node on the cluster must be of same size (Recommended)
• New Nodes can not be added once the cluster starts up
• If any machine dies, the whole cluster must be rebuilt
• If a single node gets removed, whole cluster becomes unusable
• Nodes should be physically close, to minimize network latency
• Each node must be running the same version of h2o.jar
Productionizing H2O
1. Build a Modelusing Python/R/Java/Flow
2. Download the model (as a POJO or MOJO)as a zip file.
3. Download resultingh2o-genmodel.jar (Isa library supportingscoring)
4. Invokethe model fromJava class to generate prediction
• Can be easily embedded inside a Java Application
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
H2O Flow
• Web-based interactive client environment
• Similar to Jupyter Notebook
• Can be used by non-programmer as well (Mouse
clicks!)
• Combine code execution, text, mathematics, plots
& rich media in a single document
• Allows
• Data upload
• View data uploaded directly / through other
clients
• Build Model
• View models built directly / through other
clients
• Predict
• View predictions generated directly or through
other clients
• Check cluster/CPUstatus
Algorithms
Supervised Unsupervised Miscellaneous Common
Cox Proportional
Hazards
Aggregagtor Word2vec Quantiles
Deep Learning Generalized Low Rank
Models (GLRM)
Early Stopping
Distributed Random
Forest
K-Means Clustering
Generalized Linear
Model
Principal Component
Analysis (PCA)
Gradient Boosting
Machine
Naïve Bayes Classifier
Stacked Ensembles
XGBoost
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
H2O Ecosystem
• H2O
• Steam
• Enterprise Steam
• Sparkling Water
• Driverless AI
• H2O4GPU
H2O Steam
• End-to-end platform that streamlines the entire process of building and deploying
applications
• Cluster Manager
• Start/stop cluster, allocate memory, start/pause/stopH2O instances
• Secure multi-tenant environment
• Model Manager
• Build, store, manage, compare, promote (historical) models
• Run A/B Test for models
• Scoring Server
• Deploys a model
• Scoring through REST API or In-App
Sparkling Water (1/3)
• Combines the fast, scalable machine learning algorithms of H2O with
the capabilities of Spark
• Provides a way to launch the H2O service on each Spark executor in
the Spark cluster, forming a H2O cluster
• “Certified on Spark”
Sparkling Water – Use Case (2/3)
Use Case 1:
Data pipeline consistsof multiple
data transformations withhelp
of Spark API. Final form of data is
transformedinto H2O frame and
passed to an H2O algorithm.
Use Case 2:
Data pipeline consistsof H2O’s
parallel dataload and parse
capabilities, while Spark API is
used as another provider of data
transformations.
H2O can be also be used as in-
place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
Sparkling Water – Use Case (3/3)
Use Case 3:
1. The off-line training pipeline invoked
regularly utilizes Spark & H2O API and
provides an H2O model as
output.The model is exported in a form
independent on H2O run-time.
2. The streaming datapipeline (Using Spark
Streaming)uses model trained in the first
pipeline to score the incoming data.Since
the model is exported with no run-time
dependency to H2O, the streamingpipeline
can be lightweight and independent on
H2O/ Sparkling Water infrastructure.
Spark (MLib) vs H2O
• Spark is better at the data preparationand data munging steps
• H2O is faster than the algorithmsin SparkMLib
• MLib under performsin terms of Memory,CPU and Time
• H2O provides Web Interface (Flow) for data visualization
• H2O and MLib has overlapof algorithms
• H2O is better for productionization
• POJO/MOJOapproachmorefriendly to integrate with Java applications
• Allows evaluation metrics visualization, tracking jobsand job statuses
• H2O allowsgrid search(Spark doesn’t?)
• Spark has a better community support
• H2O has enterprisesupport
Check the slide on References
• Need for “iyzico”fraud detectionproduct
• Continuous Delivery: Models need to be continuously deployed on production
• Real-Time Fraud Detection: Predictiontime of max 100ms
• HighAvailability &Scalability
• Low Learning Curve: Stack should be usable by data scientist & SW developer
• Open Source
• Fast : Fast prototyping & deploying
• On Premise
• Initial Choice
• prediction.io+ Spark ML
Case Study I : Migration From SparkMLib To H2O (1/3)
Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
Case Study I (2/3)
• Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner)
• Simplicity of deploying an existingmodel (local env) to production
• POJO based models. Easy to deploy in Java environment
• Release management and DevOps cycle are easy
• Hardwarerequirementsfor training
• Memory need for training with 1 million transactions & 100 features with RF (64 Trees)
Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB
• Decision Trees and BayesianModels
• Python, R, SQL Support
• Experimentationon local environment
• Experiments can be done with Python, R
• Predictiontime (ms)
• Feature Engineering, Data Pipeline was in Java 8. No need of
migration
• Migration from Spark ML + prediction.io to H2O
• 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings)
• 12 cores saved (Spark ML & prediction.io needed these cores to reduce model
training time)
• Response time decreased almost 10 times (300 milliseconds to 35
milliseconds)
Case Study I (3/3)
Case Study II : Booking.com (1/n)
Source: https://www.youtube.com/watch?v=_CBKECLkIt8
Case Study II : Booking.com (2/n)
Case Study II : Booking.com (3/3)
Spark/Sparkling Water – Do I need it?
Benchmarking ML Libraries
https://github.com/szilard/benchm-ml
• Training data
• Number of rowsvaried as 10K, 100K, 1M, 10M
• ~1K features
• Binary ClassificationProblem
• Hardware (Single Instance)
• Amazon EC2 c3.8xlarge (32 cores,60GB RAM)
• If OOM, r3.8xlarge instance(32 cores,250GB RAM)
• Observations
• Training time
• Maximum memory usage duringtraining
• AUC (predictiveaccuracy)
Random Forest
H2O
• Fast, uses all cores, more accurate
• Memory Efficient
• 1M : 5G, 10M : 25 G
SparkMLib
• Slower
• Larger memory footprint
• Runs OOM at n = 1M
• With 250 G, finishes for 1M, but
crashes for 10M
• AUC broke at 1M
• Spark 2.0 is even slower
XGBoost
• Fast
• High accuracy
• Memory efficient
• 1M : 2G, 10M : 9G
Gradient Boosting Machines
Learn_rate=0.01
max_depth=16
n_trees=1000
Learn_rate=0.1
max_depth=6
n_trees=300
• Memory footprint of
GBMs smaller than for RF
• Bottleneck is mainly
training time
• Spark is inefficient in
memory (especially for
deeper trees) & crashes.
Works for shallow trees
• H2O and xgboost are the
fastest
Performance of various GBM implementations
For deployment, H2O has
the best ways to deploy as
a real-time (fast scoring)
application.
https://github.com/szilard/GBM-perf
Do I need Big Data?
• Single Instance vs Cluster
• Sending data over a network vs using shared memory
• Several distributed systems have significant computation & memory overhead
• Map-reduce style communicationpattern : Not best fit for many ML
algorithms
Benchmarking For Bigger Data
Netflix VectorFlow
• Minimalist library
• Specifically optimized for
training sparse data
• Single-machine, multi-core
environment
Benchmarking For Bigger Data
• Not enough clarity about the hardwareused
• For tree-based ensembles (RF, GBM)H2O and xgboost
can train on 100Mrecordson a single server, though
the trainingtimes become several hours
Single Node
Multiple Nodes
Security In H2O
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/security.html
Disadvantages
• No High Availability (HA) for Clusters
• Doesn’t work well on sparse data
• GPU Support is in alpha stage
• There is No SVM
• Cluster support helps Big Data
• For small data needs single, fast machines with lot of cores
References
• https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster-
machine-learning-if-it-is-not-used-on-a-cluster-How
• https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine-
learning-tool
• https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When-
would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy-
somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for-
consultancy-eventually
• https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2-
OMQAwAJ
Questions
H2O Architecture
https://www.stat.berkeley.edu/~ledell/docs/h2o_hpccon_oct2015.pdf
http://gotocon.com/dl/goto-berlin-2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf
H2O Frame Distributed Fork & Join
Do I need Spark to run H20?
- https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library
- https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
H2O : POJO vs MOJO
- POJOs are not supported for source files larger than 1G
- MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM,
GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost
models.
- POJOs are also not supported for XGBoost, GLRM, or Stacked
Ensembles models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
SparkML vs SparkMLib
• Spark MLib vs Spark ML :
• https://spark.apache.org/docs/latest/ml-guide.html
Machine Learning With H2O vs SparkML

More Related Content

What's hot

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
High Concurrency Architecture at TIKI
High Concurrency Architecture at TIKIHigh Concurrency Architecture at TIKI
High Concurrency Architecture at TIKINghia Minh
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 

What's hot (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark
SparkSpark
Spark
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
High Concurrency Architecture at TIKI
High Concurrency Architecture at TIKIHigh Concurrency Architecture at TIKI
High Concurrency Architecture at TIKI
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 

Similar to Machine Learning With H2O vs SparkML

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesRose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesRose Toomey
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recapUserReport
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 

Similar to Machine Learning With H2O vs SparkML (20)

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Hadoop Summit 2014 - recap
Hadoop Summit 2014 - recapHadoop Summit 2014 - recap
Hadoop Summit 2014 - recap
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 

Recently uploaded

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

Recently uploaded (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Machine Learning With H2O vs SparkML

  • 1. Machine Learning by H2O vs SparkML Arnab Biswas June 2018
  • 2. H2O Open Source, In-Memory, Distributed Machine Learning Tool • Open Source (Apache 2.0) • In-Memory (Faster) • Distributed (Big Data/No Sampling) • Third Version (Stable) • Easy To Use • Mission - "How do we get this to work efficiently at big data scale?“ http://docs.h2o.ai/
  • 3. • R, Python, Scala, Java, JSON, JavaScript, Web Interface (Flow) • Entire library is embedded inside a jar file • Composed in Java, naturally supports Java & Scala • R, Python, JavaScript, Excel, Tableau, Flow communicates with H2O clusters using REST API calls • Easy to switch between R/Python/Java/Flow environments Multiple LanguageSupport
  • 4. • Uses in-memory compression(2-4 times smaller than gzip) • Data frames are much smaller in memory and on disk • Handles billions of data rows in-memory, even with a small cluster • Data gets distributed acrossmultiple JVM • Modelingusing whole set of data (without sampling) • Faster training/predictiontime • The larger is the data set, the better is the performance • Consists of a Flow web-based GUI (Easy to use for Non-Programmers) • However,notvery impressive! • Easy to deploy models in production • Checkpoint • Continuetraining an existing model with new data • IterativeMethods (???) H2O : Advantage https://en.wikipedia.org/wiki/H2O_(software)
  • 5. Clustering (1/2) • Can be deployed on a single node / multi-node cluster / Hadoop cluster / Apache Spark cluster • Clustering enhances speed of computation • Hadoop/Spark for clustering is NOT mandatory • Multi-node cluster with shared memory model • All computation in-memory • Each node sees only some rows of data • No limit to cluster size • Distributed Data Frames (collection of vectors) • Columns are distributed (across nodes) - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 6. Clustering : Limitations (2/2) • For small data, clustering introduces slowness • Find the sweet spot between data size & number of nodes • Each node on the cluster must be of same size (Recommended) • New Nodes can not be added once the cluster starts up • If any machine dies, the whole cluster must be rebuilt • If a single node gets removed, whole cluster becomes unusable • Nodes should be physically close, to minimize network latency • Each node must be running the same version of h2o.jar
  • 7. Productionizing H2O 1. Build a Modelusing Python/R/Java/Flow 2. Download the model (as a POJO or MOJO)as a zip file. 3. Download resultingh2o-genmodel.jar (Isa library supportingscoring) 4. Invokethe model fromJava class to generate prediction • Can be easily embedded inside a Java Application http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 8. H2O Flow • Web-based interactive client environment • Similar to Jupyter Notebook • Can be used by non-programmer as well (Mouse clicks!) • Combine code execution, text, mathematics, plots & rich media in a single document • Allows • Data upload • View data uploaded directly / through other clients • Build Model • View models built directly / through other clients • Predict • View predictions generated directly or through other clients • Check cluster/CPUstatus
  • 9. Algorithms Supervised Unsupervised Miscellaneous Common Cox Proportional Hazards Aggregagtor Word2vec Quantiles Deep Learning Generalized Low Rank Models (GLRM) Early Stopping Distributed Random Forest K-Means Clustering Generalized Linear Model Principal Component Analysis (PCA) Gradient Boosting Machine Naïve Bayes Classifier Stacked Ensembles XGBoost https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/flow/images/H2O-Algorithms-Road-Map.pdf
  • 10. H2O Ecosystem • H2O • Steam • Enterprise Steam • Sparkling Water • Driverless AI • H2O4GPU
  • 11. H2O Steam • End-to-end platform that streamlines the entire process of building and deploying applications • Cluster Manager • Start/stop cluster, allocate memory, start/pause/stopH2O instances • Secure multi-tenant environment • Model Manager • Build, store, manage, compare, promote (historical) models • Run A/B Test for models • Scoring Server • Deploys a model • Scoring through REST API or In-App
  • 12. Sparkling Water (1/3) • Combines the fast, scalable machine learning algorithms of H2O with the capabilities of Spark • Provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster • “Certified on Spark”
  • 13. Sparkling Water – Use Case (2/3) Use Case 1: Data pipeline consistsof multiple data transformations withhelp of Spark API. Final form of data is transformedinto H2O frame and passed to an H2O algorithm. Use Case 2: Data pipeline consistsof H2O’s parallel dataload and parse capabilities, while Spark API is used as another provider of data transformations. H2O can be also be used as in- place datatransformer.http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/index.html
  • 14. Sparkling Water – Use Case (3/3) Use Case 3: 1. The off-line training pipeline invoked regularly utilizes Spark & H2O API and provides an H2O model as output.The model is exported in a form independent on H2O run-time. 2. The streaming datapipeline (Using Spark Streaming)uses model trained in the first pipeline to score the incoming data.Since the model is exported with no run-time dependency to H2O, the streamingpipeline can be lightweight and independent on H2O/ Sparkling Water infrastructure.
  • 15. Spark (MLib) vs H2O • Spark is better at the data preparationand data munging steps • H2O is faster than the algorithmsin SparkMLib • MLib under performsin terms of Memory,CPU and Time • H2O provides Web Interface (Flow) for data visualization • H2O and MLib has overlapof algorithms • H2O is better for productionization • POJO/MOJOapproachmorefriendly to integrate with Java applications • Allows evaluation metrics visualization, tracking jobsand job statuses • H2O allowsgrid search(Spark doesn’t?) • Spark has a better community support • H2O has enterprisesupport Check the slide on References
  • 16. • Need for “iyzico”fraud detectionproduct • Continuous Delivery: Models need to be continuously deployed on production • Real-Time Fraud Detection: Predictiontime of max 100ms • HighAvailability &Scalability • Low Learning Curve: Stack should be usable by data scientist & SW developer • Open Source • Fast : Fast prototyping & deploying • On Premise • Initial Choice • prediction.io+ Spark ML Case Study I : Migration From SparkMLib To H2O (1/3) Source: https://iyzico.engineering/spark-ml-to-h2o-migration-for-machine-learning-in-iyzico-dcba86b8eab2
  • 17. Case Study I (2/3) • Benchmarking Criteria : TensorFlow, SparkML, H2O (Winner) • Simplicity of deploying an existingmodel (local env) to production • POJO based models. Easy to deploy in Java environment • Release management and DevOps cycle are easy • Hardwarerequirementsfor training • Memory need for training with 1 million transactions & 100 features with RF (64 Trees) Spark ML : 16 GB RAM, Tensor Flow : 10 GB, H2O : 2 GB • Decision Trees and BayesianModels • Python, R, SQL Support • Experimentationon local environment • Experiments can be done with Python, R • Predictiontime (ms)
  • 18. • Feature Engineering, Data Pipeline was in Java 8. No need of migration • Migration from Spark ML + prediction.io to H2O • 60 GB RAM is saved (Spark ML & prediction.io needed for model trainings) • 12 cores saved (Spark ML & prediction.io needed these cores to reduce model training time) • Response time decreased almost 10 times (300 milliseconds to 35 milliseconds) Case Study I (3/3)
  • 19. Case Study II : Booking.com (1/n) Source: https://www.youtube.com/watch?v=_CBKECLkIt8
  • 20. Case Study II : Booking.com (2/n)
  • 21. Case Study II : Booking.com (3/3)
  • 22. Spark/Sparkling Water – Do I need it?
  • 23. Benchmarking ML Libraries https://github.com/szilard/benchm-ml • Training data • Number of rowsvaried as 10K, 100K, 1M, 10M • ~1K features • Binary ClassificationProblem • Hardware (Single Instance) • Amazon EC2 c3.8xlarge (32 cores,60GB RAM) • If OOM, r3.8xlarge instance(32 cores,250GB RAM) • Observations • Training time • Maximum memory usage duringtraining • AUC (predictiveaccuracy)
  • 24. Random Forest H2O • Fast, uses all cores, more accurate • Memory Efficient • 1M : 5G, 10M : 25 G SparkMLib • Slower • Larger memory footprint • Runs OOM at n = 1M • With 250 G, finishes for 1M, but crashes for 10M • AUC broke at 1M • Spark 2.0 is even slower XGBoost • Fast • High accuracy • Memory efficient • 1M : 2G, 10M : 9G
  • 25. Gradient Boosting Machines Learn_rate=0.01 max_depth=16 n_trees=1000 Learn_rate=0.1 max_depth=6 n_trees=300 • Memory footprint of GBMs smaller than for RF • Bottleneck is mainly training time • Spark is inefficient in memory (especially for deeper trees) & crashes. Works for shallow trees • H2O and xgboost are the fastest
  • 26. Performance of various GBM implementations For deployment, H2O has the best ways to deploy as a real-time (fast scoring) application. https://github.com/szilard/GBM-perf
  • 27. Do I need Big Data? • Single Instance vs Cluster • Sending data over a network vs using shared memory • Several distributed systems have significant computation & memory overhead • Map-reduce style communicationpattern : Not best fit for many ML algorithms Benchmarking For Bigger Data
  • 28. Netflix VectorFlow • Minimalist library • Specifically optimized for training sparse data • Single-machine, multi-core environment
  • 29. Benchmarking For Bigger Data • Not enough clarity about the hardwareused • For tree-based ensembles (RF, GBM)H2O and xgboost can train on 100Mrecordson a single server, though the trainingtimes become several hours Single Node Multiple Nodes
  • 31. Disadvantages • No High Availability (HA) for Clusters • Doesn’t work well on sparse data • GPU Support is in alpha stage • There is No SVM • Cluster support helps Big Data • For small data needs single, fast machines with lot of cores
  • 32. References • https://www.quora.com/Does-H2O-software-allow-you-to-perform-faster- machine-learning-if-it-is-not-used-on-a-cluster-How • https://www.quora.com/Why-would-one-use-H2O-ai-over-scikit-learn-machine- learning-tool • https://www.quora.com/What-are-the-risks-of-using-H2O-ai-framework-When- would-my-company-need-to-pay-anything-to-H2O-ai-Is-the-framework-buggy- somehow-or-is-it-hard-to-install-configure-extend-Do-I-need-to-pay-for- consultancy-eventually • https://groups.google.com/forum/#!msg/h2ostream/m2HIfUxfw-k/X8G2- OMQAwAJ
  • 35. H2O Frame Distributed Fork & Join
  • 36. Do I need Spark to run H20? - https://stackoverflow.com/questions/47894205/which-the-benefits-of-sparking-water-over-h20-machine-learning-library - https://stackoverflow.com/questions/48697292/is-there-any-performance-difference-for-ml-training-between-h2o-multi-node-clust/48697638#48697638
  • 37. H2O : POJO vs MOJO - POJOs are not supported for source files larger than 1G - MOJOs are supported for AutoML, Deep Learning, DRF, GBM, GLM, GLRM, K-Means, Stacked Ensembles, SVM, Word2vec, and XGBoost models. - POJOs are also not supported for XGBoost, GLRM, or Stacked Ensembles models. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
  • 38. SparkML vs SparkMLib • Spark MLib vs Spark ML : • https://spark.apache.org/docs/latest/ml-guide.html