Examine the unique features of the MapR Converged Data Platform and how they can support production-grade enterprise machine learning - Ends with a live demo using H2O - Presented at Hadoop Summit Tokyo 2016
40. Agenda
• Why tooling matters in Machine Learning
• What is H2O and Sparkling Water
• Why MapR
• Demo
41. ML project problems
• Multiple data sources
• Different formats
• Large volumes of data to be read
• System bootstrap time
• Collaboration between data scientists
• Comparing models
• Deployment of the model
• Versioning
• Too many moving parts!
• etc.etc.
42. Successful ML platform
• Fast ingestion and manipulation of versatile data
• Intuitive modeling UI/API
• Easy model validation, visualisation and comparison
• Easy model deployment w/ versioning for fast predictions
43. • Written in high performance Java - native Java API
• Supports multiple file formats and data sources
• ETL capabilities
• Highly paralleled and distributed implementation
• Fast in-memory computation on highly compressed data
• Allows you to use all your data without sampling
• Runs on top of most major Hadoop distributions
ML
platform
Ingestions
platform
Big data
platform
What is H2O?
• Open source platform
• Exposes math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
44. FlowUI
• Notebook style open
source interface for H2O
• Code execution,
mathematics, plots, and
rich media
45. Why H2O?
• Fast ingestion and manipulation of versatile data
• Blazing fast data parsing, supports multiple formats and
data sources
• Intuitive modeling UI/API
• FlowUI, R/Python/REST APIs
• Easy model validation, visualisation and comparison
• Cross-validation, FlowUI graphs, comparison via Steam
• Easy model deployment /w versioning for fast predictions
• Model export as POJO, deploy as service via Steam
46. What is Sparkling Water?
• Framework integrating Spark and H2O
• H2O instances on Spark executors
• Allows to call Spark and H2O methods together
47. Why MapR?
• H2O + MapR-FS = fast data ingestion made even faster
• Data resilience
• MapR snapshots + H2O modelling from checkpoints =
continuous and versioned modelling
49. Airline delay classification
Model predicting
flight delays
ETL Modelling Predictions
Load data from CSVs
Model using
H2O’s GLM
* https://github.com/h2oai/sparkling-water/tree/master/examples/scripts