Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. Agenda
About me
Problem statement
Overview
Package motivation
Notable H2O additions
Side-by-side
Questions
3. About me
Taylor Smith
Data scientist at State Farm
M.S. Analytics from The University of Texas at Austin
~3 years in data science, ~6 years writing software
tgsmith61591@gmail.com
http://github.com/tgsmith61591
https://www.linkedin.com/in/taylorgsmith
@TayGriffinSmith
5. DS/DE—typical division of labor
Data scientist
1. Frame the problem
2. Gather raw data
3. Analyze
Data engineer
1. Gather raw data
2. Consolidate data
3. Production
6. Where’s the disconnect?
Exploration
Technologies (Hadoop/Spark/Python/R)
Implementation
Technologies (Python/R/Java)
Dependencies/versioning
Discrepancy in tooling
7. Package motivation
What is skutil?
Began as a pre-processing library to unify Caret, sklearn, etc.
Specifically relevant to actuarial departments—(why?)
Evolved to include H2O modules
Objectives:
Deliver an easy transition into the world of distributed computing that H2O offers
Help bridge “gap” between data scientist and data engineer roles
Provide the same, familiar interface that sklearn users have come to know and love
8. Package motivation [cont’d]
Regarding R…
H2O package completeness
Why Python…
Quickly growing active user base
Easily supported by non-DS engineers
CI/CD friendly
https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/
9. Skutil—Notable H2O additions
H2OPipeline
Similar to sklearn.pipeline.Pipeline
H2OTransformer H2OTransformer H2OEstimator
10. Skutil—Notable H2O additions [cont’d]
H2OGridSearchCV (and H2ORandomizedSearchCV)
Similar to sklearn.grid_search module
Parameter grid
Param set 0
Param set n
… Best model
11. Ok, I have a model… now what?
Deploying in Python?
Pickle-compatible persistence
Entire pipelines can be stored
Deploying model in Java?
Leverage H2O’s built-in “download POJO” capability*
(future release will auto-gen main class and compile runnable fat-jar)
* Just the H2O model; not the full pipeline
12. Skutil at a glance—present and future
Current (v0.1.3)
Transformers
Feature selection
Imputation
Class balancers
Model selection & Pipelines
Road map
PySpark integration
(Thank you to fellow contributor, Charles Drotar)
Automated runnable jar creation using jinja
+