Skutil - H2O meets Sklearn - Taylor Smith

•Télécharger en tant que PPTX, PDF•

1 j'aime•1,293 vues

Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Technologie

Scikit-Util
H2O MEETS SKLEARN
Taylor Smith
October 26, 2016

Agenda
 About me
 Problem statement
 Overview
 Package motivation
 Notable H2O additions
 Side-by-side
 Questions

About me
 Taylor Smith
 Data scientist at State Farm
 M.S. Analytics from The University of Texas at Austin
 ~3 years in data science, ~6 years writing software
tgsmith61591@gmail.com
http://github.com/tgsmith61591
https://www.linkedin.com/in/taylorgsmith
@TayGriffinSmith

Problem statement
WHY AM I STANDING HERE TALKING TO YOU?

DS/DE—typical division of labor
 Data scientist
1. Frame the problem
2. Gather raw data
3. Analyze
 Data engineer
1. Gather raw data
2. Consolidate data
3. Production

Where’s the disconnect?
 Exploration
 Technologies (Hadoop/Spark/Python/R)
 Implementation
 Technologies (Python/R/Java)
 Dependencies/versioning
 Discrepancy in tooling

Package motivation
 What is skutil?
 Began as a pre-processing library to unify Caret, sklearn, etc.
 Specifically relevant to actuarial departments—(why?)
 Evolved to include H2O modules
 Objectives:
 Deliver an easy transition into the world of distributed computing that H2O offers
 Help bridge “gap” between data scientist and data engineer roles
 Provide the same, familiar interface that sklearn users have come to know and love

Package motivation [cont’d]
 Regarding R…
 H2O package completeness
 Why Python…
Quickly growing active user base
Easily supported by non-DS engineers
CI/CD friendly
https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/

Skutil—Notable H2O additions
 H2OPipeline
 Similar to sklearn.pipeline.Pipeline
H2OTransformer H2OTransformer H2OEstimator

Skutil—Notable H2O additions [cont’d]
 H2OGridSearchCV (and H2ORandomizedSearchCV)
 Similar to sklearn.grid_search module
Parameter grid
Param set 0
Param set n
… Best model

Ok, I have a model… now what?
 Deploying in Python?
 Pickle-compatible persistence
 Entire pipelines can be stored
 Deploying model in Java?
 Leverage H2O’s built-in “download POJO” capability*
 (future release will auto-gen main class and compile runnable fat-jar)
* Just the H2O model; not the full pipeline

Skutil at a glance—present and future
 Current (v0.1.3)
 Transformers
 Feature selection
 Imputation
 Class balancers
 Model selection & Pipelines
 Road map
 PySpark integration
 (Thank you to fellow contributor, Charles Drotar)
 Automated runnable jar creation using jinja
+

H2O vs. Sklearn
Load data
Split data
Fit model

Skutil vs. Sklearn
Load data
Split data
Fit model

Contenu connexe

Tendances

Intro to H2O Machine Learning in R at Santa Clara University

Sri Ambati

Data Science at Scale by Sarah Guido

Spark Summit

Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces. Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.

H2O Deep Water - Making Deep Learning Accessible to Everyone

Sri Ambati

H2O intro at Dallas Meetup

Sri Ambati

H2O Big Join Slides

Sri Ambati

Deep Learning with MXNet - Dmitry Larko

Sri Ambati

The focus of this presentation is scalable machine learning using the h2o R and Python packages. H2O is an open source, distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface. Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, Stacked Ensembles (aka "Super Learners"), dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others. R and Python code with H2O machine learning code examples will be demoed live and will be made available on GitHub for participants to follow along on their laptops if they choose. For those interested in running the code on a multi-node Amazon EC2 cluster, an H2O AMI is also available. Author Bio: Dr. Erin LeDell is a Machine Learning Scientist at H2O.ai, the company that produces the open source machine learning platform, H2O. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE in 2016) and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.

Scalable Machine Learning in R and Python with H2O

Sri Ambati

The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution. The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs. This session will showcase how to address the above mentioned issues for a drug similarity use case. We’ll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools.

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...

Databricks

Strata San Jose 2016: Scalable Ensemble Learning with H2O

Sri Ambati

How Deep Learning Will Make Us More Human Again While deep learning is taking over the AI space, most of us are struggling to keep up with the pace of innovation. Arno Candel shares success stories and challenges in training and deploying state-of-the-art machine learning models on real-world datasets. He will also share his insights into what the future of machine learning and deep learning might look like, and how to best prepare for it. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

ArnoCandelAIFrontiers011217

Sri Ambati

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Databricks

H2O Rains with Databricks Cloud - NY 02.16.16

Sri Ambati

High Performance Machine Learning in R with H2O

Sri Ambati

The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show: 1) How we tried to solve this problem using traditional DW techniques 2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients. 3) Some of the key learnings we had when migrating from DW to Spark.

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Databricks

AI Development with H2O.ai

Yalçın Yenigün

We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly? Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company? In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.” We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.

Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...

Databricks

Introduction to Analytics with Azure Notebooks and Python

Jen Stirrup

This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. By Vida Ha at Spark Summit East 2016.

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Databricks

Machine Learning with Spark

elephantscale

Sparkling Water 5 28-14

Sri Ambati

Tendances (20)

Intro to H2O Machine Learning in R at Santa Clara University

Data Science at Scale by Sarah Guido

H2O Deep Water - Making Deep Learning Accessible to Everyone

H2O intro at Dallas Meetup

H2O Big Join Slides

Deep Learning with MXNet - Dmitry Larko

Scalable Machine Learning in R and Python with H2O

Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...

Strata San Jose 2016: Scalable Ensemble Learning with H2O

ArnoCandelAIFrontiers011217

Spark Summit 2015 keynote: Making Big Data Simple with Spark

H2O Rains with Databricks Cloud - NY 02.16.16

High Performance Machine Learning in R with H2O

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

AI Development with H2O.ai

Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...

Introduction to Analytics with Azure Notebooks and Python

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Machine Learning with Spark

Sparkling Water 5 28-14

En vedette

Sparkling Water 2.0 - Michal Malohlava

Sri Ambati

Cybersecurity with AI - Ashrith Barthur

Sri Ambati

H2O AutoML roadmap - Ray Peck

Sri Ambati

Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Top 10 Data Science Practitioner Pitfalls

Sri Ambati

How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark. By Michal Malohlava and H2O.ai Our 100th Meetup at 0xdata, September 30, 2014 Open Source meets Out Door. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

2014 09 30_sparkling_water_hands_on

Sri Ambati

Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup Basic knowledge of R/python and general ML concepts Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop Level: 200 Time: 2 Hours Agenda: - Introduction to ML, H2O and Sparkling Water - Refresher of data manipulation in R & Python - Supervised learning ---- Understanding liner regression model with an example ---- Understanding binomial classification with an example ---- Understanding multinomial classification with an example - Unsupervised learning ---- Understanding k-means clustering with an example - Using machine learning models in production - Sparkling Water Introduction & Demo

Applied Machine learning using H2O, python and R Workshop

Avkash Chauhan

H2O World - Sparkling Water - Michal Malohlava

Sri Ambati

Nvidia Deep Learning Solutions - Alex Sabatier

Sri Ambati

Deep Water brings the latest and greatest in the Deep Learning space all under the H2O hood. Use Tensorflow, Mxnet & Caffe all from standard H2O interface's including R, Python & Flow. Also deploy your models easily using the H2O platform. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Deep Water - GPU Deep Learning for H2O - Arno Candel

Sri Ambati

Machine Learning with H2O, Spark, and Python at Strata 2015

Sri Ambati

Deep Learning R Vignette Documentation: https://github.com/0xdata/h2o/tree/master/docs/deeplearning/ Deep Learning has been dominating recent machine learning competitions with better predictions. Unlike the neural networks of the past, modern Deep Learning methods have cracked the code for training stability and generalization. Deep Learning is not only the leader in image and speech recognition tasks, but is also emerging as the algorithm of choice in traditional business analytics. This talk introduces Deep Learning and implementation concepts in the open-source H2O in-memory prediction engine. Designed for the solution of enterprise-scale problems on distributed compute clusters, it offers advanced features such as adaptive learning rate, dropout regularization and optimization for class imbalance. World record performance on the classic MNIST dataset, best-in-class accuracy for eBay text classification and others showcase the power of this game changing technology. A whole new ecosystem of Intelligent Applications is emerging with Deep Learning at its core. About the Speaker: Arno Candel Prior to joining 0xdata as Physicist & Hacker, Arno was a founding Senior MTS at Skytree where he designed and implemented high-performance machine learning algorithms. He has over a decade of experience in HPC with C++/MPI and had access to the world's largest supercomputers as a Staff Scientist at SLAC National Accelerator Laboratory where he participated in US DOE scientific computing initiatives. While at SLAC, he authored the first curvilinear finite-element simulation code for space-charge dominated relativistic free electrons and scaled it to thousands of compute nodes. He also led a collaboration with CERN to model the electromagnetic performance of CLIC, a ginormous e+e- collider and potential successor of LHC. Arno has authored dozens of scientific papers and was a sought-after academic conference speaker. He holds a PhD and Masters summa cum laude in Physics from ETH Zurich. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

H2O Distributed Deep Learning by Arno Candel 071614

Sri Ambati

Scalable Data Science and Deep Learning with H2O In this session, we introduce the H2O data science platform. We will explain its scalable in-memory architecture and design principles and focus on the implementation of distributed deep learning in H2O. Advanced features such as adaptive learning rates, various forms of regularization, automatic data transformations, checkpointing, grid-search, cross-validation and auto-tuning turn multi-layer neural networks of the past into powerful, easy-to-use predictive analytics tools accessible to everyone. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases. By the end of the hands-on-session, attendees will have learned to perform end-to-end data science workflows with H2O using both the easy-to-use web interface and the flexible R interface. We will cover data ingest, basic feature engineering, feature selection, hyperparameter optimization with N-fold cross-validation, multi-model scoring and taking models into production. We will train supervised and unsupervised methods on realistic datasets. With best-of-breed machine learning algorithms such as elastic net, random forest, gradient boosting and deep learning, you will be able to create your own smart applications. A local installation of RStudio is recommended for this session. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

H2O Deep Learning at Next.ML

Sri Ambati

Video: https://www.youtube.com/watch?v=R3IXd1iwqjc Meetup: http://www.meetup.com/SF-Bay-ACM/events/231709894/ In this talk, Arno Candel presents a brief history of AI and how Deep Learning and Machine Learning techniques are transforming our everyday lives. Arno will introduce H2O, a scalable open-source machine learning platform, and show live demos on how to train sophisticated machine learning models on large distributed datasets. He will show how data scientists and application developers can use the Flow GUI, R, Python, Java, Scala, JavaScript and JSON to build smarter applications, and how to take them to production. He will present customer use cases from verticals including insurance, fraud, churn, fintech, and marketing. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Transform your Business with AI, Deep Learning and Machine Learning

Sri Ambati

Sparkling Water is the newest application on the Apache Spark in-memory platform to extend Machine Learning for better predictions and to quickly deploy models into production. H2O is proud to partner with Cloudera and Databricks to bring this capability to a wide audience. H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFast¬TM Scoring Engine. Learn more by going to http://www.h2o.ai and contact us for more information. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Sparkling Water Webinar October 29th, 2014

Sri Ambati

Sparkling Water Meetup

Sri Ambati

H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks

Sri Ambati

H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston

Sri Ambati

Introduction to Sparkling Water - Spark Summit East 2016

Sri Ambati

Sparkling Water

h2oworld

H2O World - PySparkling Water - Nidhi Mehta

Sri Ambati

En vedette (20)

Sparkling Water 2.0 - Michal Malohlava

Cybersecurity with AI - Ashrith Barthur

H2O AutoML roadmap - Ray Peck

Top 10 Data Science Practitioner Pitfalls

2014 09 30_sparkling_water_hands_on

Applied Machine learning using H2O, python and R Workshop

H2O World - Sparkling Water - Michal Malohlava

Nvidia Deep Learning Solutions - Alex Sabatier

Deep Water - GPU Deep Learning for H2O - Arno Candel

Machine Learning with H2O, Spark, and Python at Strata 2015

H2O Distributed Deep Learning by Arno Candel 071614

H2O Deep Learning at Next.ML

Transform your Business with AI, Deep Learning and Machine Learning

Sparkling Water Webinar October 29th, 2014

Sparkling Water Meetup

H2O World - What Do Companies Need to do to Stay Ahead - Michael Marks

H2O World - ML Could Solve NLP Challenges: Ontology Management - Erik Huddleston

Introduction to Sparkling Water - Spark Summit East 2016

Sparkling Water

H2O World - PySparkling Water - Nidhi Mehta

Similaire à Skutil - H2O meets Sklearn - Taylor Smith

NYC_2016_slides

Nathan Halko

Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. Horton Randall Pruim Daniel T. Kaplan A Student's Guide to R Project MOSAIC 2 horton, kaplan, pruim Copyright (c) 2015 by Nicholas J. Horton, Randall Pruim, & Daniel Kaplan. Edition 1.2, November 2015 This material is copyrighted by the authors under a Creative Commons Attribution 3.0 Unported License. You are free to Share (to copy, distribute and transmit the work) and to Remix (to adapt the work) if you attribute our work. More detailed information about the licensing is available at this web page: http: //www.mosaic-web.org/go/teachingRlicense.html. Cover Photo: Maya Hanna. http://www.mosaic-web.org/go/teachingRlicense.html http://www.mosaic-web.org/go/teachingRlicense.html Contents 1 Introduction 13 2 Getting Started with RStudio 15 3 One Quantitative Variable 27 4 One Categorical Variable 39 5 Two Quantitative Variables 45 6 Two Categorical Variables 55 7 Quantitative Response, Categorical Predictor 61 8 Categorical Response, Quantitative Predictor 69 9 Survival Time Outcomes 73 4 horton, kaplan, pruim 10 More than Two Variables 75 11 Probability Distributions & Random Variables 83 12 Power Calculations 89 13 Data Management 93 14 Health Evaluation (HELP) Study 107 15 Exercises and Problems 111 16 Bibliography 115 17 Index 117 About These Notes We present an approach to teaching introductory and in- termediate statistics courses that is tightly coupled with computing generally and with R and RStudio in particular. These activities and examples are intended to highlight a modern approach to statistical education that focuses on modeling, resampling based inference, and multivari- ate graphical techniques. A secondary goal is to facilitate computing with data through use of small simulation studies and appropriate statistical analysis workflow. This follows the philosophy outlined by Nolan and Temple Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang. Computing in the statistics curriculum. The American Statistician, 64(2):97–107, 2010 tics education is a principal component of the recently adopted American Statistical Association’s curriculum guidelines2. 2 ASA Undergraduate Guide- lines Workgroup. 2014 curriculum guidelines for undergraduate programs in statistical science. Technical report, American Statistical Associa- tion, November 2014. http: //www.amstat.org/education/ curriculumguidelines.cfm Throughout this book (and its companion volumes), we introduce multiple activities, some appropriate for an introductory course, others suitable for higher levels, that demonstrate key concepts in statistics and modeling while also supporting the core material of more traditional courses. A Work in Progress Caution! Despite our best efforts, you WILL find bugs both in this document and in our code. Please let us know when y ...

Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx

wellesleyterresa

A Look into the Apache OODT Ecosystem

Chris Mattmann

Samsung SDS OpeniT - The possibility of Python

Insuk (Chris) Cho

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Elastic Data Analytics Platform @Datadog

C4Media

Linked Data in Learning Analytics Tools

Mathieu d'Aquin

Nowadays, Data Science is buzzing all over the place. But what is a, so-called, Data Scientist? Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data. However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial. In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results. Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data. The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.

Towards a rebirth of data science (by Data Fellas)

Andy Petrella

Hadoop/Spark Non-Technical Basics

Zitao Liu

The main objective of this workshop is to give the audience hands on experience with several Hadoop technologies and jump start their hadoop journey. In this workshop, you will load data and submit queries using Hadoop! Before jumping in to the technology, the Founders of DataKitchen review Hadoop and some of its technologies (MapReduce, Hive, Pig, Impala and Spark), look at performance, and present a rubric for choosing which technology to use when. NOTE: To complete hands on poriton in the time allotted, attendees should come with a newly created AWS (Amazon Web Services) Account and complete the other prerequisites found in the DataKitchen blog <http: />.

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...

DataKitchen

Alluxio Global Online Meetup Apr 23, 2020 For more Alluxio events: https://www.alluxio.io/events/ Speakers: Jiao (Jennie) Wang, Intel Tsai Louie, Intel Bin Fan, Alluxio Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked. Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications. This talk, we will go over: - What is Analytics Zoo and how it works - How to run Analytics Zoo with Alluxio in deep learning applications - Initial performance benchmark results using the Analytics Zoo + Alluxio stack

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Alluxio, Inc.

Intel recently released BigDL, an open source distributed deep Learning framework for Apache Spark (https://github.com/intel-analytics/BigDL). It brings native support for deep learning functionalities to Spark, provides orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch/TensorFlow) with respect to single node Xeon performance, and efficiently scales out deep learning workloads based on the Spark architecture. In addition, it also allows data scientists to perform distributed deep learning analysis on big data using the familiar tools including python, notebook, etc. In this talk, we will give an introduction to BigDL, show how Big Data users and data scientist can leverage BigDL for their deep learning (such as image recognition, object detection, NLP, etc.) analysis on large amounts of data in a distributed fashion, which allows them to use their Big Data (e.g., Apache Hadoop and Spark) cluster as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Distributed Deep Learning At Scale On Apache Spark With BigDL

Yulia Tell

2014-10-10-SBC361-Reproducible research

Yannick Wurm

Agile data lake? An oxymoron?

samthemonad

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Paco Nathan

Using Cascalog to build an app with City of Palo Alto Open Data

OSCON Byrum

Starting with outlining the history of conventional version control before diving into explaining QoDs (Quantitative Oriented Developers) and the unique problems their ML systems pose from an operations perspective (MLOps). With the only status quo solutions being proprietary in-house pipelines (exclusive to Uber, Google, Facebook) and manual tracking/fragile "glue" code for everyone else. Datmo works to solve this issue by empowering QoDs in two ways: making MLOps manageable and simple (rather than completely abstracted away) as well as reducing the amount of glue code so to ensure more robust end-to-end pipelines. This goes through a simple example of using Datmo with an Iris classification dataset. Later workshops will expand to show how Datmo can work with other data pipelining tools.

Version Control in Machine Learning + AI (Stanford)

Anand Sampat

Reshape Data Lake (as of 2020.07)

Eric Sun

Data Science with Spark

Krishna Sankar

Machine learning at scale challenges and solutions

Stavros Kontopoulos

This is the slide deck from the second webinar or our chapter's (SME Chapter 112) "Python for Engineers and Manufacturers" series. The webinar was held on August 2, 2017. All of the slide decks and code for this webinar series are located at: https://github.com/sme112/python_webinars To learn about SME Chapter 112 and our events, please visit the following links: https://www.facebook.com/sme112/ https://www.linkedin.com/company/sme112

Exploring and Using the Python Ecosystem

Adam Cook

Similaire à Skutil - H2O meets Sklearn - Taylor Smith (20)

NYC_2016_slides

Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx

A Look into the Apache OODT Ecosystem

Samsung SDS OpeniT - The possibility of Python

Elastic Data Analytics Platform @Datadog

Linked Data in Learning Analytics Tools

Towards a rebirth of data science (by Data Fellas)

Hadoop/Spark Non-Technical Basics

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio

Distributed Deep Learning At Scale On Apache Spark With BigDL

2014-10-10-SBC361-Reproducible research

Agile data lake? An oxymoron?

OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data

Using Cascalog to build an app with City of Palo Alto Open Data

Version Control in Machine Learning + AI (Stanford)

Reshape Data Lake (as of 2020.07)

Data Science with Spark

Machine learning at scale challenges and solutions

Exploring and Using the Python Ecosystem

Plus de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Sri Ambati

Generative AI Masterclass - Model Risk Management.pptx

Sri Ambati

AI and the Future of Software Development: A Sneak Peek

Sri Ambati

LLMOps: Match report from the top of the 5th

Sri Ambati

Building, Evaluating, and Optimizing your RAG App for Production

Sri Ambati

Sandeep Singh, Head of Applied AI Computer Vision, Beans.ai H2O Open Source GenAI World SF 2023 In the modern era of machine learning, leveraging both open-source and closed-source solutions has become paramount for achieving cutting-edge results. This talk delves into the intricacies of seamlessly integrating open-source Large Language Model (LLM) solutions like Vicuna, Falcon, and Llama with industry giants such as ChatGPT and Google's Palm. As the demand for fine-tuned and specialized datasets grows, it is imperative to understand the synergy between these tools. Attendees will gain insights into best practices for building and enriching datasets tailored for fine-tuning tasks, ensuring that their LLM projects are both robust and efficient. Through real-world examples and hands-on demonstrations, this talk will equip attendees with the knowledge to harness the power of both open and closed-source tools in a coherent and effective manner.

Building LLM Solutions using Open Source and Closed Source Solutions in Coher...

Sri Ambati

Patrick Hall, Professor, AI Risk Management, The George Washington University H2O Open Source GenAI World SF 2023 Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!

Risk Management for LLMs

Sri Ambati

Dr. Alexy Khrabrov, Open Source Science Community Director, IBM H2O Open Source GenAI World SF 2023 In this talk, Dr. Alexy Khrabrov, recently elected Chair of the new Generative AI Commons at Linux Foundation for AI & Data, outlines the OSS AI landscape, challenges, and opportunities. With new models and frameworks being unveiled weekly, one thing remains constant: community building and validation of all aspects of AI is key to reliable and responsible AI we can use for business and society needs. Industrial AI is one key area where such community validation can prove invaluable.

Open-Source AI: Community is the Way

Sri Ambati

Building Custom GenAI Apps at H2O

Sri Ambati

Megan Kurka, Vice President, Customer Data Scientist, H2O.ai H2O Open Source GenAI World SF 2023 Discover the transformative power of Applied Gen AI. Learn how the H2O team builds customized applications and workflows that integrate capabilities of Gen AI and AutoML specifically designed to address and enhance financial use cases. Explore real world examples, learn best practices, and witness firsthand how our innovative solutions are reshaping the landscape of finance technology.

Applied Gen AI for the Finance Vertical

Sri Ambati

Cutting Edge Tricks from LLM Papers

Sri Ambati

Pascal Pfeiffer, Principal Data Scientist, H2O.ai H2O Open Source GenAI World SF 2023 This talk dives into the expansive ecosystem of Large Language Models (LLMs), offering practitioners an insightful guide to various relevant applications, from natural language understanding to creative content generation. While exploring use cases across different industries, it also honestly addresses the current limitations of LLMs and anticipates future advancements.

Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...

Sri Ambati

Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...

Sri Ambati

KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...

Sri Ambati

LLM Interpretability

Sri Ambati

Never Reply to an Email Again

Sri Ambati

Introducción al Aprendizaje Automatico con H2O-3 (1)

Sri Ambati

Numerai is an open, crowd-sourced hedge fund powered by predictions from data scientists around the world. In return, participants are rewarded with weekly payouts in crypto. In this talk, Joe will give an overview of the Numerai tournament based on his own experience. He will then explain how he automates the time-consuming tasks such as testing different modelling strategies, scoring new datasets, submitting predictions to Numerai as well as monitoring model performance with H2O Driverless AI and R.

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...

Sri Ambati

In this session, you will learn about what you should do after you’ve taken an AI transformation baseline. Over the span of this session, we will discuss the next steps in moving toward AI readiness through alignment of talent and tools to drive successful adoption and continuous use within an organization. To find additional videos on AI courses, earn badges, join the courses at H2O.ai Learning Center: https://training.h2o.ai/products/ai-foundations-course To find the Youtube video about this presentation: https://youtu.be/K1Cl3x3rd8g Speaker: Chemere Davis (H2O.ai - Senior Data Scientist Training Specialist)

AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...

Sri Ambati

The chances of successfully implementing AI strategies within an organization significantly improve when you can recognize where your organization is on the maturity scale. Over this course, you will learn the keys to unlocking value with AI which include asking the right questions about the problems you are solving and ensuring you have the right cross-section of talent, tools, and resources. By the end of this module, you should be able to recognize where your organization is on the AI transformation spectrum and identify some strategies that can get you to the next stage in your journey. To find additional videos on AI courses, earn badges, join the courses at H2O.ai Learning Center: https://training.h2o.ai/products/ai-foundations-course To find the Youtube video about this presentation: https://youtu.be/PJgr2epM6qs Speakers: Chemere Davis (H2O.ai - Senior Data Scientist Training Specialist) Ingrid Burton (H2O.ai - CMO)

AI Foundations Course Module 1 - An AI Transformation Journey

Sri Ambati

Plus de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Generative AI Masterclass - Model Risk Management.pptx

AI and the Future of Software Development: A Sneak Peek

LLMOps: Match report from the top of the 5th

Building, Evaluating, and Optimizing your RAG App for Production

Building LLM Solutions using Open Source and Closed Source Solutions in Coher...

Risk Management for LLMs

Open-Source AI: Community is the Way

Building Custom GenAI Apps at H2O

Applied Gen AI for the Finance Vertical

Cutting Edge Tricks from LLM Papers

Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...

Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...

KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...

LLM Interpretability

Never Reply to an Email Again

Introducción al Aprendizaje Automatico con H2O-3 (1)

From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...

AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...

AI Foundations Course Module 1 - An AI Transformation Journey

Dernier

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

Dubai, often portrayed as a shimmering oasis in the desert, faces its own set of challenges, including the occasional threat of flooding. Despite its reputation for opulence and modernity, the emirate is not immune to the forces of nature. In recent years, Dubai has experienced sporadic but significant floods, testing the resilience of its infrastructure and communities. Among the critical lifelines in this bustling metropolis is the Dubai International Airport, a bustling hub that connects the city to the world. This article explores the intersection of Dubai flood events and the resilience demonstrated by the Dubai International Airport in the face of such challenges.

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Orbitshub

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Exploring Multimodal Embeddings with Milvus

Zilliz

The microservices honeymoon is over. When starting a new project or revamping a legacy monolith, teams started looking for alternatives to microservices. The Modular Monolith, or 'Modulith', is an architecture that reaps the benefits of (vertical) functional decoupling without the high costs associated with separate deployments. This talk will delve into the advantages and challenges of this progressive architecture, beginning with exploring the concept of a 'module', its internal structure, public API, and inter-module communication patterns. Supported by spring-modulith, the talk provides practical guidance on addressing the main challenges of a Modultith Architecture: finding and guarding module boundaries, data decoupling, and integration module-testing. You should not miss this talk if you are a software architect or tech lead seeking practical, scalable solutions. About the author With two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Victor Rentea

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

DBX First Quarter 2024 Investor Presentation

Dropbox

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

The action of the next cyber saga takes place in the mystical lands of the Asia-Pacific region, where the main characters began their digital activities in the middle of 2021 and qualitatively strengthened it in 2022. Corporate espionage, document theft, audio recordings, and data leaks from messaging platforms were all a matter of one day for Dark Pink. Their geographical focus may have started in the Asia-Pacific region, but their ambitions knew no bounds, targeting a European government ministry in a bold move to expand their portfolio. Their victim profile was as diverse as a UN meeting, targeting military organizations, government agencies, and even a religious organization. Because discrimination is not a fashionable agenda. In the world of cybercrime, they serve as a reminder that sometimes the most serious threats come in the most unassuming packages with a pink bow.

Cyberprint. Dark Pink Apt Group [EN].pdf

Overkill Security

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Dernier (20)

Ransomware_Q4_2023. The report. [EN].pdf

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

CNIC Information System with Pakdata Cf In Pakistan

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Exploring Multimodal Embeddings with Milvus

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

DBX First Quarter 2024 Investor Presentation

FWD Group - Insurer Innovation Award 2024

Manulife - Insurer Transformation Award 2024

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

presentation ICT roal in 21st century education

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

MS Copilot expands with MS Graph connectors

Cyberprint. Dark Pink Apt Group [EN].pdf

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Strategies for Landing an Oracle DBA Job as a Fresher

Skutil - H2O meets Sklearn - Taylor Smith

1. Scikit-Util H2O MEETS SKLEARN Taylor Smith October 26, 2016

2. Agenda  About me  Problem statement  Overview  Package motivation  Notable H2O additions  Side-by-side  Questions

3. About me  Taylor Smith  Data scientist at State Farm  M.S. Analytics from The University of Texas at Austin  ~3 years in data science, ~6 years writing software tgsmith61591@gmail.com http://github.com/tgsmith61591 https://www.linkedin.com/in/taylorgsmith @TayGriffinSmith

4. Problem statement WHY AM I STANDING HERE TALKING TO YOU?

5. DS/DE—typical division of labor  Data scientist 1. Frame the problem 2. Gather raw data 3. Analyze  Data engineer 1. Gather raw data 2. Consolidate data 3. Production

6. Where’s the disconnect?  Exploration  Technologies (Hadoop/Spark/Python/R)  Implementation  Technologies (Python/R/Java)  Dependencies/versioning  Discrepancy in tooling

7. Package motivation  What is skutil?  Began as a pre-processing library to unify Caret, sklearn, etc.  Specifically relevant to actuarial departments—(why?)  Evolved to include H2O modules  Objectives:  Deliver an easy transition into the world of distributed computing that H2O offers  Help bridge “gap” between data scientist and data engineer roles  Provide the same, familiar interface that sklearn users have come to know and love

8. Package motivation [cont’d]  Regarding R…  H2O package completeness  Why Python… Quickly growing active user base Easily supported by non-DS engineers CI/CD friendly https://www.r-bloggers.com/on-the-growth-of-r-and-python-for-data-science/

9. Skutil—Notable H2O additions  H2OPipeline  Similar to sklearn.pipeline.Pipeline H2OTransformer H2OTransformer H2OEstimator

10. Skutil—Notable H2O additions [cont’d]  H2OGridSearchCV (and H2ORandomizedSearchCV)  Similar to sklearn.grid_search module Parameter grid Param set 0 Param set n … Best model

11. Ok, I have a model… now what?  Deploying in Python?  Pickle-compatible persistence  Entire pipelines can be stored  Deploying model in Java?  Leverage H2O’s built-in “download POJO” capability*  (future release will auto-gen main class and compile runnable fat-jar) * Just the H2O model; not the full pipeline

12. Skutil at a glance—present and future  Current (v0.1.3)  Transformers  Feature selection  Imputation  Class balancers  Model selection & Pipelines  Road map  PySpark integration  (Thank you to fellow contributor, Charles Drotar)  Automated runnable jar creation using jinja +

13. H2O vs. Sklearn SKUTIL IN ACTION

14. H2O vs. Sklearn Load data Split data Fit model

15. Skutil vs. Sklearn Load data Split data Fit model

16. Questions? THANK YOU!!

Skutil - H2O meets Sklearn - Taylor Smith

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Skutil - H2O meets Sklearn - Taylor Smith

Similaire à Skutil - H2O meets Sklearn - Taylor Smith (20)

Plus de Sri Ambati

Plus de Sri Ambati (20)

Dernier

Dernier (20)

Skutil - H2O meets Sklearn - Taylor Smith