DutchMLSchool 2022 - Automation

July 4 - 6, 2022
2 n d E d i t i o n

BigML, Inc #DutchMLSchool
The road to production
Automating and deploying Machine Learning projects
2
jao
CTO, BigML

Outline
1 ML as a system service
2 ML as a RESTful cloudy service
3 Machine Learning worflows
4 Client–side automation
5 Server–side workflow automation
6 A first taste of WhizzML: abstraction is back
7 And back to the (distributed) client: BigMLOps
3 / 61

Machine Learning as a System Service
The goal
Machine Learning as a system level
service
• Accessibility
• Integrability
• Automation
• Ease of use
4 / 61

5 / 61

The goal
Machine Learning as a system level
service
The means
• APIs: ML building blocks
• Abstraction layer over feature
engineering
• Abstraction layer over algorithms
• Automation
6 / 61

Outline
7 / 61

RESTful-ish ML Services
8 / 61

9 / 61

RESTful done right: Whitebox resources
• Your data, your model
• Model reverse engineering becomes moot
• Maximizes reach (Web, CLI, desktop, IoT)
10 / 61

• Excellent abstraction layer
• Transparent data model
• Immutable resources and UUIDs: traceability
• Simple yet effective interaction model
• Easy access from any language (API bindings)
Algorithmic complexity and computing resources management
problems mostly washed away
11 / 61

12 / 61

13 / 61

Outline
14 / 61

Textbook Machine Learning workflows
Dr. Natalia Konstantinova (http://nkonst.com/machine-learning-explained-simple-words/)
15 / 61

Tumor detection using anomalies
Given data about a tumor:
• Extract the relevant features that
characterize it (unsupervised
learning)
• Classify the tumor as either benign
or malignant, improving diagnosis
and avoiding unnecessary surgery
19 / 61

Tumor detection using anomalies
Given data about a tumor:
• Extract the relevant features that
characterize it (unsupervised
learning)
• Classify the tumor as either benign
or malignant, improving diagnosis
and avoiding unnecessary surgery
Example: University of Wisconsin Hospital’s Cancer dataset
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
19 / 61

Tumor detection using anomalies: workflow
20 / 61

Tumor detection using anomalies: Evaluation
Is the anomaly score a good predictor in real cases?
21 / 61

Tumor detection using anomalies: Automation?
22 / 61

(Non) automation via Web UI
Strengths of Web UI
Simple Just clicking around
Discoverable Exploration and experimenting
Abstract Transparent error handling and scalability
24 / 61

(Non) automation via Web UI
Strengths of Web UI
Simple Just clicking around
Discoverable Exploration and experimenting
Abstract Transparent error handling and scalability
Problems of Web UI
Only simple Simple tasks are simple, hard tasks quickly get hard
No automation or batch operations Clicking humans don’t scale well
24 / 61

Outline
25 / 61

Abstracting over raw HTTP: bindings
26 / 61

Example workflow: Python bindings
from bigml.api import BigML
api = BigML()
source = 'source/5643d345f43a234ff2310a3e'
dataset = api.create_dataset(source)
api.ok(dataset)
r, s = 0.8, "seed"
train_dataset = api.create_dataset(dataset, {"rate": r, "seed": s})
test_dataset = api.create_dataset(dataset, {"rate": r, "seed": s, "out_of_bag": True})
api.ok(train_dataset)
model = api.create_model(train_dataset)
api.ok(model)
api.ok(test_dataset)
evaluation = api.create_evaluation(model, test_dataset)
api.ok(evaluation)
28 / 61

Is this production code?
How do we generalize to, say, 100 datasets?
29 / 61

# Now do it 100 times, serially
for i in range(0, 100):
r, s = 0.8, i
train = api.create_dataset(dataset, {"rate": r, "seed": s})
test = api.create_dataset(dataset, {"rate": r, "seed": s, "out_of_bag": True})
api.ok(train)
model.append(api.create_model(train))
api.ok(model)
api.ok(test)
evaluation.append(api.create_evaluation(model, test))
api.ok(evaluation[i])
30 / 61

# More efficient if we parallelize, but at what level?
r, s = 0.8, i
train.append(api.create_dataset(dataset, {"rate": r, "seed": s}))
test.append(api.create_dataset(dataset, {"rate": r, "seed": s, "out_of_bag": True})
# Do we wait here?
api.ok(train[i])
api.ok(test[i])
model.append(api.create_model(train[i]))
api.ok(model[i])
evaluation.append(api.create_evaluation(model, test_dataset))
31 / 61

# More efficient if we parallelize, but at what level?
r, s = 0.8, i
# Or do we wait here?
api.ok(train[i])
# and here?
api.ok(model[i])
api.ok(train[i])
32 / 61

# More efficient if we parallelize, but how do we handle errors??
r, s = 0.8, i
api.ok(train[i])
try:
api.ok(model[i])
api.ok(test[i])
except:
# How to recover if test[i] is failed? New datasets? Abort?
33 / 61

Client-side Machine Learning Automation
Problems of bindings-based, client solutions
Complexity Lots of details outside the problem domain
Reuse No inter-language compatibility
Scalability Client-side workflows are hard to optimize
Reproducibility Noisy, complex and hard to audit development environment
Not enough abstraction
34 / 61

A partial solution: CLI declarative tools
# "1-click" dataset with parameterized fields
bigmler --train data/diabetes.csv
--no-model
--name "4-featured diabetes"
--dataset-fields
"plasma glucose,insulin,diabetes pedigree,diabetes"
--output-dir output/diabetes
--project "Certification Workshop"
# "1-click" ensemble
bigmler --train data/iris.csv
--number-of-models 500
--sample-rate 0.85
--output-dir output/iris-ensemble
--project "Certification Workshop"
35 / 61

Rich, parameterized workflows: cross-validation
bigmler analyze --cross-validation # parameterized input
--dataset $(cat output/diabetes/dataset)
--k-folds 3 # number of folds during validation
--output-dir output/diabetes-validation
36 / 61

Client-side Machine Learning automation
Problems of client-side solutions
Hard to generalize Declarative client tools hide complexity at the cost of flexibility
Hard to combine Black–box tools cannot be easily integrated as parts of bigger
client–side workflows
Hard to audit Client–side development environments are complex and very hard
to sandbox
Not enough automation
37 / 61

Complex Too fine-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
to sandbox
Not enough abstraction
37 / 61

Complex Too fine-grained, leaky abstractions
Cumbersome Error handling, network issues
Hard to reuse Tied to a single programming language
Hard to scale Parallelization again a problem
to sandbox
Algorithmic complexity and computing resources management problems mostly
washed away are back!
37 / 61

Algorithmic complexity and computing resources management problems are back! 38 / 61

Outline
39 / 61

Machine Learning Automation
40 / 61

Solution (scalability, reuse): Back to the server
41 / 61

Server–side automation: Scriptify
42 / 61

Solution (complexity, reuse): Domain–specific languages
43 / 61

In a Nutshell
1. Workflows reified as server–side, RESTful resources
2. Domain–specific language for ML workflow automation
44 / 61

Workflows as RESTful Resources
Library Reusable building-block: a collection of WhizzML
definitions that can be imported by other libraries or
scripts.
Script Executable code that describes an actual workflow.
• Imports List of libraries with code used by the script.
• Inputs List of input values that parameterize the
workflow.
• Outputs List of values computed by the script and
returned to the user.
Execution Given a script and a complete set of inputs, the workflow
can be executed and its outputs generated.
45 / 61

Workflows as RESTful Resources: the bazaar
46 / 61

Workflows as RESTful Resources: metaprogramming
Resources that create
resources that create
resources that create
resources that create 47 / 61

Different ways of executing WhizzML Scripts
Web UI
BigMLer
Bindings
Executions
−→
48 / 61

Executing WhizzML scripts: bindings
api = BigML()
# choose workflow
script = 'script/567b4b5be3f2a123a690ff56'
# define parameters
inputs = {'source': 'source/5643d345f43a234ff2310a3e'}
# execute
api.ok(api.create_execution(script, inputs))
49 / 61

Creating and executing WhizzML scripts with BigMLer
bigmler execute --code "(+ 1 2)" --output-dir simple_exe
bigmler execute --script script/50a2bb64035d0706db000643
bigmler execute --script script/50a2bb64035d0706db000643
--inputs my_inputs.json
bigmler execute --code '(define addition (+ a b))'
--declare-inputs my_inputs_dec.json
--declare-outputs my_outputs_dec.json
--no-execute
50 / 61

Outline
51 / 61

api = BigML()
source = 'source/5643d345f43a234ff2310a3e'
dataset = api.create_dataset(source)
api.ok(dataset)
r, s = 0.8, "seed"
train_dataset = api.create_dataset(dataset, {"rate": r, "seed": s})
test_dataset = api.create_dataset(dataset, {"rate": r, "seed": s, "out_of_bag": True})
api.ok(train_dataset)
model = api.create_model(train_dataset)
api.ok(model)
api.ok(test_dataset)
evaluation = api.create_evaluation(model, test_dataset)
api.ok(evaluation)
52 / 61

Syntactic Abstraction in WhizzML: Simple workflow
;; ML artifacts are first-class citizens,
;; we only need to talk about our domain
(let ([train-id test-id] (create-dataset-split id 0.8)
model-id (create-model train-id))
(create-evaluation test-id
model-id
{"name" "Evaluation 80/20"
"missing_strategy" 0}))
53 / 61

;; ML artifacts are first-class citizens,
;; we only need to talk about our domain
(create-evaluation test-id
model-id
{"name" "Evaluation 80/20"
"missing_strategy" 0}))
Ready for production!
53 / 61

Domain Specificity and Scalability: Trivial parallelization
;; Workflow for 1 resource
(create-evaluation test-id model-id))
54 / 61

;; Workflow for arbitrary number of resources
(let (splits (for (id input-datasets)
(create-dataset-split id 0.8)))
(for (split splits)
(create-evaluation (create-model (split 0)) (split 1))))
55 / 61

(let (score (create-anomalyscore anomaly-id input))
(if (> score threshold)
(raise "Input is too weird to predict")
(create-prediction model-id input)))
57 / 61

(for (input inputs)
(when (< (create-anomalyscore anomaly-id input))
(create-prediction model-id input)))
58 / 61

Outline
59 / 61

Package and deploy BigML work
fl
ows in a few clicks
Deploy and
monitor your
application
1 Create an Application
Ops
2 Connect to BigML and add
Work
fl
ows and models
3 Package everything
in a container
4
60 / 61

DutchMLSchool 2022 - Automation

Recommandé

Recommandé

Contenu connexe

Similaire à DutchMLSchool 2022 - Automation

Similaire à DutchMLSchool 2022 - Automation (20)

Plus de BigML, Inc

Plus de BigML, Inc (20)

Dernier

Dernier (20)

DutchMLSchool 2022 - Automation