DutchMLSchool. Introduction to Machine Learning with the BigML Platform

1st edition | July 8-11, 2019
1

BigML, Inc #DutchMLSchool
Introduction to BigML
Making Machine Learning Beautifully Simple
Full Name
Role, Company
2
Poul Petersen
CIO, BigML, Inc

Sampling the Audience
3
Expert: Published papers at KDD, ICML, NIPS, etc or
developed own ML algorithms used at large scale
Aﬁcionado: Understands pros/cons of different
techniques and/or can tweak algorithms as needed
Practitioner: Very familiar with ML packages (Weka,
Scikit, BigML, etc.)
Newbie: Just taking Coursera ML class or reading an
introductory book to ML
Absolute beginner: ML sounds like science ﬁction

A Present for You
4

Free 1-Month PRO Subscription
5
https://bigml.com/accounts/register/
dutchmlschool

A Brief History of BigML
6
• BigML Mission: To make Machine
Learning Beautifully Simple
• BigML Founded in Corvallis,
Oregon in 2011 - long before ML
was "cool"
• You’ve never heard of it?
• Most innovative city in the United
States!

A Brief History of BigML
7

BigML Platform
8
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE

BigML Platform
9
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
On-Premises

Machine Learning Motivation
10
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?

Machine Learning Motivation
11
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

Data vs Expert
12
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508

Data vs Expert
13
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics

Data vs Expert
14
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535

More Data!
15
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..
• Can we still fit a line to 10 variables? (well, yes)
• Will fitting a line give good results? (unlikely)
• What about those text fields and categorical values?

Modeling Home Prices
16

What just happened?
17
Home
Data
Square Feet?
Location?
Model Prediction:
Price=418K

Some Terminology…
18
Home
Data
Model Prediction:
Price=418K
Training
Data
• Modeling
• Clustering
• Anomaly Detection
• Association Discovery
ML
Resource
ML
Platform
“Consume” the model
or
“put into production”
• Dashboard
• Custom Application
• Wearable / Edge device
• Batch Process

Model Choices
19
• Single Decision Tree was Easy to understand, but could we
build something stronger?
• There are actually hundreds of algorithms…

Model Choices
20

Model Choices
21
• Single Decision Tree was Easy to understand, but could we
build something stronger?
• There are actually hundreds of algorithms…
• BigML carefully implements the best in terms of interpretability
and the ability to work with real-world data:
• Linear Regression
• Logistic Regression
• Single Decision Trees
• Decision Forest / Random Decision Forest
• Boosted Trees
• Deepnets (wait - those are hard, right?)

Deepnets are Hard, Right?
22
x1 x2 x3 x4
y1 y2 y3Outputs
Inputs
h1 h2 h3 h4 h5 Hidden layer
3 Classes
4 Features
h1 h2 h3 h4 h5 Hidden layer
h1 h2 h3 h4 h9 Hidden layer….
h1 = activation?(wx, x) ?

BigML Deepnets
23
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes signiﬁcant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)

Model Choices
24

Choosing the Algorithm
25
Decreasing Interpretability / Better Representation / Longer Training
IncreasingDataSize/Complexity
Early Stage

Rapid Prototyping
Mid Stage

Proven Application
Late Stage

Critical Performance
DeepnetsSingle Tree Model
Logistic Regression Boosted Trees
Random

Decision Forest
Decision Forest
STILL
TO
O
H
AR
D
?

OptiML
26
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to ﬁnd ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but ﬁnds the optimum
machine learning algorithm and parameters for your data
automatically
• Outputs the top performing algorithms and parameters for your
data… Why use just one “best” result?

Fusions
27
• Similar to an Ensemble, but we can mix different model types
• Logistic Regression, plus a Deepnet for example
• You can also create a fusion with different training sets!
• Last week, plus last month data, etc
• Or a Fusion of OptiML models
• Combines the “best of the best”

OptiML & Fusions
28

ML Workﬂows
29
MODEL
FILTERSOLD HOMES
BATCH
PREDICTION
NEW FEATURES
DATASET DEALS
DATASET
FILTERFORSALE HOMES NEW FEATURES
• Real-world ML Applications
are workﬂows!
• Often requires
unsupervised learning!

Let’s build a recommender
30
Typical way to shop for a home…

Recommender Idea
31
?
?
?
?
Preference
Model
Preference
Data
Sample
… then use the Preference Model to
filter all the homes on the market
All Homes
Forsale

Title
32
What if there are really unusual homes in the data?
• A mansion with 20 bathrooms
• A home with no bedrooms
• A lot size that is smaller than the home?
We don’t want to show these as suggestions
because they are unusual…. How do we detect
anomalies?

Anomaly Detection
!33

What just happened?
34
• We wanted to ﬁnd and remove unusual houses.
• We created an Anomaly Detector and examined
the top anomalies.
• We found some unusual houses to remove and
discovered bad data (missing values) that we want
to ﬁx.

A clever way to ﬁx missing data
35
Let’s use Machine Learning…
BEDS BATHS
SQFT PRICE BEDS BATHS
3.125 US$530.000 5 3
2.100 US$460.000 2
1.200 US$250.000 3
3.950 US$610.000 6 4
4
1.5

WhizzML
!36

What just happened?
37
• We had a Dataset with missing values.
• We wanted to apply an algorithm to ﬁx the missing
values with Machine Learning
• Rather than write the algorithm, we found what we
needed in the WhizzML public gallery.
• Now that we have cloned the Script we can use it
again and again.
• We can write new ones too!

Recommender Problem #2
38
• How can we avoid showing essentially the
same house over and over?
All Homes
?
?
?
Sample
Modern

39
• How can we avoid showing essentially the
same house over and over?
All Homes
Modern
Lots of
Land
• Great! What if we don’t know how to group
them? Or how many groups?
?
sample
?
sample

Clustering
40

What just happened?
41
• Since we don’t know how many groups of homes
there should be, we used G-means Clustering to ﬁnd
the optimum number of groups of homes
• Our recommender will use these groups to create a
better sampling for user preference
• We also tried to understand the home clusters using
“model clusters” but the models were difﬁcult to
interpret

Understanding Clusters Better
42
If SQFT >= 3,125 THEN “Cluster 1”
What if we could get rules like…
SQFT PRICE BEDS BATHS CLUSTER
3.125 US$530.000 5 3 Cluster 1
2.100 US$460.000 4 2 Cluster 3
1.200 US$250.000 3 1,5 Cluster 5
3.950 US$610.000 6 4 Cluster 1

Association Discovery
!43

What just happened?
44
• We used a Batch Centroid to add the Cluster
assignment of each home as a feature to the Dataset
• We use Association Discovery to ﬁnd “interesting”
relationships between the features including the Cluster
assignment

45
There is much more interesting information than just the
number of BEDS, BATHS, etc.
• Unfortunately, these "remarks" are not available in the
Redﬁn download
• Adding them to our dataset requires crawling the
website
• Like most ML projects, preparing the data is 80% of
the difﬁculty (fortunately I already did it!)

Topic Modeling
46

What just happened?
47
• We extending the home dataset with the syndicated
remarks text ﬁeld
• We built a model to predict sale price and explored how
key words discovered in the remarks impacted price
• We used topic modeling to create a deeper thematic
understanding of the remarks
• Homes that are "in-town" or "out-of-town"
• We extended the dataset with ﬁelds that represent for
each home how related they are to each of these topics
• This will allow our clustering to group homes by a deeper
meaning than just BEDS, BATHS, etc
• Is there a better way to capture “locality”?

Idea: Better Feature
48
Worth More
Worth Less

A Better Feature for Home Prices
49
LATITUDE LONGITUDE REFERENCE
LATITUDE
REFERENCE
LONGITUDE
44,583 -123,296775 44,5638 -123,2794
44,604414 -123,296129 44,5638 -123,2794
44,600108 -123,29707 44,5638 -123,2794
44,603077 -123,295004 44,5638 -123,2794
44,589587 -123,301154 44,5638 -123,2794
Distance (m)
700
30,4
19,38
37,8
23,39

Haversine Formula
50
https://en.wikipedia.org/wiki/Haversine_formula

Feature Engineering
51

What just happened?
52
• We wanted to create a new feature “distance from OSU”
• This is possible with Flatline, a DSL for feature engineering
• Rather than writing the code for the coordinate
transformation, we found a ready-made script shared in
the WhizzML gallery
• We cloned the script and transformed the dataset
• This can be easily repeated with new datasets: fresh data
or different cities

Recommender Idea
53
?
?
Modern
Lots of
Land
Small
?
?
?
?
Preference
Model
Preference
Data

House Recommender
54

Co-organized by: Sponsor:
Business Partners:

DutchMLSchool. Introduction to Machine Learning with the BigML Platform

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DutchMLSchool. Introduction to Machine Learning with the BigML Platform

Similaire à DutchMLSchool. Introduction to Machine Learning with the BigML Platform (20)

Plus de BigML, Inc

Plus de BigML, Inc (20)

Dernier

Dernier (20)

DutchMLSchool. Introduction to Machine Learning with the BigML Platform