Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
3. BigML, Inc #DutchMLSchool
Sampling the Audience
3
Expert: Published papers at KDD, ICML, NIPS, etc or
developed own ML algorithms used at large scale
Aficionado: Understands pros/cons of different
techniques and/or can tweak algorithms as needed
Practitioner: Very familiar with ML packages (Weka,
Scikit, BigML, etc.)
Newbie: Just taking Coursera ML class or reading an
introductory book to ML
Absolute beginner: ML sounds like science fiction
6. BigML, Inc #DutchMLSchool
A Brief History of BigML
6
• BigML Mission: To make Machine
Learning Beautifully Simple
• BigML Founded in Corvallis,
Oregon in 2011 - long before ML
was "cool"
• You’ve never heard of it?
• Most innovative city in the United
States!
8. BigML, Inc #DutchMLSchool
BigML Platform
8
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
9. BigML, Inc #DutchMLSchool
BigML Platform
9
Web-based Frontend
Visualizations
Distributed Machine Learning Backend
SOURCE
SERVER
DATASET
SERVER
MODEL
SERVER
PREDICTION
SERVER
EVALUATION
SERVER
SAMPLE
SERVER
WHIZZML
SERVER
Tools - https://bigml.com/tools
REST API - https://bigml.com/api
Smart Infrastructure
(auto-deployable, auto-scalable)
SERVERS
EVENTS GEARMAN
QUEUE
DESIRED
TOPOLOGY
AWS
COSTS
RUNQUEUE
SCALER
BUSY
SCALER
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
AUTO
TOPOLOGY
ACTUAL
TOPOLOGY
MESSAGE
QUEUE
On-Premises
10. BigML, Inc #DutchMLSchool
Machine Learning Motivation
10
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?
11. BigML, Inc #DutchMLSchool
Machine Learning Motivation
11
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
12. BigML, Inc #DutchMLSchool
Data vs Expert
12
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508
13. BigML, Inc #DutchMLSchool
Data vs Expert
13
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
14. BigML, Inc #DutchMLSchool
Data vs Expert
14
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
15. BigML, Inc #DutchMLSchool
More Data!
15
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..
• Can we still fit a line to 10 variables? (well, yes)
• Will fitting a line give good results? (unlikely)
• What about those text fields and categorical values?
18. BigML, Inc #DutchMLSchool
Some Terminology…
18
Home
Data
Model Prediction:
Price=418K
Training
Data
• Modeling
• Clustering
• Anomaly Detection
• Association Discovery
ML
Resource
ML
Platform
“Consume” the model
or
“put into production”
• Dashboard
• Custom Application
• Wearable / Edge device
• Batch Process
19. BigML, Inc #DutchMLSchool
Model Choices
19
• Single Decision Tree was Easy to understand, but could we
build something stronger?
• There are actually hundreds of algorithms…
21. BigML, Inc #DutchMLSchool
Model Choices
21
• Single Decision Tree was Easy to understand, but could we
build something stronger?
• There are actually hundreds of algorithms…
• BigML carefully implements the best in terms of interpretability
and the ability to work with real-world data:
• Linear Regression
• Logistic Regression
• Single Decision Trees
• Decision Forest / Random Decision Forest
• Boosted Trees
• Deepnets (wait - those are hard, right?)
23. BigML, Inc #DutchMLSchool
BigML Deepnets
23
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes significant expert knowledge
• Solution: Metalearning (a good initial guess)
• Solution: Network search (try a bunch)
25. BigML, Inc #DutchMLSchool
Choosing the Algorithm
25
Decreasing Interpretability / Better Representation / Longer Training
IncreasingDataSize/Complexity
Early Stage
Rapid Prototyping
Mid Stage
Proven Application
Late Stage
Critical Performance
DeepnetsSingle Tree Model
Logistic Regression Boosted Trees
Random
Decision Forest
Decision Forest
STILL
TO
O
H
AR
D
?
26. BigML, Inc #DutchMLSchool
OptiML
26
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to find ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but finds the optimum
machine learning algorithm and parameters for your data
automatically
• Outputs the top performing algorithms and parameters for your
data… Why use just one “best” result?
27. BigML, Inc #DutchMLSchool
Fusions
27
• Similar to an Ensemble, but we can mix different model types
• Logistic Regression, plus a Deepnet for example
• You can also create a fusion with different training sets!
• Last week, plus last month data, etc
• Or a Fusion of OptiML models
• Combines the “best of the best”
29. BigML, Inc #DutchMLSchool
ML Workflows
29
MODEL
FILTERSOLD HOMES
BATCH
PREDICTION
NEW FEATURES
DATASET DEALS
DATASET
FILTERFORSALE HOMES NEW FEATURES
• Real-world ML Applications
are workflows!
• Often requires
unsupervised learning!
31. BigML, Inc #DutchMLSchool
Recommender Idea
31
?
?
?
?
Preference
Model
Preference
Data
Sample
… then use the Preference Model to
filter all the homes on the market
All Homes
Forsale
32. BigML, Inc #DutchMLSchool
Title
32
What if there are really unusual homes in the data?
• A mansion with 20 bathrooms
• A home with no bedrooms
• A lot size that is smaller than the home?
We don’t want to show these as suggestions
because they are unusual…. How do we detect
anomalies?
34. BigML, Inc #DutchMLSchool
What just happened?
34
• We wanted to find and remove unusual houses.
• We created an Anomaly Detector and examined
the top anomalies.
• We found some unusual houses to remove and
discovered bad data (missing values) that we want
to fix.
35. BigML, Inc #DutchMLSchool
A clever way to fix missing data
35
Let’s use Machine Learning…
BEDS BATHS
SQFT PRICE BEDS BATHS
3.125 US$530.000 5 3
2.100 US$460.000 2
1.200 US$250.000 3
3.950 US$610.000 6 4
4
1.5
37. BigML, Inc #DutchMLSchool
What just happened?
37
• We had a Dataset with missing values.
• We wanted to apply an algorithm to fix the missing
values with Machine Learning
• Rather than write the algorithm, we found what we
needed in the WhizzML public gallery.
• Now that we have cloned the Script we can use it
again and again.
• We can write new ones too!
38. BigML, Inc #DutchMLSchool
Recommender Problem #2
38
• How can we avoid showing essentially the
same house over and over?
All Homes
?
?
?
Sample
Modern
39. BigML, Inc #DutchMLSchool
Recommender Problem #2
39
• How can we avoid showing essentially the
same house over and over?
All Homes
Modern
Lots of
Land
• Great! What if we don’t know how to group
them? Or how many groups?
?
sample
?
sample
41. BigML, Inc #DutchMLSchool
What just happened?
41
• Since we don’t know how many groups of homes
there should be, we used G-means Clustering to find
the optimum number of groups of homes
• Our recommender will use these groups to create a
better sampling for user preference
• We also tried to understand the home clusters using
“model clusters” but the models were difficult to
interpret
42. BigML, Inc #DutchMLSchool
Understanding Clusters Better
42
If SQFT >= 3,125 THEN “Cluster 1”
What if we could get rules like…
SQFT PRICE BEDS BATHS CLUSTER
3.125 US$530.000 5 3 Cluster 1
2.100 US$460.000 4 2 Cluster 3
1.200 US$250.000 3 1,5 Cluster 5
3.950 US$610.000 6 4 Cluster 1
44. BigML, Inc #DutchMLSchool
What just happened?
44
• We used a Batch Centroid to add the Cluster
assignment of each home as a feature to the Dataset
• We use Association Discovery to find “interesting”
relationships between the features including the Cluster
assignment
45. BigML, Inc #DutchMLSchool
Recommender Problem #3
45
There is much more interesting information than just the
number of BEDS, BATHS, etc.
• Unfortunately, these "remarks" are not available in the
Redfin download
• Adding them to our dataset requires crawling the
website
• Like most ML projects, preparing the data is 80% of
the difficulty (fortunately I already did it!)
47. BigML, Inc #DutchMLSchool
What just happened?
47
• We extending the home dataset with the syndicated
remarks text field
• We built a model to predict sale price and explored how
key words discovered in the remarks impacted price
• We used topic modeling to create a deeper thematic
understanding of the remarks
• Homes that are "in-town" or "out-of-town"
• We extended the dataset with fields that represent for
each home how related they are to each of these topics
• This will allow our clustering to group homes by a deeper
meaning than just BEDS, BATHS, etc
• Is there a better way to capture “locality”?
52. BigML, Inc #DutchMLSchool
What just happened?
52
• We wanted to create a new feature “distance from OSU”
• This is possible with Flatline, a DSL for feature engineering
• Rather than writing the code for the coordinate
transformation, we found a ready-made script shared in
the WhizzML gallery
• We cloned the script and transformed the dataset
• This can be easily repeated with new datasets: fresh data
or different cities