2. Social @tsunamiide tsunami.io Earthquake Enterprises
Two parts
Simple Clustering Algorithm
Using ML with Large Datasets
3. Social @tsunamiide tsunami.io Earthquake Enterprises
Very elegant
Scales to large datasets
It is simple and easy to learn
Works with unsupervised data
4. Social @tsunamiide tsunami.io Earthquake Enterprises
Competitive Analysis
Compare products from Company A with
Company B by clustering them into groups
Semi-Structured Search Engine
Show different results to different users
depending on how they are classified
▪ What Google thinks about you:
https://www.google.com/settings/ads/onweb/
5. Social @tsunamiide tsunami.io Earthquake Enterprises
Multivariate data set
(i.e. each row is a float[])
Classification is
labeled
Not linearly
separable
Popular for testing
ML Algorithms
7. Social @tsunamiide tsunami.io Earthquake Enterprises
E.g. Classifying text documents
Charting no longer makes sense
Need to rely derived metrics
9. Social @tsunamiide tsunami.io Earthquake Enterprises
Many ML algorithms rely on the features
to be in the range of [-1,1] or [0,1]
K-means will work with any range but for
many distance functions larger ranges will
crowed out smaller ones
We can use this to emphasize some
factors over others
10. Social @tsunamiide tsunami.io Earthquake Enterprises
select the number of clusters (K)
select a seed for each cluster (centroid)
Do {
assign each item in the training set to the
closest centroid
update each centroid to the mean of the
assigned items }
while (any of the centroids have moved)
11. Social @tsunamiide tsunami.io Earthquake Enterprises
Number of clusters are known (3)
Pick seed by randomly selecting 3 rows
from dataset
We intentionally pick 3 close together for
demonstration
12. Social @tsunamiide tsunami.io Earthquake Enterprises
Number of clusters
Distance functions
Feature scaling
Datasets
E.g. included abalone and breast cancer
datasets
14. Social @tsunamiide tsunami.io Earthquake Enterprises
Faster algorithms
with more data will
often beat slower
algorithms with less
data.
15. Social @tsunamiide tsunami.io Earthquake Enterprises
Some algorithms do not scale well
e.g. Layered NN
can take many days (not suited to tutorials)
ML algorithms need to be run repeatedly
Tuning hyper-parameters
K-fold cross validation
Feature discovery
16. Social @tsunamiide tsunami.io Earthquake Enterprises
Random Forest
Built in, popular and effective
Leave one out
My preferred
17. Social @tsunamiide tsunami.io Earthquake Enterprises
Use a fast algorithm for factor discovery
Use a slow algorithm for final solution
Many competitions are won on starting the
slow algorithm as soon as possible