Slides from a lightning talk on classification methods, originally given at Open Source Open Mic Chicago 01/2016. Yes, I know I left things you. You try covering this in 5 minutes.
5. things to know
- you need data labeled with the correct answers to
“train” these algorithms before they work
- feature = dimension = attribute of the data
- class = category = Harry Potter house
15. logistic regression
“divide it with a log function”
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
+ gives you probabilities
+ the model is a formula
+ can “threshold” to make model more or less
conservative
💩💩💩💩💩💩💩💩💩💩💩
- only works with linear decision boundaries
16. SVMs (support vector machines)
“*advanced* draw a line through it”
- better definition of “shitty”
- lines can turn into non-linear
shapes if you transform your
data
26. SVMs (support vector machines)
“*advanced* draw a line through it”
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
works well on a lot of different shapes of data
thanks to the kernel trick
💩💩💩💩💩💩💩💩💩💩💩
not super easy to explain to people
can only kinda do probabilities
34. KNN (k-nearest neighbors)
“what do similar cases look like?”
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
+ no training, adding new data is easy
+ you get to define “distance”
💩💩💩💩💩💩💩💩💩💩💩
- can be outlier-sensitive
- you have to define “distance”
39. decision tree learners
make a flow chart of it
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
+ fit all kinds of arbitrary shapes
+ output is a clear set of
conditionals
💩💩💩💩💩💩💩💩💩💩💩
- extremely prone to overfitting
- have to rebuild when you get new
data
- no probability estimates
41. ensemble models
make a bunch of models and combine them
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
- don’t overfit as much as their component parts
- Generally don’t require much parameter tweaking
- If data doesn’t change very often, you can make
them semi-online by just adding new trees
- Can provide probabilities
💩💩💩💩💩💩💩💩💩💩💩
- Slower than their component parts (though if
those are fast, it doesn’t matter)