3. Inputs
● Data Matrix (Regression)
Predictand
Predictor 1
Predictor 2
Predictor 3
Predictor 4
.56
Red
.456
Male
.589
.78
Green
.654
Female
.6654
.987
Blue
.678
Female
.789
.123
Blue
.999
Male
.543
4. Inputs
● Data Matrix (Binary Classification)
Predictand
Predictor 1
Predictor 2
Predictor 3
Predictor 4
Yes
Red
.456
Male
.589
No
Green
.654
Female
.6654
Yes
Blue
.678
Female
.789
No
Blue
.999
Male
.543
5. Inputs To Streaming Classification
● Observations now have an explicit arrival
order.
Predictand
Predictor 1
Predictor 2
Predictor 3
Predictor 4
Time
Yes
Red
.456
Male
.589
Jan 1st
2011
No
Green
.654
Female
.6654
Feb 4th
2012
Yes
Blue
.678
Female
.789
Feb 5th
2013
No
Blue
.999
Male
.543
July 4th
6. Inputs To Streaming Classification
● New Observations can arrive at any time
Predictand
Predictor 1
Predictor 2
Predictor 3
Predictor 4
Time
Yes
Red
.456
Male
.589
Jan 1st 2011
No
Green
.654
Female
.6654
Feb 4th
2012
Yes
Blue
.678
Female
.789
Feb 5th
2013
No
Blue
.999
Male
.543
July 4th
2013
Yes
Red
.456
Male
.456
NOW
7. Problems
● Do the important predictors change over
time and when does this change occur?
● How far back is data relevant to today’s
problem?
● What happens when our predictors change
again in the future?
● What if this is all happening rapidly… will it
scale?
8. Enter Online Random Forest
● Input is a single new observation
● Trees learn incrementally on this new data
● Trees are dropped from the forest based on
performance and replaced a new “ungrown”
tree
9. Visualization of a single tree
Accuracy on test cases: 75%
5, 6
0, 70
Pure data stop
splitting
10. Visualization of a single tree
Accuracy on test cases: 55%
0, 70
2, 25
20,3
50 new observations have
come and we create another
split off the parent node’s left
branch
11. Tree gets pruned
Accuracy on test cases: 55% …
compare to Random variable and
incorporate the age of the tree.
Accuracy is TOO BAD. Prune
the tree
0, 70
2, 25
20,3
12. New Tree
It’s a stump that hasn’t yet split
any data. If asked for a
classification request it will vote
the prior probability calculated
from the last 100 observations
that the old pruned tree saw
13. Online Random Forest
● By dropping trees that predict poorly we can
adapt to change in important predictors
● If previous data is relevant to today’s
problem, tree’s learned from it in the past. If
it no longer becomes relevant it will be
reflected in the accuracy and the tree will get
prune
14. Online Random Forest
● This process of incremental learning and
dropping is constantly occurring so we can
constantly adapt to a changing signal
● We built our Online Random Forest with
scala’s actor framework
● We distribute our tree’s computations (and
physical location) therefore we can handle
high input data streams