Online random forests in 10 minutes

•

8 j'aime•7,660 vues

CvilleDataScience

Technologie

Traditional Supervised Learning
Algorithms
●
●
●
●
●

Regression
Random Forest
Support Vector Machines
Classification and Regression Tree (CART)
etc

Inputs
● Data Matrix (Regression)
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

.56

Red

.456

Male

.589

.78

Green

.654

Female

.6654

.987

Blue

.678

Female

.789

.123

Blue

.999

Male

.543

Inputs
● Data Matrix (Binary Classification)
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Yes

Red

.456

Male

.589

No

Green

.654

Female

.6654

Yes

Blue

.678

Female

.789

No

Blue

.999

Male

.543

Inputs To Streaming Classification
● Observations now have an explicit arrival
order.
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Time

Yes

Red

.456

Male

.589

Jan 1st
2011

No

Green

.654

Female

.6654

Feb 4th
2012

Yes

Blue

.678

Female

.789

Feb 5th
2013

No

Blue

.999

Male

.543

July 4th

Inputs To Streaming Classification
● New Observations can arrive at any time
Predictand

Predictor 1

Predictor 2

Predictor 3

Predictor 4

Time

Yes

Red

.456

Male

.589

Jan 1st 2011

No

Green

.654

Female

.6654

Feb 4th
2012

Yes

Blue

.678

Female

.789

Feb 5th
2013

No

Blue

.999

Male

.543

July 4th
2013

Yes

Red

.456

Male

.456

NOW

Problems
● Do the important predictors change over
time and when does this change occur?
● How far back is data relevant to today’s
problem?
● What happens when our predictors change
again in the future?
● What if this is all happening rapidly… will it
scale?

Enter Online Random Forest
● Input is a single new observation
● Trees learn incrementally on this new data
● Trees are dropped from the forest based on
performance and replaced a new “ungrown”
tree

Visualization of a single tree
Accuracy on test cases: 75%

5, 6

0, 70

Pure data stop
splitting

Visualization of a single tree
Accuracy on test cases: 55%

0, 70

2, 25

20,3

50 new observations have
come and we create another
split off the parent node’s left
branch

Tree gets pruned
Accuracy on test cases: 55% …
compare to Random variable and
incorporate the age of the tree.
Accuracy is TOO BAD. Prune
the tree

0, 70

2, 25

20,3

New Tree
It’s a stump that hasn’t yet split
any data. If asked for a
classification request it will vote
the prior probability calculated
from the last 100 observations
that the old pruned tree saw

Online Random Forest
● By dropping trees that predict poorly we can
adapt to change in important predictors
● If previous data is relevant to today’s
problem, tree’s learned from it in the past. If
it no longer becomes relevant it will be
reflected in the accuracy and the tree will get
prune

Online Random Forest
● This process of incremental learning and
dropping is constantly occurring so we can
constantly adapt to a changing signal
● We built our Online Random Forest with
scala’s actor framework
● We distribute our tree’s computations (and
physical location) therefore we can handle
high input data streams

Recommandé

20130116_pfiseminar_gwas_postgwasPreferred Networks

8 Tweet Structures To Generate Engagement The Media Octopus

Map Reduce data types and formatsVigen Sahakyan

[DL輪読会] Residual Attention Network for Image ClassificationDeep Learning JP

イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出Tomoki Hayashi

[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP

Bayes Belief NetworksSai Kumar Kodam

Hadoop MapReduce FundamentalsLynn Langit

Recommandé

20130116_pfiseminar_gwas_postgwasPreferred Networks

8 Tweet Structures To Generate Engagement The Media Octopus

Map Reduce data types and formatsVigen Sahakyan

[DL輪読会] Residual Attention Network for Image ClassificationDeep Learning JP

イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出Tomoki Hayashi

[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP

Bayes Belief NetworksSai Kumar Kodam

Hadoop MapReduce FundamentalsLynn Langit

Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...Edureka!

Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit

Comparison of Machine Learning Algorithms butest

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3Kirill Eremenko

Bayesian Global OptimizationAmazon Web Services

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueScyllaDB

PRML 10.4 - 10.6Akira Miyazawa

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangCeph Community

Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会（公聴会）発表スライド Patho...カリス東大AI博士

10分でわかる主成分分析(PCA)Takanori Ogata

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko

Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...KCS Keio Computer Society

Study Notes: Facebook HaystackGao Yunzhong

(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki

Long-Tailed Classificationの最新動向についてPlot Hong

Recurrent and Recursive Nets (part 2)sohaib_alam

Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun

分散システム読書会 06章-同期（前編）Ichiro TAKAHASHI

Deep High Resolution Representation Learning for Human Pose Estimationharmonylab

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Contenu connexe

Tendances

Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...Edureka!

Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit

Comparison of Machine Learning Algorithms butest

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3Kirill Eremenko

Bayesian Global OptimizationAmazon Web Services

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueScyllaDB

PRML 10.4 - 10.6Akira Miyazawa

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangCeph Community

Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会（公聴会）発表スライド Patho...カリス東大AI博士

10分でわかる主成分分析(PCA)Takanori Ogata

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...Edureka!

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko

Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...KCS Keio Computer Society

Study Notes: Facebook HaystackGao Yunzhong

(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki

Long-Tailed Classificationの最新動向についてPlot Hong

Recurrent and Recursive Nets (part 2)sohaib_alam

Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun

分散システム読書会 06章-同期（前編）Ichiro TAKAHASHI

Deep High Resolution Representation Learning for Human Pose Estimationharmonylab

Tendances (20)

Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...

Scaling Big Data Mining Infrastructure Twitter Experience

Comparison of Machine Learning Algorithms

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - Module 3

Bayesian Global Optimization

Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique

PRML 10.4 - 10.6

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang

Ph.D. Defense Presentation Slides (Changhee Han) カリスの東大博論審査会（公聴会）発表スライド Patho...

10分でわかる主成分分析(PCA)

Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...

Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...

Outrageously Large Neural Networks:The Sparsely-Gated Mixture-of-Experts Laye...

Study Notes: Facebook Haystack

(DL Hacks輪読) How transferable are features in deep neural networks?

Long-Tailed Classificationの最新動向について

Recurrent and Recursive Nets (part 2)

Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation

分散システム読書会 06章-同期（前編）

Deep High Resolution Representation Learning for Human Pose Estimation

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

Why Teams call analytics are critical to your entire businesspanagenda

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

MINDCTI Revenue Release Quarter One 2024MIND CTI

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

AWS Community Day CPH - Three problems of Terraform

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

2024: Domino Containers - The Next Step. News from the Domino Container commu...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Ransomware_Q4_2023. The report. [EN].pdf

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Cyberprint. Dark Pink Apt Group [EN].pdf

Why Teams call analytics are critical to your entire business

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

presentation ICT roal in 21st century education

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

MINDCTI Revenue Release Quarter One 2024

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...

Boost Fertility New Invention Ups Success Rates.pdf

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Online random forests in 10 minutes

1. Online Random Forest in 10 Minutes

2. Traditional Supervised Learning Algorithms ● ● ● ● ● Regression Random Forest Support Vector Machines Classification and Regression Tree (CART) etc

3. Inputs ● Data Matrix (Regression) Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 .56 Red .456 Male .589 .78 Green .654 Female .6654 .987 Blue .678 Female .789 .123 Blue .999 Male .543

4. Inputs ● Data Matrix (Binary Classification) Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Yes Red .456 Male .589 No Green .654 Female .6654 Yes Blue .678 Female .789 No Blue .999 Male .543

5. Inputs To Streaming Classification ● Observations now have an explicit arrival order. Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time Yes Red .456 Male .589 Jan 1st 2011 No Green .654 Female .6654 Feb 4th 2012 Yes Blue .678 Female .789 Feb 5th 2013 No Blue .999 Male .543 July 4th

6. Inputs To Streaming Classification ● New Observations can arrive at any time Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time Yes Red .456 Male .589 Jan 1st 2011 No Green .654 Female .6654 Feb 4th 2012 Yes Blue .678 Female .789 Feb 5th 2013 No Blue .999 Male .543 July 4th 2013 Yes Red .456 Male .456 NOW

7. Problems ● Do the important predictors change over time and when does this change occur? ● How far back is data relevant to today’s problem? ● What happens when our predictors change again in the future? ● What if this is all happening rapidly… will it scale?

8. Enter Online Random Forest ● Input is a single new observation ● Trees learn incrementally on this new data ● Trees are dropped from the forest based on performance and replaced a new “ungrown” tree

9. Visualization of a single tree Accuracy on test cases: 75% 5, 6 0, 70 Pure data stop splitting

10. Visualization of a single tree Accuracy on test cases: 55% 0, 70 2, 25 20,3 50 new observations have come and we create another split off the parent node’s left branch

11. Tree gets pruned Accuracy on test cases: 55% … compare to Random variable and incorporate the age of the tree. Accuracy is TOO BAD. Prune the tree 0, 70 2, 25 20,3

12. New Tree It’s a stump that hasn’t yet split any data. If asked for a classification request it will vote the prior probability calculated from the last 100 observations that the old pruned tree saw

13. Online Random Forest ● By dropping trees that predict poorly we can adapt to change in important predictors ● If previous data is relevant to today’s problem, tree’s learned from it in the past. If it no longer becomes relevant it will be reflected in the accuracy and the tree will get prune

14. Online Random Forest ● This process of incremental learning and dropping is constantly occurring so we can constantly adapt to a changing signal ● We built our Online Random Forest with scala’s actor framework ● We distribute our tree’s computations (and physical location) therefore we can handle high input data streams

15. Example Stream

16.

17.

18.

19.

20. Changing Feature Importance