SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
How to Grow
Distributed Random
Forests
JanVitek
Purdue University
on sabbatical at 0xdata
Photo credit: http://foundwalls.com/winter-snow-forest-pine/
Overview
•Not data scientist…
•I implement programming languages for a living…
•Leading the FastR project; a next generation R implementation…
•Today I’ll tell you how to grow a distributed random forest in 2KLOC
PART I
Why so random?
Introducing:
Random Forest
Bagging
Out of bag error estimate
Confusion matrix
Photo credit: http://foundwalls.com/winter-snow-forest-pine/
Leo Breiman. Random forests. Machine learning, 2001.
ClassificationTrees
•Consider a supervised learning problem with a simple data set with
two classes and the data has two features x in [1,4] and y in [5,8].
•We can build a classification tree to predict classes of new
observations
1 2 3
6
5
7
ClassificationTrees
•Consider a supervised learning problem with a simple data set with
two classes and the data has two features x in [1,4] and y in [5,8].
•We can build a classification tree to predict classes of new
observations
1 2 3
6
5
7
>3
>6
>7
>2
>6
>7
2.6
6.5
>6
>7
5.5
ClassificationTrees
•Classification trees overfit the data
>3
>6
>7
>2
>6
>7
2.6
6.5
>6
>7
5.5
Random Forest
•Avoid overfitting by building many randomized, partial, trees and
vote to determine class of new observations
1 2 3
6
5
7
1 2 3
6
5
7
1 2 3
6
5
7
Random Forest
•Each tree sees part of the training sets and captures part of the
information it contains
1 2 3
6
5
7
1 2 3
6
5
7
1 2 3
6
5
7
>3
>6
>7
>2
>6
>7
>6
>7
>3
>6
>7
>2
>6
>7
>6
>7
Bagging
•First rule of RF:
each tree see is a different random selection
(without replacement) of the training set.
1 2 3
6
5
7
1 2 3
6
5
7
1 2 3
6
5
7
Split selection
•Second rule of RF:
Splits are selected to maximize gain on a random subset
of features. Each split sees a new random subset.
>6
>7
?
Gini impurity
Information gain
1 2 3
6
5
7
OOBE
•One can use the training data to get an error estimate
(“out of bag error” or OOBE)
•Validate each tree on complement of training data
1 2 3
6
5
7
1 2 3
6
5
7
Validation
•Validation can be done using OOBE (which is often convenient as
it does not require preprocessing) or with a separate validation
data set.
•A Confusion Matrix summarizes the class assignments performed
during validation and gives an overview of the classification errors
assigned
/ actual
Red Green
Red 15 5 33%
Green 1 10 10%
PART II
Demo
Running RF on Iris
Photo credit: http://foundwalls.com/winter-snow-forest-pine/
Iris RF results
Sample tree
• // Column constants
int COLSEPALLEN = 0;
int COLSEPALWID = 1;
int COLPETALLEN = 2;
int COLPETALWID = 3;
int classify(float fs[]) {
if( fs[COLPETALWID] <= 0.800000 )
return Iris-setosa;
else
if( fs[COLPETALLEN] <= 4.750000 )
if( fs[COLPETALWID] <= 1.650000 )
return Iris-versicolor;
else
return Iris-virginica;
else
if( fs[COLSEPALLEN] <= 6.050000 )
if( fs[COLPETALWID] <= 1.650000 )
return Iris-versicolor;
else
return Iris-virginica;
else
return Iris-virginica;
}
Comparing accuracy
•We compared several implementations and found that we are OK
all datasets most tools are reasonably accurate and give similar results. It is noteworthy that wiseRF,
ch usually runs the fastet, does not give as accurate results as R and Weka (this is the case for for the
icle, Stego and Spam datasets). H2O consistently gives the best results for these small datasets. In the
ger case studies, the tools are practically tied on Credit (with Weka and wiseRF being 0.2% better). For
usion H2O is 0.8% less accurate than wiseRF, and for the largest dataset, Covtype, H2O is markedly
re accuracte than wiseRF (over 10%).
Dataset H2O R Weka wiseRF
Iris 2.0% 2.0% 2.0% 2.0%
Vehicle 21.3% 21.3% 22.0% 22.0%
Stego 13.6% 13.9% 14.0% 14.9%
Spam 4.2% 4.2% 4.4% 5.2%
Credit 6.7% 6.7% 6.5% 6.5%
Intrusion 21.2% 19.0% 19.5% 20.4%
Covtype 3.6% 22.9% – 14.8%
. 2: The best overall classification errors for individual tools and datasets. H2O is generally the most
accurate with the exception of Intrusion where it is slightly less precise than R. R is the second
best, with the exception of larger datasets like Covtype where internal restrictions seem to result in
significant loss of accuracy.
The dataset is available at http://nsl.cs.unb.ca/NSL-KDD/. This dataset is used as benchmark of Mahout at https:
iki.apache.org/MAHOUT/partial-implementation.html
3
Dataset Features Predictor Instances (train/test) Imbalanced Missing observations
Iris 4 3 classes 100/50 NO 0
Vehicle 18 4 classes 564/282 NO 0
Stego 163 3 classes 3,000/4,500 NO 0
Spam 57 2 classes 3,067/1,534 YES 0
Credit 10 2 classes 100,000/50,000 YES 29,731
Intrusion 41 2 classes 125,973/22,544 NO 0
Covtype 54 7 classes 387,342/193,672 YES 0
Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use for
calibration. The last three datasets are medium sized problems. Credit is the only dataset with
missing observations. There are several imbalanced datasets.
2.1 Iris
The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data set
contains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearly
PART III
Writing a DRF algorithm
in Java with H2O
Design choices,
implementation techniques,
pitfalls.
Photo credit: http://foundwalls.com/winter-snow-forest-pine/
Distributing and Parallelizing RF
•When data does not fit in RAM, what impact does that have
for random forest:
•How do we sample?
•How do we select splits?
•How do we estimate OOBE?
Insights
•RF building parallelize extremely well when random data
sample fits in memory
•Trees can be built in parallel trivially
•Trees size increases with data volume
•Validation requires trees to be co-located with data
Strategy
•Start with a randomized partition of the data on nodes
•Build trees in parallel on subsets of each node’s data
•Exchange trees for validation
Reading and Parsing Data
•H2O does that for us and returns aValueArray which is row-
order distributed table
class ValueArray extends Iced implements Cloneable {
long numRows()
int numCols()
long length()
double datad(long rownum, int colnum) {
•Each 4MB chunk of theVA is stored on a (possibly) different
node and identified by a unique key
Extracting random subsets
•Each node holds a random set of 4MB chunks of the value array
final ValueArray ary = DKV.get( dataKey ).get();
ArrayList<RecursiveAction> dataInhaleJobs = new ArrayList<RecursiveAction>();
for( final Key k : keys ) {
if (!k.home()) continue; // skip non-local keys
final int rows = ary.rpc(ValueArray.getChunkIndex(k));
dataInhaleJobs.add(new RecursiveAction() {
@Override protected void compute() {
for(int j = 0; j < rows; ++j)
for( int c = 0; c < ncolumns; ++c)
localData.add ( ary.datad( j , c) );
}});
}
ForkJoinTask.invokeAll(dataInhaleJobs);
Evaluating splits
•Each feature that must be considered for a split requires
processing data of the form (feature value, class)
{ (3.4, red), (3.3, green), (2, red), (5, green), (6.1, green) }
•We should sort the values before processing
{ (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) }
•But since each split is done on different sets of rows, we have
to sort features at every split
•Trees can have 100k splits
Evaluating splits
•Instead we discretize the value
{ (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) }
•becomes
{ (0, red), (1, green), (2, red), (3, green), (4, green) }
•and no sorting is required as we can represent the colors by
arrays (of size #cardinality of the feature)
•For efficiency we can bin multiple values together
Evaluating splits
•The implementation of entropy based split is now simple
Split ltSplit(int col, Data d, int[] dist, Random rand) {
final int[] distL = new int[d.classes()], distR = dist.clone();
final double upperBoundReduction = upperBoundReduction(d.classes());
double maxReduction = -1; int bestSplit = -1;
for (int i = 0; i < columnDists[col].length - 1; ++i) {
for (int j = 0; j < distL.length; ++j) {
double v = columnDists[col][i][j]; distL[j] += v; distR[j] -= v;
}
int totL = 0, totR = 0;
for (int e: distL) totL += e;
for (int e: distR) totR += e;
double eL = 0, eR = 0;
for (int e: distL) eL += gain(e,totL);
for (int e: distR) eR += gain(e,totR);
double eReduction = upperBoundReduction-( (eL*totL + eR*totR) / (totL+totR) );
if (eReduction > maxReduction) { bestSplit = i; maxReduction = eReduction; }
}
return Split.split(col,bestSplit,maxReduction);
Parallelizing tree building
•Trees are built in parallel with the Fork/Join framework
Statistic left = getStatistic(0,data, seed + LTSSINIT);
Statistic rite = getStatistic(1,data, seed + RTSSINIT);
int c = split.column, s = split.split;
SplitNode nd = new SplitNode(c, s,…);
data.filter(nd,res,left,rite);
FJBuild fj0 = null, fj1 = null;
Split ls = left.split(res[0], depth >= maxdepth);
Split rs = rite.split(res[1], depth >= maxdepth);
if (ls.isLeafNode()) nd.l = new LeafNode(...);
else fj0 = new FJBuild(ls,res[0],depth+1, seed + LTSINIT);
if (rs.isLeafNode()) nd.r = new LeafNode(...);
else fj1 = new FJBuild(rs,res[1],depth+1, seed - RTSINIT);
if (data.rows() > ROWSFORKTRESHOLD)…
fj0.fork();
nd.r = fj1.compute();
nd.l = fj0.join();
Challenges
•Found out that Java Random isn’t
•Tree size does get to be a challenge
•Need more randomization
•Determinism is needed for debugging
PART III
Playing with DRF
Covtype,
playing with knobs
Photo credit: http://foundwalls.com/winter-snow-forest-pine/
Covtype
Dataset Features Predictor Instances (train/test) Imbalanced Missing observations
Iris 4 3 classes 100/50 NO 0
Vehicle 18 4 classes 564/282 NO 0
Stego 163 3 classes 3,000/4,500 NO 0
Spam 57 2 classes 3,067/1,534 YES 0
Credit 10 2 classes 100,000/50,000 YES 29,731
Intrusion 41 2 classes 125,973/22,544 NO 0
Covtype 54 7 classes 387,342/193,672 YES 0
Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use for
calibration. The last three datasets are medium sized problems. Credit is the only dataset with
missing observations. There are several imbalanced datasets.
2.1 Iris
The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data set
contains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearly
separable from the other 2; the latter are not linearly separable from each other. The dataset contains no
missing data.
Varying sampling rate for covtype
•Changing the proportion of data used for each tree affects error
•The danger is overfitting; and loosing the OOBE
4.2 Sampling Rate
H2O supports changing the proportion of the population that is randomly selected (without replacement)
to build each tree. Figure 1 illustrates the impact of varying the sampling rate between 1 and 99% when
building trees for Covtype.5
The blue line tracks the OOB error. It shows clearly that after approximately
80% sampling the OOBE will not improve. The red line shows the improvement in classification error, this
keeps dropping suggesting that for Covtype more data is better.
0 20 40 60 80 100
5101520
sampling rate
error
OOB err
Classif err
Fig. 1: Impact of changing the sampling rate on the overall error rate (both classification and OOB).
Recommendation: The sampling rate needs to be set to the level that minimizes the classification error.
The OOBEE is a good estimator of the classficiation error.
4.3 Feature selection
0 10 20 30 40 50
4.5
ignores
2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB).
0 10 20 30 40 50
3.54.04.55.0
features
error
sample=50%
sample=60%
sample=70%
sample=80%
Impact of changing the number of features used to evaluate each split with di↵erent sampling rates
(50%, 60%, 70% and 80%) on the overall error rate (both classification and OOB).
Changing #feature / split for covtype
•Increasing the number of features can be beneficial
•Impact is not huge though
Ignoring features for covtype
•Some features are best ignored
0 10 20 30 40 50
4.55.05.56.06.57.07.58.0
ignores
error
OOB err
Classif err
Fig. 2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB).
5.0
Conclusion
•Random forest is a powerful machine learning technique
•It’s easy to write a distributed and parallel implementation
•Different implementations choices are possible
•Scaling it up toTB data comes next…
Photo credit www.twitsnaps.com

Contenu connexe

Tendances

Ethics in Data Management.pptx
Ethics in Data Management.pptxEthics in Data Management.pptx
Ethics in Data Management.pptxRavindra Babu
 
myopathy .pptx
myopathy .pptxmyopathy .pptx
myopathy .pptxkireeti8
 
Iron deficiency anaemia
Iron deficiency anaemiaIron deficiency anaemia
Iron deficiency anaemiaJelilat Kareem
 
Sickle cell anemia
Sickle cell anemiaSickle cell anemia
Sickle cell anemiakwelton90
 
Approach to anemia
Approach to anemiaApproach to anemia
Approach to anemiaZaheen Zehra
 
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...Metin Uslu
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Neo4j
 
Rural Electrification Models and Costs - Homer Energy
Rural Electrification Models and Costs - Homer EnergyRural Electrification Models and Costs - Homer Energy
Rural Electrification Models and Costs - Homer EnergyLeonardo ENERGY
 

Tendances (16)

Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 
Ethics in Data Management.pptx
Ethics in Data Management.pptxEthics in Data Management.pptx
Ethics in Data Management.pptx
 
myopathy .pptx
myopathy .pptxmyopathy .pptx
myopathy .pptx
 
Iron deficiency anaemia
Iron deficiency anaemiaIron deficiency anaemia
Iron deficiency anaemia
 
Sickle cell anemia
Sickle cell anemiaSickle cell anemia
Sickle cell anemia
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Smart grid
Smart gridSmart grid
Smart grid
 
Big Data analytics best practices
Big Data analytics best practicesBig Data analytics best practices
Big Data analytics best practices
 
Approach to anemia
Approach to anemiaApproach to anemia
Approach to anemia
 
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...
Birliktelik Kuralları Kullanılarak Pazar Sepeti Analizi (Market Basket Analys...
 
09 Ego Network Analysis
09 Ego Network Analysis09 Ego Network Analysis
09 Ego Network Analysis
 
Approach to anemia
Approach to anemiaApproach to anemia
Approach to anemia
 
Anemia
AnemiaAnemia
Anemia
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 
Rural Electrification Models and Costs - Homer Energy
Rural Electrification Models and Costs - Homer EnergyRural Electrification Models and Costs - Homer Energy
Rural Electrification Models and Costs - Homer Energy
 
Role of peripheral blood
Role of peripheral bloodRole of peripheral blood
Role of peripheral blood
 

Similaire à Jan vitek distributedrandomforest_5-2-2013

Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at ScaleSri Ambati
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
The TileDB Embedded Storage Engine
The TileDB Embedded Storage EngineThe TileDB Embedded Storage Engine
The TileDB Embedded Storage EngineStavros Papadopoulos
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Random Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachRandom Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachWithTheBest
 
Creating a Provincial Wide Consolidated Cutblocks Layer using FME
Creating a Provincial Wide Consolidated Cutblocks Layer using FMECreating a Provincial Wide Consolidated Cutblocks Layer using FME
Creating a Provincial Wide Consolidated Cutblocks Layer using FMESafe Software
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 

Similaire à Jan vitek distributedrandomforest_5-2-2013 (20)

Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
User biglm
User biglmUser biglm
User biglm
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
The TileDB Embedded Storage Engine
The TileDB Embedded Storage EngineThe TileDB Embedded Storage Engine
The TileDB Embedded Storage Engine
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Random Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna QuachRandom Forests: The Vanilla of Machine Learning - Anna Quach
Random Forests: The Vanilla of Machine Learning - Anna Quach
 
Creating a Provincial Wide Consolidated Cutblocks Layer using FME
Creating a Provincial Wide Consolidated Cutblocks Layer using FMECreating a Provincial Wide Consolidated Cutblocks Layer using FME
Creating a Provincial Wide Consolidated Cutblocks Layer using FME
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 

Plus de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Plus de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Jan vitek distributedrandomforest_5-2-2013

  • 1. How to Grow Distributed Random Forests JanVitek Purdue University on sabbatical at 0xdata Photo credit: http://foundwalls.com/winter-snow-forest-pine/
  • 2. Overview •Not data scientist… •I implement programming languages for a living… •Leading the FastR project; a next generation R implementation… •Today I’ll tell you how to grow a distributed random forest in 2KLOC
  • 3. PART I Why so random? Introducing: Random Forest Bagging Out of bag error estimate Confusion matrix Photo credit: http://foundwalls.com/winter-snow-forest-pine/ Leo Breiman. Random forests. Machine learning, 2001.
  • 4. ClassificationTrees •Consider a supervised learning problem with a simple data set with two classes and the data has two features x in [1,4] and y in [5,8]. •We can build a classification tree to predict classes of new observations 1 2 3 6 5 7
  • 5. ClassificationTrees •Consider a supervised learning problem with a simple data set with two classes and the data has two features x in [1,4] and y in [5,8]. •We can build a classification tree to predict classes of new observations 1 2 3 6 5 7 >3 >6 >7 >2 >6 >7 2.6 6.5 >6 >7 5.5
  • 6. ClassificationTrees •Classification trees overfit the data >3 >6 >7 >2 >6 >7 2.6 6.5 >6 >7 5.5
  • 7. Random Forest •Avoid overfitting by building many randomized, partial, trees and vote to determine class of new observations 1 2 3 6 5 7 1 2 3 6 5 7 1 2 3 6 5 7
  • 8. Random Forest •Each tree sees part of the training sets and captures part of the information it contains 1 2 3 6 5 7 1 2 3 6 5 7 1 2 3 6 5 7 >3 >6 >7 >2 >6 >7 >6 >7 >3 >6 >7 >2 >6 >7 >6 >7
  • 9. Bagging •First rule of RF: each tree see is a different random selection (without replacement) of the training set. 1 2 3 6 5 7 1 2 3 6 5 7 1 2 3 6 5 7
  • 10. Split selection •Second rule of RF: Splits are selected to maximize gain on a random subset of features. Each split sees a new random subset. >6 >7 ? Gini impurity Information gain
  • 11. 1 2 3 6 5 7 OOBE •One can use the training data to get an error estimate (“out of bag error” or OOBE) •Validate each tree on complement of training data 1 2 3 6 5 7 1 2 3 6 5 7
  • 12. Validation •Validation can be done using OOBE (which is often convenient as it does not require preprocessing) or with a separate validation data set. •A Confusion Matrix summarizes the class assignments performed during validation and gives an overview of the classification errors assigned / actual Red Green Red 15 5 33% Green 1 10 10%
  • 13. PART II Demo Running RF on Iris Photo credit: http://foundwalls.com/winter-snow-forest-pine/
  • 15. Sample tree • // Column constants int COLSEPALLEN = 0; int COLSEPALWID = 1; int COLPETALLEN = 2; int COLPETALWID = 3; int classify(float fs[]) { if( fs[COLPETALWID] <= 0.800000 ) return Iris-setosa; else if( fs[COLPETALLEN] <= 4.750000 ) if( fs[COLPETALWID] <= 1.650000 ) return Iris-versicolor; else return Iris-virginica; else if( fs[COLSEPALLEN] <= 6.050000 ) if( fs[COLPETALWID] <= 1.650000 ) return Iris-versicolor; else return Iris-virginica; else return Iris-virginica; }
  • 16. Comparing accuracy •We compared several implementations and found that we are OK all datasets most tools are reasonably accurate and give similar results. It is noteworthy that wiseRF, ch usually runs the fastet, does not give as accurate results as R and Weka (this is the case for for the icle, Stego and Spam datasets). H2O consistently gives the best results for these small datasets. In the ger case studies, the tools are practically tied on Credit (with Weka and wiseRF being 0.2% better). For usion H2O is 0.8% less accurate than wiseRF, and for the largest dataset, Covtype, H2O is markedly re accuracte than wiseRF (over 10%). Dataset H2O R Weka wiseRF Iris 2.0% 2.0% 2.0% 2.0% Vehicle 21.3% 21.3% 22.0% 22.0% Stego 13.6% 13.9% 14.0% 14.9% Spam 4.2% 4.2% 4.4% 5.2% Credit 6.7% 6.7% 6.5% 6.5% Intrusion 21.2% 19.0% 19.5% 20.4% Covtype 3.6% 22.9% – 14.8% . 2: The best overall classification errors for individual tools and datasets. H2O is generally the most accurate with the exception of Intrusion where it is slightly less precise than R. R is the second best, with the exception of larger datasets like Covtype where internal restrictions seem to result in significant loss of accuracy. The dataset is available at http://nsl.cs.unb.ca/NSL-KDD/. This dataset is used as benchmark of Mahout at https: iki.apache.org/MAHOUT/partial-implementation.html 3 Dataset Features Predictor Instances (train/test) Imbalanced Missing observations Iris 4 3 classes 100/50 NO 0 Vehicle 18 4 classes 564/282 NO 0 Stego 163 3 classes 3,000/4,500 NO 0 Spam 57 2 classes 3,067/1,534 YES 0 Credit 10 2 classes 100,000/50,000 YES 29,731 Intrusion 41 2 classes 125,973/22,544 NO 0 Covtype 54 7 classes 387,342/193,672 YES 0 Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use for calibration. The last three datasets are medium sized problems. Credit is the only dataset with missing observations. There are several imbalanced datasets. 2.1 Iris The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data set contains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearly
  • 17. PART III Writing a DRF algorithm in Java with H2O Design choices, implementation techniques, pitfalls. Photo credit: http://foundwalls.com/winter-snow-forest-pine/
  • 18. Distributing and Parallelizing RF •When data does not fit in RAM, what impact does that have for random forest: •How do we sample? •How do we select splits? •How do we estimate OOBE?
  • 19. Insights •RF building parallelize extremely well when random data sample fits in memory •Trees can be built in parallel trivially •Trees size increases with data volume •Validation requires trees to be co-located with data
  • 20. Strategy •Start with a randomized partition of the data on nodes •Build trees in parallel on subsets of each node’s data •Exchange trees for validation
  • 21. Reading and Parsing Data •H2O does that for us and returns aValueArray which is row- order distributed table class ValueArray extends Iced implements Cloneable { long numRows() int numCols() long length() double datad(long rownum, int colnum) { •Each 4MB chunk of theVA is stored on a (possibly) different node and identified by a unique key
  • 22. Extracting random subsets •Each node holds a random set of 4MB chunks of the value array final ValueArray ary = DKV.get( dataKey ).get(); ArrayList<RecursiveAction> dataInhaleJobs = new ArrayList<RecursiveAction>(); for( final Key k : keys ) { if (!k.home()) continue; // skip non-local keys final int rows = ary.rpc(ValueArray.getChunkIndex(k)); dataInhaleJobs.add(new RecursiveAction() { @Override protected void compute() { for(int j = 0; j < rows; ++j) for( int c = 0; c < ncolumns; ++c) localData.add ( ary.datad( j , c) ); }}); } ForkJoinTask.invokeAll(dataInhaleJobs);
  • 23. Evaluating splits •Each feature that must be considered for a split requires processing data of the form (feature value, class) { (3.4, red), (3.3, green), (2, red), (5, green), (6.1, green) } •We should sort the values before processing { (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) } •But since each split is done on different sets of rows, we have to sort features at every split •Trees can have 100k splits
  • 24. Evaluating splits •Instead we discretize the value { (2, red), (3.3, green), (3.4, red), (5, green), (6.1, green) } •becomes { (0, red), (1, green), (2, red), (3, green), (4, green) } •and no sorting is required as we can represent the colors by arrays (of size #cardinality of the feature) •For efficiency we can bin multiple values together
  • 25. Evaluating splits •The implementation of entropy based split is now simple Split ltSplit(int col, Data d, int[] dist, Random rand) { final int[] distL = new int[d.classes()], distR = dist.clone(); final double upperBoundReduction = upperBoundReduction(d.classes()); double maxReduction = -1; int bestSplit = -1; for (int i = 0; i < columnDists[col].length - 1; ++i) { for (int j = 0; j < distL.length; ++j) { double v = columnDists[col][i][j]; distL[j] += v; distR[j] -= v; } int totL = 0, totR = 0; for (int e: distL) totL += e; for (int e: distR) totR += e; double eL = 0, eR = 0; for (int e: distL) eL += gain(e,totL); for (int e: distR) eR += gain(e,totR); double eReduction = upperBoundReduction-( (eL*totL + eR*totR) / (totL+totR) ); if (eReduction > maxReduction) { bestSplit = i; maxReduction = eReduction; } } return Split.split(col,bestSplit,maxReduction);
  • 26. Parallelizing tree building •Trees are built in parallel with the Fork/Join framework Statistic left = getStatistic(0,data, seed + LTSSINIT); Statistic rite = getStatistic(1,data, seed + RTSSINIT); int c = split.column, s = split.split; SplitNode nd = new SplitNode(c, s,…); data.filter(nd,res,left,rite); FJBuild fj0 = null, fj1 = null; Split ls = left.split(res[0], depth >= maxdepth); Split rs = rite.split(res[1], depth >= maxdepth); if (ls.isLeafNode()) nd.l = new LeafNode(...); else fj0 = new FJBuild(ls,res[0],depth+1, seed + LTSINIT); if (rs.isLeafNode()) nd.r = new LeafNode(...); else fj1 = new FJBuild(rs,res[1],depth+1, seed - RTSINIT); if (data.rows() > ROWSFORKTRESHOLD)… fj0.fork(); nd.r = fj1.compute(); nd.l = fj0.join();
  • 27. Challenges •Found out that Java Random isn’t •Tree size does get to be a challenge •Need more randomization •Determinism is needed for debugging
  • 28. PART III Playing with DRF Covtype, playing with knobs Photo credit: http://foundwalls.com/winter-snow-forest-pine/
  • 29. Covtype Dataset Features Predictor Instances (train/test) Imbalanced Missing observations Iris 4 3 classes 100/50 NO 0 Vehicle 18 4 classes 564/282 NO 0 Stego 163 3 classes 3,000/4,500 NO 0 Spam 57 2 classes 3,067/1,534 YES 0 Credit 10 2 classes 100,000/50,000 YES 29,731 Intrusion 41 2 classes 125,973/22,544 NO 0 Covtype 54 7 classes 387,342/193,672 YES 0 Tab. 1: Overview of the seven datasets. The first four datasets are micro benchmarks that we use for calibration. The last three datasets are medium sized problems. Credit is the only dataset with missing observations. There are several imbalanced datasets. 2.1 Iris The “Iris” dataset is a classical dataset http://archive.ics.uci.edu/ml/datasets/Iris. The data set contains 3 classes of 50 instances each, where each class refers to a type of plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other. The dataset contains no missing data.
  • 30. Varying sampling rate for covtype •Changing the proportion of data used for each tree affects error •The danger is overfitting; and loosing the OOBE 4.2 Sampling Rate H2O supports changing the proportion of the population that is randomly selected (without replacement) to build each tree. Figure 1 illustrates the impact of varying the sampling rate between 1 and 99% when building trees for Covtype.5 The blue line tracks the OOB error. It shows clearly that after approximately 80% sampling the OOBE will not improve. The red line shows the improvement in classification error, this keeps dropping suggesting that for Covtype more data is better. 0 20 40 60 80 100 5101520 sampling rate error OOB err Classif err Fig. 1: Impact of changing the sampling rate on the overall error rate (both classification and OOB). Recommendation: The sampling rate needs to be set to the level that minimizes the classification error. The OOBEE is a good estimator of the classficiation error. 4.3 Feature selection
  • 31. 0 10 20 30 40 50 4.5 ignores 2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB). 0 10 20 30 40 50 3.54.04.55.0 features error sample=50% sample=60% sample=70% sample=80% Impact of changing the number of features used to evaluate each split with di↵erent sampling rates (50%, 60%, 70% and 80%) on the overall error rate (both classification and OOB). Changing #feature / split for covtype •Increasing the number of features can be beneficial •Impact is not huge though
  • 32. Ignoring features for covtype •Some features are best ignored 0 10 20 30 40 50 4.55.05.56.06.57.07.58.0 ignores error OOB err Classif err Fig. 2: Impact of ignoring one feature at a time on the overall error rate (both classification and OOB). 5.0
  • 33. Conclusion •Random forest is a powerful machine learning technique •It’s easy to write a distributed and parallel implementation •Different implementations choices are possible •Scaling it up toTB data comes next… Photo credit www.twitsnaps.com