Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Distributed Logistic Model Trees, Stratio Intelligence
Mateo Álvarez and Antonio Soriano
@StratioBD
Aerospace Engineer, MSc in
Propulsion Systems (UPM), Master
in Data Science (URJC).
Working as data scientist and Big
Data...
Ph.D. in Telecommunications, MSc in
Electronic Systems Engineering and
Telecommunication Technologies,
Systems and Network...
Why using interpretable algorithms instead of “black boxes”
Logistic Regression
Decision Trees
Variance-Bias tradeoff
Metr...
• Why use interpretable algorithms instead of “black boxes”
• Logistic Regression
• Decision Trees
• Variance-Bias tradeof...
Accuracy Explainability
VS
Medical Studies Power management Financial environment Criminal activity
Threshold
Probability
Feature
Threshold
Probability
Local bad adjust
Feature
Threshold
Probability
Feature
Feature 1
Root node
Leaf node Leaf node
Feature 1
Root node
Feature 2 Feature 2
Leaf node Leaf node Leaf node Leaf node
Feature 1
Root node
Feature 2 Feature 2
Leaf node Leaf node Leaf node Leaf node
Local bad adjust
MeanError
Model complexity
Test error
Training error
Model complexity
Variance
Total error
Bias
2
Error
OverfittingUnderfi...
Variance
Bias
Missing important variables for the problem
to make the predictions
Variance
Bias
Overfitting to the sample/training data
Variance
Bias
Irreducible error on prediction
Variance
Bias
• Logistic Model Trees
• Distributed implementation
• Cost function & configuration parametres
• Demo
@StratioBD
Root node
Performance
metrics
Performance
metrics
Feature 1
Root node
Performance
metrics
Performance
metrics
Feature 1
Root node
Performance
metrics
Performance
metrics
Performance
metrics
Performance
metrics
Performance
metrics
Pe...
Feature 1
Root node
Feature 2 Feature 2
Feature 1
Root node
Feature 2
Feature 1
Root node
Feature 2
DISTRIBUTED IMPLEMENTATION
Spark’s Decision Tree
(distributed implementation of random forests)
Spark’s Logistic Regressio...
LMT Cost function to fix the logistic regression threshold
• AccuracyCostFunction
• ConfusionMatrix
• PrecisionCostFunctio...
Big datasets
Power of spark to distribute building the
tree and logistic regressions
ADVANTAGES OF THIS IMPLEMENTATION
Med...
Example of DLMT algorithm
in a synthetic dataset
• Metrics
• Demo
@StratioBD
PREDICTION
Positive Negative
TRUE
CONDITION
Positive True Positives False Negatives
Negative False Positives True Negative...
PREDICTION
Positive Negative
TRUE
CONDITION
Positive True Positives False Negatives
Negative False Positives True Negative...
PREDICTION
Positive Negative
TRUE
CONDITION
Positive True Positives False Negatives
Negative False Positives True Negative...
f
f
1
n
Benchmark
ABF
Data
Algorithms
@StratioBD
Accuracy ExplainabilityVS
Performance Metrics:
AUROC, AUPRC, ACCURACY
Automatic
Benchmarking
Framework
f
f
1
n
BenchmarkAB...
THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
contact@stratio.com
www.stratio.com
@StratioBD
people@stratio.com
WE ARE HIRING
@StratioBD
Prochain SlideShare
Chargement dans…5
×

Distributed Logistic Model Trees

1 006 vues

Publié le

Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano and Mateo Alvarez presented a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leaves. While being highly interpretable, the LMT consistently performs equally or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Distributed Logistic Model Trees

  1. 1. Distributed Logistic Model Trees, Stratio Intelligence Mateo Álvarez and Antonio Soriano @StratioBD
  2. 2. Aerospace Engineer, MSc in Propulsion Systems (UPM), Master in Data Science (URJC). Working as data scientist and Big Data developer at Stratio Big Data in the data science department mateo-alvarez
  3. 3. Ph.D. in Telecommunications, MSc in Electronic Systems Engineering and Telecommunication Technologies, Systems and Networks (UPV), and MSc “Big Data Expert” (UTAD). Working as data scientist and Big Data developer at at Stratio Big Data in the data science department @Phd_A_Soriano
  4. 4. Why using interpretable algorithms instead of “black boxes” Logistic Regression Decision Trees Variance-Bias tradeoff Metrics Demo Logistic Model Trees Distributed implementation Cost function & configuration params Demo
  5. 5. • Why use interpretable algorithms instead of “black boxes” • Logistic Regression • Decision Trees • Variance-Bias tradeoff @StratioBD
  6. 6. Accuracy Explainability VS Medical Studies Power management Financial environment Criminal activity
  7. 7. Threshold Probability Feature
  8. 8. Threshold Probability Local bad adjust Feature
  9. 9. Threshold Probability Feature
  10. 10. Feature 1 Root node Leaf node Leaf node
  11. 11. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node
  12. 12. Feature 1 Root node Feature 2 Feature 2 Leaf node Leaf node Leaf node Leaf node Local bad adjust
  13. 13. MeanError Model complexity Test error Training error Model complexity Variance Total error Bias 2 Error OverfittingUnderfitting
  14. 14. Variance Bias
  15. 15. Missing important variables for the problem to make the predictions Variance Bias
  16. 16. Overfitting to the sample/training data Variance Bias
  17. 17. Irreducible error on prediction Variance Bias
  18. 18. • Logistic Model Trees • Distributed implementation • Cost function & configuration parametres • Demo @StratioBD
  19. 19. Root node Performance metrics
  20. 20. Performance metrics Feature 1 Root node Performance metrics Performance metrics
  21. 21. Feature 1 Root node Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics Performance metrics
  22. 22. Feature 1 Root node Feature 2 Feature 2
  23. 23. Feature 1 Root node Feature 2
  24. 24. Feature 1 Root node Feature 2
  25. 25. DISTRIBUTED IMPLEMENTATION Spark’s Decision Tree (distributed implementation of random forests) Spark’s Logistic Regression / weka’s Logistic Regression on the nodes
  26. 26. LMT Cost function to fix the logistic regression threshold • AccuracyCostFunction • ConfusionMatrix • PrecisionCostFunction • PrecisionRecallCostFunction • RocCostFunction The same cost function for pruning criteria Performance metrics Performance metrics Performance metrics
  27. 27. Big datasets Power of spark to distribute building the tree and logistic regressions ADVANTAGES OF THIS IMPLEMENTATION Medium datasets Distributed tree growth and weka’s logistic regression Small datasets Although it can be slow to distribute the data for the decision tree, cost functions can be still used and specific optimization for particular cases
  28. 28. Example of DLMT algorithm in a synthetic dataset
  29. 29. • Metrics • Demo @StratioBD
  30. 30. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives Precission True Positive Rate (Recall) False Positive Rate TPR = TP/(TP+FN) Insensitive to unbalance FPR = FP/(FP+TN) Insensitive to unbalance Precision = TP/(TP+FP) Sensitive to unbalance Accuracy = (TP+TN)/(TP+TN+FP+FN) Sensitive to unbalance
  31. 31. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) False Positive Rate AUROC (AUC): TPR/FPR -> Insensitive to unbalance! TPR FPR Best performance
  32. 32. PREDICTION Positive Negative TRUE CONDITION Positive True Positives False Negatives Negative False Positives True Negatives True Positive Rate (Recall) AUPRC: Precision/TPR -> Sensitive to unbalance! Precission Precision Recall Best performance
  33. 33. f f 1 n Benchmark ABF Data Algorithms
  34. 34. @StratioBD
  35. 35. Accuracy ExplainabilityVS Performance Metrics: AUROC, AUPRC, ACCURACY Automatic Benchmarking Framework f f 1 n BenchmarkABF 1 2 3 4
  36. 36. THANK YOU UNITED STATES Tel: (+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com @StratioBD
  37. 37. people@stratio.com WE ARE HIRING @StratioBD

×