COURBOSPARK:
DECISION TREE FOR
TIME-SERIES ON SPARK
Christophe Salperwyck – EDF R&D
Simon Maby – OCTO Technology - @simonm...
| 2
AGENDA
1. PROBLEM DESCRIPTION
2. IMPLEMENTATION
• Courbotree: presentation of the algorithm
• From mllib to courbospar...
| 3
FRENCH METERS DATA
| 4
• 1 measure every 10 min
• 35 million customers
• Time-series: 144 points x 365 days
 Annual data volume: 1800 billio...
| 5
LOAD CURVES CLASSIFICATION
Contract type Region … Equipment type Load Curve
9KVA 75 … Elec
6KVA 22 … Gas
… … … … …
12K...
| 6
WHY A DECISION TREE?
• Easy to understand
• Ability to explore the model
• Ability to choose the
expressivity of the m...
| 7
Goal: find the most different curves depending on an explanatory
feature
How to split? we can either:
• Minimize curve...
| 8
MAXIMIZE DIFFERENCES BETWEEN AVERAGE
CURVES (feature: Equipment Type)
Electrical
Gas
Hour
PinW
ArgMax(d)
mean
| 9
EXISTING DISTRIBUTED DECISION TREE
Scalable Distributed Decision Trees in Spark MLLib
Manish Amde (Origami Logic), Hir...
| 10
MLLIB DECISION TREE PARALLELIZATION
| 11
Step 1:
compute average
curves
[0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[
Host 1 Host 2 Host 3
[0:10[ [10:20[
Host ...
| 12
To build the tree:
• Criteria: entropy, Gini, variance
• Data structure: LabelPoint
FROM MLLIB TO COURBOSPARK
| 13
To build the tree:
• Criteria: entropy, Gini, variance, inertia (to compare time-series)
• Data structure: LabelPoint...
| 14
DEALING WITH NOMINAL FEATURES
Current implementation for regression:
 order the categories by their mean on the targ...
| 15
NOMINAL VALUES: TYPE OF CONTRACT
4 CATEGORIES {A, B, C, D}
A B
C D?
| 16
DEALING WITH NOMINAL FEATURES
Hard to order curves…
Solution 1:
Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{...
| 17
DEALING WITH NOMINAL FEATURES
Solution 2:
Agglomerative Hierarchical Clustering. Bottom up approach.
Complexity is O(...
| 18
HOW TO
Algorithm parameters
Configure spark context
Load the data file
Learn the model
| 19
LOOKING FOR THE TEST CONFIGURATION
For a constant global capacity on 12 nodes:
•120 cores + 120 GB RAM
#Executors RAM...
| 20
SCALABILITY TO #CONTAINERS
| 21
SCALABILITY TO #CONTAINERS
| 22
SCALABILITY TO #CONTAINERS
| 23
SCALABILITY TO #LINES
| 24
FRAMEWORK STABILITY
Tested on:
• 10GB, 100GB, 200GB, 300GB,
400GB, 500GB, 1TB
• Categorical and continuous
variables
...
| 25
SCALABILITY TO #COLUMNS
| 26
SCALABILITY TO #CATEGORIES
| 27
| 28
REAL LIFE DATASET
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000 1200 1400
Timeinminutes
Data in GB
• 9 exec...
| 29
• spark.default.parallelism
• spark.executor.memory
• spark.storage.memoryfraction
• spark.akka.framesize
TUNING
| 30
Developers view
• Flawless transition from local to cluster mode
• Debug mode with an IDE
• Good performances need kn...
| 31
HEY SCALA <3
| 32
Data Scientists view
• The API is not very data oriented
• …but now we have SparkSQL and Dataframes!
• IPython + pySp...
| 33
OPS view
• Better than mapReduce
• Performances are predictable for tested code
• YARNed
• Lots of releases, MlLib co...
| 34
FUTURE WORKS
• Unbalanced trees
• Improve performance
• Other criteria for time-series comparison
• Missing values in...
Prochain SlideShare
Chargement dans…5
×

Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning sur les séries temporelles

2 014 vues

Publié le

Ces dernières années, nous avons assisté à une évolution majeure de l’écosystème des solutions de gestion de la donnée. Les usages ont également évolué tant sur les aspects analytiques que transactionnels : le batch J+1 n'est plus une fatalité !

Quels constats et quelles perspectives pour les SI traditionnels à l'heure où les technologies événementielles sont de plus en plus accessibles et adoptées ?

Courbo-Spark : exemple de Machine Learning sur des séries temporelles

Les arbres de décisions sont des modèles bien connus de classification et de régression dans l'univers du Machine Learning. Dans le contexte industriel d’EDF il est souvent nécessaire d’appliquer ce type d’algorithme à des séries temporelles. Nous allons vous présenter comment EDF et OCTO ont adapté l’implémentation des arbres de décision dans Spark afin de traiter de grands volumes de courbes de charges.

Publié dans : Données & analyses
0 commentaire
3 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
2 014
Sur SlideShare
0
Issues des intégrations
0
Intégrations
4
Actions
Partages
0
Téléchargements
0
Commentaires
0
J’aime
3
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning sur les séries temporelles

  1. 1. COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK Christophe Salperwyck – EDF R&D Simon Maby – OCTO Technology - @simonmaby Xdata project: www.xdata.fr, grants from "Investissement d'Avenir" program, 'Big Data' call
  2. 2. | 2 AGENDA 1. PROBLEM DESCRIPTION 2. IMPLEMENTATION • Courbotree: presentation of the algorithm • From mllib to courbospark 3. PERFORMANCES • Configuration (cluster description, spark config…) 4. FEEDBACK ON SPARK/MLLIB
  3. 3. | 3 FRENCH METERS DATA
  4. 4. | 4 • 1 measure every 10 min • 35 million customers • Time-series: 144 points x 365 days  Annual data volume: 1800 billion records, 120 TB of raw data BIG DATA!
  5. 5. | 5 LOAD CURVES CLASSIFICATION Contract type Region … Equipment type Load Curve 9KVA 75 … Elec 6KVA 22 … Gas … … … … … 12KVA 34 … Elec
  6. 6. | 6 WHY A DECISION TREE? • Easy to understand • Ability to explore the model • Ability to choose the expressivity of the model
  7. 7. | 7 Goal: find the most different curves depending on an explanatory feature How to split? we can either: • Minimize curves dispersion (intra inertia) or • Maximize differences between average curves (inter inertia) SPLIT CRITERIA: INERTIA
  8. 8. | 8 MAXIMIZE DIFFERENCES BETWEEN AVERAGE CURVES (feature: Equipment Type) Electrical Gas Hour PinW ArgMax(d) mean
  9. 9. | 9 EXISTING DISTRIBUTED DECISION TREE Scalable Distributed Decision Trees in Spark MLLib Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp- content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf A MapReduce Implementation of C4.5 Decision Tree Algorithm Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49- 60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf Distributed Decision Tree Learning for Mining Big Data Streams Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013. http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
  10. 10. | 10 MLLIB DECISION TREE PARALLELIZATION
  11. 11. | 11 Step 1: compute average curves [0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[ Host 1 Host 2 Host 3 [0:10[ [10:20[ Host 1 Step 2: collect and find the best split HORIZONTAL STRATEGY
  12. 12. | 12 To build the tree: • Criteria: entropy, Gini, variance • Data structure: LabelPoint FROM MLLIB TO COURBOSPARK
  13. 13. | 13 To build the tree: • Criteria: entropy, Gini, variance, inertia (to compare time-series) • Data structure: LabelPoint, TimeSeries • Finding split point for nominal features For data visualization of the tree: • Quantile on the nodes and leaves • Lost of inertia • Number of curves per nodes, leaves FROM MLLIB TO COURBOSPARK
  14. 14. | 14 DEALING WITH NOMINAL FEATURES Current implementation for regression:  order the categories by their mean on the target A BC D Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
  15. 15. | 15 NOMINAL VALUES: TYPE OF CONTRACT 4 CATEGORIES {A, B, C, D} A B C D?
  16. 16. | 16 DEALING WITH NOMINAL FEATURES Hard to order curves… Solution 1: Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D}, {AC}/{BD}… Problem: Combinatory problem depending on n the number of different categories. Complexity is O(2n)
  17. 17. | 17 DEALING WITH NOMINAL FEATURES Solution 2: Agglomerative Hierarchical Clustering. Bottom up approach. Complexity is O(n3) - we don’t expect n > 100
  18. 18. | 18 HOW TO Algorithm parameters Configure spark context Load the data file Learn the model
  19. 19. | 19 LOOKING FOR THE TEST CONFIGURATION For a constant global capacity on 12 nodes: •120 cores + 120 GB RAM #Executors RAM per exec. Cores per exec. Performance on 100Gb data 12 10 GB 10 22 minutes 24 5 GB 5 17 minutes 60 2 GB 2 12 minutes 120 1 GB 1 15 minutes
  20. 20. | 20 SCALABILITY TO #CONTAINERS
  21. 21. | 21 SCALABILITY TO #CONTAINERS
  22. 22. | 22 SCALABILITY TO #CONTAINERS
  23. 23. | 23 SCALABILITY TO #LINES
  24. 24. | 24 FRAMEWORK STABILITY Tested on: • 10GB, 100GB, 200GB, 300GB, 400GB, 500GB, 1TB • Categorical and continuous variables • Bin sizes from 100 to 1000
  25. 25. | 25 SCALABILITY TO #COLUMNS
  26. 26. | 26 SCALABILITY TO #CATEGORIES
  27. 27. | 27
  28. 28. | 28 REAL LIFE DATASET 0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000 1200 1400 Timeinminutes Data in GB • 9 executors with 20 GB and 8 cores • 10 to 1000 millions load curves (10 numerical and 10 categorical features)
  29. 29. | 29 • spark.default.parallelism • spark.executor.memory • spark.storage.memoryfraction • spark.akka.framesize TUNING
  30. 30. | 30 Developers view • Flawless transition from local to cluster mode • Debug mode with an IDE • Good performances need knowledge FEEDBACKS
  31. 31. | 31 HEY SCALA <3
  32. 32. | 32 Data Scientists view • The API is not very data oriented • …but now we have SparkSQL and Dataframes! • IPython + pySpark • Feature engineering VS model engineering FEEDBACKS
  33. 33. | 33 OPS view • Better than mapReduce • Performances are predictable for tested code • YARNed • Lots of releases, MlLib code is evolving quickly FEEDBACKS
  34. 34. | 34 FUTURE WORKS • Unbalanced trees • Improve performance • Other criteria for time-series comparison • Missing values in explanatory features

×