SlideShare une entreprise Scribd logo
1  sur  38
COURBOSPARK:
DECISION TREE FOR
TIME-SERIES ON SPARK
Christophe Salperwyck – EDF R&D
Simon Maby – OCTO Technology - @simonmaby
Xdata project: www.xdata.fr, grants from
"Investissement d'Avenir" program, 'Big Data' call
| 2
AGENDA
1. PROBLEM DESCRIPTION
2. IMPLEMENTATION
• Courbotree: presentation of the algorithm
• From mllib to courbospark
3. PERFORMANCES
• Configuration (cluster description, spark config…)
4. FEEDBACK ON SPARK/MLLIB
| 3
FRENCH METERS DATA
| 4
• 1 measure every 10 min
• 35 million customers
• Time-series: 144 points x 365 days
 Annual data volume: 1800 billion records, 120 TB
of raw data
BIG DATA!
| 5
LOAD CURVES CLASSIFICATION
Contract type Region … Equipment type Load Curve
9KVA 75 … Elec
6KVA 22 … Gas
… … … … …
12KVA 34 … Elec
| 6
WHY A DECISION TREE?
• Easy to understand
• Ability to explore the model
• Ability to choose the
expressivity of the model
| 7
Goal: find the most different curves depending on an explanatory
feature
How to split? we can either:
• Minimize curves dispersion (intra inertia)
or
• Maximize differences between average curves (inter inertia)
SPLIT CRITERIA: INERTIA
| 8
MAXIMIZE DIFFERENCES BETWEEN AVERAGE
CURVES (feature: Equipment Type)
Electrical
Gas
Hour
PinW
ArgMax(d)
mean
| 9
EXISTING DISTRIBUTED DECISION TREE
Scalable Distributed Decision Trees in Spark MLLib
Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet
Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp-
content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf
A MapReduce Implementation of C4.5 Decision Tree Algorithm
Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49-
60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013.
http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
| 10
MLLIB DECISION TREE PARALLELIZATION
| 11
Step 1:
compute average
curves
[0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[
Host 1 Host 2 Host 3
[0:10[ [10:20[
Host 1
Step 2:
collect and find
the best split
HORIZONTAL STRATEGY
| 12
To build the tree:
• Criteria: entropy, Gini, variance
• Data structure: LabelPoint
FROM MLLIB TO COURBOSPARK
| 13
To build the tree:
• Criteria: entropy, Gini, variance, inertia (to compare time-series)
• Data structure: LabelPoint, TimeSeries
• Finding split point for nominal features
For data visualization of the tree:
• Quantile on the nodes and leaves
• Lost of inertia
• Number of curves per nodes, leaves
FROM MLLIB TO COURBOSPARK
| 14
DEALING WITH NOMINAL FEATURES
Current implementation for regression:
 order the categories by their mean on the target
A BC D
Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
| 15
NOMINAL VALUES: TYPE OF CONTRACT
4 CATEGORIES {A, B, C, D}
A B
C D?
| 16
DEALING WITH NOMINAL FEATURES
Hard to order curves…
Solution 1:
Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D},
{AC}/{BD}…
Problem:
Combinatory problem depending on n the number of
different categories. Complexity is O(2n).
| 17
DEALING WITH NOMINAL FEATURES
Solution 2:
Agglomerative Hierarchical Clustering. Bottom up approach.
Complexity is O(n3) - we don’t expect n > 100
| 18
HOW TO
Algorithm parameters
Configure spark context
Load the data file
Learn the model
| 19
LOOKING FOR THE TEST CONFIGURATION
For a constant global capacity on 12 nodes:
•120 cores + 120 GB RAM
#Executors RAM per exec. Cores per exec. Performance on
100Gb data
12 10 GB 10 22 minutes
24 5 GB 5 17 minutes
60 2 GB 2 12 minutes
120 1 GB 1 15 minutes
| 20
SCALABILITY TO #CONTAINERS
| 21
SCALABILITY TO #CONTAINERS
| 22
SCALABILITY TO #CONTAINERS
| 23
SCALABILITY TO #LINES
| 24
FRAMEWORK STABILITY
Tested on:
• 10GB, 100GB, 200GB, 300GB,
400GB, 500GB, 1TB
• Categorical and continuous
variables
• Bin sizes from 100 to 1000
| 25
SCALABILITY TO #COLUMNS
| 26
SCALABILITY TO #CATEGORIES
| 27
| 28
REAL LIFE DATASET
0
50
100
150
200
250
300
350
400
0 200 400 600 800 1000 1200 1400
Timeinminutes
Data in GB
• 9 executors with 20 GB and 8 cores
• 10 to 1000 millions load curves (10 numerical and 10 categorical features)
| 29
• spark.default.parallelism
• spark.executor.memory
• spark.storage.memoryfraction
• spark.akka.framesize
TUNING
| 30
Developers view
• Flawless transition from local to cluster mode
• Debug mode with an IDE
• Good performances need knowledge
FEEDBACKS
| 31
HEY SCALA <3
| 32
Data Scientists view
• The API is not very data oriented
• …but now we have SparkSQL and Dataframes!
• IPython + pySpark
• Feature engineering VS model engineering
FEEDBACKS
| 33
OPS view
• Better than mapReduce
• Performances are predictable for tested code
• YARNed
• Lots of releases, MlLib code is evolving quickly
FEEDBACKS
| 34
FUTURE WORKS
• Unbalanced trees
• Improve performance
• Other criteria for time-series comparison
• Missing values in explanatory features
MERCI
| 36
EXISTING DISTRIBUTED DECISION TREE
Partitionning Engine Target Ensemble Pruning
MLlib Horizontal Spark Num + Nom Yes No
MR C4.5 Vertical MR Nom No No
PLANET Horizontal MR Num + Nom Yes No
SAMOA Vertical Storm/S4 Num + Nom Yes Yes
| 37
EXAMPLE OF TREE ON LOAD CURVES
| 38
Store a curve +
explanatory features
Main class to
learn the model
Inertia split crterion
Store statistiques about the
model: quantiles, average
curve, lost in inertia

Contenu connexe

Tendances

ExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsRim Moussa
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchRim Moussa
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical ExcellenceNETWAYS
 
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatengeKarin Patenge
 
Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open WorkshopExtremeEarth
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopExtremeEarth
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence GeneratorRim Moussa
 
Addressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC RuntimesAddressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC Runtimesinside-BigData.com
 

Tendances (20)

ExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and Achievements
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP benchMultidimensional DB design, revolving TPC-H benchmark into OLAP bench
Multidimensional DB design, revolving TPC-H benchmark into OLAP bench
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical Excellence
 
The National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF TemplatesThe National Oceanographic Data Center’s NetCDF Templates
The National Oceanographic Data Center’s NetCDF Templates
 
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
20190703_AGIT_GeoRasterWorkshop_GriddedData_KPatenge
 
Polar Use Case - ExtremeEarth Open Workshop
Polar Use Case  - ExtremeEarth Open WorkshopPolar Use Case  - ExtremeEarth Open Workshop
Polar Use Case - ExtremeEarth Open Workshop
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open Workshop
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Big Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use CaseBig Data Seervices in Danaos Use Case
Big Data Seervices in Danaos Use Case
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
Addressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC RuntimesAddressing Emerging Challenges in Designing HPC Runtimes
Addressing Emerging Challenges in Designing HPC Runtimes
 
Earth Science Data and Information System (ESDIS) Project Update
Earth Science Data and Information System (ESDIS) Project UpdateEarth Science Data and Information System (ESDIS) Project Update
Earth Science Data and Information System (ESDIS) Project Update
 

En vedette

Real-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with StormReal-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with StormDataWorks Summit
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Hortonworks
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduceThibault Debatty
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...Amazon Web Services
 
Word count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaWord count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaChirag Ahuja
 

En vedette (10)

Real-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with StormReal-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with Storm
 
Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011Apache Hadoop 0.23 at Hadoop World 2011
Apache Hadoop 0.23 at Hadoop World 2011
 
Pig on Storm
Pig on StormPig on Storm
Pig on Storm
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendatio...
 
Word count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using javaWord count example in hadoop mapreduce using java
Word count example in hadoop mapreduce using java
 

Similaire à CourboSpark: Decision Tree for Time-series on Spark

Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...OCTO Technology
 
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...zionsaint
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...DataStax Academy
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
 
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...Matteo Ferroni
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战Weiwei Fang
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...Edge AI and Vision Alliance
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jcJames Saint-Rossy
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Ivo Andreev
 

Similaire à CourboSpark: Decision Tree for Time-series on Spark (20)

CourboSpark
CourboSparkCourboSpark
CourboSpark
 
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
 
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
DARPA ERI Summit 2018: The End of Moore’s Law & Faster General Purpose Comput...
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 
CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017CONDOR @ NGCLE@e-Novia 15.11.2017
CONDOR @ NGCLE@e-Novia 15.11.2017
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
[EUC2016] DockerCap: a software-level power capping orchestrator for Docker c...
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战数据中心网络研究:机遇与挑战
数据中心网络研究:机遇与挑战
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
“Once-for-All DNNs: Simplifying Design of Efficient Models for Diverse Hardwa...
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jc
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

CourboSpark: Decision Tree for Time-series on Spark

  • 1. COURBOSPARK: DECISION TREE FOR TIME-SERIES ON SPARK Christophe Salperwyck – EDF R&D Simon Maby – OCTO Technology - @simonmaby Xdata project: www.xdata.fr, grants from "Investissement d'Avenir" program, 'Big Data' call
  • 2. | 2 AGENDA 1. PROBLEM DESCRIPTION 2. IMPLEMENTATION • Courbotree: presentation of the algorithm • From mllib to courbospark 3. PERFORMANCES • Configuration (cluster description, spark config…) 4. FEEDBACK ON SPARK/MLLIB
  • 4. | 4 • 1 measure every 10 min • 35 million customers • Time-series: 144 points x 365 days  Annual data volume: 1800 billion records, 120 TB of raw data BIG DATA!
  • 5. | 5 LOAD CURVES CLASSIFICATION Contract type Region … Equipment type Load Curve 9KVA 75 … Elec 6KVA 22 … Gas … … … … … 12KVA 34 … Elec
  • 6. | 6 WHY A DECISION TREE? • Easy to understand • Ability to explore the model • Ability to choose the expressivity of the model
  • 7. | 7 Goal: find the most different curves depending on an explanatory feature How to split? we can either: • Minimize curves dispersion (intra inertia) or • Maximize differences between average curves (inter inertia) SPLIT CRITERIA: INERTIA
  • 8. | 8 MAXIMIZE DIFFERENCES BETWEEN AVERAGE CURVES (feature: Equipment Type) Electrical Gas Hour PinW ArgMax(d) mean
  • 9. | 9 EXISTING DISTRIBUTED DECISION TREE Scalable Distributed Decision Trees in Spark MLLib Manish Amde (Origami Logic), Hirakendu Das (Yahoo! Inc.), Evan Sparks (UC Berkeley), Ameet Talwalkar (UC Berkeley). Spark Summit 2014. http://spark-summit.org/wp- content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf A MapReduce Implementation of C4.5 Decision Tree Algorithm Wei Dai, Wei Ji. International Journal of Database Theory and Application. Vol. 7, No. 1, 2014, pages 49- 60. http://www.chinacloud.cn/upload/2014-03/14031920373451.pdf PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Biswanath Panda, Joshua S. Herbach, Sugato Basu, Roberto J. Bayardo. VLDB 2009. http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36296.pdf Distributed Decision Tree Learning for Mining Big Data Streams Arinto Murdopo, Master thesis, Yahoo ! Labs Barcelona, July 2013. http://people.ac.upc.edu/leandro/emdc/arinto-emdc-thesis.pdf
  • 10. | 10 MLLIB DECISION TREE PARALLELIZATION
  • 11. | 11 Step 1: compute average curves [0:10[ [10:20[ [0:10[ [10:20[ [0:10[ [10:20[ Host 1 Host 2 Host 3 [0:10[ [10:20[ Host 1 Step 2: collect and find the best split HORIZONTAL STRATEGY
  • 12. | 12 To build the tree: • Criteria: entropy, Gini, variance • Data structure: LabelPoint FROM MLLIB TO COURBOSPARK
  • 13. | 13 To build the tree: • Criteria: entropy, Gini, variance, inertia (to compare time-series) • Data structure: LabelPoint, TimeSeries • Finding split point for nominal features For data visualization of the tree: • Quantile on the nodes and leaves • Lost of inertia • Number of curves per nodes, leaves FROM MLLIB TO COURBOSPARK
  • 14. | 14 DEALING WITH NOMINAL FEATURES Current implementation for regression:  order the categories by their mean on the target A BC D Partitions tested: {A}/{CBD}, {AC}/{BD}, {ACB}/{C}
  • 15. | 15 NOMINAL VALUES: TYPE OF CONTRACT 4 CATEGORIES {A, B, C, D} A B C D?
  • 16. | 16 DEALING WITH NOMINAL FEATURES Hard to order curves… Solution 1: Compare curves 2 by 2  {A}/{BCD}, {AB}/{CD}, {ABC}/{D}, {AC}/{BD}… Problem: Combinatory problem depending on n the number of different categories. Complexity is O(2n).
  • 17. | 17 DEALING WITH NOMINAL FEATURES Solution 2: Agglomerative Hierarchical Clustering. Bottom up approach. Complexity is O(n3) - we don’t expect n > 100
  • 18. | 18 HOW TO Algorithm parameters Configure spark context Load the data file Learn the model
  • 19. | 19 LOOKING FOR THE TEST CONFIGURATION For a constant global capacity on 12 nodes: •120 cores + 120 GB RAM #Executors RAM per exec. Cores per exec. Performance on 100Gb data 12 10 GB 10 22 minutes 24 5 GB 5 17 minutes 60 2 GB 2 12 minutes 120 1 GB 1 15 minutes
  • 20. | 20 SCALABILITY TO #CONTAINERS
  • 21. | 21 SCALABILITY TO #CONTAINERS
  • 22. | 22 SCALABILITY TO #CONTAINERS
  • 24. | 24 FRAMEWORK STABILITY Tested on: • 10GB, 100GB, 200GB, 300GB, 400GB, 500GB, 1TB • Categorical and continuous variables • Bin sizes from 100 to 1000
  • 26. | 26 SCALABILITY TO #CATEGORIES
  • 27. | 27
  • 28. | 28 REAL LIFE DATASET 0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000 1200 1400 Timeinminutes Data in GB • 9 executors with 20 GB and 8 cores • 10 to 1000 millions load curves (10 numerical and 10 categorical features)
  • 29. | 29 • spark.default.parallelism • spark.executor.memory • spark.storage.memoryfraction • spark.akka.framesize TUNING
  • 30. | 30 Developers view • Flawless transition from local to cluster mode • Debug mode with an IDE • Good performances need knowledge FEEDBACKS
  • 32. | 32 Data Scientists view • The API is not very data oriented • …but now we have SparkSQL and Dataframes! • IPython + pySpark • Feature engineering VS model engineering FEEDBACKS
  • 33. | 33 OPS view • Better than mapReduce • Performances are predictable for tested code • YARNed • Lots of releases, MlLib code is evolving quickly FEEDBACKS
  • 34. | 34 FUTURE WORKS • Unbalanced trees • Improve performance • Other criteria for time-series comparison • Missing values in explanatory features
  • 35. MERCI
  • 36. | 36 EXISTING DISTRIBUTED DECISION TREE Partitionning Engine Target Ensemble Pruning MLlib Horizontal Spark Num + Nom Yes No MR C4.5 Vertical MR Nom No No PLANET Horizontal MR Num + Nom Yes No SAMOA Vertical Storm/S4 Num + Nom Yes Yes
  • 37. | 37 EXAMPLE OF TREE ON LOAD CURVES
  • 38. | 38 Store a curve + explanatory features Main class to learn the model Inertia split crterion Store statistiques about the model: quantiles, average curve, lost in inertia

Notes de l'éditeur

  1. Ability to choose the expressivity of the model (by choosing the number of leaves)
  2. Inertia/mean is additive
  3. Horizontal scalability
  4. Complexity is linear with the number of categories Why not ordered categories by their distance to the mean curve to keep the current implementation?
  5. How do I order those curves?
  6. 100 => 0.3s 200 => 2s 400 => 15s
  7. Meilleur conf avec deux variables
  8. Inertia is additive
  9. Inertia is additive
  10. Inertia is additive
  11. Inertia is additive
  12. Inertia is additive
  13. Inertia is additive
  14. Inertia is additive
  15. Parler des next steps pour scaler mieux. Études à lancer
  16. Flawless transition from local to cluster mode Debug mode with an IDE Good performances need knowledge Hey Scala The API is not very data oriented But SparkSQL and Dataframes are changing the game IPython + pySpark Custom code is possible: feature engineering VS model engineering
  17. Flawless transition from local to cluster mode Debug mode with an IDE Good performances need knowledge Hey Scala The API is not very data oriented But SparkSQL and Dataframes are changing the game IPython + pySpark Custom code is possible: feature engineering VS model engineering
  18. Flawless transition from local to cluster mode Debug mode with an IDE Good performances need knowledge Hey Scala The API is not very data oriented But SparkSQL and Dataframes are changing the game IPython + pySpark Custom code is possible: feature engineering VS model engineering
  19. YARN monitoring, scheduling and ressource management
  20. To be updated