Hadoop Proof of Concept for Massive Time-Series Analytics

A proof of concept
with Hadoop :
storage and
analytics of
electrical time-
series
June 13th 2012
Bruno JACQUIN, Marie-Luce PICARD,
Leeley DAIO-PIRES DOS SANTOS,
Alzennyr GOMES DA SILVA,
David WORMS, Charles BERNARD

Outline

1. A very brief presentation of the EDF

Group
2. Smart metering data

3. Massive data management for utilities ?

4. A Proof of Concept using Hadoop

5. Conclusion and Perspectives

EDF GROUP PROFILE

EDF Group profile
¥  A leading player in the energy market, active in all areas of electricity from generation to
trading
and network management.

¥  Balance between regulated and deregulated activities.

¥  Expertise in engineering and operating generation plants and networks.

¥  Expertise in the design and promotion of energy eco-efficiency solutions.

in the French and UK electricity markets, solid positions in Italy and numerous
¥  Leader
other European countries; industrial operations in Asia and the United States.

37 million 630,4 TWh 108.9g of CO2
customers worldwide electricity generation worldwide per kWh generated
(CO2 emissions from EDF Group electricity
and heat generation)
158,842 €65.2 billion
employees worldwide in sales Consolidated data at 12.31.2010.

EDF WORLDWIDE

Map of Group operations

Smart-Grids projects everywhere in the
world ...

Key: red=electricity, green=gas, blue=water
and triangle=trial or pilot where circle=project

EDF R&D : Créer de la valeur et préparer l’avenir 6

The EDF Group: a bright outlook for smart
grids
Lower consumption peaks mean less Clearer information
dependence on high-carbon generation to raise awareness of energy saving strategies

Decarbonization of the energy mix Billing based on actual
through a smoother integration consumption
of renewable energies into networks

Smart
meter

Reduction in network losses to boost
competitivity of the system
Precision in targeted investments
for the maintenance
and modernization of networks

More efficient repairs to networks after extreme weather events

Promoting the development of electric transportation that emits fewer green house gases

New energy uses (e.g. electric mobility, storage, etc.)

Smart Grids : what ? And what for ?
" Environmental, economical, social and policy
drivers lead to a deep change of the energy sector:
"  Climate change, environmental concerns
"  Increased pressure of operational and financial efficiency
"  Increasing awareness of consumers, role of citizens
"  Technological pressure (IT, smart devices)
Source – Wikipedia
A smart grid delivers electricity from suppliers to consumers using
digital technology with two-way communications to control appliances
at consumers' homes to save energy, reduce cost and increase
reliability and transparency. It overlays the electrical grid with an
information and net metering system, and includes smart meters.
Such a modernized eletricity network is being promoted by many
governments as a way of addressing energy independence, global
warming and emergency resilience issues.

Smart metering data:
the Precious Load
Curves

WhatData ou « The curve look like ?
Big does a load data deluge »

WhatData ou « The curve look like ? (2)

WhatData ou « The curve look like ? (3)

Individual load curves :
- Left : same customer, two
different days
- Up: same day, two different
customers

Massive data
management for
utilities ?

Massive data management in the energy
domain: myth or reality ?
"  Challenges :
"   More complexity in the electric power system (demand
response, distributed generation …)
"   Faster evolution of customer indoor equipment (smart meters
and devices, Internet of Things …)
ð  Core business will involve more IT and data management

"  The R&D SIGMA project deals with scalability and Big
Data :
"   Skillson Big Data techniques
"   Prototyping on business cases
"   With internal (IT), academic or industrial partners

Massive data management in the energy
domain: myth or reality ?
"  The SIGMA project studies and experiments
appropriate methods and techniques
"  Storage technologies for massive data sets, especially time-
series
"   Data processing :
"   Complex Event Processing, real time analytical processing
"   Large scale data-mining : massively parallel processing,
distributed data-mining

"  Use cases
"  Smart-grids, CRM and customer insight, generation
optimization : consumption and production forecasting, power
plant maintenance

A Proof of Concept
using Hadoop

Storing massive time series
"  Objective: Proof of Concept for running a large number
of queries (variable levels of complexity with variable
scopes and frequencies, variable acceptable latencies)
on a huge number of load curves
"   Data:
individual curves, weather data, contractual information,
network data
"   1 measurement every 10 mn for 35 million customers a
year
"   Annual volume of data
"   1800 billion records ; 120 TB uncompressed data

Storing massive time series: objectives

"  Build an « operational Data Warehouse » able to:
"   Supply a large volume of data
"   Ingest new coming data

"   Pre-processing, synchronization and filling

"   Allow concurrent and simultaneous queries

" Tactical queries: Curve selection compared with a

mean curve
" Analytical queries: Aggregated curves

" Ad-hoc queries

"   ‘Recoflux’ (simplified)

"   Extraction capabilities

Storing massive time series: evaluation

"  Evaluation criteria
" Quantitative
"   competition(QoS)
"   Performances (SLA)
" Qualitative :
"   Convergence
"   Agility

Using relational technologies for storing
massive time series
"  Relational approaches, Very Large DataBases
"   Works
carried out with partners: Teradata, Oracle, IBM,
EMC², HP
"   Appliances or software offers,
"   Shared-nothing or shared-everything ; Column-based, line or
hybrid mode?
"   Separation between an operational use (ODS) and an
analytical use (DWH)?

Using Hadoop for storing massive time series

"   Native distributed file-system (HDFS)

"   Distributed treatments using the Map/Reduce paradigm
"   Large
dotcom usage but very limited industrial deployment,
maturity is yet to come despite the major editors arriving
with offers including integration, appliances and support

"   Internal POC concluded in April 2012

The Data model The data deluge »
Big Data ou «

Compressed data
Volume on HDFS :
è 10 TB (x3)

Data generator - CourboGen ©

"   Generates load curves and associated data
"   Customizable tool: interval, duration, data quality, noise on
the curves
"   Distributed architecture (NodeJS, Redis)

"   Output as data stream

Visualization of 35M curves for one week

Design

"   Hive in the center of our DW
" HBase at the forefront of data access

Design
"   Hive in the center of our DW
"  Allows ad-hoc and complex analytical queries
"   Customer tables stored as rcfile are replicated in all Data
Nodes (19)
"   Consumption measurements are partitioned by day and
customer profil criteria
"   Daily volume: 25 GB ; Average block size: 10 MB

" HBase at the forefront of data access
"  Allows low latencies queries
"   Recent metering data stored “In Memory” tables

"   Stores a subset of measurements and aggregates in
tables with “Bloom filters” enabled

Hardware configuration: the cluster

"   20 nodes in 2 racks:

"  7 x 1U nodes with 4 x 1 TB

"   13 x 2U nodes with 8x1 TB

"   Total : 132 TB ; 336 cores (AMD)

" Hadoop distribution : Cloudera CDH3u3 (open source)

Hardware configuration: the cluster

Time series representation models: options
§  TUPLE
CREATE TABLE cdc_tuple ( id_cdc INT, date_releve TINYINT,
p INT )
PARTITIONED BY(day STRING, optarif STRING, psousc
TINYINT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerde'
STORED AS RCFILE;

§  ARRAY
CREATE TABLE cdc_array ( id_cdc INT, values array<
array< int > > ) …

§  COLUMN
CREATE TABLE cdc_144_cols (id_cdc INT, p1 INT, p2 INT,
…, p144 INT )…

Time series representation models: options
Getting a daily individual load curve
‘select * from cdc_tuple where day='2008-01-01' and id_cdc =
136630;’

Time series representation models: impact

"   Computing a global aggregated load curve for 1 day

Representation model Daily volume Query execution time
Tuple 10.1 GB ( x 3 replicas) 2 min 22 sec
Column 8.8 GB ( x 3 replicas) 1 min 17 sec
Array 16 GB ( x 3 replicas) 1 min 18 sec

Results - HBase
"   Tactical queries are successfully handled by HBase, offering
low latencies under a high concurrent load.
Representation Period Nb concurrent Queries / Sec Query execution
model execution queries time (seconds)
time

Columns (7 * 144) 1 minute 100 470 0.21

Array (7 x 1 array of 1 minute 100 495 0.20
144 values)
Columns (7 * 144) 5 minutes 500 524 0.19

Array (7 x 1 array of 5 minutes 500 430 0.18
144 values)

Query: curve selection

Results – Hive (1)

Query Execution time
(tuples representation)

Aggregation France (sum) 10 min interval 1 min, 56 sec

Load curve aggregated by contractual information 2 min, 21 sec

Analysing consumption trends according to the customers building 1 heure, 18 sec
caracteristics
TOP N customers candidates for a power level update 1 heure, 7 min, 35 sec

Results for different queries:
- Planned queries (with adequate partitioning)
- ad-hoc queries

Results – Hive (2)

"   Recoflux scenarios
Scénario Mode séquentiel (minutes) Mode parallèle (minutes)
1 jour 1 semaine 1 jour 1 semaine
521 1.44 10.10 1.56 3.00
522 27.87 195.09 28.50 31.01
523 7.98 23.94
524 10.71 74.99 15.97 19.58
525 6.10 42.70 7.45 8.39
526 0.86 6.08 0.92 2.43

" Recoflux is a very important business application (power
consumption aggregations are computed according to
different criteria ; updates and temporal data): results really
acceptable

Using NoSQL technologies for storing
massive time series: results
Integration Hadoop / Tableau Software : visualisation of 700k feeders

Alternative approach for storing massive
time-series : conclusions
"  The less
"   Not yet mature, a few feedbacks available in the industry
"   Lack of competences in Europe (impact of configuration and
tuning, smart skills)
"   Major editors offering: still young but actively emerging

"  The more
"   Low cost
"   Ability to recycle existing commodity hardware
"   One of the few solution which allows the coupling between
structured and unstructured data
"   Flexibility despite being a complex system to deploy and
manage. Fault tolerant and scalable.

Alternative approach for storing massive
time-series : conclusions
"  Perspectives
"   Partnersoffering industrial support
"   Hardware configuration
"   Usage of statistical libraries
"   Connectivity with the relational world

"   USAGES:
"  ETL,
"  intelligent and reliable archival solution,
"  high throughput data presentation (publication)

Conclusions and perspectives
" Hadoop perspectives
"   Non traditional usage of Hadoop using a structured schema
"   Will become a component of the company IS for non-critical
usages
"   Any suggestions ?
"   storage mode for time-series ?
"   usages ?

"  Contacts:
" marie-luce.picard@edf.fr
" bruno.jacquin@edf.fr

Hadoop Proof of Concept for Massive Time-Series Analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop Proof of Concept for Massive Time-Series Analytics

Similaire à Hadoop Proof of Concept for Massive Time-Series Analytics (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Hadoop Proof of Concept for Massive Time-Series Analytics