BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

Designing Big Data Pipelines
Applying the TOREADOR Methodology
BDVA webinar
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani

Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies

Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies
DS SS SC WC E

Sample Scenario
• Infrastructure for pollution monitoring
managed by Lombardia Informatica, an
agency of Lombardy region in Italy.
• A network of sensors acquire pollution data
everyday.
• sensors, containing information of a
specific acquiring sensor such as ID,
pollutant type, unit of measure
• data acquisition stations, managing a set
of sensors and information regarding their
position (e.g. longitude/latitude)
• pollution values, containing the values
acquired by sensors, the timestamp, and
the validation status. Each value is
validated by a human operator that
manually labels it as valid or invalid.

•The goal is to design and
deploy a Big Data pipeline to:
• predict the labels of acquired
data in real time
• alert the operator when
anomalous values are observed
Reference Scenario

Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs

Queue
Kafka Spark HBase
Display/
Query
Sensor
Data
Compute
Predictive
label
Store
HBase
Without the methodology..
•Draft the pipeline stages
•Identify the technology
•Develop the scripts
•Deploy
Slow, error-prone, difficult
to reuse..

• The pipeline includes two processing stages: training stage and prediction stage
• Our DM will include 2 requirement specifications:
DataPreparation.DataTransformation.Filtering;
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Training;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Batch.
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Prediction;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Streaming.
Declarative Model
DS1
DS2

• Based on the Declarative Models, the TOREADOR (SS) will return a set of
services consistent with DS1 and DS2
• The user can easily compose these services to address the scenario’s
goals
Procedural Model
DS1
SS
SC1
DS2
SS
SC2

• The two compositions must be connected as the e-gestion of SC1 is the
in-gestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2

• The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2

• The TOREADOR compiler translates SC1 and SC2 into executable
orchestrations in a suitable workflow language
Deployment Model
DS1
SS
SC1
DS2
SS
SC2
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
WC1
WC2
1-n

• The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2

The Code-based Line
Code Once/Deploy Everywhere
The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and
controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code-
based approach.
She expresses the parallel computation of a coded algorithm, in terms of parallel primitives.
Toreador distributes it among computational nodes hosted by different Cloud environments.
The resulting computation can be saved as a service for the Service-based line
19
I. Code III. DeployII. Transform
Skeleton-Based
Code Compiler

Code-based compiler
import math
import random
def data_parallel_region(distr, func, *repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance between two
vectors"""
return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a,
b)]))
def kmeans_init(data, k):
"""Returns initial centroids configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids):
"""Returns the given instance paired to key of
nearest centroid"""
comparator = lambda x: distance(x[1], p)
print (comparator)
Source Code
MapReduce
Bag of Tasks
Producer Consumer
…......
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def distance(a,b):
two vectors"""
in zip(a, b)]))
configuration"""
import math
def distance(a,b):
two vectors"""
in zip(a, b)]))
configuration"""
import math
def distance(a,b):
two vectors"""
in zip(a, b)]))
configuration"""
Skeleton
Secondary
Scripts

Or each us at info@toreador-project.eu
2017
Want to give it a try? Stay Tuned!
http://www.toreador-project.eu/community/

Declarative Models: vocabulary
• Declarative model offers a vocabulary for an computation
independent description of BDA
• Organized in 5 areas
• Representation (Data Mode, Data Type, Management, Partitioning)
• Preparation (Data Reduction, Expansion, Cleaning, Anonymization)
• Analytics (Analytics Model, Task, Learning Approach, Expected Quality)
• Processing (Analysis Goal, Interaction, Performances)
• Visualization and Reporting (Goal, Interaction, Data Dimensionality)
• Each specification can be structured in three levels:
• Goal: Indicator – Objective – Constraint
• Feature: Type – Sub Type – Sub Sub Type

Declarative Models
• A web-based GUI for
specifying the requirements
of a BDA
• No coding, for basic users
• Analytics services are
provided by the target
TOREADOR platform
• Big Data campaign built by
composing existing services
• Based on model
transformations
26

Declarative Models
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
27

Declarative Models: machine readable
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
28
…
"tdm:label": "Data Representation",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:label": "Data Source Model Type",
"tdm:constraint": "{}",
{
"tdm:label": "Data Structure",
"tdm:visualisationType": "Option",
{
"tdm:label": "Structured",
"$$hashKey": "object:21"
}
]
},
....

Interference Declaration
• A few examples
Data_Preparation.Anonymization. Technique.k-anonymity
→¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low
Data_Preparation.Anonymization. Technique.hashing
→¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean
Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency
→¬ Data_Representation.Storage_Property. Partitioning
29

• Interference
Declarations
• Boolean Interference:
P→¬Q
• Intensity of an
Interference: DP∩DQ
• Interference
Enforcement
• Fuzzy interpretation
max (1-P, 1-Q)
30
Consistency Check

Methodology: Building Blocks
• Declarative Specifications allow customers to define declarative models
shaping a BDA and retrieve a set of compatible services
• Service Catalog specifies the set of abstract services (e.g., algorithms,
mechanisms, or components) that are available to Big Data customers and
consultants for building their BDA
• Service Composition Repository permits to specify the procedural model
defining how services can be composed to carry out the Big Data analytics
• Support specification of an abstract Big Data service composition
• Deployment Configurations define the platform-dependent version of a
procedural model, as a workflow that is ready to be executed on the target
Big Data platform
32

Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform

Procedural Models
• Platform-independent models that formally and
unambiguously describe how analytics should be
configured and executed
• They are generated following goals and constraints
specified in the declarative models
• They provide a workflow in the form of a service
orchestration
• Sequence
• Choice
• If-then
• Do-While
• Split-Join

• User creates the flow based
on the list of returned
services
Service Composition

services
• Services enriched with ad
hoc parameters
Service Composition

services
• Services enriched with ad
hoc parameters
• The flow is submitted to the
service which translates it
into OWL-S service
composition
Service Composition

• All internals are made
explicits
• Clear specification of the
services
• Reuse and modularity
Service Composition

• It consists of two main sub-processes
• Structure generation: the compiler parses the procedural model and identifies
the process operators (sequence, alternative, parallel, loop) composing it
• Service configuration: for each service in the procedural model the
corresponding one is identified and inserted in the deployment model
• Support transformations to any orchestration engine available as a
service
• Available for Oozie and Spring XD
Workflow compiler

• Workflow compiler takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• It produces as output an executable workflow
• For example an Oozie workflow
• XML file of the workflow
• job.properties
• System variables
Deployment Model

Translating the Composition Structure
• Deployment models:
• specify how procedural models are instantiated and configured on a target platform
• drive analytics execution in real scenario
• are platform-dependent
• Workflow compiler transforms the procedural model in a deployment
model that can be directly executed on the target platform.
• This transformation is based on a compiler that takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• and produces as output a technology-dependent workflow

Translating the Composition Structure
• OWL-S service composition structure is mapped on different control
constructs

• Workflow contain 3 types of
distinct PLACEHOLDER
• GREEN placeholders are
SYSTEM variables defined in
Oozie properties
• RED placeholders are JOB
variables defined in file
job.properties
• YELLOW placeholders are
ARGUMENTS of executable
jobs on OOZIE Server
• More on the demo…
Generating an Executable Workflow

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

Recommended

Recommended

More Related Content

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo) (20)

More from Big Data Value Association

More from Big Data Value Association (20)

Recently uploaded

Recently uploaded (20)

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)