In the Internet of Everything, huge volumes of multimedia data are generated at very high rates by heterogeneous sources in various formats, such as sensors readings, process logs, structured data from RDBMS, etc. The need of the hour is setting up efficient data pipelines that can compute advanced analytics models on data and use results to customize services, predict future needs or detect anomalies. This Webinar explores the TOREADOR conversational, service-based approach to the easy design of efficient and reusable analytics pipelines to be automatically deployed on a variety of cloud-based execution platforms.
4. Sample Scenario
• Infrastructure for pollution monitoring
managed by Lombardia Informatica, an
agency of Lombardy region in Italy.
• A network of sensors acquire pollution data
everyday.
• sensors, containing information of a
specific acquiring sensor such as ID,
pollutant type, unit of measure
• data acquisition stations, managing a set
of sensors and information regarding their
position (e.g. longitude/latitude)
• pollution values, containing the values
acquired by sensors, the timestamp, and
the validation status. Each value is
validated by a human operator that
manually labels it as valid or invalid.
5. •The goal is to design and
deploy a Big Data pipeline to:
• predict the labels of acquired
data in real time
• alert the operator when
anomalous values are observed
Reference Scenario
6. Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
8. • The pipeline includes two processing stages: training stage and prediction stage
• Our DM will include 2 requirement specifications:
DataPreparation.DataTransformation.Filtering;
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Training;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Batch.
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Prediction;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Streaming.
Declarative Model
DS1
DS2
9. • Based on the Declarative Models, the TOREADOR (SS) will return a set of
services consistent with DS1 and DS2
• The user can easily compose these services to address the scenario’s
goals
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
10. • The two compositions must be connected as the e-gestion of SC1 is the
in-gestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
11. • The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
12. • The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
13. • The TOREADOR compiler translates SC1 and SC2 into executable
orchestrations in a suitable workflow language
Deployment Model
DS1
SS
SC1
DS2
SS
SC2
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
WC1
WC2
1-n
15. • The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
16. • The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
17. The Code-based Line
Code Once/Deploy Everywhere
The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and
controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code-
based approach.
She expresses the parallel computation of a coded algorithm, in terms of parallel primitives.
Toreador distributes it among computational nodes hosted by different Cloud environments.
The resulting computation can be saved as a service for the Service-based line
19
I. Code III. DeployII. Transform
Skeleton-Based
Code Compiler
18. Code-based compiler
import math
import random
def data_parallel_region(distr, func, *repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance between two
vectors"""
return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a,
b)]))
def kmeans_init(data, k):
"""Returns initial centroids configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids):
"""Returns the given instance paired to key of
nearest centroid"""
comparator = lambda x: distance(x[1], p)
print (comparator)
Source Code
MapReduce
Bag of Tasks
Producer Consumer
…......
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
Skeleton
Secondary
Scripts
19. Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
20. Or each us at info@toreador-project.eu
2017
Want to give it a try? Stay Tuned!
http://www.toreador-project.eu/community/
23. Declarative Models: vocabulary
• Declarative model offers a vocabulary for an computation
independent description of BDA
• Organized in 5 areas
• Representation (Data Mode, Data Type, Management, Partitioning)
• Preparation (Data Reduction, Expansion, Cleaning, Anonymization)
• Analytics (Analytics Model, Task, Learning Approach, Expected Quality)
• Processing (Analysis Goal, Interaction, Performances)
• Visualization and Reporting (Goal, Interaction, Data Dimensionality)
• Each specification can be structured in three levels:
• Goal: Indicator – Objective – Constraint
• Feature: Type – Sub Type – Sub Sub Type
24. Declarative Models
• A web-based GUI for
specifying the requirements
of a BDA
• No coding, for basic users
• Analytics services are
provided by the target
TOREADOR platform
• Big Data campaign built by
composing existing services
• Based on model
transformations
26
25. Declarative Models
• A web-based GUI for
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
27
30. Methodology: Building Blocks
• Declarative Specifications allow customers to define declarative models
shaping a BDA and retrieve a set of compatible services
• Service Catalog specifies the set of abstract services (e.g., algorithms,
mechanisms, or components) that are available to Big Data customers and
consultants for building their BDA
• Service Composition Repository permits to specify the procedural model
defining how services can be composed to carry out the Big Data analytics
• Support specification of an abstract Big Data service composition
• Deployment Configurations define the platform-dependent version of a
procedural model, as a workflow that is ready to be executed on the target
Big Data platform
32
31. Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
32. Procedural Models
• Platform-independent models that formally and
unambiguously describe how analytics should be
configured and executed
• They are generated following goals and constraints
specified in the declarative models
• They provide a workflow in the form of a service
orchestration
• Sequence
• Choice
• If-then
• Do-While
• Split-Join
33. • User creates the flow based
on the list of returned
services
Service Composition
34. • User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
Service Composition
35. • User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
• The flow is submitted to the
service which translates it
into OWL-S service
composition
Service Composition
36. • All internals are made
explicits
• Clear specification of the
services
• Reuse and modularity
Service Composition
38. Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
39. • It consists of two main sub-processes
• Structure generation: the compiler parses the procedural model and identifies
the process operators (sequence, alternative, parallel, loop) composing it
• Service configuration: for each service in the procedural model the
corresponding one is identified and inserted in the deployment model
• Support transformations to any orchestration engine available as a
service
• Available for Oozie and Spring XD
Workflow compiler
40. • Workflow compiler takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• It produces as output an executable workflow
• For example an Oozie workflow
• XML file of the workflow
• job.properties
• System variables
Deployment Model
41. Translating the Composition Structure
• Deployment models:
• specify how procedural models are instantiated and configured on a target platform
• drive analytics execution in real scenario
• are platform-dependent
• Workflow compiler transforms the procedural model in a deployment
model that can be directly executed on the target platform.
• This transformation is based on a compiler that takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• and produces as output a technology-dependent workflow
42. Translating the Composition Structure
• OWL-S service composition structure is mapped on different control
constructs
43. • Workflow contain 3 types of
distinct PLACEHOLDER
• GREEN placeholders are
SYSTEM variables defined in
Oozie properties
• RED placeholders are JOB
variables defined in file
job.properties
• YELLOW placeholders are
ARGUMENTS of executable
jobs on OOZIE Server
• More on the demo…
Generating an Executable Workflow