SlideShare a Scribd company logo
1 of 44
Download to read offline
Designing Big Data Pipelines
Applying the TOREADOR Methodology
BDVA webinar
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies
Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies
DS SS SC WC E
Sample Scenario
• Infrastructure for pollution monitoring
managed by Lombardia Informatica, an
agency of Lombardy region in Italy.
• A network of sensors acquire pollution data
everyday.
• sensors, containing information of a
specific acquiring sensor such as ID,
pollutant type, unit of measure
• data acquisition stations, managing a set
of sensors and information regarding their
position (e.g. longitude/latitude)
• pollution values, containing the values
acquired by sensors, the timestamp, and
the validation status. Each value is
validated by a human operator that
manually labels it as valid or invalid.
•The goal is to design and
deploy a Big Data pipeline to:
• predict the labels of acquired
data in real time
• alert the operator when
anomalous values are observed
Reference Scenario
Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
Queue
Kafka Spark HBase
Display/
Query
Sensor
Data
Compute
Predictive
label
Store
HBase
Without the methodology..
•Draft the pipeline stages
•Identify the technology
•Develop the scripts
•Deploy
Slow, error-prone, difficult
to reuse..
• The pipeline includes two processing stages: training stage and prediction stage
• Our DM will include 2 requirement specifications:
DataPreparation.DataTransformation.Filtering;
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Training;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Batch.
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Prediction;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Streaming.
Declarative Model
DS1
DS2
• Based on the Declarative Models, the TOREADOR (SS) will return a set of
services consistent with DS1 and DS2
• The user can easily compose these services to address the scenario’s
goals
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the e-gestion of SC1 is the
in-gestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The TOREADOR compiler translates SC1 and SC2 into executable
orchestrations in a suitable workflow language
Deployment Model
DS1
SS
SC1
DS2
SS
SC2
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
WC1
WC2
1-n
Deployment
• The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
• The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
The Code-based Line
Code Once/Deploy Everywhere
The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and
controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code-
based approach.
She expresses the parallel computation of a coded algorithm, in terms of parallel primitives.
Toreador distributes it among computational nodes hosted by different Cloud environments.
The resulting computation can be saved as a service for the Service-based line
19
I. Code III. DeployII. Transform
Skeleton-Based
Code Compiler
Code-based compiler
import math
import random
def data_parallel_region(distr, func, *repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance between two
vectors"""
return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a,
b)]))
def kmeans_init(data, k):
"""Returns initial centroids configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids):
"""Returns the given instance paired to key of
nearest centroid"""
comparator = lambda x: distance(x[1], p)
print (comparator)
Source Code
MapReduce
Bag of Tasks
Producer Consumer
…......
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
Skeleton
Secondary
Scripts
Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
Or each us at info@toreador-project.eu
2017
Want to give it a try? Stay Tuned!
http://www.toreador-project.eu/community/
Thank you
Declarative Model Definition
Declarative Models: vocabulary
• Declarative model offers a vocabulary for an computation
independent description of BDA
• Organized in 5 areas
• Representation (Data Mode, Data Type, Management, Partitioning)
• Preparation (Data Reduction, Expansion, Cleaning, Anonymization)
• Analytics (Analytics Model, Task, Learning Approach, Expected Quality)
• Processing (Analysis Goal, Interaction, Performances)
• Visualization and Reporting (Goal, Interaction, Data Dimensionality)
• Each specification can be structured in three levels:
• Goal: Indicator – Objective – Constraint
• Feature: Type – Sub Type – Sub Sub Type
Declarative Models
• A web-based GUI for
specifying the requirements
of a BDA
• No coding, for basic users
• Analytics services are
provided by the target
TOREADOR platform
• Big Data campaign built by
composing existing services
• Based on model
transformations
26
Declarative Models
• A web-based GUI for
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
27
Declarative Models: machine readable
• A web-based GUI for
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
28
…
"tdm:label": "Data Representation",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:label": "Data Source Model Type",
"tdm:constraint": "{}",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:label": "Data Structure",
"tdm:constraint": "{}",
"tdm:visualisationType": "Option",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:constraint": "{}",
"tdm:label": "Structured",
"$$hashKey": "object:21"
}
]
},
....
Interference Declaration
• A few examples
Data_Preparation.Anonymization. Technique.k-anonymity
→¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low
Data_Preparation.Anonymization. Technique.hashing
→¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean
Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency
→¬ Data_Representation.Storage_Property. Partitioning
29
• Interference
Declarations
• Boolean Interference:
P→¬Q
• Intensity of an
Interference: DP∩DQ
• Interference
Enforcement
• Fuzzy interpretation
max (1-P, 1-Q)
30
Consistency Check
Service-Based Line
Methodology: Building Blocks
• Declarative Specifications allow customers to define declarative models
shaping a BDA and retrieve a set of compatible services
• Service Catalog specifies the set of abstract services (e.g., algorithms,
mechanisms, or components) that are available to Big Data customers and
consultants for building their BDA
• Service Composition Repository permits to specify the procedural model
defining how services can be composed to carry out the Big Data analytics
• Support specification of an abstract Big Data service composition
• Deployment Configurations define the platform-dependent version of a
procedural model, as a workflow that is ready to be executed on the target
Big Data platform
32
Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
Procedural Models
• Platform-independent models that formally and
unambiguously describe how analytics should be
configured and executed
• They are generated following goals and constraints
specified in the declarative models
• They provide a workflow in the form of a service
orchestration
• Sequence
• Choice
• If-then
• Do-While
• Split-Join
• User creates the flow based
on the list of returned
services
Service Composition
• User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
Service Composition
• User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
• The flow is submitted to the
service which translates it
into OWL-S service
composition
Service Composition
• All internals are made
explicits
• Clear specification of the
services
• Reuse and modularity
Service Composition
Deployment Model Definition
Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
• It consists of two main sub-processes
• Structure generation: the compiler parses the procedural model and identifies
the process operators (sequence, alternative, parallel, loop) composing it
• Service configuration: for each service in the procedural model the
corresponding one is identified and inserted in the deployment model
• Support transformations to any orchestration engine available as a
service
• Available for Oozie and Spring XD
Workflow compiler
• Workflow compiler takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• It produces as output an executable workflow
• For example an Oozie workflow
• XML file of the workflow
• job.properties
• System variables
Deployment Model
Translating the Composition Structure
• Deployment models:
• specify how procedural models are instantiated and configured on a target platform
• drive analytics execution in real scenario
• are platform-dependent
• Workflow compiler transforms the procedural model in a deployment
model that can be directly executed on the target platform.
• This transformation is based on a compiler that takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• and produces as output a technology-dependent workflow
Translating the Composition Structure
• OWL-S service composition structure is mapped on different control
constructs
• Workflow contain 3 types of
distinct PLACEHOLDER
• GREEN placeholders are
SYSTEM variables defined in
Oozie properties
• RED placeholders are JOB
variables defined in file
job.properties
• YELLOW placeholders are
ARGUMENTS of executable
jobs on OOZIE Server
• More on the demo…
Generating an Executable Workflow
Analytics Deployment Approach

More Related Content

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsAndrea PETRUCCI
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Mary Fesenko
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentationHossam Hassan
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robotsJaime Martin Losa
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelAlessio Bucaioni
 
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
VET4SBO Level 3   module 1 - unit 2 - 0.009 enVET4SBO Level 3   module 1 - unit 2 - 0.009 en
VET4SBO Level 3 module 1 - unit 2 - 0.009 enKarel Van Isacker
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationAmir Zmora
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Idit Levine
 
Automated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemAutomated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemSander van der Burg
 
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueBhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueVijayananda Mohire
 

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo) (20)

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_Publications
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
JosephAnthonyEAlvarez_CV_2016
JosephAnthonyEAlvarez_CV_2016JosephAnthonyEAlvarez_CV_2016
JosephAnthonyEAlvarez_CV_2016
 
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Stoop 305-reflective programming5
Stoop 305-reflective programming5Stoop 305-reflective programming5
Stoop 305-reflective programming5
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentation
 
Notifier Tools pdf
Notifier Tools pdfNotifier Tools pdf
Notifier Tools pdf
 
resume2
resume2resume2
resume2
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robots
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component Model
 
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
VET4SBO Level 3   module 1 - unit 2 - 0.009 enVET4SBO Level 3   module 1 - unit 2 - 0.009 en
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017
 
Automated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemAutomated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented System
 
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueBhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
 

More from Big Data Value Association

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingBig Data Value Association
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceBig Data Value Association
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingBig Data Value Association
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Big Data Value Association
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyBig Data Value Association
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Big Data Value Association
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...Big Data Value Association
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna Big Data Value Association
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBig Data Value Association
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...Big Data Value Association
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBig Data Value Association
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBig Data Value Association
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...Big Data Value Association
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBig Data Value Association
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBig Data Value Association
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingBig Data Value Association
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewBig Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Big Data Value Association
 

More from Big Data Value Association (20)

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharing
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplace
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharing
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionals
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshop
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshop
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
 
Virtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench FrameworkVirtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench Framework
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
 

Recently uploaded

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

  • 1. Designing Big Data Pipelines Applying the TOREADOR Methodology BDVA webinar Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
  • 4. Sample Scenario • Infrastructure for pollution monitoring managed by Lombardia Informatica, an agency of Lombardy region in Italy. • A network of sensors acquire pollution data everyday. • sensors, containing information of a specific acquiring sensor such as ID, pollutant type, unit of measure • data acquisition stations, managing a set of sensors and information regarding their position (e.g. longitude/latitude) • pollution values, containing the values acquired by sensors, the timestamp, and the validation status. Each value is validated by a human operator that manually labels it as valid or invalid.
  • 5. •The goal is to design and deploy a Big Data pipeline to: • predict the labels of acquired data in real time • alert the operator when anomalous values are observed Reference Scenario
  • 6. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  • 7. Queue Kafka Spark HBase Display/ Query Sensor Data Compute Predictive label Store HBase Without the methodology.. •Draft the pipeline stages •Identify the technology •Develop the scripts •Deploy Slow, error-prone, difficult to reuse..
  • 8. • The pipeline includes two processing stages: training stage and prediction stage • Our DM will include 2 requirement specifications: DataPreparation.DataTransformation.Filtering; DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Training; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Batch. DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Prediction; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Streaming. Declarative Model DS1 DS2
  • 9. • Based on the Declarative Models, the TOREADOR (SS) will return a set of services consistent with DS1 and DS2 • The user can easily compose these services to address the scenario’s goals Procedural Model DS1 SS SC1 DS2 SS SC2
  • 10. • The two compositions must be connected as the e-gestion of SC1 is the in-gestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 11. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 12. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 13. • The TOREADOR compiler translates SC1 and SC2 into executable orchestrations in a suitable workflow language Deployment Model DS1 SS SC1 DS2 SS SC2 spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l WC1 WC2 1-n
  • 15. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  • 16. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  • 17. The Code-based Line Code Once/Deploy Everywhere The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code- based approach. She expresses the parallel computation of a coded algorithm, in terms of parallel primitives. Toreador distributes it among computational nodes hosted by different Cloud environments. The resulting computation can be saved as a service for the Service-based line 19 I. Code III. DeployII. Transform Skeleton-Based Code Compiler
  • 18. Code-based compiler import math import random def data_parallel_region(distr, func, *repl): return [func(x, *repl) for x in distr] def distance(a, b): """Computes euclidean distance between two vectors""" return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a, b)])) def kmeans_init(data, k): """Returns initial centroids configuration""" return random.sample(data, k) def kmeans_assign(p, centroids): """Returns the given instance paired to key of nearest centroid""" comparator = lambda x: distance(x[1], p) print (comparator) Source Code MapReduce Bag of Tasks Producer Consumer …...... import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids Skeleton Secondary Scripts
  • 19. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  • 20. Or each us at info@toreador-project.eu 2017 Want to give it a try? Stay Tuned! http://www.toreador-project.eu/community/
  • 23. Declarative Models: vocabulary • Declarative model offers a vocabulary for an computation independent description of BDA • Organized in 5 areas • Representation (Data Mode, Data Type, Management, Partitioning) • Preparation (Data Reduction, Expansion, Cleaning, Anonymization) • Analytics (Analytics Model, Task, Learning Approach, Expected Quality) • Processing (Analysis Goal, Interaction, Performances) • Visualization and Reporting (Goal, Interaction, Data Dimensionality) • Each specification can be structured in three levels: • Goal: Indicator – Objective – Constraint • Feature: Type – Sub Type – Sub Sub Type
  • 24. Declarative Models • A web-based GUI for specifying the requirements of a BDA • No coding, for basic users • Analytics services are provided by the target TOREADOR platform • Big Data campaign built by composing existing services • Based on model transformations 26
  • 25. Declarative Models • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 27
  • 26. Declarative Models: machine readable • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 28 … "tdm:label": "Data Representation", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Source Model Type", "tdm:constraint": "{}", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Structure", "tdm:constraint": "{}", "tdm:visualisationType": "Option", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:constraint": "{}", "tdm:label": "Structured", "$$hashKey": "object:21" } ] }, ....
  • 27. Interference Declaration • A few examples Data_Preparation.Anonymization. Technique.k-anonymity →¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low Data_Preparation.Anonymization. Technique.hashing →¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency →¬ Data_Representation.Storage_Property. Partitioning 29
  • 28. • Interference Declarations • Boolean Interference: P→¬Q • Intensity of an Interference: DP∩DQ • Interference Enforcement • Fuzzy interpretation max (1-P, 1-Q) 30 Consistency Check
  • 30. Methodology: Building Blocks • Declarative Specifications allow customers to define declarative models shaping a BDA and retrieve a set of compatible services • Service Catalog specifies the set of abstract services (e.g., algorithms, mechanisms, or components) that are available to Big Data customers and consultants for building their BDA • Service Composition Repository permits to specify the procedural model defining how services can be composed to carry out the Big Data analytics • Support specification of an abstract Big Data service composition • Deployment Configurations define the platform-dependent version of a procedural model, as a workflow that is ready to be executed on the target Big Data platform 32
  • 31. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  • 32. Procedural Models • Platform-independent models that formally and unambiguously describe how analytics should be configured and executed • They are generated following goals and constraints specified in the declarative models • They provide a workflow in the form of a service orchestration • Sequence • Choice • If-then • Do-While • Split-Join
  • 33. • User creates the flow based on the list of returned services Service Composition
  • 34. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters Service Composition
  • 35. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters • The flow is submitted to the service which translates it into OWL-S service composition Service Composition
  • 36. • All internals are made explicits • Clear specification of the services • Reuse and modularity Service Composition
  • 38. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  • 39. • It consists of two main sub-processes • Structure generation: the compiler parses the procedural model and identifies the process operators (sequence, alternative, parallel, loop) composing it • Service configuration: for each service in the procedural model the corresponding one is identified and inserted in the deployment model • Support transformations to any orchestration engine available as a service • Available for Oozie and Spring XD Workflow compiler
  • 40. • Workflow compiler takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • It produces as output an executable workflow • For example an Oozie workflow • XML file of the workflow • job.properties • System variables Deployment Model
  • 41. Translating the Composition Structure • Deployment models: • specify how procedural models are instantiated and configured on a target platform • drive analytics execution in real scenario • are platform-dependent • Workflow compiler transforms the procedural model in a deployment model that can be directly executed on the target platform. • This transformation is based on a compiler that takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • and produces as output a technology-dependent workflow
  • 42. Translating the Composition Structure • OWL-S service composition structure is mapped on different control constructs
  • 43. • Workflow contain 3 types of distinct PLACEHOLDER • GREEN placeholders are SYSTEM variables defined in Oozie properties • RED placeholders are JOB variables defined in file job.properties • YELLOW placeholders are ARGUMENTS of executable jobs on OOZIE Server • More on the demo… Generating an Executable Workflow