SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Introduction Active Data Discussion Conclusion
Active Data
A Data-Centric Approach to Data Life-Cycle Management
Anthony Simonet1 Gilles Fedak1
Matei Ripeanu2 Samer Al-Kiswany2
1Inria, ENS Lyon, University of Lyon 2University of British Columbia
November 18th, 2013
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
Introduction Active Data Discussion Conclusion
Outline
Introduction
Data Life Cycle Management
Use-case
Requirements
Active Data
Active Data: principles & features
Exemple: Globus Online and iRODS
Discussion
Advantages
Limitations
Conclusion
Related works
Conclusion
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
Introduction Active Data Discussion Conclusion
Big Data
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20
Science and Industry have become data-intensive
Volume of data produced by science and industry grows exponentially
How to store this deluge of data?
How to extract knowledge and sense?
How to make data valuable?
Some examples
CERN’s Large Hadron Collider: 1.5PB/week
Large Synoptic Survey Telescope, Chile: 30 TB/night
Billion edge social network graphs
Searching and mining the Web
Introduction Active Data Discussion Conclusion
Data Life Cycle
Data Life Cycle
Creation/Acquisition
Transfer
Replication
Disposal/Archiving
Definition
The life cycle is the course of operational stages through which
data pass from the time when they enter a system to the time
when they leave it.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Management
Complicated scenarios
Execution of workflow
Complex interactions between software
Need to quickly react to operational events
Ad-hoc task-centric approaches
Hard to program, maintain and debug
No formal specification
Complicates interactions between systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
Introduction Active Data Discussion Conclusion
Data Life Cycle Use-case
Example: the Advanced Photon Source at Argonne National Lab
100TB of raw data per day
Raw data are preprocessed and registered in a Globus dataset
catalog
Data are analyzed by various applications
Results are stored in the dataset catalog and shared
Instrument
(Beamline)
Local
Storage
Transfer
Metadata
Catalog
Extract &
Register Metadata
Remote
Data Center
Transfer
Academic
Cluster
Analysis
More analysis
Upload result
Register result metadata
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
Introduction Active Data Discussion Conclusion
Use-case
Task Centric
Vs
Data Centric
Independent scripts Express data-dependancies
Hard to program, maintain, verify Cross data-center coordination
Coarse granularity User-level fault-tolerance
Incremental processing
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
Introduction Active Data Discussion Conclusion
Requirements
Challenges: a perfect system should. . .
Simply represent the life cycle of data distributed across
different data centers and systems
Simplify DLM modeling and reasoning
Hide the complexity resulting from using different
infrastructures and systems
Be easy to integrate with existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
•
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identifier, corresponding to the actual
data item’s.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
•
Written
t2
Read
t3
t4
Terminated
A transition is fired whenever a data state changes.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data principles
System programmers expose their system’s internal data life cycle
with a model based on Petri Nets.
A life cycle model is made of Places and Transitions
Created
t1
•
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is fired.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
Introduction Active Data Discussion Conclusion
Active Data features
The Active Data programming model and runtime environment:
Allows to react to life cycle progression
Exposes transparently distributed data sets
Can be integrated with existing systems
Has scalable performance and minimum overhead over
existing systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
Introduction Active Data Discussion Conclusion
Implementation
Prototype implemented in Java ( 2,800 LOC)
Client/Service communication is Publish/Subscribe
2 types of subscription:
Every transitions for a given data item
Every data items for a given transition
Active Data
Service
Client
Client
subscribeClient
subscribe
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
Several ways to publish transitions
Instrument the code
Read the logs
Rely on an existing notification system
The service orders transitions by time of arrival
Active Data
Service
Client
publish transition
Client
subscribeClient
subscribe
publish transition
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Implementation
Clients run transition handler code locally
Transition handlers are executed
Serially
In a blocking way
In the order transitions were published
Active Data
Service
Client
publish transition
Client
subscribe
notify
Client
subscribe
notify
publish transition
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550
# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Transitionspersecond
Figure: Average number of transitions per second handled by the Active
Data Service
Clients publish 10,000 transitions in a row without pausing.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Performance evaluation: Throughput
10 50 100 200 300 400 450 500 550
# clients
5,000
10,000
15,000
20,000
25,000
30,000
35,000
Transitionspersecond
Figure: Average number of transitions per second handled by the Active
Data Service
The prototype scales up to 30,000 transitions per seconds.
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Data Provenance
Definition
The complete history of data life cycle derivations and operations.
Assess the quality of data
Keep track of the origin of data over time
Specialized Provenance Aware Storage Systems
−→ What about heterogeneous systems?
Example with Globus Online and iRODS
File transfer service Data store and metadata catalog
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
•
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Id: {GO: 7b9e02c4-925d-11e2}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Created
t1 t2
•
Succeeded Failed
t3 t4
Terminated
Globus Online
public void handler () {
iput (...);
}
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
•
Created
Get
Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
Id: {GO: 7b9e02c4-925d-11e2,
iRODS: 10032}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20
Data events coming from Globus Online and iRODS
Created
Get
• Put
Terminated
t5
t9
t6
t7
t8 t10
iRODS
public void handler () {
annotate ();
}
Created
t1 t2
Succeeded Failed
t3 t4
Terminated
Globus Online
Introduction Active Data Discussion Conclusion
Exemple: Globus Online and iRODS
$ imeta ls -d test/ out_test_4628
AVUs defined for dataObj test/ out_test_4628 :
attribute: GO_FAULTS
value: 0
----
attribute: GO_COMPLETION_TIME
value: 2013 -03 -21 19:28:41Z
----
attribute: GO_REQUEST_TIME
value: 2013 -03 -21 19:28:17Z
----
attribute: GO_TASK_ID
value: 7b9e02c4 -925d -11e2 -97ce -123139404 f2e
----
attribute: GO_SOURCE
value: go#ep1 /~/ test
----
attribute: GO_DESTINATION
value: asimonet#fraise /~/ out_test_4628
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20
Introduction Active Data Discussion Conclusion
Advantages
Simple and graphical way to program DLM operations
Allows to formally verify some properties of data life cycles
Easy coordination between systems
Easy to scale
Easy to debug
Easy fault tolerance
Fine-grain interaction with data life cycle
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20
Introduction Active Data Discussion Conclusion
Limitations
Complexity to reason in terms of life cycle events
Lack of standard
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20
Introduction Active Data Discussion Conclusion
Related works
Data-centric parallel processing
Programing models:
MapReduce and higher level abstractions: PigLatin, Twister
Incremental systems: MapReduce-Online, Percolator, Chimera,
Nephele
Other models with implicit parallelism: Swift, Dryad, Allpairs
Storage systems
BitDew
MosaStore
Provenance Aware Storage Systems
Active Storage
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20
Introduction Active Data Discussion Conclusion
Conclusion
Active Data is. . .
Data-centric & Event-driven
System-level data integration
What’s next?
Advanced representation of operations that consume and
produce data: represent data derivation
Data collection abilities
Distributed implementation of the Publish/Subscribe layer
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20
Introduction Active Data Discussion Conclusion
Thank you!
Questions?
Inria booth #2116
A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20

Contenu connexe

Similaire à Active Data PDSW'13

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Gilles Fedak
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 
Ramin elahi fog_computing_ecosystem_final_dec22_updated
Ramin elahi fog_computing_ecosystem_final_dec22_updatedRamin elahi fog_computing_ecosystem_final_dec22_updated
Ramin elahi fog_computing_ecosystem_final_dec22_updatedHarshitParkar6677
 
Science Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesScience Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesriround
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Data in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathonData in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathonCisco DevNet
 
OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...Open Science Fair
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for usebolu804
 
Discrete MFG IoT Factory of the Future
Discrete MFG IoT Factory of the FutureDiscrete MFG IoT Factory of the Future
Discrete MFG IoT Factory of the FutureMainstay
 
Get Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog ComputingGet Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog ComputingBiren Gandhi
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Seema Shah
 
OGF Standards Overview - Globus World 2013
OGF Standards Overview - Globus World 2013OGF Standards Overview - Globus World 2013
OGF Standards Overview - Globus World 2013Alan Sill
 
MT85 Challenges at the Edge: Dell Edge Gateways
MT85 Challenges at the Edge: Dell Edge GatewaysMT85 Challenges at the Edge: Dell Edge Gateways
MT85 Challenges at the Edge: Dell Edge GatewaysDell EMC World
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)EUDAT
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Data Streaming in IoT and Big Data Analytics
Data Streaming in  IoT and Big Data AnalyticsData Streaming in  IoT and Big Data Analytics
Data Streaming in IoT and Big Data AnalyticsVincenzo Gulisano
 
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfMAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfGary Mazzaferro
 
High Performance Green Infrastructure, New Directions in Real-Time Control
High Performance Green Infrastructure, New Directions in Real-Time ControlHigh Performance Green Infrastructure, New Directions in Real-Time Control
High Performance Green Infrastructure, New Directions in Real-Time ControlMarcus Quigley
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 

Similaire à Active Data PDSW'13 (20)

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastruc...
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 
Ramin elahi fog_computing_ecosystem_final_dec22_updated
Ramin elahi fog_computing_ecosystem_final_dec22_updatedRamin elahi fog_computing_ecosystem_final_dec22_updated
Ramin elahi fog_computing_ecosystem_final_dec22_updated
 
Science Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related servicesScience Gateways: one portal, many e-Infrastructures and related services
Science Gateways: one portal, many e-Infrastructures and related services
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Data in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathonData in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathon
 
OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...OSFair2017 Workshop | Brokering services facilitating interoperability and da...
OSFair2017 Workshop | Brokering services facilitating interoperability and da...
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for use
 
Discrete MFG IoT Factory of the Future
Discrete MFG IoT Factory of the FutureDiscrete MFG IoT Factory of the Future
Discrete MFG IoT Factory of the Future
 
Get Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog ComputingGet Cloud Resources to the IoT Edge with Fog Computing
Get Cloud Resources to the IoT Edge with Fog Computing
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Gc vit sttp cc december 2013
Gc vit sttp cc december 2013Gc vit sttp cc december 2013
Gc vit sttp cc december 2013
 
OGF Standards Overview - Globus World 2013
OGF Standards Overview - Globus World 2013OGF Standards Overview - Globus World 2013
OGF Standards Overview - Globus World 2013
 
MT85 Challenges at the Edge: Dell Edge Gateways
MT85 Challenges at the Edge: Dell Edge GatewaysMT85 Challenges at the Edge: Dell Edge Gateways
MT85 Challenges at the Edge: Dell Edge Gateways
 
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data Streaming in IoT and Big Data Analytics
Data Streaming in  IoT and Big Data AnalyticsData Streaming in  IoT and Big Data Analytics
Data Streaming in IoT and Big Data Analytics
 
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdfMAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
 
High Performance Green Infrastructure, New Directions in Real-Time Control
High Performance Green Infrastructure, New Directions in Real-Time ControlHigh Performance Green Infrastructure, New Directions in Real-Time Control
High Performance Green Infrastructure, New Directions in Real-Time Control
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 

Plus de Gilles Fedak

Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Gilles Fedak
 
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...Gilles Fedak
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingGilles Fedak
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsGilles Fedak
 

Plus de Gilles Fedak (6)

Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?Which Computing Infrastructure for the Decentralized World ?
Which Computing Infrastructure for the Decentralized World ?
 
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...
Devcon3 : iExec Allowing Scalable, Efficient, and Virtualized Off-chain Execu...
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud Computing
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, Optimizations
 

Dernier

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Dernier (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Active Data PDSW'13

  • 1. Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet1 Gilles Fedak1 Matei Ripeanu2 Samer Al-Kiswany2 1Inria, ENS Lyon, University of Lyon 2University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20
  • 2. Introduction Active Data Discussion Conclusion Outline Introduction Data Life Cycle Management Use-case Requirements Active Data Active Data: principles & features Exemple: Globus Online and iRODS Discussion Advantages Limitations Conclusion Related works Conclusion A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20
  • 3. Introduction Active Data Discussion Conclusion Big Data A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20 Science and Industry have become data-intensive Volume of data produced by science and industry grows exponentially How to store this deluge of data? How to extract knowledge and sense? How to make data valuable? Some examples CERN’s Large Hadron Collider: 1.5PB/week Large Synoptic Survey Telescope, Chile: 30 TB/night Billion edge social network graphs Searching and mining the Web
  • 4. Introduction Active Data Discussion Conclusion Data Life Cycle Data Life Cycle Creation/Acquisition Transfer Replication Disposal/Archiving Definition The life cycle is the course of operational stages through which data pass from the time when they enter a system to the time when they leave it. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20
  • 5. Introduction Active Data Discussion Conclusion Data Life Cycle Management Complicated scenarios Execution of workflow Complex interactions between software Need to quickly react to operational events Ad-hoc task-centric approaches Hard to program, maintain and debug No formal specification Complicates interactions between systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20
  • 6. Introduction Active Data Discussion Conclusion Data Life Cycle Use-case Example: the Advanced Photon Source at Argonne National Lab 100TB of raw data per day Raw data are preprocessed and registered in a Globus dataset catalog Data are analyzed by various applications Results are stored in the dataset catalog and shared Instrument (Beamline) Local Storage Transfer Metadata Catalog Extract & Register Metadata Remote Data Center Transfer Academic Cluster Analysis More analysis Upload result Register result metadata A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20
  • 7. Introduction Active Data Discussion Conclusion Use-case Task Centric Vs Data Centric Independent scripts Express data-dependancies Hard to program, maintain, verify Cross data-center coordination Coarse granularity User-level fault-tolerance Incremental processing A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20
  • 8. Introduction Active Data Discussion Conclusion Requirements Challenges: a perfect system should. . . Simply represent the life cycle of data distributed across different data centers and systems Simplify DLM modeling and reasoning Hide the complexity resulting from using different infrastructures and systems Be easy to integrate with existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20
  • 9. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  • 10. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  • 11. Introduction Active Data Discussion Conclusion Active Data principles System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A life cycle model is made of Places and Transitions Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20
  • 12. Introduction Active Data Discussion Conclusion Active Data features The Active Data programming model and runtime environment: Allows to react to life cycle progression Exposes transparently distributed data sets Can be integrated with existing systems Has scalable performance and minimum overhead over existing systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20
  • 13. Introduction Active Data Discussion Conclusion Implementation Prototype implemented in Java ( 2,800 LOC) Client/Service communication is Publish/Subscribe 2 types of subscription: Every transitions for a given data item Every data items for a given transition Active Data Service Client Client subscribeClient subscribe A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  • 14. Introduction Active Data Discussion Conclusion Implementation Several ways to publish transitions Instrument the code Read the logs Rely on an existing notification system The service orders transitions by time of arrival Active Data Service Client publish transition Client subscribeClient subscribe publish transition A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  • 15. Introduction Active Data Discussion Conclusion Implementation Clients run transition handler code locally Transition handlers are executed Serially In a blocking way In the order transitions were published Active Data Service Client publish transition Client subscribe notify Client subscribe notify publish transition A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20
  • 16. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitionspersecond Figure: Average number of transitions per second handled by the Active Data Service Clients publish 10,000 transitions in a row without pausing. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
  • 17. Introduction Active Data Discussion Conclusion Performance evaluation: Throughput 10 50 100 200 300 400 450 500 550 # clients 5,000 10,000 15,000 20,000 25,000 30,000 35,000 Transitionspersecond Figure: Average number of transitions per second handled by the Active Data Service The prototype scales up to 30,000 transitions per seconds. A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20
  • 18. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  • 19. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems −→ What about heterogeneous systems? A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  • 20. Introduction Active Data Discussion Conclusion Exemple: Data Provenance Definition The complete history of data life cycle derivations and operations. Assess the quality of data Keep track of the origin of data over time Specialized Provenance Aware Storage Systems −→ What about heterogeneous systems? Example with Globus Online and iRODS File transfer service Data store and metadata catalog A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20
  • 21. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS • Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online Id: {GO: 7b9e02c4-925d-11e2}
  • 22. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS Created t1 t2 • Succeeded Failed t3 t4 Terminated Globus Online public void handler () { iput (...); }
  • 23. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS • Created Get Put Terminated t5 t9 t6 t7 t8 t10 iRODS Id: {GO: 7b9e02c4-925d-11e2, iRODS: 10032} Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online
  • 24. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20 Data events coming from Globus Online and iRODS Created Get • Put Terminated t5 t9 t6 t7 t8 t10 iRODS public void handler () { annotate (); } Created t1 t2 Succeeded Failed t3 t4 Terminated Globus Online
  • 25. Introduction Active Data Discussion Conclusion Exemple: Globus Online and iRODS $ imeta ls -d test/ out_test_4628 AVUs defined for dataObj test/ out_test_4628 : attribute: GO_FAULTS value: 0 ---- attribute: GO_COMPLETION_TIME value: 2013 -03 -21 19:28:41Z ---- attribute: GO_REQUEST_TIME value: 2013 -03 -21 19:28:17Z ---- attribute: GO_TASK_ID value: 7b9e02c4 -925d -11e2 -97ce -123139404 f2e ---- attribute: GO_SOURCE value: go#ep1 /~/ test ---- attribute: GO_DESTINATION value: asimonet#fraise /~/ out_test_4628 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20
  • 26. Introduction Active Data Discussion Conclusion Advantages Simple and graphical way to program DLM operations Allows to formally verify some properties of data life cycles Easy coordination between systems Easy to scale Easy to debug Easy fault tolerance Fine-grain interaction with data life cycle A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20
  • 27. Introduction Active Data Discussion Conclusion Limitations Complexity to reason in terms of life cycle events Lack of standard A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20
  • 28. Introduction Active Data Discussion Conclusion Related works Data-centric parallel processing Programing models: MapReduce and higher level abstractions: PigLatin, Twister Incremental systems: MapReduce-Online, Percolator, Chimera, Nephele Other models with implicit parallelism: Swift, Dryad, Allpairs Storage systems BitDew MosaStore Provenance Aware Storage Systems Active Storage A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20
  • 29. Introduction Active Data Discussion Conclusion Conclusion Active Data is. . . Data-centric & Event-driven System-level data integration What’s next? Advanced representation of operations that consume and produce data: represent data derivation Data collection abilities Distributed implementation of the Publish/Subscribe layer A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20
  • 30. Introduction Active Data Discussion Conclusion Thank you! Questions? Inria booth #2116 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20