Presentation held at Euro-Par 2013, Aachen, Germany
Abstract. Controlling the granularity of workflow activities executed on widely distributed computing platforms such as grids is required to reduce the impact of task queuing and data transfer time. Most existing granularity control approaches assume extensive knowledge about the applications and resources (e.g. task duration on each resource), and that both the workload and available resources do not change over time. We propose a granularity control algorithm for platforms where such clairvoyant and offline conditions are not realistic. Our method groups tasks when the fineness degree of the application, which takes into account the ratio of shared data and the queuing/round-trip time ratio, becomes higher than a threshold determined from execution traces. The algorithm also de-groups task groups when new resources arrive. The application's behavior is constantly monitored so that the characteristics useful for the optimization are progressively discovered. Experimental results, obtained with 3 workflow activities deployed on the European Grid Infrastructure, show that (i) the grouping process yields speed-ups of about 2.5 when the amount of available resources is constant and that (ii) the use of de-grouping yields speed-ups of 2 when resources progressively appear.
More information: www.rafaelsilva.com
[2024]Digital Global Overview Report 2024 Meltwater.pdf
On-line, non-clairvoyant optimization of workflow activity granularity task on grids
1. 1
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
On-line, Non-Clairvoyant Optimization of
Workflow Activity Granularity on Grids
Rafael FERREIRA DA SILVA, Tristan GLATARD
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ
INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
Euro-Par 2013
August 26-30, 2013
2. Outline
Context
The Virtual Imaging Platform
Problem definition
Task granularity
Self-healing of workflow executions on grids
Task granularity control process
Experiments and results
Conclusion
2
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
3. Outline
Context
The Virtual Imaging Platform
Problem definition
Task granularity
Self-healing of workflow executions on grids
Task granularity control process
Experiments and results
Conclusion
3
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
4. Context
Virtual Imaging Platform (VIP)
Medical imaging science-gateway
Grid of ~180 sites (EGI – http://www.egi.eu)
Significant usage
452 registered users from 50 countries
Consumed 472 CPU years from
August 2012 to July 2013
http://dirac.france-grilles.fr
4
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
VIP consumption since August 2012
5. Workflow Execution
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
2. User launches
a simulation
3. MOTEUR generates
invocations
4. GASW generates
grid jobs
5. Jobs are submitted
to DIRAC
6. Pilot jobs are
submitted to EGI
1. Input data
upload
7. Pilot jobs
fetch grid jobs
8. Inputs download
10. Results upload
11. Download results
9. Execution
5
6. Low performance of lightweight (a.k.a. fine-grained) tasks:
High queuing times
Communication overhead
Task Granularity
6
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
time
R1
R2
R3
t1
t2
t3
t4
t5
t1 t2
t3
t4
t5
Resources
lightweight tasks Lightweight task
executions are delayed
Group into coarse-grained tasks
reduces the cost of data transfers
when grouped tasks share input data,
and saves queuing time
7. Workflow Self-Healing
7
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Problem: costly manual operations
Rescheduling tasks, restarting services or replicating data files
In this work: task granularity in distributed workflows
Objective: automated platform administration
Autonomous detection of fine-grained tasks
Perform appropriate set of actions
Assumptions: online and non-clairvoyant
Only partial information available
Decisions must be fast
Production conditions, no user activity and workloads prediction
8. General MAPE-K loop
8
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Incident 1
degree η = 0.8
Incident 2
degree η = 0.4
Incident 3
degree η = 0.1
level
1
level
2
level
3
Roulette wheel selection
Incident 1
Selected
Rule Confidence (ρ) ρxη
2 1 0.8 0.32
3 1 0.2 0.02
1 1
1.0 0.80
Association rules
for incident 1
Incident 2
Selected
Roulette wheel selection
based on association rules
Set of Actions
x2
level
1
level
2
level
3
level
1
level
2
level
3
€
=
ηi
ηjj=1
n
∑
event
(job completion and failures)
or
timeout
Monitoring Analysis
Execution Knowledge
Planning
Monitoring data
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents
on distributed computing infrastructures, Future Generation Computer Systems
(FGCS), in press, 2013.
9. Incident degrees are quantified in discrete incident levels
Thresholds are determined from visual mode clustering
or K-means
Incident Levels and Actions
9
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
No actions are triggered Triggers a set of actions
Thresholds cluster platform
configurations into groups
10. Outline
Context
The Virtual Imaging Platform
Problem definition
Task granularity
Self-healing of workflow executions on grids
Task granularity control process
Experiments and results
Conclusion
10
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
11. Task execution
Incident degree
Fineness control: degree
11
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
€
ηf = maxi∈[1,m]{ fi = di ⋅ ri}
€
di =
t
~
_ shared
t
~
_ shared + ni (t
~
− t
~
_ shared )
€
ri =
max j∈[1,ni ] qj
max j∈[1,ni ] qj + t
~
_ shared + ni(t
~
− t
~
_ shared )
Queued Time
Shared Input Data
Other Input
Data
Application Execution
€
t
~
_ shared
€
t
€
qj
Median task phase durations
i = waiting task
n = number of waiting tasks
12. Fineness control: task estimation
Estimation of task durations
Job phases: setup inputs download execution outputs upload
Assumption: bag of tasks (all jobs have equal durations)
Median-based estimation:
12
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Median duration
of jobs phases
Real job
duration
42s
300s
20s
?
42s
300s
400s*
15s
Estimated job
duration
50s
250s
400s
15s
completed
current
*: max(400s, 20s) = 400s
€
t
~
= 715s
€
t
~
i = 757s
13. Fineness control: levels and actions
13
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Levels: identified from the platform logs
Actions
Task grouping
Grouped pairwise until
or the amount of waiting groups Q is smaller or equal
to the amount of running groups R
€
τf
Level 1
(no actions)
Level 2
action: task grouping
€
ηf ≤ τ f
14. Levels Incident degree
Coarseness control
14
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
€
ηc =
R
Q + R
€
τc = 0.5
time
R1
R2
R3
t1
t2
t3
t4
t5
t1
t2+t3
t4+t5
Resources
Tasks at t1
t2+t3
t4+t5
Loss of parallelism
Non-stationary load
Loss of parallelism
Task-degrouping
t1 t2
Grouped tasks
at t2
De-group tasks
when R > Q
15. Workload for Case Studies
Based on the workload of VIP
January 2011 to April 2012
Case Studies on:
Pilot Jobs
User accounting
Task analysis
Bag of tasks
Workflows
112 users 2,941 workflow executions 680,988 tasks
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
34,162 queued
339,545 pilot jobs
15
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs,
user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM
Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
16. Outline
Context
The Virtual Imaging Platform
Problem definition
Task granularity
Self-healing of workflow executions on grids
Task granularity control process
Experiments and results
Conclusion
16
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
17. Experiment Conditions
17
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Experiment 1
Evaluate the fineness control process under stationary load
Experiment 2
Evaluate the de-grouping control process under non-stationary load
Workflows characteristics
18. 18
Results: stationary load
18
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Fineness yields significant makespan reduction for all repetitions
19. 19
Results: stationary load (2)
19
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Task grouping speed-ups
SimuBloch and FIELD-II
up to a factor of 2.6, and
PET-SORTEO/emission up
to a factor of 2.5
Not able to group all SimuBloch tasks in a single group because 2
tasks must be completed for the task estimation process
20. 20
Results: non-stationary load
20
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Resources appear progressively Resources appear suddenly
Speeds up executions up to a factor of 1.5 for
Fineness, and 2.1 for Fineness-Coarseness
Fineness is penalized by its lack of
adaptation: slowdown of 20%
21. 21
Results: non-stationary load (2)
21
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Linear correlation coefficient between the makespan and the
average queuing time is 0.91, which indicates they are correlated
22. Outline
Context
The Virtual Imaging Platform
Problem definition
Task granularity
Self-healing of workflow executions on grids
Task granularity control process
Experiments and results
Conclusion
22
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
23. Concluding remarks
23
Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Context
Autonomous handling of unfairness among workflow executions
No strong assumptions on resource characteristics and workload
Summary of the proposed method
Implements a generic MAPE-K loop
Determines task fineness based on queue waiting time and estimated
data transfer time of shared input data
Tasks are grouped pairwise as long as Q > R, and tasks are too fine
Tasks are ungrouped when the number of available resources increases
Optimizing task granularity
Properly detects and handles lightweight tasks
Stationary load: fineness control significantly reduces the makespan of
all applications
Non-stationary load: de-grouping algorithm compensates lack of
adaptation of task grouping
24. Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Thank you for your attention.
Questions?
Rafael FERREIRA DA SILVA, Tristan GLATARD
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ
INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
On-line, Non-Clairvoyant Optimization of
Workflow Activity Granularity on Grids
Acknowledgments:
VIP users and project members
French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063)
EC FP7 Programme (312579 ER-flow)
European Grid Initiative (EGI)
France-Grilles