Contenu connexe
Similaire à Adaptive load balancing techniques in global scale grid environment
Similaire à Adaptive load balancing techniques in global scale grid environment (20)
Adaptive load balancing techniques in global scale grid environment
- 1. International JournalVolume 1, Number Engineering(IJCET), ISSN 0976 – 6367(Print),
International Journal of Computer Engineering and Technology
ISSN 0976 – 6375(Online)
of Computer 2, Sept – Oct (2010), © IAEME
and Technology (IJCET), ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online) Volume 1 IJCET
Number 2, Sept - Oct (2010), pp. 85-96 ©IAEME
© IAEME, http://www.iaeme.com/ijcet.html
ADAPTIVE LOAD BALANCING TECHNIQUES IN
GLOBAL SCALE GRID ENVIRONMENT
D.Asir
PG Scholar
Department of Computer Science and Engineering
Karunya University, Coimbatore
E-Mail: asird@karunya.edu.in
Shamila Ebenezer
Assistant Professor
Department of Computer Science and Engineering
Karunya University, Coimbatore
E-Mail: shamila_cse@karunya.edu
Daniel.D,
PG Scholar
Department of Computer Science and Engineering
Karunya University, Coimbatore
E-Mail: Daniel_joen@yahoo.com
ABSTRACT
Data partitioning and load balancing are important components of parallel
computations. Many different partitioning strategies have been developed, with great
effectiveness in parallel applications. But the load-balancing problem is not yet solved
completely; new applications and architectures require new partitioning features.
Increased use of heterogeneous computing architectures requires partitioners that account
for non-uniform computing, network, and memory resources. This paper surveys
different adaptive technique for a partial differential system to solve load balancing
problem.
Index Terms: Dynamic load balancing; Performance characterization;
Adaptive mesh refinement.
85
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
I. INTRODUCTION
Adaptive Load Balancing Operate smoothly and scale reliably when facing spikes
in data volumes or unexpected utilization loads on the grid. Also it selects the best node
for session execution based on resource requirements and availability. An application-
centric performance characterization of dynamic partitioning and Load balancing
techniques for distributed adaptive grid hierarchies that underlie parallel adaptive mesh
refinement (AMR) techniques [1,14] for the solution of partial differential equations.
Early adaptive techniques of mesh motion (r-refinement) have been giving way to
methods that combine mesh refinement/coarsening (h-refinement) with order variation
(p-refinement) [3]. As advances in computer architecture enable the solution of complex
three-dimensional problems, the efficiency, reliability, and robustness provided by
adaptively will make its use even more advantageous. Parallel computation will be
essential in these simulations. Processor load-balancing must be dynamic since frequent
adaptive enrichment will upset a balanced computation. An adaptive finite element
method, have workloads that are unpredictable or change during the computation; such
applications require dynamic load balancers that adjust the decomposition as the
computation proceeds. Numerous strategies for static and dynamic load balancing have
been developed, including recursive bisection (RB) methods, space filling curve (SFC)
partitioning and graph partitioning, multilevel, and diffusive methods [7,10]. These
methods provide effective partitioning for many applications, perhaps suggesting that the
load-balancing problem is solved. Efficient parallel execution of these irregular grid
applications requires the partitioning of the associated graph into p parts with the
following two objectives: (i) each partition has an equal amount of total vertex weight;
(ii) the total weight of the edges cut by the partitions is minimized [2]. Simulation of
three dimensional flow with chemical reactions and plasma discharge in complex
geometries is one of the most resource demanding problems in computational science,
requiring both high performance and high-throughput computing. Grid computing
technologies opened up new opportunities to access virtually unlimited computational
resources, and inspired many researchers to develop new methodologies and algorithms
for parallel distributed applications on the Grid.
86
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
II. ALB ALGORITHMS
A. Adaptive mesh-refinement algorithms (AMR)
1) Space-Filling Curves: Space-filling curves (SFC) [1] are a class of locality preserving
mappings from d-dimensional space to 1- dimensional space. The self similar or
recursive nature of mappings can be exploited to represent a hierarchical structure and
to maintain locality across different levels of hierarchy. The SFC representation of the
adaptive grid hierarchy is a 1-D ordered list of composite grid blocks where each
composite block represents a block of the entire grid hierarchy and may contain more
than one grid level.
2) Independent Grid Distribution: Distributes the grids independently across the
processors. This distribution leads to balanced loads and no redistribution is required
when grids are created or deleted. In the adaptive grid hierarchy, a fine grid typically
corresponds to a small region of the underlying coarse grid. If both, the fine and coarse
grid are distributed over the entire set of processors, all the processors will
communicate with the small set of processors corresponding to the associated coarse
grid region, causing a serialization bottleneck.
3) Combined Grid Distribution: Distributes the total work load in the grid hierarchy by
first forming a simple linear structure by abutting grids at a level and then decomposing
this structure into partitions of equal load. Regriding operations involving the creation
or deletion of a grid are extremely expensive, as they require an almost complete
redistribution of the grid hierarchy [4].The combined grid decomposition does not
exploit the parallelism available within a level of the hierarchy.
4) Independent Level Distribution: Each level of the grid hierarchy is distributed by
partitioning the combined load of all component grids at the level among the
processors. This scheme overcomes some of the drawbacks of the independent grid
distribution. Parallelism within a level of the hierarchy is exploited. Although the inter-
grid communication bottleneck is reduced in this case, the required scatter
communications can be expensive. Creation or deletion of component grids at any level
requires a redistribution of the entire level.
87
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
5) Iterative Tree balancing: A table is created from the grids at each time step, which
keeps pointers to neighboring and parent grids. for every grid, immediate neighbors and
children are also considered along with load distribution. Thus load balancing, inter
level communication and intra level communication are addressed together. This
scheme is used for distributing fine-element meshes and is promising as it deals with all
the constraints to some extent.
6) Weighted Distribution: First assign a weight to each of these overheads. This weight
defines the significance and contribution of the overhead to the overall application
performance, The next step uses these weights to compute the affinity of each
component grid to the different processors. Initially, grids have no affinity for any
processor.
B. Dynamic Load Balancing via Tiling
Tiling load-balancing system [3] is a modification of the global load-balancing
technique of that is applicable to a wide class of two-dimensional, uniform-grid
applications. Global balance is achieved by performing local balancing within
overlapping processor neighborhoods, where each processor is defined to be the center of
a neighborhood. Local balance involves element migrations to processors in the same
neighborhood that have elements sharing edges. tiling system is required by the adaptive
refinement algorithm. Because elemental workloads may vary due to refinement, the
tiling algorithm must account for elemental workloads when performing local load
balancing.
C. Multi criteria Geometric Partitioning:
Crash simulations are “multiphase" applications consisting of two separate
phases: computation of forces and contact detection. Obtaining a single decomposition
that is good with respect to both phases would remove the need for communication
between phases. Each object would have multiple loads, corresponding to its workload in
each phase. The challenge would be computing a single decomposition that is balanced
with respect to all loads. Such a multi criteria partitioner could be used in other situations
as well, such as balancing both computational work and memory usage. Most geometric
partitioners reduce the partitioning problem [6] to a one-dimensional problem. Multi
88
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
criteria load balancing can be formulated as either a multi constraint or multi objective
problem. Often, the balance of each load is considered a constraint and has to satisfy a
certain tolerance. Such a formulation fits the standard form, where, in this case, there is
no objective, only constraints. Unfortunately, there is no guarantee that a solution exists
to this problem. In practice, we want a “best possible" decomposition [7], even if the
desired balance criteria cannot be satisfied. Thus, an alternative is to make the constraints
objectives; that is, we want to achieve as good balance as possible with respect to all
loads.
D. Repartitioning Algorithms Based on Multilevel Diffusion
The multilevel graph partitioning algorithm [2] implemented in METIS has three
phases, a coarsening phase a partitioning phase, and a refinement phase. During the
coarsening phase, a sequence of smaller graphs are constructed from an input graph by
collapsing vertices together. When enough vertices have been collapsed together so that
the coarsest graph is sufficiently small, a kway partition is found. Finally, the partition of
the coarsest graph is projected back to the original graph by refining it at each
uncoarsening level using a kway partitioning refinement algorithm. In the coarsening
phase, only pairs of nodes that belong to the same partition are considered for merging.
Hence, the initial partition of the coarsest level graph is identical to the input partition of
the graph that is being repartitioned and thus does not need to be computed. This makes
the coarsening phase completely parallelizable, as coarsening is local to each processor.
The uncoarsening phase of MLD contains two subphases: multilevel diffusion and
multilevel refinement. In the multi-level diffusion phase, balance is sought on the
coarsest graph in a process similar to multilevel refinement. This is accomplished by
forcing the migration of vertices out of overbalanced partitions.
89
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
Figure 2.1 Multilevel diffusion repartitioning
Multilevel diffusion repartitioning algorithms are made up of three phases, graph
coarsening, multilevel diffusion, and multilevel refinement. The coarsening phase results
in a series of contracted graphs. The multilevel diffusion phase balances the graph using
the very coarsest graphs. The multilevel refinement phase seeks to improve the edge-cut
disturbed by the balancing process. Optionally, the multilevel diffusion can be guided by
a diffusion solution. We will refer to our multilevel undirected diffusion repartitioning
algorithm as MLD and to our multilevel directed diffusion repartitioning algorithm as
MLDD. Single-level directed diffusion (SLDD) will be used to provide a comparison
with our multilevel diffusion schemes. In SLDD, diffusion and refinement are performed
only on the original input graph and thus, no graph contraction is performed.
E. SAMR (Structured Adaptive Mesh Refinement)
Adaptive Characteristics of SAMR Applications [14] are analyzed from four
aspects: granularity, dynamicity, imbalance and dispersion.
90
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
1) Granularity: The basic entity for data movement is a grid. Each grid consists of a
computational interior and a ghost zone. The computational interior is the region of
interest that has been refined from the immediately coarser level; the ghost zone is the
part added to exterior of computational interior in order to obtain boundary
information. For the computational interior, there is a requirement for the minimum
number of cells, which is equal to the refinement ratio to the power of the number of
dimensions.
2) Dynamicity: After each time-step of every level, the adaptation process is invoked
based on one or more refinement criteria defined at the beginning of the simulation.
The local regions satisfying the criteria will be refined. High frequency of adaptation
requires the underlying DLB method to execute very fast, as well as to maintain high
quality of load balancing.
3) Load Imbalance: The ideal balanced load is calculated. The standard deviation is
pretty small compared to the average load, which means that the average load reflects
the entire load distribution.
4) Dispersion: A few processors whose loads are increased dramatically and most
processors have little or no change. All the processors can be grouped into four
subgroups and each subgroup has similar characteristics with the percentage of
refinement ranging from zero to 86% .These calculation indicates that different
datasets exhibit different load distribution, and the underlying DLB scheme should
provide high quality of load balancing for all these datasets. After taking into
consideration the adaptive characteristics of the SAMR application, we developed an
improved DLB scheme. DLB is composed of two steps: moving grid phase and
splitting-grid phase.
Moving Grid Phase:
Step 1: Assign Moveflag, Splitflag as one and Lastmin,Lastmax as zero.
Step 2: When the condition Maxload/Avgload > threshold is set, the load is imbalanced.
Step 3: Then the Maxproc moves its grid to Minproc(using global information) under the
condition the load is no more than (threshold * Avgload-Minload)
Step 4: This phase continues until all grids residing on the Maxproc are too large to be
moved.
91
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
Splitting Grid Phase:
Step 1: The Maxproc finds the Maxgrid.
Step 2: If the size of Maxgrid is no more than (Avgload-Minload) the grid moved to
Minproc from Maxproc.
Step 3: If not Maxproc Splits the grid into two smaller grids.
Step 4: Any one size is around (Avgload-Minload) will be redistributed to Minproc.
F. Adaptive workload balancing (AWLB) on heterogeneous resources
One of the factors that determine the performance of parallel applications on
heterogeneous resources is the quality of the workload distribution, e.g. through
functional decomposition or domain decomposition. Optimal load distribution is
characterized by two things: (1) all processors have a workload proportional to their
computational capacity and (2) communications between the processors are minimized.
These goals are conflicting since the communication is minimized when all the workload
is processed by a single processor and no communication takes place, and distributing the
workload inevitably incurs communication overheads. Thus, it is necessary to find a
balance and define a metric [15] that characterizes the quality of workload distribution
for a parallel problem.
1. Benchmark the resources dynamically assigned to the parallel application; measure the
resource characteristics that constitute the set of resource parameters µ (available
processing power, memory and links bandwidth).
2. Estimate the range of possible values of the application parameter fc. The minimal
value is fmin=0, which corresponds to the case when no communications occur
between the parallel processes of the application. The upper bound can be calculated
based on the following reasoning: For the parallel processing to make sense, that is to
ensure that running a parallel program on several processors is faster than sequential
execution, the calculation time should exceed communication time. For homogeneous
resources this can be expressed as follow
3. Search through the range of possible values of fc in [0 . . . fc max] to find the optimal
value fc* minimizing the application execution time. For each value of fc calculate the
92
- 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
corresponding load distribution based on the resource parameters µ .With this
distribution perform one time step, and measure the execution time the target
optimization function. Selection of the next value of fc can be done by any optimization
method for unimodal smooth functions; for instance a simple line-search method can be
used.
4. Execute further calculations using the discovered fc*.
5. In the case of dynamic resources where performance is influenced by other factors
(which is generally the case on the Grid), a periodic reestimation of resource
parameters µ and load redistribution shall be performed during run-time of the
application. Re-balancing shall be invoked if the application performance over the last
step drops more than a certain user-defined threshold.
6. If the application is dynamically changing then fc*must be periodically re-estimated on
the same set of resources.
G. The Path Algorithm
There are two steps to implement the PATH algorithm:
First Step: We use simple single-packet algorithm (SMSP) to check the network
structure and to get the bottleneck link Lk. Compared with the standard single-packet
algorithm (SDSP) [12], SMSP algorithm does not have to measure the bandwidth of each
link of the whole network.
Second Step: Use Packet Train with header probe to measure the bandwidth of
the link Lk. The source sends out a header packet H and a packet train T1, T2,… Tn.
Both the header and the packet train are UDP packets. All the packets Ti of the packet-
train are of the same size. Sh, the size of header packet H is much larger than St, the size
of Ti. Each packet Ti contains only 8 bytes, used for identifying the packet.
We denote the time-to-live (TTL) of a packet by tj if the packet expires after
reaching router Rj. The TTL of all the packet-train packets Ti is tj. So the Ti packets will
stop at router Rj. Rj would respond through ICMP time-exceeded packets to the source.
III EVALUATION
Efficient data structures used for adaptive refinement and tiling include trees of
grids with finer grids regarded as offspring of coarser ones. Within each grid, AVL tree
93
- 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
structures [3] permit easy insertion and deletion of elements as they migrate between
processors. Similar tree structures at inter-processor boundaries facilitate the transfer of
data between neighboring processors. Most previous work focuses on incorporating
environment information into preselected partitioning algorithms [6,7,10]. As an
alternative, such information could be used to select appropriate partitioning strategies.
The work assigned to these nodes is then recursively partitioned among the nodes in their
sub trees. Different partitioning methods can be used in each level and sub tree to
produce effective partitions with respect to the network; for example, graph or hyper
graph partitioners could minimize communication between nodes connected by slow
networks while fast geometric partitioners operate within each node. A repartitioning of a
dynamic graph can be computed by simply partitioning the new graph from scratch.
However, since no concern is given for the existing partition, most vertices are not likely
to be assigned to their initial partitions with this method. Intelligent remapping of the
resulting partition can reduce the required movement of vertices, but vertex migration can
still be quite high. The second strategy is to use the existing partitioning as input for a
repartitioning algorithm and to attempt to minimize the difference between the original
partition and the output partition. This strategy can result in much smaller vertex
migration compared to schemes that partition the modified graph from scratch. our
multilevel diffusion repartitioning algorithms are made up of three phases, graph
coarsening, multilevel diffusion, and multilevel refinement. The coarsening phase results
in a series of contracted graphs. The multilevel diffusion phase balances the graph using
the very coarsest graphs. The multilevel refinement phase [3] seeks to improve the edge-
cut disturbed by the balancing process. Optionally, the multilevel diffusion can be guided
by a diffusion solution. DLB is not a Scratch-Remap Scheme because it takes into
consideration the previous load distribution during the current redistribution process. As
compared to Diffusion Scheme, our DLB scheme differs from it in two manners. First,
our DLB scheme addresses the issue of coarse granularity of SAMR applications [14]. It
splits large-sized grids located on overloaded processors if just the movement of grids is
not enough to handle load imbalance. Second, our DLB scheme chooses the direct data
movement between overloaded and under loaded processors instead of just between
neighboring processors.
94
- 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
IV CONCLUSION
In this paper we surveyed various Adaptive techniques for balancing the load in a
global scale grid environment. By using DLB scheme including moving-grid phase and
split-grid phase, the total execution time of SAMR applications was reduced up to 47%,
and the quality of load balancing was improved by more than two times especially when
the number of processors is larger than 16. In multilevel diffusion technique the results
on a variety of synthetic and application meshes show that it is a robust scheme for
repartitioning a wide variety of adaptive meshes. For adaptive finite element methods,
data movement from an old decomposition to a new one can consume orders of
magnitude more time than the actual computation of a new decomposition; highly
incremental partitioning strategies that minimize data movement are important for high
performance of adaptive simulations
REFERENCES
[1] Characterizing the Performance of Dynamic Distribution and Load-Balancing
Techniques for Adaptive Grid Hierarchies, Mausumi Shee, Samip Bhavsar, and
Manish Parashar, Proceedings of the IASTED International Conference Parallel and
Distributed Computing and Systems November 3-6, 1999 in Cambridge
Massachusetts, USA.
[2] Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes, Multilevel
Diffusion Schemes for Repartitioning of Adaptive Meshes Kirk Schloegl, George
Karypis, and Vipin Kumar, JOURNAL OF PARALLEL AND DISTRIBUTED
COMPUTING 47, 109–124 (1997) ARTICLE NO. PC971410
[3] Parallel Adaptive hp-Refinement Techniques for Conservation Laws, Karen D.
Devine and Joseph E. Flaherty, Applied Numerical Mathematics, 20 (1996) 367-386
Sandia National Laboratories Tech. Rep. SAND95-1142J
[4] Adaptive Performance Modeling on Hierarchical Grid Computing Environments
Wahid Nasri1, Luiz Angelo Steffenel and Denis Trystram, Laboratoire ID-IMAG,
INPG, Grenoble, France, Author manuscript, published in " (2007)"
[5] Object-Based Adaptive Load Balancing for MPI Programs Milind Bhandarkar, L. V.
Kal’e, Eric de Sturler, and Jay Hoeinger, Research funded by the U.S. Department of
95
- 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME
Energy through the University of California under Subcontract number B341494,
October 6, 2000
[6] Parallel Dynamic Graph Partitioning for Adaptive Unstructured Meshes, C. Walshaw,
M. Cross, and M. G. Everett, JOURNAL OF PARALLEL AND DISTRIBUTED
COMPUTING 47, 102–108 (1997) ARTICLE NO. PC971407
[7] New Challenges in Dynamic Load Balancing, Karen D. Devine 1, Erik G. Boman,
Robert T. Heaphy, Bruce A. Hendrickson, Sandia contract PO15162 and the
Computer Science Research Institute at Sandia National Laboratories.
[8] H. Casanova, “Simgrid: A Toolkit for the Simulation of Application Scheduling,” in
Proceedings of the IEEE International Symposium on Cluster Computing and the
Grid (CCGrid’01), May 2001, pp. 430–437.
[9] G. Shao, Adaptive Scheduling of Master/Worker Applications on Distributed
Computational Resources, Ph.D. thesis, University of California, San Diego, May
2001.
[10] On Partitioning Dynamic Adaptive Grid Hierarchies,Manish Parashar and James
C.Browne, Binary Black-Hole NSF Grand challenge (NSF ACS/PHY
9318152),January 1996.
[11] Hash-Storage Techniques for Adaptive multilevel solvers and their domain
Decomposition Parallelization, Contemporary Mathematics volume 218,1998.
[12] A. B. Downey, “Using Pathchar to Estimate Internet Link Characteristics” ACM
SIGCOMM '99 Pages: 241-250.
[13] Adaptive Load Balancing for Divide-and-Conquer Grid Applications Rob V. van
Nieuwpoort, Jason Maassen, Gosia Wrzesi_nska, Thilo Kielmann, Henri E. Bal, 2004
Kluwer Academic Publishers
[14] Dynamic Load Balancing for Structured Adaptive Mesh Refinement Applications,
Zhiling Lan, Valerie E. Taylor, Greg Bryan, National Computational Science
Alliance (ACI- 9619019)
[15] V.V. Korkhov, et al., A Grid-based Virtual Reactor: Parallel performance and
adaptive load balancing, J. Parallel Distrib. Comput. (2007), doi:
10.1016/j.jpdc.2007.08.010
96