50120130405014 2-3

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print),
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &
ISSN 0976 - 6375(Online), Volume 4, Issue 5, September - October (2013), © IAEME

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 5, September – October (2013), pp. 109-114
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com

IJCET
©IAEME

DYNAMIC DATA REPLICATION AND JOB SCHEDULING BASED ON
POPULARITY AND CATEGORY
Priya Deshpande1, Brijesh Khundhawala2, Prasanna Joeg3
1

Assistant Professor, MITCOE, Pune
2
ME-Student, MITCOE, Pune
3
Professor MITCOE, Pune

ABSTRACT
Dealing with a huge amount of data puts the requirement for efficient data access more
critical in data grids. Improving data access time is a one way of reducing the job execution time i.e.
improving performance. To speed up the data access and reduce bandwidth consumption, data grids
replicate data in multiple locations. This paper studies a new data replication strategy in data grid,
which takes into account two important issues concerning replication: storage capability of different
nodes and bandwidth consumption between nodes. It also considers the popularity of the file for
replacement. Lesser popular files get less priority then the higher popular file. We also need to
consider the limitation on storage. We can optimize the performance by putting the file as much
close to client as possible. Our algorithm optimizes the replication with taking in to consideration
popularity of the file, limited storage and category of the file.
Keywords: Date Replication, Job Scheduling, Replica Strategy
I. INTRODUCTION
Large scale geographically distributed system are becoming very much popular in dataintensive applications, most importantly scientific applications. Life Sciences, astrophysics and
bioinformatics research communities are deploying Grid Systems to process large amounts of
datasets and which are stored at geographically dispersed locations. Millions of files are generated
regularly which goes beyond the amount in terabytes. The volume of interesting data is measured in
terabytes and will become in petabytes in short time because the development of technology and the
ability of research are growing fast [13]. There is really a great need to ensure efficient access to
such huge and widely dispersed data in a data grid. In Data Grid, performance is majorly influenced
109


by the data locality [1]. Data replication is a widely known method used to improve the performance
of data access in distributed systems. By creating replicas we can efficiently reduce the bandwidth
consumption and access latency. In particular increasing the data read performance from the
perspective of clients is the motif of the data replication algorithm. Replication is a mechanism for
creating and managing multiple copies of Files. Replica management service can be viewed as
composed of following activities: creating new replica(s), registering these new replicas in a Replica
Catalog and querying the catalog to find the location of the respective replica(s).The replication
mechanism includes three main subjects: which file should be replicated, when to replicate and
where to replicate. For improving the job execution time we have tried to consider job scheduling,
too. First of all we are trying to put the data in the grid category wise, i.e. data with same category
are placed as much close as possible. Then while replica replacement we replace the required data
with the least popular file which is decided based on the access frequency of the file. We are trying
to place the job as much close as possible to the required data. So, overall performances can be
improved.
II. RELATED WORKS
In the Grid Computing environment Data replication and Scheduling is primary concern for
performance optimization. Replica selection, Replica placement and Replica replacement has always
been very much crucial for the performance. Replica placement should be done in such a way that
there should be minimum file transfer time for job execution. Replica replacement has some
strategies like LRU and LFU. There are many researches going on in these areas.
EDGSim[2], a simulation implemented by the European Data Grid project, was designed to
simulate the performance of European Data Grid but was focused on the scheduling algorithm
optimization. Data location is important but no replication was considered. While, Gridnet[3] aims to
address replication of data. It proposed a dynamic replication algorithm and memory middleware that
was evaluated to improve the data access time.
The importance of data locality was first described by K. Rangnathan[4]. It suggested
replication strategies to reduce network bandwidth and access delay. Our system architecture is
similar to proposed in it few changes. H.Sato et al. [5] proposed a file replication algorithm that
improved simple replication methods by taking into consideration network capacity and file access
pattern. Similarly, R.S. Chang et al. [6] proposed the Latest Access Largest Weight(LALW) method,
which used data access history by applying a greater weight to a more recent access in data
replication.
In [9], a decentralized architecture for adaptive media dissemination was proposed. They
assumed that the popularity of the datasets satisfies the Zipf Distribution. Author defined the replica
weight based on popularity.
In [7] Dynamic Optimal Replication Strategy is proposed which is based on the File’s Access
History, Network Status and File’s Size. Performances show that it works better then LRU and LFU.
In [8] Dynamic strategy is proposed which tracks changes in the data access patterns and then applies
the relevant tradition replication strategies like LRU and LFU best for the data access pattern.
In our paper we have proposed strategy taking into account the File Access History to
determine popularity, Category of the Data and Location of job to be executed, which definitely
gives a hope for a better performance. Rest Of the paper is structured as follows: Section 3 proposes
System Architecture for the strategy, Section 4 specifies steps for dynamic replication and section 5
defines the scheduling strategy. Section 6 gives conclusion and Section 7 suggests references.

110


III. SYSTEM ARCHITECTURE
System Architecture for our grid model is shown in Fig. 1
Components of our architecture are as follows:
• LS: Local Scheduler of Grid Site.
• DS: Dataset/ Data Scheduler of Grid site.
• RC: Replica Catalogue stores list of all replicas on the grid.
• CE: Computing Element: Each grid site contains 0 or more CEs for its computing capability
• SE: Storage element: Each site contains 0 or more Storage elements representing its storage
capacity.
• JB: Job Broker which receives jobs from users and submits it to appropriate grid site.
• Replication manager: A centralized server that stores replication information of the system. It
contains Active replicator to perform replication for the system. It is better to have a
decentralized RM.

Figure 1 System Architecture [12]
Let’s have a brief how process goes on. After every predefined interval, the replica manager
collects data usage information of the environment. The interval selected should not be too much
large as the information collected should be fresh. Equally interval should not be too much small.
Because if this happens too much rapidly then it increases bandwidth usage and system processing
power equally. Then it decides based on the data which files to replicate based on the strategy going
to be defined in section 3. In that it considers both distance and relation of the data to each other.
Jobs from various clients will be submitted to Job Broker. We can assume job broker as a
name node of a hadoop[11]. Job broker decides to which machine a job should be assigned to. In our
model LS and DS works in parallel. When LS executes job, DS will find the data required in the
local machine and in other machines one step ahead. So time is saved and system utilization is
increased. This is shown in detail in section 4.
IV. DYNAMIC REPLICATION
Here it is assumed that the data in Data grids belongs to a field of research, e.g. Biology,
Chemical, Meteorology, Medical, etc. [12] they are the first level of a hierarchical tree. Splitting
them further down, we can divide the biology to cell biology, molecular of biology, cell technology,
proteomics, etc. We can split this category further down. The reason behind this assumption is that
111


data in one category is rarely or never used in another category. By doing so, we can form the
hierarchical tree of relationships between data of different categories. Each data entry is in one
category and has a close deal with data entries in the same category rather than other categories.
Because the replication takes place before the job execution it is better to put the replica nearer to the
site which frequently uses it. If we gather the data which has a high probability of getting used,
performance will be definitely increased. Our idea is to muster the data that are highly related to each
other into small regions so that the job which uses that data will be scheduled to run in that region.
As discussed previously main issues of the replication are:
• Which data to be replicated?
• Where to put the new replica?
Following sections answer to these issues:
A. REPLICA DECISION
In Order to decide which file needs to be copied we find popularity of the files and based on
that we choose the file. In the actual usage, data access patterns change over time, so any dynamic
replication strategy must keep track of file access histories to decide on when, what and where to
replicate. The “popularity” of the file is determined by finding out its access rate by various
clients/users. Thus to find out the popular file is the key and first step of our strategy. Here, it is
assumed that the recently popular file will be accessed more frequently in the near future. This
popularity record is maintained by every replication server. For replica decision data category is also
important. Replica decision will be made according to category of the data. Relevant data will be
placed together. Each unique file is assigned a unique identifier (FID). After regular interval our
algorithm is invoked to find out the popularity of files. Access history logs are cleared at the
beginning of each replication interval to capture the current access pattern dynamics. The interval is
chosen based on the arrival rate of data requests. Short interval will be chosen for high data requests
and vice versa. Interval is adopted dynamically. Data access for each unique file is aggregated and
summarized and Number of Access NOA (f) is stored in the server. Then the average amount of data
accessed is calculated and any file that has more data access then average amount, it needs to be
replicated. We are going to replicate the chosen file only if the number of replicas of the chosen file
is less than the threshold value. Threshold value can be decided by the following equation:
R=q/w
R is the relative capacity of the whole system. q is the sum of all node’s capacity and w is the
total size of all the files in the data grid.
B. REPLICA PLACEMENT
As stated above, our strategy tries to put the replica as much as close to the category it
belongs, so that the job belongs to that category will be executed nearby which in turn reduces the
time for file transfer at the time of job execution. For example, in an organization if we put data
related to HR department as much close as possible then it will be faster for a job to fetch all the data
required. We can even place job by taking into account its category. In the same manner we can put
data for different departments by considering their category. To put the files closer we find out the
distance. Distance is the time required to transfer the file from one node to other. So distance should
be as lower as possible. But for the replicas of the same file distance should be as greater as possible.
So, the two replicas of the same file don’t come in to same region. To chose a site to place the newly
created replica we evaluate the Distance for all the sites for selected file. The site which offers lowest
distance will be chosen to store a new replica. If data store is lesser than the required then we will

112


again use the popularity of the files on the target site to find out the least popular files. Which will be
deleted and then new replica will be stored.
V. SCHEDULING STRATEGY
As discussed above we try to put the job as much closer to category which it belongs. Job
broker makes use of both the DS- Data Scheduler and LS- Local Scheduler to optimize the job
execution. Job broker will calculate the estimated time taken by all the sites and then will chose the
site which has minimum estimated time.
Estimated required time = ሼࡰࢀሺ࢐ሻ ൅ ࡽࢀሺ࢐ሻ ൅ ࡱࢀሽ
DT: Time required transferring the data from other nodes to site where job is being executed.
QT: Queuing Time
ET: Time required executing the job. ...... [12]
After this process, job is assigned to site with minimum Estimated Time. In the tradition
systems, firstly all the data needed is gathered and then only job’s execution is started. Now here in
our strategy Data fetching and job execution will be done in parallel. Local Scheduler will fetch the
files required for job execution on the local site and will put them in the queue as per their turn for
usage. Files which are not available will be brought to current site by Data Scheduler. When the job
is executing DS will fetch and bring the file to local site. For example, if the site is executing first
task then DS will try to bring the files needed for the second task of the job at the same time. If the
file required is yet not arrived while needed CE will wait. As soon as the file arrives it will resume its
execution. So this strategy minimizes the time required to execute the job.
Job Execution:
Receive Job(J);
CreateThread(LS)
{
Receivedata(d);
ExecuteJob();
}
CreateThread(DS)
{
Data d =FetchData();
SendData(d);
}
Return Result:
113


VI. CONCLUSION
In this paper, we proposed dynamic optimal strategy which first calculates the popularity of
the files based on data access history. And then the most popular file is taken in to consideration.
Then of the number of replicas is less than the threshold value then replica is placed on the most
appropriate node based on the file’s category. Job execution is also suggested to improve the
performance. Traditional replication strategies don’t react to current status, so they are not as much
effective as dynamic replication strategies. But still there are many areas need to be considered for
the improvement of performance in the Data Grid Environment. More parameters needs to be
considered in future as Grid sizes are increasing drastically and complication are increased.
VII. REFERENCES
[1]
[2]
[3]

[4]

[5]
[6]
[7]
[8]
[9]
[10]

[11]

[12]

[13]

[14]

Foster, I., The grid: A new infrastructure for 21st century science. Physics Today. V55. 42-47,
2002, John Wiley & Sons.
P.Crosby.EDGSim.http://www.hep.ucl.ac.uk/~pac/EDGSim/
H. Lamehamedi, et al., Simulation of Dynamic Data Replication Strategies in Data Grids. In
Proc.Of 12th Heterogeneous Computing Workshop (HCW2003), Nice, France, Apr
2003.IEEE-CS Press.
K. Rangnathan, I. Foster, “Design and Evaluation of Dynamic Replication Strategies for a
High- Performance Data Grid”, International Conference on Conference on computing in
High Energy and Nuclear Physics, 2001.
H. Sato, et al., “Access-Pattern and Bandwidth Aware File Replication Algorithm in a Grid
Environment”, International Conference on Grid Computing, pp. 250-257, 2008.
R.S. Chang, H.p. Chang, “A Dynamic Data Replication Strategy Using Access-Weights in
Data Gtids” supercomputing, Vol. 45 No 3, pp. 277-295,2008.
Wquing Zhao, XianbinXu, Zhuowei Wang, Yuping Zhang, Shuibing He, “A Dynamic
Optimal Replication Strategy in Data Grid Environment”, @ 2010 IEEE.
MyunghoonJeon, Kwang-Ho Lim, Hyun Ahn, Byoung-Dai Lee, “Dynamic Data Replication
Scheme in cloud Computing Environment”, @2012 IEEE.
PhillippeCudre-Mauroux, and Karl Aberer, “A Decentralized Architecture for Adaptive
Media Dissemination”, ICME’-2 Proceedings, 2002, pp. 533-536.
Mohammad Shorfuzzaman, Peter Graham and RAsitEskicioglu, ”Popularity Driven Dynamic
Replica Placement in Hierarchical Data Grids”, 2008 Ninth international Conference on
Parallel and Distributed Computing, Applications and Technologies.
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo!, Sunnyvale,
California USA, {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com, ”The Hadoop
Distributed File System”
Nhan Nguyen Dang, Soonwook Hwang, Sang Boem Lim*,”Improvement of Data Grid’s
Performance by Combining Job Scheduling with Dynamic Replication Strategy”,@2007 The
Sixth International Conference on Grid and Cooperative Computing
A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The Data Grid:
Towards an Architecture for the Distributed Management and Analysis of Large Scientific
Datasets,” Journal of Network and Computer Application,vol. 23, pages 187-200, 2000.
M. Pushpalatha, T. Ramarao, Revathi Venkataraman and Sorna Lakshmi, “Mobility Aware
Data Replication using Minimum Dominating Set in Mobile Ad Hoc Networks”,
International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2,
2012, pp. 645 - 658, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.

114

50120130405014 2-3

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

En vedette

En vedette (6)

Similaire à 50120130405014 2-3

Similaire à 50120130405014 2-3 (20)

Plus de IAEME Publication

Plus de IAEME Publication (20)

Dernier

Dernier (20)

50120130405014 2-3