SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications
ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711

RESEARCH ARTICLE

www.ijera.com

OPEN ACCESS

Different Techniques to Transfer Big data: a Survey
Kiran Kumar Reddi1 Dnvsls Indira2
1
2

Department of Computer Science, Krishna University, Machilipatnam, Krishna Dt., AP- 521001
Research Scholar, Department of CS, Krishna University, Machilipatnam, Krishna Dt., AP- 521001

ABSTRACT
Now a days the world is moving around the social networks, i.e most of the people are interacting with each
other via internet only. Transmitting data via the Internet is a routine and common task for users today.
Transferring a gigabyte of data in an entire day was normal, however users are now transmitting multiple
gigabytes in a single hour. The Big Data is the combination of structured, semi-structured, unstructured,
homogeneous and heterogeneous data. With the influx of big data and massive scientific data sets that are
measured in tens of petabytes, a user has the propensity to transfer even larger amounts of data. When
transferring data sets of this magnitude on public or shared networks, the performance of all workloads in the
system will be impacted. This paper addresses the issues and challenges inherent with transferring big data
over networks. A survey of current transfer techniques is provided in this paper.
Key words: Big Data, Grid, Parallel Transfer, NICE
is a complicated and time-consuming process.
These long duration transfers could take tens of hours
I.
INTRODUCTION
to several days and a normal "one click and wait"
Over the past several years there has been
method will not suffice. During the course of the
a tremendous increase in the amount of data being
transfer, servers may go off-line and network
transferred between Internet users. Escalating usage
conditions may change that either hinder or stop the
of streaming multimedia and other Internet based
transfer completely. The user needs to know how to
applications has contributed to this surge in data
maintain the data transmission until completion. This
transmissions. Another facet of the increase is
paper gives an overview of previously existing
due to the expansion of Big Data, which refers
techniques to transfer files around the world.
to data sets that are an order of magnitude larger
This section concentrates on grid computing
than the standard file transmitted via the Internet.
to distribute data and parallel transfer techniques.
Big Data can range in size from hundreds of
gigabytes to petabytes. Today everything is being
2.1 GRIDS
stored digitally.
Within the past decade,
Grid computing has emerged as a
everything from banking transactions to medical
framework
for
aggregating
geographically
history has migrated to digital storage. This
distributed, heterogeneous resources that enables
change from physical documents to digital files has
secure and unified access to computing, storage and
necessitated the creation of large data sets and
networking resources (1). Grid applications have
consequently the transfer of large amounts of data.
vast datasets and/or complex computations that
There is no sign that the amount of data being stored
require
secure
resource
sharing
among
or transmitted by users is steady or even decreasing.
geographically distributed systems. The term "Grid"
Every year average Internet users are moving
was inspired by the electrical grid system, where a
more and more data through their Internet
user can plug in an appliance to a universal socket
connections. Depending on the bandwidth of these
and have instant access to power without knowing
connections and the size of the data sets being
exactly where that power was generated or how
transmitted, the duration of transfers could
it came to reach the socket (1). The vision for
potentially be measured in days or even weeks.
grids was similar.
A user could simply access as
There exists a need for an efficient transfer technique
much computing power as required through a
that can move large amounts of data quickly and
common interface without concern for who was
easily without impacting other users or
providing the resources. Currently, grids have not
applications.
This paper presents different
yet reached that level of simplicity. Grids offer
techniques to transfer the data across the globe. This
coordinated resource sharing and problem solving in
paper concentrated on survey on grids, parallel
dynamic, multi institutional virtual organizations (2).
techniques to transfer data on internet and a new
A virtual organization (VO) comprises a set of
model to transfer big data efficiently across the globe.
individuals and/or institutions having access to
computers, software, data, and other resources for
II.
RELATED WORK
collaborative problem-solving or other purposes (3).
Retrieving large data files (GB, TB, PB)
www.ijera.com

708 | P a g e
Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications
ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711
A grid can also be defined as a system that
coordinates resources that are not subject to
centralized control, using standard, open, generalpurpose protocols and interfaces in order to
deliver nontrivial qualities of service (4). Data grids,
a specialized extension of grid computing, are
responsible for providing the infrastructure and
services to access, transfer, and modify massive
datasets stored in distributed storage resources (5).
They allow users to access computational and storage
resources in order to execute data-intensive
applications on remote data. The objective of a data
grid system is to integrate heterogeneous data files
stored in a large number of geographically distributed
sites into a single virtual data management system
and to provide diverse services to fit the needs of
high-performance distributed and data intensive
computing (5).
2.2 GRID MIDDLEWARE
The sharing of resources in a grid is facilitated
and controlled by a set of services that allow
resources to be discovered, accessed, allocated,
monitored, and accounted for, regardless of the
their physical location (6). Since these services
create a layer between physical resources and
applications, they are often referred to as Grid
Middleware.
Every grid has different service
requirements, therefore the architecture and grid
middleware implementation of every grid can vary.
The middleware of many grids is based on the
software architecture called the Globus Toolkit
(7).
The toolkit is a set of libraries and
programs that address common problems that
occur when building distributed system services
and applications (8). It provides a set of
infrastructure services that implement interfaces for
managing computational, storage,
and other
resources. The Globus Toolkit provides all of these
services and it is left to grid administrators to
determine whether or not to include certain services
in their grid implementation. These are a few of the
well known and widely used grids that deploy
the Globus Toolkit: TeraGrid, Open Science Grid,
EGEE, Worldwide LHC Computing Grid (WLCG)
(8), China National Grid, UK National Grid Service
and NAREGI (9). The architecture of the Globus
Toolkit contains several components, each of which
is responsible for different grid functions. A few of
these services are (7):
• Grid Resource Allocation and management
(GRAM) - This service initiates, monitors, and
manages the execution of computations on
remote computers. It allows a user to specify:
the quantity and type of resources needed, the
data sets required for their computation, the
executable application to be run, the
necessary security credentials, and the job
persistence requirements.
• Data access and movement - The reliable file
www.ijera.com

www.ijera.com

transfer (RFT) service is provided to ensure that
data is successfully transferred from one location
to another.
• Replica Management - This service keeps track
of all replicas and their content using a replica
location service (RLS) and a data replication
service (DRS).
• Monitoring and Discovery - Multiple services
collect and process information about the
configuration and state of all resources to enable
monitoring of system status.
• Security - Services establish the identity of
users or services (authentication), protect
communications, and determine who is
allowed
to
perform
what
actions
(authorization), as well as manage user
credentials and maintain group membership
information.
There are several grid applications
available for moving data from one location in a
grid to another. The most widely-used data
movement tool, which is also a component of the
Globus Toolkit, is called GridFTP (10; 11). It is
an extension of the File Transfer Protocol (FTP) and
was designed specifically for grid environments.
Several data retrieval techniques developed
specifically for retrieving large files in grid
computing environments. The sizes of data files
requested in grids are much larger than normal web
data requests. It is not uncommon for a grid data
file size to be measured in gigabytes or terabytes.
Users want to be able to download these files as
quickly as possible, by any means necessary.
Since utilizing a single server can be limiting,
retrieving data from multiple servers in a parallel
(also known as data co-allocation) has been
suggested as an alternative.

III.

PARALLEL TRANSMISSION
TECHNIQUES USED ON THE
INTERNET

Rodriguez
and
Biersack
present
mechanisms for parallel access to data on the
Internet (11). They develop two different parallelaccess schemes: history-based and dynamic. The
goal for all of their schemes is to balance the load
amongst all available servers by allocating a
workload to each replica that is proportional to its
service rate. The authors state that parallel access has
additional overhead in comparison to a single access.
The additional overhead occurs when multiple
connections are opened and extra traffic is generated
to perform block requests. In order to minimize
these overhead costs, these techniques should only
be utilized for larger files. Their history-based
technique utilizes a database with information about
previous rates from the different servers to the
receiver in order to estimate future transfers. Using
these estimates the algorithm assigns varying
portions of the file to each replica with the goal that
709 | P a g e
Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications
ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711
all servers will finish transferring the portions at the
same time. The authors evaluate their history-based
technique using live webservers on the Internet
distributed across the world. Due to the presence of
other Internet traffic, the authors found that the
performance of their technique varied at different
times of the day. They found that network
conditions rapidly change and estimating the
transfer rate to every server using past histories
results in poor estimates. Their results show that
during peak traffic times when transfer rates vary
dramatically and historical information is not a
good indicator of future performance, their historybased parallel
access technique
has higher
download times than clients accessing a single
server. In response to the performance of their
history-based technique, the authors develop a
dynamic technique that adjusts to changing
network conditions. Their dynamic technique
divides the desired file into a fixed number of
equal sized blocks. The client requests one block
from every replica. When a server completes a
request, another block is assigned. When there are a
small number of blocks outstanding, idle servers
are requested to deliver blocks that have already
been assigned to another server, but have yet to
be received. There will then be multiple servers
working on the same requests. The authors state that
the bandwidth wasted on overlapping these requests
is smaller than the worst-case scenario of waiting
for the slowest server to deliver the last block. To
further enhance the performance of their
technique, they utilize TCP-persistent connections
between the client and every server to minimize
the overhead of opening multiple TCP connections.
They also propose pipelining the requests to each
server in order to decrease inter block idle times.
With pipelining, a new block request is sent to a
server before the previous block request is
completely received. In the evaluations of their
dynamic technique, the authors find that there is
a significant speedup in comparison to a single
server access. Since the dynamic technique is not
relying on historical information and can adapt to
changing
network conditions, it
has greater
performance than requesting data from a single
server even under peak traffic conditions. They also
observe that the transfer time of a dynamic parallel
access is very close to the optimum transfer time.
Utilizing request pipelining, the authors demonstrate
that their technique would be almost equal to the
optimal transfer time. These advanced retrieval
techniques allow users to utilize multiple resources
simultaneously.
These
advanced
techniques
provide improved performance for users, however
they are quite complicated to implement and use.
They require significant user involvement and
require multiple user decisions that can
dramatically affect the performance of the transfer.
A user needs know how in order to make these
www.ijera.com

www.ijera.com

techniques function properly and efficiently. Another
configuration option that is frequently left for the
user to determine is segment size.
In some
advanced techniques, the data file is divided into
small portions called segments. The segment size is
often left for the user to decide and the size chosen
can affect the performance of the transfer.
Determining the appropriate segment size is not a
simple task. If the size is too small, a server may
receive hundreds to thousands of requests for
portions of a single file. This will result in longer
disk service times at a server, as the number of
users increases. A server's storage system can best
service requests if it has greater knowledge of a user's
workload. It can better schedule reading from the
hard disks, as well as take advantage of prefetching and caching strategies.

IV.

NICE MODEL

A new, NICE Model for Big Data
transfers, which is based on a store-and-forward
model instead of an end-to-end approach, is
presented. This nice model ensures that Big Data
transfers only occur during low demand periods when
there is idle bandwidth that can be repurposed for
these large transfers. Under this model, Big Data are
transmitted when the Internet traffic at the senders
LAN is low. If the Internet traffic at the receivers
LAN is high at this time, then the data are stored at a
staging server and later transmitted to the receiver.
Similar to the nice command in Linux, a transfer
tool based on the nice algorithm, gives itself low
priority and is nice to other applications using the
Internet. The overall goal is to develop an
application that can transmit big data via the
Internet for all users with in any campus.

Fig 1: Nice Model architecture
In order to avail of maximum installed
bandwidth without impacting other users, the key is
to open multiple transmission streams during low
traffic. If the sender and receiver are in the
same time zone then a direct transmission from
sender to receiver is feasible. If the sender and
receiver are in different time zones, then the low
traffic periods at the two end points do not
coincide. In this instance, the file is transmitted
from the sender to one or more staging server(s),
placed in the Internet zone. Depending on the
Internet configuration between the sender and
710 | P a g e
Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications
ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711
receiver, the file may be transmitted to a single
staging server via multiple streams or the file
may be divided and parts of the file are transmitted
concurrently to multiple staging servers. When the
receiver's LAN traffic is low, the file can be
transmitted from the staging server(s) to the
receiver. A file transmission tool such as GridFTP
could be used for the transmission from the sender to
the staging server(s) and from the staging server(s) to
the receiver.

V.

CONCLUSION & FUTURE WORK

Big data transfers via the Internet are
not a commonplace task for most users today.
Currently, there are no tools to facilitate these kinds
of transmissions. The task of transferring massive
amounts of data across the country or even the globe
is a challenging and daunting undertaking for any
user.
As the popularity of distributed storage
propagates and the amount of scientific data
continues to surge, the demand for big data transfers
will grow at a tremendous rate. The existing tools
for moving large amounts of data are based on
the parallel model, which is designed to grab as
much bandwidth as possible by opening
concurrent data streams. This greedy approach
may be good for a single user's transfer,
however it is not scalable for multiple users on a
shared network. The entire system suffers when
users attempt to grab bandwidth. Parallel system
supports multiple servers to transfer data via internet
but implementation of that system is very difficult.
The Nice model can be used for big data transfers.
Under this model, these transfers are relegated to low
demand periods when there is ample, idle bandwidth
available.
This bandwidth can then be
repurposed for big data transmissions without
impacting other users in the system. Since the
nice model uses a store-and-forward approach by
utilizing staging servers, the model is able to
accommodate differences in
time zones and
variations in bandwidth. There is a lot of challenging
issues like network protocols, security, compression
techniques, routing algorithms, Traffic Monitoring
and bandwidth management devices. New algorithms
are required to transfer big data around the globe by
concentrating all the above issues.

[3]

[4]
[5]

[6]

[7]

[8]
[9]
[10]

[11]

REFERENCES
[1]

[2]

FOSTER, I., AND KESSELMAN, C. The
Grid 2: Blueprint for a New Computing
Infrastructure.
Morgan
Kaufmann
Publishers Inc., San Francisco, CA, USA,
2003.
FOSTER, I., KESSELMAN, C., AND
TUECKE, S. The anatomy of the Grid:
Enabling scalable virtual organizations. The
International Journal of High Performance
Computing Applications 15, 3 (Fall 2001),
200-222.

www.ijera.com

[12]

www.ijera.com

LAURE,
E.,
FISHER,
S.
M.,
FROHNER, A., GRANDI, C., KUNSZT,
P. Z., KRENEK, A., MULMO, O.,
PACINI, F., PRELZ, F., WHITE, J.,
BARROSO, M., BUNCIC, P., HEMMER,
F., DI MEGLIO, A., AND EDLUND, A.
Programming the grid with glite.
Computational Methods in Science and
Technology 12, 1 (2006), 33-45.
FOSTER, I. What is the grid? a three point
checklist. GRIDToday (July 2002).
XLAO QIN, H. J. Data Grids: Supporting
Data-Intensive Applications in Wide Area
Networks. 2006, pp. 481-494.
LAURE, E., HEMMER, F., PRELZ, F.,
BECO, S., FISHER, S., LIVNY, M.,
GUY, L., BARROSO, M., BUNCIC,
P., KUNSZT, P. Z., DI MEGLIO, A.,
AIMAR, A., EDLUND, A., GROEP, D.,
PACINI, F., SGARAVATTO, M., AND
MULMO, O. Middleware for the next
generation grid infrastructure. Computing in
High Energy Physics and Nuclear
Physics (October 2004), 826.
FOSTER, I. Globus toolkit version 4:
Software for service-oriented systems. In
IFIP International Conference on
Network and Parallel Computing (2006),
pp. 2-13.
Worldwide lhc computing grid (wlcg).
http://lcg.web.cern.ch/LCG/.
Naregi: National research grid initiative,
http://www.naregi.org, 2009.
ALLCOCK,
W.,
BESTEE,
J.,
BEESNAHAN, J., CHEEVENAK, A.
L.,
FOSTEE, I., KESSELMAN, C.,
MEDEE, S., NEFEDOVA, V., AND
STEVEN, D. Q. Secure, efficient data
transport and replica management for
high-performance data-intensive computing.
In IEEE Mass Storage Conference (2001).
ALLCOCK,
W.,
BESTEE,
J.,
BEESNAHAN, J., CHEEVENAK, A.
L.,
FOSTEE, I., KESSELMAN, C.,
MEDEE,
S.,
NEFEDOVA,
V.,
QUESNEL, D., AND TUECKE, S. Data
management and transfer in high
performance
computational
grid
environments. Parallel Computing Journal
28, 5 (May 2002), 749-771.
RODRIGUEZ, P., AND BIERSACK, E.W.
Dynamic parallel access to replicated
content in the internet. IEEE/ACM Trans.
Netw. 10, 4 (2002), 455-465.

711 | P a g e

Contenu connexe

Tendances

Centralized Data Verification Scheme for Encrypted Cloud Data Services
Centralized Data Verification Scheme for Encrypted Cloud Data ServicesCentralized Data Verification Scheme for Encrypted Cloud Data Services
Centralized Data Verification Scheme for Encrypted Cloud Data Services
Editor IJMTER
 
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTINGDISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
ijgca
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computing
sudha kar
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]
Raul Soto
 

Tendances (20)

Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
4. the grid evolution
4. the grid evolution4. the grid evolution
4. the grid evolution
 
Session19 Globus
Session19 GlobusSession19 Globus
Session19 Globus
 
Grid computing
Grid computingGrid computing
Grid computing
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Centralized Data Verification Scheme for Encrypted Cloud Data Services
Centralized Data Verification Scheme for Encrypted Cloud Data ServicesCentralized Data Verification Scheme for Encrypted Cloud Data Services
Centralized Data Verification Scheme for Encrypted Cloud Data Services
 
Open Source Grid Middleware Packages
Open Source Grid Middleware  PackagesOpen Source Grid Middleware  Packages
Open Source Grid Middleware Packages
 
Bt9002 grid computing 1
Bt9002 grid computing 1Bt9002 grid computing 1
Bt9002 grid computing 1
 
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTINGDISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
DISTRIBUTED AND BIG DATA STORAGE MANAGEMENT IN GRID COMPUTING
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4Cs6703 grid and cloud computing unit 4
Cs6703 grid and cloud computing unit 4
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined Networking
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
Unit i introduction to grid computing
Unit i   introduction to grid computingUnit i   introduction to grid computing
Unit i introduction to grid computing
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]
 
Gcc notes unit 1
Gcc notes unit 1Gcc notes unit 1
Gcc notes unit 1
 
Grid Computing (An Up-Coming Technology)
Grid Computing (An Up-Coming Technology)Grid Computing (An Up-Coming Technology)
Grid Computing (An Up-Coming Technology)
 
A Literature Survey on Resource Management Techniques, Issues and Challenges ...
A Literature Survey on Resource Management Techniques, Issues and Challenges ...A Literature Survey on Resource Management Techniques, Issues and Challenges ...
A Literature Survey on Resource Management Techniques, Issues and Challenges ...
 
High level view of cloud security
High level view of cloud securityHigh level view of cloud security
High level view of cloud security
 

En vedette

Kelompok 1 matematika titik, garis, bidang dan kurva
Kelompok 1 matematika   titik, garis, bidang dan kurvaKelompok 1 matematika   titik, garis, bidang dan kurva
Kelompok 1 matematika titik, garis, bidang dan kurva
Restu Waras Toto
 

En vedette (20)

H364752
H364752H364752
H364752
 
Dd36630635
Dd36630635Dd36630635
Dd36630635
 
Dj36661666
Dj36661666Dj36661666
Dj36661666
 
F363941
F363941F363941
F363941
 
Dw36735744
Dw36735744Dw36735744
Dw36735744
 
Dz36755762
Dz36755762Dz36755762
Dz36755762
 
E363138
E363138E363138
E363138
 
Ed36788792
Ed36788792Ed36788792
Ed36788792
 
Cz36611614
Cz36611614Cz36611614
Cz36611614
 
Dr36712714
Dr36712714Dr36712714
Dr36712714
 
Do36689702
Do36689702Do36689702
Do36689702
 
Dt36717722
Dt36717722Dt36717722
Dt36717722
 
Dp36703707
Dp36703707Dp36703707
Dp36703707
 
Du36723729
Du36723729Du36723729
Du36723729
 
Kelompok 1 matematika titik, garis, bidang dan kurva
Kelompok 1 matematika   titik, garis, bidang dan kurvaKelompok 1 matematika   titik, garis, bidang dan kurva
Kelompok 1 matematika titik, garis, bidang dan kurva
 
Avataria
AvatariaAvataria
Avataria
 
Datos de las obras
Datos de las obrasDatos de las obras
Datos de las obras
 
SensesClub-Gift-TienphongBank
SensesClub-Gift-TienphongBankSensesClub-Gift-TienphongBank
SensesClub-Gift-TienphongBank
 
Opticon Marketing Marketing for Chiropractics PowerPoint
Opticon Marketing Marketing for Chiropractics PowerPointOpticon Marketing Marketing for Chiropractics PowerPoint
Opticon Marketing Marketing for Chiropractics PowerPoint
 
TCAIC II - Trabajo Práctico Modelo 02 BIS - Gestión de Proyectos
TCAIC II - Trabajo Práctico Modelo 02 BIS - Gestión de ProyectosTCAIC II - Trabajo Práctico Modelo 02 BIS - Gestión de Proyectos
TCAIC II - Trabajo Práctico Modelo 02 BIS - Gestión de Proyectos
 

Similaire à Dq36708711

Similaire à Dq36708711 (20)

CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
CONTENT BASED DATA TRANSFER MECHANISM FOR EFFICIENT BULK DATA TRANSFER IN GRI...
 
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUESBIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
 
1771 1775
1771 17751771 1775
1771 1775
 
1771 1775
1771 17751771 1775
1771 1775
 
Grid computing
Grid computingGrid computing
Grid computing
 
Grid computing
Grid computingGrid computing
Grid computing
 
A Study of Protocols for Grid Computing Environment
A Study of Protocols for Grid Computing EnvironmentA Study of Protocols for Grid Computing Environment
A Study of Protocols for Grid Computing Environment
 
A Review Grid Computing
A Review  Grid ComputingA Review  Grid Computing
A Review Grid Computing
 
Grid computing: An Emerging Technology
Grid computing: An Emerging TechnologyGrid computing: An Emerging Technology
Grid computing: An Emerging Technology
 
Survey on Synchronizing File Operations Along with Storage Scalable Mechanism
Survey on Synchronizing File Operations Along with Storage Scalable MechanismSurvey on Synchronizing File Operations Along with Storage Scalable Mechanism
Survey on Synchronizing File Operations Along with Storage Scalable Mechanism
 
GridComputing-an introduction.ppt
GridComputing-an introduction.pptGridComputing-an introduction.ppt
GridComputing-an introduction.ppt
 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid Systems
 
Ijcatr04071003
Ijcatr04071003Ijcatr04071003
Ijcatr04071003
 
A Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid SystemsA Survey of File Replication Techniques In Grid Systems
A Survey of File Replication Techniques In Grid Systems
 
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
 
A Review Paper On Grid Computing
A Review Paper On Grid ComputingA Review Paper On Grid Computing
A Review Paper On Grid Computing
 
5. the grid implementing production grid
5. the grid implementing production grid5. the grid implementing production grid
5. the grid implementing production grid
 
Ijetcas14 583
Ijetcas14 583Ijetcas14 583
Ijetcas14 583
 
Grid computing
Grid computingGrid computing
Grid computing
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Dq36708711

  • 1. Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711 RESEARCH ARTICLE www.ijera.com OPEN ACCESS Different Techniques to Transfer Big data: a Survey Kiran Kumar Reddi1 Dnvsls Indira2 1 2 Department of Computer Science, Krishna University, Machilipatnam, Krishna Dt., AP- 521001 Research Scholar, Department of CS, Krishna University, Machilipatnam, Krishna Dt., AP- 521001 ABSTRACT Now a days the world is moving around the social networks, i.e most of the people are interacting with each other via internet only. Transmitting data via the Internet is a routine and common task for users today. Transferring a gigabyte of data in an entire day was normal, however users are now transmitting multiple gigabytes in a single hour. The Big Data is the combination of structured, semi-structured, unstructured, homogeneous and heterogeneous data. With the influx of big data and massive scientific data sets that are measured in tens of petabytes, a user has the propensity to transfer even larger amounts of data. When transferring data sets of this magnitude on public or shared networks, the performance of all workloads in the system will be impacted. This paper addresses the issues and challenges inherent with transferring big data over networks. A survey of current transfer techniques is provided in this paper. Key words: Big Data, Grid, Parallel Transfer, NICE is a complicated and time-consuming process. These long duration transfers could take tens of hours I. INTRODUCTION to several days and a normal "one click and wait" Over the past several years there has been method will not suffice. During the course of the a tremendous increase in the amount of data being transfer, servers may go off-line and network transferred between Internet users. Escalating usage conditions may change that either hinder or stop the of streaming multimedia and other Internet based transfer completely. The user needs to know how to applications has contributed to this surge in data maintain the data transmission until completion. This transmissions. Another facet of the increase is paper gives an overview of previously existing due to the expansion of Big Data, which refers techniques to transfer files around the world. to data sets that are an order of magnitude larger This section concentrates on grid computing than the standard file transmitted via the Internet. to distribute data and parallel transfer techniques. Big Data can range in size from hundreds of gigabytes to petabytes. Today everything is being 2.1 GRIDS stored digitally. Within the past decade, Grid computing has emerged as a everything from banking transactions to medical framework for aggregating geographically history has migrated to digital storage. This distributed, heterogeneous resources that enables change from physical documents to digital files has secure and unified access to computing, storage and necessitated the creation of large data sets and networking resources (1). Grid applications have consequently the transfer of large amounts of data. vast datasets and/or complex computations that There is no sign that the amount of data being stored require secure resource sharing among or transmitted by users is steady or even decreasing. geographically distributed systems. The term "Grid" Every year average Internet users are moving was inspired by the electrical grid system, where a more and more data through their Internet user can plug in an appliance to a universal socket connections. Depending on the bandwidth of these and have instant access to power without knowing connections and the size of the data sets being exactly where that power was generated or how transmitted, the duration of transfers could it came to reach the socket (1). The vision for potentially be measured in days or even weeks. grids was similar. A user could simply access as There exists a need for an efficient transfer technique much computing power as required through a that can move large amounts of data quickly and common interface without concern for who was easily without impacting other users or providing the resources. Currently, grids have not applications. This paper presents different yet reached that level of simplicity. Grids offer techniques to transfer the data across the globe. This coordinated resource sharing and problem solving in paper concentrated on survey on grids, parallel dynamic, multi institutional virtual organizations (2). techniques to transfer data on internet and a new A virtual organization (VO) comprises a set of model to transfer big data efficiently across the globe. individuals and/or institutions having access to computers, software, data, and other resources for II. RELATED WORK collaborative problem-solving or other purposes (3). Retrieving large data files (GB, TB, PB) www.ijera.com 708 | P a g e
  • 2. Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711 A grid can also be defined as a system that coordinates resources that are not subject to centralized control, using standard, open, generalpurpose protocols and interfaces in order to deliver nontrivial qualities of service (4). Data grids, a specialized extension of grid computing, are responsible for providing the infrastructure and services to access, transfer, and modify massive datasets stored in distributed storage resources (5). They allow users to access computational and storage resources in order to execute data-intensive applications on remote data. The objective of a data grid system is to integrate heterogeneous data files stored in a large number of geographically distributed sites into a single virtual data management system and to provide diverse services to fit the needs of high-performance distributed and data intensive computing (5). 2.2 GRID MIDDLEWARE The sharing of resources in a grid is facilitated and controlled by a set of services that allow resources to be discovered, accessed, allocated, monitored, and accounted for, regardless of the their physical location (6). Since these services create a layer between physical resources and applications, they are often referred to as Grid Middleware. Every grid has different service requirements, therefore the architecture and grid middleware implementation of every grid can vary. The middleware of many grids is based on the software architecture called the Globus Toolkit (7). The toolkit is a set of libraries and programs that address common problems that occur when building distributed system services and applications (8). It provides a set of infrastructure services that implement interfaces for managing computational, storage, and other resources. The Globus Toolkit provides all of these services and it is left to grid administrators to determine whether or not to include certain services in their grid implementation. These are a few of the well known and widely used grids that deploy the Globus Toolkit: TeraGrid, Open Science Grid, EGEE, Worldwide LHC Computing Grid (WLCG) (8), China National Grid, UK National Grid Service and NAREGI (9). The architecture of the Globus Toolkit contains several components, each of which is responsible for different grid functions. A few of these services are (7): • Grid Resource Allocation and management (GRAM) - This service initiates, monitors, and manages the execution of computations on remote computers. It allows a user to specify: the quantity and type of resources needed, the data sets required for their computation, the executable application to be run, the necessary security credentials, and the job persistence requirements. • Data access and movement - The reliable file www.ijera.com www.ijera.com transfer (RFT) service is provided to ensure that data is successfully transferred from one location to another. • Replica Management - This service keeps track of all replicas and their content using a replica location service (RLS) and a data replication service (DRS). • Monitoring and Discovery - Multiple services collect and process information about the configuration and state of all resources to enable monitoring of system status. • Security - Services establish the identity of users or services (authentication), protect communications, and determine who is allowed to perform what actions (authorization), as well as manage user credentials and maintain group membership information. There are several grid applications available for moving data from one location in a grid to another. The most widely-used data movement tool, which is also a component of the Globus Toolkit, is called GridFTP (10; 11). It is an extension of the File Transfer Protocol (FTP) and was designed specifically for grid environments. Several data retrieval techniques developed specifically for retrieving large files in grid computing environments. The sizes of data files requested in grids are much larger than normal web data requests. It is not uncommon for a grid data file size to be measured in gigabytes or terabytes. Users want to be able to download these files as quickly as possible, by any means necessary. Since utilizing a single server can be limiting, retrieving data from multiple servers in a parallel (also known as data co-allocation) has been suggested as an alternative. III. PARALLEL TRANSMISSION TECHNIQUES USED ON THE INTERNET Rodriguez and Biersack present mechanisms for parallel access to data on the Internet (11). They develop two different parallelaccess schemes: history-based and dynamic. The goal for all of their schemes is to balance the load amongst all available servers by allocating a workload to each replica that is proportional to its service rate. The authors state that parallel access has additional overhead in comparison to a single access. The additional overhead occurs when multiple connections are opened and extra traffic is generated to perform block requests. In order to minimize these overhead costs, these techniques should only be utilized for larger files. Their history-based technique utilizes a database with information about previous rates from the different servers to the receiver in order to estimate future transfers. Using these estimates the algorithm assigns varying portions of the file to each replica with the goal that 709 | P a g e
  • 3. Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711 all servers will finish transferring the portions at the same time. The authors evaluate their history-based technique using live webservers on the Internet distributed across the world. Due to the presence of other Internet traffic, the authors found that the performance of their technique varied at different times of the day. They found that network conditions rapidly change and estimating the transfer rate to every server using past histories results in poor estimates. Their results show that during peak traffic times when transfer rates vary dramatically and historical information is not a good indicator of future performance, their historybased parallel access technique has higher download times than clients accessing a single server. In response to the performance of their history-based technique, the authors develop a dynamic technique that adjusts to changing network conditions. Their dynamic technique divides the desired file into a fixed number of equal sized blocks. The client requests one block from every replica. When a server completes a request, another block is assigned. When there are a small number of blocks outstanding, idle servers are requested to deliver blocks that have already been assigned to another server, but have yet to be received. There will then be multiple servers working on the same requests. The authors state that the bandwidth wasted on overlapping these requests is smaller than the worst-case scenario of waiting for the slowest server to deliver the last block. To further enhance the performance of their technique, they utilize TCP-persistent connections between the client and every server to minimize the overhead of opening multiple TCP connections. They also propose pipelining the requests to each server in order to decrease inter block idle times. With pipelining, a new block request is sent to a server before the previous block request is completely received. In the evaluations of their dynamic technique, the authors find that there is a significant speedup in comparison to a single server access. Since the dynamic technique is not relying on historical information and can adapt to changing network conditions, it has greater performance than requesting data from a single server even under peak traffic conditions. They also observe that the transfer time of a dynamic parallel access is very close to the optimum transfer time. Utilizing request pipelining, the authors demonstrate that their technique would be almost equal to the optimal transfer time. These advanced retrieval techniques allow users to utilize multiple resources simultaneously. These advanced techniques provide improved performance for users, however they are quite complicated to implement and use. They require significant user involvement and require multiple user decisions that can dramatically affect the performance of the transfer. A user needs know how in order to make these www.ijera.com www.ijera.com techniques function properly and efficiently. Another configuration option that is frequently left for the user to determine is segment size. In some advanced techniques, the data file is divided into small portions called segments. The segment size is often left for the user to decide and the size chosen can affect the performance of the transfer. Determining the appropriate segment size is not a simple task. If the size is too small, a server may receive hundreds to thousands of requests for portions of a single file. This will result in longer disk service times at a server, as the number of users increases. A server's storage system can best service requests if it has greater knowledge of a user's workload. It can better schedule reading from the hard disks, as well as take advantage of prefetching and caching strategies. IV. NICE MODEL A new, NICE Model for Big Data transfers, which is based on a store-and-forward model instead of an end-to-end approach, is presented. This nice model ensures that Big Data transfers only occur during low demand periods when there is idle bandwidth that can be repurposed for these large transfers. Under this model, Big Data are transmitted when the Internet traffic at the senders LAN is low. If the Internet traffic at the receivers LAN is high at this time, then the data are stored at a staging server and later transmitted to the receiver. Similar to the nice command in Linux, a transfer tool based on the nice algorithm, gives itself low priority and is nice to other applications using the Internet. The overall goal is to develop an application that can transmit big data via the Internet for all users with in any campus. Fig 1: Nice Model architecture In order to avail of maximum installed bandwidth without impacting other users, the key is to open multiple transmission streams during low traffic. If the sender and receiver are in the same time zone then a direct transmission from sender to receiver is feasible. If the sender and receiver are in different time zones, then the low traffic periods at the two end points do not coincide. In this instance, the file is transmitted from the sender to one or more staging server(s), placed in the Internet zone. Depending on the Internet configuration between the sender and 710 | P a g e
  • 4. Kiran Kumar Reddi et al Int. Journal of Engineering Research and Applications ISSN : 2248-9622, Vol. 3, Issue 6, Nov-Dec 2013, pp.708-711 receiver, the file may be transmitted to a single staging server via multiple streams or the file may be divided and parts of the file are transmitted concurrently to multiple staging servers. When the receiver's LAN traffic is low, the file can be transmitted from the staging server(s) to the receiver. A file transmission tool such as GridFTP could be used for the transmission from the sender to the staging server(s) and from the staging server(s) to the receiver. V. CONCLUSION & FUTURE WORK Big data transfers via the Internet are not a commonplace task for most users today. Currently, there are no tools to facilitate these kinds of transmissions. The task of transferring massive amounts of data across the country or even the globe is a challenging and daunting undertaking for any user. As the popularity of distributed storage propagates and the amount of scientific data continues to surge, the demand for big data transfers will grow at a tremendous rate. The existing tools for moving large amounts of data are based on the parallel model, which is designed to grab as much bandwidth as possible by opening concurrent data streams. This greedy approach may be good for a single user's transfer, however it is not scalable for multiple users on a shared network. The entire system suffers when users attempt to grab bandwidth. Parallel system supports multiple servers to transfer data via internet but implementation of that system is very difficult. The Nice model can be used for big data transfers. Under this model, these transfers are relegated to low demand periods when there is ample, idle bandwidth available. This bandwidth can then be repurposed for big data transmissions without impacting other users in the system. Since the nice model uses a store-and-forward approach by utilizing staging servers, the model is able to accommodate differences in time zones and variations in bandwidth. There is a lot of challenging issues like network protocols, security, compression techniques, routing algorithms, Traffic Monitoring and bandwidth management devices. New algorithms are required to transfer big data around the globe by concentrating all the above issues. [3] [4] [5] [6] [7] [8] [9] [10] [11] REFERENCES [1] [2] FOSTER, I., AND KESSELMAN, C. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. FOSTER, I., KESSELMAN, C., AND TUECKE, S. The anatomy of the Grid: Enabling scalable virtual organizations. The International Journal of High Performance Computing Applications 15, 3 (Fall 2001), 200-222. www.ijera.com [12] www.ijera.com LAURE, E., FISHER, S. M., FROHNER, A., GRANDI, C., KUNSZT, P. Z., KRENEK, A., MULMO, O., PACINI, F., PRELZ, F., WHITE, J., BARROSO, M., BUNCIC, P., HEMMER, F., DI MEGLIO, A., AND EDLUND, A. Programming the grid with glite. Computational Methods in Science and Technology 12, 1 (2006), 33-45. FOSTER, I. What is the grid? a three point checklist. GRIDToday (July 2002). XLAO QIN, H. J. Data Grids: Supporting Data-Intensive Applications in Wide Area Networks. 2006, pp. 481-494. LAURE, E., HEMMER, F., PRELZ, F., BECO, S., FISHER, S., LIVNY, M., GUY, L., BARROSO, M., BUNCIC, P., KUNSZT, P. Z., DI MEGLIO, A., AIMAR, A., EDLUND, A., GROEP, D., PACINI, F., SGARAVATTO, M., AND MULMO, O. Middleware for the next generation grid infrastructure. Computing in High Energy Physics and Nuclear Physics (October 2004), 826. FOSTER, I. Globus toolkit version 4: Software for service-oriented systems. In IFIP International Conference on Network and Parallel Computing (2006), pp. 2-13. Worldwide lhc computing grid (wlcg). http://lcg.web.cern.ch/LCG/. Naregi: National research grid initiative, http://www.naregi.org, 2009. ALLCOCK, W., BESTEE, J., BEESNAHAN, J., CHEEVENAK, A. L., FOSTEE, I., KESSELMAN, C., MEDEE, S., NEFEDOVA, V., AND STEVEN, D. Q. Secure, efficient data transport and replica management for high-performance data-intensive computing. In IEEE Mass Storage Conference (2001). ALLCOCK, W., BESTEE, J., BEESNAHAN, J., CHEEVENAK, A. L., FOSTEE, I., KESSELMAN, C., MEDEE, S., NEFEDOVA, V., QUESNEL, D., AND TUECKE, S. Data management and transfer in high performance computational grid environments. Parallel Computing Journal 28, 5 (May 2002), 749-771. RODRIGUEZ, P., AND BIERSACK, E.W. Dynamic parallel access to replicated content in the internet. IEEE/ACM Trans. Netw. 10, 4 (2002), 455-465. 711 | P a g e