Presentation southernstork 2009-nov-southernworkshop

Data Placement Scheduling
between Distributed Repositories
Stork 1.0 and beyond
Mehmet Balman
Louisiana State University
Baton Rouge, LA

MotivationMotivation

Scientific applicationsare becoming more data intensive
(dealing with petabytes of data)

We use geographically distributed resources to satisfy
immense computational requirements

The distributed nature of the resources made data
movement is a major bottleneck for end-to-end
application performance

Therefore, complex middleware is required to
orchestrate the use of these storage and network
resources between collaborating parties, and to manage
the end-to-end distribution of data.


Data Movement using Stork

Data Scheduling

Tuning Data Transfer Operations

Failure-Awareness

Job Aggregation

Future Directions
AgendaAgenda


Advance Data Transfer Protocols (i.e. GridFTP)

High throughput data transfer

Data Scheduler: Stork

Organizing data movement activities

Ordering data transfer requests
Moving Large Data SetsMoving Large Data Sets

A scientific application generates immense amount
of simulation data using supercomputing resources
The generated data is stored in a temporary space and
need to be moved to a data repository for further processing
or archiving
Another application may be waiting this generated data as
its input to start execution
Delaying the data transfer operation or completing the
transfer far after than the expected time may create several
problems
– (other resources are waiting for this transfer operation
to complete)
Use caseUse case


Stork: A batch scheduler for Data Placement
activities

Supports plug-in data transfer modules for
specific protocols/services

Throttling: deciding number of concurrent
transfers

Keep a log of data placement activities

Add fault tolerance to data transfers

Tuning protocol transfer parameters (number of
parallel TCP streams)
Scheduling Data Movement JobsScheduling Data Movement Jobs

[ dest_url = "gsiftp://eric1.loni.org/scratch/user/";
arguments = -p 4 dbg -vb";
src_url = "file:///home/user/test/";
dap_type = "transfer";
verify_checksum = true;
verify_filesize = true;
set_permission = "755" ;
recursive_copy = true;
network_check = true;
checkpoint_transfer = true;
output = "user.out";
err = "user.err";
log = "userjob.log";
]
Stork Job submissionStork Job submission

End-to-end bulk data transfer (latency wall)

TCP based solutions

Fast TCP, Scalable TCP etc

UDP based solutions

RBUDP, UDT etc

Most of these solutions require kernel level
changes

Not preferred by most domain scientists
Fast Data TransferFast Data Transfer


Take an application-level transfer protocol (i.e.
GridFTP) and tune-up for better performance:

Using Multiple (Parallel) streams

Tuning Buffer size
(efficient utilization of available network capacity)
Level of Parallelism in End-to-end Data Transfer

number of parallel data streams connected to a data transfer
service for increasing the utilization of network bandwidth

number of concurrent data transfer operations that are
initiated at the same time for better utilization of system
resources.
Application Level TuningApplication Level Tuning


Instead of a single connection at a time, multiple
TCP streams are opened to a single data transfer
service in the destination host.

We gain larger bandwidth in TCP especially in a
network with less packet loss rate; parallel connections
better utilize the TCP buffer available to the data
transfer, such that N connections might be N times
faster than a single connection

Multiple TCP streams result in extra in the system
Parallel TCP StreamsParallel TCP Streams

Average Throughput using parallel streams over 1GbpsAverage Throughput using parallel streams over 1Gbps
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c


Instead of predictive sampling, use data from
actual transfer

transfer data by chunks (partial transfers) and
also set control parameters on the fly.

measure throughput for every transferred data
chunk

gradually increase the number of parallel
streams till it comes to an equilibrium point
Adaptive TuningAdaptive Tuning


No need to probe the system and make
measurements with external profilers

Does not require any complex model for
parameter optimization

Adapts to changing environment

But, overhead in changing parallelism level

Fast start (exponentially increase the number
of parallel streams)


Start with single stream (n=1)

Measure instant throughput for every data chunk transferred
(fast start)

Increase the number of parallel streams (n=n*2),

transfer the data chunk

measure instant throughput

If current throughput value is better than previous one,
continue

Otherwise, set n to the old value and gradually increase
parallelism level (n=n+1)

If no throughput gain by increasing number of streams (found
the equilibrium point)

Increase chunk size (delay measurement period)

Dynamic Tuning AlgorithmDynamic Tuning Algorithm

• Dynamic Environment:
• data transfers are prune to frequent failures
• what went wrong during data transfer?
• No access to the remote resources
• Messages get lost due to system malfunction
• Instead of waiting failure to happen
• Detect possible failures and malfunctioning services
• Search for another data server
• Alternate data transfer service
• Classify erroneous cases to make better decisions
Failure AwarenessFailure Awareness

• Use Network Exploration Techniques
– Check availability of the remote service
– Resolve host and determine connectivity failures
– Detect available data transfers service
– should be Fast and Efficient not to bother system/network
resources
• Error while transfer is in progress?
– Error_TRANSFER
• Retry or not?
• When to re-initiate the transfer
• Use alternate options?
Error DetectionError Detection

• Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
• A better logging facility and classification
•Recover from Failure
•Retry failed operation
•Postpone scheduling of
a failed operations
•Early Error Detection
•Initiate Transfer when
erroneous condition
recovered
•Or use Alternate
options
Error ClassificationError Classification

Error ReportingError Reporting

Scoop data - Hurricane Gustov Simulations
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G)
Failure Aware SchedulingFailure Aware Scheduling

• Verify the successful completion of the operation
by controlling checksum and file size.
• for GridFTP, Stork transfer module can recover
from a failed operation by restarting from the last
transmitted file. In case of a retry from a failure,
scheduler informs the transfer module to recover
and restart the transfer using the information from
a rescue file created by the checkpoint-enabled
transfer module.
• An “intelligent” (dynamic tuning) alternative to
Globus RFT (Reliable File Transfer)
New Transfer ModulesNew Transfer Modules

• Multiple data movement jobs are combined and
processed as a single transfer job
• Information about the aggregated job is stored in the
job queue and it is tied to a main job which is actually
performing the transfer operation such that it can be
queried and reported separately.
• Hence, aggregation is transparent to the user
• We have seen vast performance improvement,
especially with small data files
– decreasing the amount of protocol usage
– reducing the number of independent network connections
Job AggregationJob Aggregation

Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB
data file per job
Job AggregationJob Aggregation

We need priority-based data transfer scheduling
with advance reservation and provisioning to allow
researchers to use data placement as-a-service
where they can plan ahead and reserve the time
period for their data movement operations.
Need to orchestrate advance storage and network
allocation together for data movements (very less
progress in the literature)
Future DirectionsFuture Directions

Next generation research networks such as ESNet
and Internet2
– provide high-speed on-demand data access
between collaborating institutions by delivering
network-as-a-service
On-Demand Secure Circuits and Advance
Reservation System (OSCARS)
• Guaranteed bandwidth (at certain time, for a
certain bandwidth and length of time)
Network ReservationNetwork Reservation

Research ConceptResearch Concept
accept time constraints
allow users to plan ahead
orchestrate resource allocation
provide advance resource reservation
reserve the scheduler’s time for future
data movement operation

MethodologyMethodology
two separate queues
Planning Phase
resource reservation and time allocation
− Preemption?
− Confirm submission of a request?
Execution Phase
re-organization, tuning, and ordering
Failure-awareness
Job Aggregation
Dynamic Adaptation in data transfers
Priority-based scheduling (earliest deadine?)

MethodologyMethodology
Phase 1:
The scheduler checks the availability of resources in a
given time period and justifies whether requested
operation can be satisfied with the given time
constraints

The server and the network capacity is allocated
for the future time period in advance
Phase 2:
The scheduler considers other requests reserved for
future time windows and re-order operations in the
current time period

Aggregation

Pre-processing

www.petashare.org
www.cybertools.loni.org
www.storkproject.org
www.cct.lsu.edu
Questions?Questions?
Mehmet Balman balman@cct.lsu.edu

Data Movement between
Distributed Repositories for
Large Scale Collaborative
Science
Mehmet Balman
Louisiana State University
Baton Rouge, LA

Presentation southernstork 2009-nov-southernworkshop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Presentation southernstork 2009-nov-southernworkshop

Similaire à Presentation southernstork 2009-nov-southernworkshop (20)

Plus de balmanme

Plus de balmanme (20)

Dernier

Dernier (20)

Presentation southernstork 2009-nov-southernworkshop