Berkeley Lab - Computing Sciences Seminar - Reminder
TOMORROW, June 24, 2:00pm - 3:00pm, Bldg. 50F, Room 1647
Berkeley Lab - Computing Sciences Seminar
*/Date/:*
Wednesday, June 24, 2009
*/Time/:*
2:00pm - 3:00pm
*/Location/:*
Bldg. 50F, Room 1647
*/Speaker/:*
Mehmet Balman
Department of Computer Science
Louisiana State University
*/Title/:*
Data Migration between Distributed Repositories for Collaborative
Research
*/Abstract/:*
Scientific applications especially in several areas such as physics,
biology, and astronomy have become more complex and compute
intensive. Often, such applications require geographically
distributed resources to satisfy their immense computational
requirements. Consequently, these applications also have increasing
distributed data intensive requirements, dealing with petabytes of
data. The distributed nature of the resources made data movement
the major bottleneck for end-to-end application performance. Our
approach is to use a dynamic network layer where data placement
middleware needs to adapt to the changing conditions in the
environment. Furthermore, heterogeneous resource and different data
access and security protocols are some of the challenges the data
placement middleware needs to deal with. Complex middleware is
required to orchestrate the use of these storage and network
resources between collaborating parties, and to manage the
end-to-end distribution of data.
We present a data placement scheduler, for mitigating the data
bottleneck in collaborative peta-scale applications. In this talk,
we will give details on recent research in data scheduling, some use
cases for transferring very large data sets into distributed
repositories, and experiments of effective data movement over 1Gpbs
and 10Gbps networks. We will also describe advanced features
including aggregation of data placement jobs with small data files,
dynamic tuning of data transfer operations to minimize the effect of
network latency, error detection and classification, and restarting
transfer operations after transfer interruptions.
*/Host of Seminar/: *
Arie Shoshani
------------------------------------------------------------------------
*/For additional information, such as site access or directions to the
conference room, please contact CSSeminars-Help@hpcrd.lbl.gov
<mailto:csseminars-help@hpcrd.lbl.gov>./*
*/Web Contact: CSSeminars-Help@hpcrd.lbl.govREMINDER:
<mailto:csseminars-help@hpcrd.lbl.gov>/*
_______________________________________________
CSSeminars mailing list
CSSeminars@hpcrdm.lbl.gov
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/csseminars
Exploring the Future Potential of AI-Enabled Smartphone Processors
Lblc sseminar jun09-2009-jun09-lblcsseminar
1. Data Movement between
Distributed Repositories for
Large Scale Collaborative
Science
Mehmet Balman
Louisiana State University
Baton Rouge, LA
2. Motivation
Scientific applicationsare becoming more data intensive
(dealing with petabytes of data)
We use geographically distributed resources to satisfy
immense computational requirements
The distributed nature of the resources made data
movement is a major bottleneck for end-to-end
application performance
Therefore, complex middleware is required to
orchestrate the use of these storage and network
resources between collaborating parties, and to manage
the end-to-end distribution of data.
3. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
4. PetaShare
• Distributed Storage System in Louisiana
• Spans among seven research institutions
• 300TB of disk storage
• 400TB of tape (will be online soon)
using:
IRODS (Integrated Rule-Oriented Data System)
www.irods.org
PetaShare
5. PetaShare as an example
Global Namespace among distributed resources
Client tools and interfaces:
Pcommands
Petashell (parrot)
Petafs (fuse)
Windows Browser
Web Portal
General scenario is to use an intermediate storage
area (limited capacity) and then transfer files to a
remote storage for post processing and long term
archival
6. PetaShare Architecture
Fast and Efficient Data Migration in PetaShare ?
LONI (Louisiana Optical Network Initiative)
www.loni.org
7. Lightweight client tools for transparent access
Petashell, based on Parrot
Petafs, a FUSE client
In order to improve throughput performance, we've
implemented Advance Buffer Cache in Petafs and
Petashell clients by aggregating I/O requests to minimize
the number of network messages.
Is it efficient for bulk data transfer
PetaShare Client Tools
9. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
10. Advance Data Transfer Protocols (i.e. GridFTP)
High throughput data transfer
Data Scheduler: Stork
Organizing data movement activities
Ordering data transfer requests
Moving Large Data Sets
11. Stork: A batch scheduler for Data Placement
activities
Supports plug-in data transfer modules for
specific protocols/services
Throttling: deciding number of concurrent
transfers
Keep a log of data placement activities
Add fault tolerance to data transfers
Tuning protocol transfer parameters (number
of parallel TCP streams)
Scheduling Data Movement Jobs
13. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
14. End-to-end bulk data transfer (latency wall)
TCP based solutions
Fast TCP, Scalable TCP etc
UDP based solutions
RBUDP, UDT etc
Most of these solutions require kernel level
changes
Not preferred by most domain scientists
Fast Data Transfer
15. Take an application-level transfer protocol (i.e.
GridFTP) and tune-up for better performance:
Using Multiple (Parallel) streams
Tuning Buffer size
(efficient utilization of available network capacity)
Level of Parallelism in End-to-end Data Transfer
number of parallel data streams connected to a data transfer
service for increasing the utilization of network bandwidth
number of concurrent data transfer operations that are
initiated at the same time for better utilization of system
resources.
Application Level Tuning
16. Instead of a single connection at a time, multiple
TCP streams are opened to a single data transfer
service in the destination host.
We gain larger bandwidth in TCP especially in a
network with less packet loss rate; parallel
connections better utilize the TCP buffer available to
the data transfer, such that N connections might be N
times faster than a single connection
Multiple TCP streams result in extra in the system
Parallel TCP Streams
17. Average Throughput using parallel streams over 1Gbps
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
18. Average Throughput using parallel streams over 1Gpbs
Experiments in LONI (www.loni.org) environment - transfer file to QB from
IBM m/c
21. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
22. Can we predict this
behavior?
Yes, we can come up with
a good estimation for the
parallelism level
Network statistics
Extra measurement
Historical data
Parameter Estimation
23. single stream, theoretical calculation of
throughput based on MSS, RTT and packet
loss rate:
n streams gains as much as total throughput
of n single stream: (not correct)
A better model: a relation is established
between RTT, p and the number of streams n:
Parallel Stream Optimization
25. Might not reflect the best possible current settings
(Dynamic Environment)
What if network condition changes?
Requires three sample transfers (curve fitting)
need to probe the system and make
measurements with external profilers
Does require a complex model for parameter
optimization
Parameter Estimation
26. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
27. Instead of predictive sampling, use data from
actual transfer
transfer data by chunks (partial transfers) and
also set control parameters on the fly.
measure throughput for every transferred data
chunk
gradually increase the number of parallel
streams till it comes to an equilibrium point
Adaptive Tuning
28. No need to probe the system and make
measurements with external profilers
Does not require any complex model for
parameter optimization
Adapts to changing environment
But, overhead in changing parallelism level
Fast start (exponentially increase the number
of parallel streams)
Adaptive Tuning
29. Start with single stream (n=1)
Measure instant throughput for every data chunk transferred
(fast start)
Increase the number of parallel streams (n=n*2),
transfer the data chunk
measure instant throughput
If current throughput value is better than previous one,
continue
Otherwise, set n to the old value and gradually increase
parallelism level (n=n+1)
If no throughput gain by increasing number of streams
(found the equilibrium point)
Increase chunk size (delay measurement period)
Adaptive Tuning
30. Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file
to QB from IBM m/c
31. Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
32. Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
36. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
37. • Dynamic Environment:
• data transfers are prune to frequent failures
• what went wrong during data transfer?
• No access to the remote resources
• Messages get lost due to system malfunction
• Instead of waiting failure to happen
• Detect possible failures and malfunctioning services
• Search for another data server
• Alternate data transfer service
• Classify erroneous cases to make better decisions
Failure Awareness
38. • Use Network Exploration Techniques
– Check availability of the remote service
– Resolve host and determine connectivity failures
– Detect available data transfers service
– should be Fast and Efficient not to bother system/network
resources
• Error while transfer is in progress?
– Error_TRANSFER
• Retry or not?
• When to re-initiate the transfer
• Use alternate options?
Error Detection
39. • Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
• A better logging facility and classification
•Recover from Failure
•Retry failed operation
•Postpone scheduling of
a failed operations
•Early Error Detection
•Initiate Transfer when
erroneous condition
recovered
•Or use Alternate
options
Error Classification
41. Scoop data - Hurricane Gustov Simulations
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G)
Failure Aware Scheduling
42. • Verify the successful completion of the operation
by controlling checksum and file size.
• for GridFTP, Stork transfer module can recover
from a failed operation by restarting from the last
transmitted file. In case of a retry from a failure,
scheduler informs the transfer module to recover
and restart the transfer using the information from
a rescue file created by the checkpoint-enabled
transfer module.
• An “intelligent” (dynamic tuning) alternative to
Globus RFT (Reliable File Transfer)
New Transfer Modules
43. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
44. • Multiple data movement jobs are combined and
processed as a single transfer job
• Information about the aggregated job is stored in the
job queue and it is tied to a main job which is actually
performing the transfer operation such that it can be
queried and reported separately.
• Hence, aggregation is transparent to the user
• We have seen vast performance improvement,
especially with small data files
– decreasing the amount of protocol usage
– reducing the number of independent network
connections
Job Aggregation
45. Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB
data file per job
Job Aggregation
46. ➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
47. • Performance bottleneck
– Hundreds of jobs submitted to a single batch
scheduler, Stork
• Single point of failure
Stork: Central Scheduling Framework
Stork
48. • Interaction between data schedulers
– Manage data activities with lightweight agents in
each site
– Job Delegation
– peer-to-peer data movement
– data and server striping
– make use of replicas for multi-source downloads
Distributed Data Scheduling
Future Plans