More Related Content
Similar to Brasil Ross 2011 (20)
More from Eric Van Hensbergen (18)
Brasil Ross 2011
- 1. BRASIL
Basic Resource Aggregation Interface Layer
Eric Van Hensbergen Pravin Shinde (now at ETH Zurich)
(bergevan@us.ibm.com) Noah Evans (now at Bell-Labs)
IBM Research Austin
ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 2. IBM Research
Motivation
Data Flow traditional formulation.for a portionProblems
procedure using a Oriented HPC
Figure 7: The data dependency graph of the Hartree-Fock
64,000 Node Torus
12
2 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 3. IBM Research
PUSH Pipelines
UNIX Model
a|b|c
PUSH Model
a |< b >| c
For more detail: refer to PODC09 Short Paper on PUSH Dataflow Shell
3 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 4. IBM Research
PUSH Implementation
shell
Pipe Pipe
command
Pipe shell Pipe
command
Pipe shell Pipe
command
shell shell
Pipe Multiplexor Demultiplexor Pipe
command command
Pipe shell Pipe
command
Pipe shell Pipe
command
shell
Pipe command Pipe
4 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 5. IBM Research
Back to Motivation: 64,000 Nodes
t
L
I1 I2
c1 c2 c3 c4
5 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 6. IBM Research
Related Work/Background Work
• MPI
• “A Compositional Environment for MPI Programs”
• Hadoop & Other Map/Reduce Solutions
• cpu (from Plan 9) - http://plan9.bell-labs.com/magic/man2html/1/cpu
• xcpu (from LANL) - http://www.xcpu.org/
• xcpu2 (from LANL)
6 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 7. IBM Research
File System Interfaces Control File Syntax
• reserve [n] [os] [arch]
• dir [wdir]
&'($"' !&* !&0 • exec [command] [args]
!&* • kill
!$.' !$.' • killonclose
!"#$% • nice [n]
!)*+ !)*+
!)*+ !*, • splice [path]
!*, !*,
!"#2, !"#2, Environment Syntax
!-, !3"4. • [key] = [value]
!*). !3"4.
!,."./, !,."./, Namespace Syntax
!,."./, !,.54* • mount [-abcC] [server] [mnt]
!$'(*) !,.54*
!,.5(/. • bind [-abcC] [new] [old]
!&0 !,.5(/. • import [-abc] [host] [path]
!,.54( !,.54( • cd [dir]
!&1 • unmount [new] [old]
!&* !&0 • clear
!&* • . [path]
6($"'!7),(/#$), 8),,4(* 8/9:8),,4(*
7 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 8. IBM Research
Distribution and Aggregation
5//6+!)708 5//6,!)708
"%'() "%'()
5//6+
!"#$%&# !"#$%&#
!"* !"/&(012!3#,4
!"#$%&# !"#$%&#
2
!"#+ !"%.
!"#$%&# !"#$%&#
*
!"%+ !"/&(012!3*4
!"#$%&# !"#$%&#
!"%, !"#+
#+ #,
!"#$%&# !"#$%&#
%+ %, %- %. !"#, !"%+
!"#$%&# !"#$%&#
!"%- !"%,
5//6, !"#$%&# !"#$%&#
!"%. !"/&(012!324
!"#$%&# !"#$%&#
8 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 9. IBM Research
Command Line Interface(s)
• Direct File Interaction
echo “res 2” > ./mpoint/csrv/local/0/ctl
echo “exec date” > ./mpoint/csrv/local/0/0/ctl
echo “exec wc” > cat ./mpoint/csrv/local/0/1/ctl
echo “xsplice 0 1” > ./mpoint/csrv/0/local/0/ctl
cat ./mpoint/csrv/0/local/0/stdio
• Python Command Line
runJob.py -n 4 /bin/date
• PUSH
ORS=blm find . -type f |< xargs chasen | sort | uniq -c >| sort -rn
9 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 11. 10 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 12. IBM Research )*+,-./012/3.,-14+.56
7*68196/*5,96/:3;6.
'!/513.=
Simple Micro-benchmark: Job start times
!#
!
!(
19=.?..,*5@
A.B+*5;6*15
C@@B.@;6*15
$
DE.F96*15
G.=.BH;6*15
A*+.
%
#
(
! # $ !% %# !$ '% '! !(# (#$
! !
!#$%'()'%*%+,-(./'0%1%/,%2
11 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 13. IBM Research
)*+,-./0*12.34,
5647,63/84,95:;
Job Completion Times with I/O
'!,3*.=
(
!$
!%
!#
!
*8=.?../63@
A.B263C46*3
D@@B.@C46*3
E3/84
!(
A62.
FG.:846*3
H.=.BIC46*3
$
%
#
(
! # $ !% %# !$ '% '! !(# (#$
! !
JK*+=
12 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 14. IBM Research
Discussion
• Non-buffered communications channels are great for synchronization, but bad for
performance as aggregation points become serialization points.
• Pushing multiplexing to PUSH a mistake
• adds additional copies of data and additional context switches for I/O to do record separation
• by pushing them into the infrastructure we can go from 6 copies and context switches down to 2 for each
pipeline stage
• “beltway buffers” and other techniques may reduce this further or eliminate copies altogether
• Transitive mounts of namespace an elegant way to bypass network segmentation --
but it also incurs lots of overhead
• Allowing splice inside a reservation has dubious usefulness. It seems like it might
be more useful to allow for individual elements to be spawned outside a reservation.
• Fan-in and Fan-out are too limiting of a use case. What about deterministic
delivery? What about many-to-many pipeline components? What about collectives
and barriers?
• Fault tolerance fault debugging properties were poor -- no channel for out-of-band
communication of error or logging information. When transitive mounts went down,
the system wedged -- but no integrated method of figuring out where the system
went down.
• File systems and communication are outside the scope of this work, but still a sore
point for performance with the Plan 9 kernel on Blue Gene. Both issues are actively
being worked on with a release planned later this summer.
13 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 15. IBM Research
Future Work: Cutting Out The Middle Man
App Brasil Brasil Brasil App
Kernel Kernel Kernel
Initiating Terminal or Node Parent or Gateway Node(s) Compute Node
App App
Kernel Kernel Kernel
14 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 16. IBM Research
Future Work: Splice Optimizations via Direct Communication
• Splice operations through name space is elegant, but transitive mounting of name
space means that data has to flow through hierarchy in order to be spliced from one
element to another
• A direct communication path would be much more desirable, particularly on BG/P
where hierarchy is constructed on tree, but torus is the preferred node-to-node data
transport
• Solution: add a layer of indirection to splice communication
• add a file to the session directories which contains a list of “locations” for this node which essentially gives a
recipe for connecting directly to the channel in order of performance. If the source node cannot use any of
these recipes he will default to the name space path.
• example:
% cat ./mpoint/csrv/parent/c4/local/0/location
torus!1.3.4!564
tree!0.23!564
tcp!10.1.3.4!564
• in the simplest embodiment we mount the namespace of the node in order to access its interfaces, but we
also want to play with direct connections for data and/or potentially hiding everything inside the csrv interface
so the mpipe splice code (and end-users) can be more or less ignorant of how the resources are accessed
15 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 17. IBM Research
Future Work:
• Turn fan-out/fan-in/many-to-many pipeline
operations into a system primitives with
TYPE
similar syntax to existing pipe primitives but
with semantics of multipipe
• respect record boundries SIZE
• allow broadcast
• allow enumerated specification of destination
• support splice operations
DESTINATION
• Build out rest of brasil infrastructure based
on this new primitve
PARAMETERS
• All stdio channels are multipipes
• allow record buffers for decoupled performance
• All ctl channels implemented on top of multipipes
• etc.
• Same model could potentially be used for
implementation of collective operations and
barriers within a distributed file system pwrite(pipefd, buf, sz, ~(0));
namespace
16 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 18. IBM Research
Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil
New Version In Progress: http://www.bitbucket.org/ericvh/hare/sys/src/cmd/uem
http://goo.gl/5eFB
This project is supported in part by the
U.S. Department of Energy under
http://www.research.ibm.com/austin
Award Number DE-FG02- 08ER25851
17 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 19. IBM Research
Core Concept: BRASIL
Basic Resource Aggregate System Inferno Layer
•Stripped down Inferno - No GUI or anything we can live
without, minimal footprint
•Runs as a daemon (no console), all interaction via 9p
mounts of its namespace
•Different modes
•default (exports /srv/brasil or on a tcp!127.0.0.1!5670)
•gateway (exports over standard I/O - to be used by ssh initialization)
•terminal (initiates ssh connection and starts a gateway)
•Runs EVERYWHERE
•User’s workstation
•Surveyor Login Nodes
•I/O Nodes
•Compute Nodes
18 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 20. IBM Research
Preferred Embodiment: BRASIL Desktop Extension Model
CPU
ssh-duct
workstation login node I/O
CPU
•Setup
•User starts brasild on workstation
•brasild ssh’s to login node and starts another brasil hooking the two
together with 27b-6 and mount resources in /csrv
•User mounts brasild on workstation into namespace using 9pfuse or
v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)
•Boot
•User runs anl/run script on workstation
•script interacts with taskfs on login node to start cobalt qsub
•when I/O nodes boot it will connect its csrv to login csrv
•when CPU nodes boot they will connect to csrv on I/O node
•Task Execution
•User runs anl/exec script on workstation to run app
•script reserves x nodes for app using taskfs
•taskfs on workstation aggregates execution by using taskfs running
on I/O nodes
19 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 21. IBM Research
Core Concept: Central Services
•Establish hierarchical namespace on cluster services /csrv/
of
•criswell) remote servers based reference (ie. cd
Automount c3
•Export local services for use elsewhere within the network
/csrv /csrv
t /local /local
/L /l2
/local /local
/l1 /c4
/local /local
L
/c1 /L
/local /local
/c2 /t
I1 I2 /local /local
/l2 /l1
/local /local
c1 c2 c3 c4 /c3 /c1
/local /local
/c4 /c2
/local /local
20 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 23. IBM Research
Our Approach: Workload Optimized Distribution
Desktop Extension
21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 24. IBM Research
Our Approach: Workload Optimized Distribution !#$%'
!#(#
!#$%'
!#$% !#(#
!#(#
!#$%'
!#(#
!#$%'
!#(#
!#$%'
!#$% !#$% !#(# !#(# !#(#
!#$%'
!#(#
!#$%'
!#(#
!#$%'
!#$% !#(# !#(#
!#$%'
!#(#
Desktop Extension PUSH Pipeline Model
21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 25. IBM Research
Our Approach: Workload Optimized Distribution !#$%'
!#(#
!#$%'
!#$% !#(#
!#(#
!#$%'
!#(#
!#$%'
!#(#
!#$%'
!#$% !#$% !#(# !#(# !#(#
!#$%'
!#(#
local service proxy service aggregate service !#$%'
!#(#
!#$%'
!#$% !#(# !#(#
local service !#$%'
!#(#
Desktop Extension PUSH Pipeline Model
remote services
Aggregation Via
Dynamic Namespace
and
Distributed Service
Model
21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 26. IBM Research
Our Approach: Workload Optimized Distribution !#$%'
!#(#
!#$%'
!#$% !#(#
!#(#
!#$%'
!#(#
!#$%'
!#(#
!#$%'
!#$% !#$% !#(# !#(# !#(#
!#$%'
!#(#
local service proxy service aggregate service !#$%'
!#(#
!#$%'
!#$% !#(# !#(#
local service !#$%'
!#(#
Desktop Extension PUSH Pipeline Model
remote services
Aggregation Via
Dynamic Namespace
Scaling and
Distributed Service
Model
21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 27. IBM Research
Our Approach: Workload Optimized Distribution !#$%'
!#(#
!#$%'
!#$% !#(#
!#(#
!#$%'
!#(#
!#$%'
!#(#
!#$%'
!#$% !#$% !#(# !#(# !#(#
!#$%'
!#(#
local service proxy service aggregate service !#$%'
!#(#
!#$%'
!#$% !#(# !#(#
local service !#$%'
!#(#
Desktop Extension PUSH Pipeline Model
remote services
Aggregation Via
Dynamic Namespace
Scaling and Reliability
Distributed Service
Model
21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
- 34. IBM Research
Example Simple Invocation
•mpipefs
•mount /srv/mpipe /n/testpipe
•ls -l /n/testpipe
--rw-rw-rw- M 24 ericvh ericvh 0 Oct 10 18:10 /n/testpipe/data
•echo hello /n/testpipe/data
•cat /n/testpipe/data
hello
28 © 2010 IBM Corporation
- 35. IBM Research
Passing Arguments via aname
•mount /srv/mpipe /n/test othername
•ls -l /n/test
--rw-rw-rw- M 26 ericvh ericvh 0 Oct 10 18:12 /n/test/otherpipe
•mount /srv/mpipe /n/test2 -b bcastpipe
•mount /srv/mpipe /n/test3 -e 5 enumpipe
•....you get the idea, read the man page for more details
29 © 2010 IBM Corporation
- 36. IBM Research
Example for writing control blocks
int
pipewrite(int fd, char *data, ulong size, ulong which)
{
int n;
char hdr[255];
ulong tag = ~0;
char pkttype='p';
/* header byte is at offset ~0 */
n = snprint(hdr, 31, %cn%ludn%ludnn, pkttype, size, which);
n = pwrite(fd, hdr, n+1, tag);
if(n = 0)
return n;
return write(fd, data, size);
}
30 © 2010 IBM Corporation
- 37. IBM Research
Larger Example (execfs)
/proc
/clone
/###
/stdin mount -a /srv/mpipe /proc/### stdin
/stdout mount -a /srv/mpipe /proc/### stdout
/stderr
mount -a /srv/mpipe /proc/### stderr
/args
/ctl
/fd
/fpregs
/kregs
/mem
/note
/noteid
/notepg
/ns
/proc
/profile
/regs
/segment
/status
/text
/wait
31 © 2010 IBM Corporation
- 38. IBM Research
Really Large Example (gangfs)
/proc
/gclone
/status
/g###
/stdin mount -a /srv/mpipe /proc/### -b stdin
/stdout mount -a /srv/mpipe /proc/### stdout
/stderr mount -a /srv/mpipe /proc/### stderr
/ctl
/ns
/status
/wait
...and then, post exec from gangfs clone - execfs stdins are splicedfrom g#/stdin
...and then execfs stdouts and stderrs are splicedto g#/stdout and g#/stderr
...and you can do -e # with stdin to get enumerated instead of brodcast pipes
32 © 2010 IBM Corporation