SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
A Java Fork/Join Framework
                                                              Doug Lea
                                                  State University of New York at
                                                             Oswego
                                                        Oswego NY 13126
                                                         315−341−2688
                                                       dl@cs.oswego.edu

ABSTRACT                                                             Some associated programming techniques and examples are
                                                                     discussed in section 4.4 of Concurrent Programming in Java,
This paper describes the design, implementation, and                 second edition [7]. This paper discusses the design (section 2),
performance of a Java framework for supporting a style of            implementation (section 3), and performance (section 4) of
parallel programming in which problems are solved by                 FJTask, a JavaTM framework that supports this programming
(recursively) splitting them into subtasks that are solved in        style. FJTask is available as part of the util.concurrent
parallel, waiting for them to complete, and then composing           package from http://gee.cs.oswego.edu.
results. The general design is a variant of the work−stealing
framework devised for Cilk. The main implementation                  2. DESIGN
techniques surround efficient construction and management of         Fork/join programs can be run using any framework that
tasks queues and worker threads. The measured performance            supports construction of subtasks that are executed in parallel,
shows good parallel speedups for most programs, but also             along with a mechanism for waiting out their completion.
suggests possible improvements.                                      However, the java.lang.Thread class (as well as POSIX
                                                                     pthreads, upon which Java threads are often based) are
                                                                     suboptimal vehicles for supporting fork/join programs:
1. INTRODUCTION                                                      •   Fork/join tasks have simple and regular synchronization and
Fork/Join parallelism is among the simplest and most effective           management requirements. The          computation graphs
design techniques for obtaining good parallel performance.               produced by fork/join tasks admit much more efficient
Fork/join algorithms are parallel versions of familiar divide−           scheduling tactics than needed for general−purpose threads.
and−conquer algorithms, taking the typical form:                         For example, fork/join tasks never need to block except to
                                                                         wait out subtasks. Thus, the overhead and bookkeeping
   Result solve(Problem problem) {                                       necessary for tracking blocked general−purpose threads are
                                                                         wasted.
     if (problem is small)
       directly solve problem                                        •   Given reasonable base task granularities, the cost of
                                                                         constructing and managing a thread can be greater than the
     else {                                                              computation time of the task itself. While granularities can
       split problem into independent parts                              and should be subject to tuning when running programs on
       fork new subtasks to solve each part                              particular platforms, the extremely coarse granularities
       join all subtasks                                                 necessary to outweigh thread overhead limits opportunities
                                                                         for exploiting parallelism.
       compose result from subresults
     }                                                               In short, standard thread frameworks are just too heavy to
                                                                     support most fork/join programs. But since threads form the
   }                                                                 basis of many other styles of concurrent and parallel
                                                                     programming as well, it is impossible (or at least impractical) to
The fork operation starts a new parallel fork/join subtask. The      remove overhead or tune scheduling of threads themselves just
join operation causes the current task not to proceed until the      for the sake of supporting this style.
forked subtask has completed. Fork/join algorithms, like other
                                                                     While the ideas surely have a longer heritage, the first published
divide−and−conquer algorithms, are nearly always recursive,
                                                                     framework to offer systematic solutions to these problems was
repeatedly splitting subtasks until they are small enough to solve
                                                                     Cilk[5]. Cilk and other lightweight executable frameworks layer
using simple, short sequential methods.
                                                                     special−purpose fork/join support on top of an operating
                                                                     system’s basic thread or process mechanisms. This tactic applies
                                                                     equally well to Java, even though Java threads are in turn
                                                                     layered onto lower−level OS capabilities. The main advantage
                                                                     of creating such a Java lightweight execution framework is to
                                                                     enable fork/join programs to be written in a more portable
                                                                     fashion and to run on the wide range of systems supporting
                                                                     JVMs.
                                                                     The FJTask framework is based on a variant of the design
                                                                     used in Cilk. Other variants are seen in Hood[4], Filaments[8],
                                                                     stackthreads[10], and related systems relying on lightweight
class ATask                                                         As the standard example of how this framework appears to the
  extends FJTask {                                                  programmer, here is a class computing the Fibonacci function:
   public void run() {                                              class Fib extends FJTask {
     split...                               Task                       static final int threshold = 13;
     fork...                                                           volatile int number; // arg/result
     join...
     compose...                                                        Fib(int n) { number = n; }
   }                                                                   int getAnswer() {
}                                   Task                                  if (!isDone())
                                                                             throw new IllegalStateException();
                                                                          return number;
                                                                       }
                                                                        public void run() {
                               Task                                       int n = number;
                                                                          if (n <= threshold) //            granularity ctl
                                                                            number = seqFib(n);
                                                                          else {
                                                                            Fib f1 = new Fib(n −            1);
                                                                            Fib f2 = new Fib(n −            2);
             Worker                        Worker                           coInvoke(f1, f2);
                                                                            number = f1.number +            f2.number;
executable tasks. All of these frameworks map tasks to threads            }
in about the same way that operating systems map threads to             }
CPUs, but exploit the simplicity, regularity, and constraints of        public static void main(String[] args) {
fork/join programs in performing the mapping. While all of                try {
these frameworks can accommodate (to varying extents) parallel              int groupSize = 2; // for example
programs written in different styles, they optimize for fork/join           FJTaskRunnerGroup group =
designs:                                                                        new FJTaskRunnerGroup(groupSize);
                                                                            Fib f = new Fib(35); // for example
•   A pool of worker threads is established. Each worker thread             group.invoke(f);
    is a standard ("heavy") thread (here, an instance of Thread             int result = f.getAnswer();
    subclass FJTaskRunner) that processes tasks held in                     System.out.println("Answer: " +
    queues. Normally, there are as many worker threads as there                                 result);
                                                                          }
    are CPUs on a system. In native frameworks such as Cilk,              catch (InterruptedException ex) {}
    these are mapped to kernel threads or lightweight processes,        }
    and in turn to CPUs. In Java, the JVM and OS must be
    trusted to map these threads to CPUs. However, this is a            int seqFib(int n) {
    very simple task for the OS, since these threads are                  if (n <= 1) return n;
    computationally intensive. Any reasonable mapping strategy            else return seqFib(n−1) + seqFib(n−2);
                                                                        }
    will map these threads to different CPUs.                       }
•   All fork/join tasks are instances of a lightweight executable   This version runs at least thirty times faster than an equivalent
    class, not instances of threads. In Java, independently         program in which each new task is run in a new
    executable tasks must implement interface Runnable and          java.lang.Thread on the platforms described in section 4.
    define a run method. In the FJTask framework, these             It does so while maintaining the intrinsic portability of
    tasks subclass FJTask rather than subclassing Thread,           multithreaded Java programs. There are only two tuning
    both of which implement Runnable. (In both cases, a class       parameters of typical interest to programmers:
    can alternatively implement Runnable and then supply
    instances to be run within executing tasks or threads.          •   The number of worker threads to construct, which should
    Because tasks operate under restricted rules supported by           ordinarily correspond to the number of CPUs available on a
    FJTask methods, it is much more convenient to subclass              platform (or fewer, to reserve processing for other unrelated
    FJTask, so as to be able to directly invoke them.)                  purposes, or occasionally more, to soak up non−
                                                                        computational slack).
•   A special purpose queuing and scheduling discipline is used
    to manage tasks and execute them via the worker threads         •   A granularity parameter that represents the point at which
    (see section 2.1). These mechanics are triggered by those           the overhead of generating tasks outweighs potential
    few methods provided in the task class: principally fork,           parallelism benefits. This parameter is typically more
    join, isDone (a completion status indicator), and some              algorithm−dependent than platform−dependent. It is
    convenience methods such as coInvoke that forks then                normally possible to settle on a threshold that achieves good
    joins two or more tasks.                                            results when run on a uniprocessor, yet still exploits
                                                                        multiple CPUs when they are present. As a side−benefit, this
•   A simple control and management facility (here,                     approach meshes well with JVM dynamic compilation
    FJTaskRunnerGroup) sets up worker pools and initiates               mechanics that optimize small methods better than
    execution of a given fork/join task when invoked from a             monolithic procedures. This, along with data locality
    normal thread (such as the one performing main in a Java            advantages, can cause fork/join algorithms to outperform
    program).                                                           other kinds of algorithms even on uniprocessors.
2.1 Work−Stealing                                                    It reduces contention by having stealers operate on the opposite
The heart of a fork/join framework lies in its lightweight           side of the deque as owners. It also exploits the property of
scheduling mechanics. FJTask adapts the basic tactics                recursive divide−and−conquer algorithms of generating "large"
pioneered in the Cilk work−stealing scheduler:                       tasks early. Thus, an older stolen task is likely to provide a
                                                                     larger unit of work, leading to further recursive decompositions
•   Each worker thread maintains runnable tasks in its own           by the stealing thread.
    scheduling queue.
                                                                     As one consequence of these rules, programs that employ
•   Queues are maintained as double−ended queues (i.e.,              relatively small task granularities for base actions tend to run
    deques, usually pronounced "decks"), supporting both LIFO        faster than those that only use coarse−grained partitioning or
    push and pop operations, as well as a FIFO take                  those that do not use recursive decomposition. Even though
    operation.                                                       relatively few tasks are stolen in most fork/join programs,
•   Subtasks generated in tasks run by a given worker thread         creating many fine−grained tasks means that a task is likely to
    are pushed onto that workers own deque.                          be available whenever a worker thread is ready to run it.
•   Worker threads process their own deques in LIFO                  3. IMPLEMENTATION
    (youngest−first) order, by popping tasks.                        The framework has been implemented in about 800 lines of pure
•   When a worker thread has no local tasks to run, it attempts      Java code, mainly in class FJTaskRunner, a subclass of
    to take ("steal") a task from another randomly chosen            java.lang.Thread. FJTasks themselves maintain only a
    worker, using a FIFO (oldest first) rule.                        boolean completion status, and perform all other operations via
                                                                     delegation    to    their   current    worker   threads.     The
•   When a worker thread encounters a join operation, it
                                                                     FJTaskRunnerGroup class serves to construct worker
    processes other tasks, if available, until the target task is
                                                                     threads, maintains some shared state (for example, the identities
    noticed to have completed (via isDone). All tasks
                                                                     of all worker threads, needed for steal operations), and helps
    otherwise run to completion without blocking.
                                                                     coordinate startup and shutdown.
•   When a worker thread has no work and fails to steal any
                                                                     More detailed implementation documentation is available inside
    from others, it backs off (via yields, sleeps, and/or priority   the util.concurrent package. This section discusses only
    adjustment − see section 3) and tries again later unless all     two sets of problems and solutions encountered when
    workers are known to be similarly idle, in which case they       implementing this framework: Supporting efficient deque
    all block until another task is invoked from top−level.          operations (push, pop, and take), and managing the steal
                                                                     protocol by which threads obtain new work.

                                                                     3.1 Deques
                                                                     To enable efficient and scalable execution, task management
                                        Stealing                     must be made as fast as possible. Creating, pushing, and later
                                                                     popping (or, much less frequently, taking) tasks are analogs of
                                                                     procedure call overhead in sequential programs. Lower overhead
                                                                     enables programmers to adopt smaller task granularities, and in
                                                                     turn better exploit parallelism.
                                                                     Task allocation itself is the responsibility of the JVM. Java
                                                                     garbage collection relieves us of needing to create a special−
                                                                     purpose memory allocator to maintain tasks. This substantially
                                       Pushing                       reduces the complexity and lines of code needed to implement
                                                                     FJTasks compared to similar frameworks in other languages.
                                                                     The basic structure of the deque employs the common scheme
                                    Deque                            of using a single (although resizable) array per deque, along
               Base       Top                                        with two indices: The top index acts just like an array−based
                                                                     stack pointer, changing upon push and pop. The base index
                                                                     is modified only by take. Since FJTaskRunner operations
                                                                     are all intimately tied to the concrete details of the deque (for
                                                                     example, fork simply invokes push), this data structure is
                                                                     directly embedded in the class rather than being defined as a
                                      Popping                        separate component.
                                                                     Because the deque array is accessed by multiple threads,
                                                                     sometimes without full synchronization (see below), yet
                                                                     individual Java array elements cannot be declared as
                                                                     volatile, each array element is actually a fixed reference to a
                                                                     little forwarding object maintaining a single volatile
                                                                     reference. This decision was made originally to ensure
As discussed in more detail in [5], the use of LIFO rules for        conformance with Java memory rules, but the level of
each thread processing its own tasks, but FIFO rules for stealing    indirection that it entails turns out to improve performance on
other tasks is optimal for a wide class of recursive fork/join       tested platforms, presumably by reducing cache contention due
designs. Less formally, the scheme offers two basic advantages:
to accesses of nearby elements, which are spread out a bit more     of a main task, upon its completion, and around global full−stop
in memory due to the indirection.                                   synchronization points employed in some fork/join algorithms.
The main challenges in deque implementation surround                The main issue here is what to do when a worker thread has no
synchronization and its avoidance. Even on JVMs with                local tasks and cannot steal one from any other thread. If the
optimized synchronization facilities[2], the need to obtain locks   program is running on a dedicated multiprocessor, then one
for every push and pop operation becomes a bottleneck.              could make the case for relying on hard busy−wait spins looping
However, adaptations of tactics taken in Cilk[5] provide a          to try to steal work. However, even here, attempted steals
solution based on the following observations:                       increase contention, which can slow down even those threads
•   The push and pop operations are only invoked by owner           that are not idle (due to locking protocols in section 3.1).
    threads.                                                        Additionally, in more typical usage contexts of this framework,
                                                                    the operating system should somehow be convinced to try to run
•   Access to the take operation can easily be confined to one      other unrelated runnable processes or threads.
    stealing thread at a time via an entry lock on take. (This
    deque lock also serves to disable take operations when          The tools for achieving this in Java are weak, have no
    necessary.) Thus, interference control is reduced to a two−     guarantees (see [6, 7]), but usually appear to be acceptable in
    party synchronization problem.                                  practice (as do similar techniques described for Hood[3]). A
                                                                    thread that fails to obtain work from any other thread lowers its
•   The pop and take operations can only interfere if the           priority before attempting additional steals, performs
    deque is about to become empty. Otherwise they are              Thread.yield between attempts, and registers itself as
    guaranteed to operate on disjoint elements of the array.        inactive in its FJTaskRunnerGroup. If all others become
Defining the top and base indices as volatile ensures that          inactive, they all block waiting for additional main tasks.
a pop and take can proceed without locking if the deque is          Otherwise, after a given number of additional spins, threads
sure to have more than one element. This is done via a Dekker−      enter a sleeping phase, where they sleep (for up to 100ms) rather
like algorithm in which push pre−decrements top:                    than yield between steal attempts. These imposed sleeps can
  if (−−top >= base) ...                                            cause artificial lags in programs that take a long time to split
and take pre−increments base:                                       their tasks. But this appears to be the best general−purpose
  if (++base < top) ...                                             compromise. Future versions of the framework may supply
In each case they must then check to see if this could have         additional control methods so that programmers can override
caused the deque to become empty by comparing the two               defaults when they impact performance.
indices. An asymmetric rule is used upon potential conflict: pop
rechecks state and tries to continue after obtaining the deque      4. PERFORMANCE
lock (the same one as held by take), backing off only if the        In these times of nearly continuous performance improvements
deque is truly empty. A take operation instead just backs off       of compilers and JVMs, performance measurements are only of
immediately, typically then trying to steal from a different        transient value. However, the metrics reported in this section
victim. This asymmetry represents the only significant departure    reveal some basic properties of the framework.
from the otherwise similar THE protocol used in Cilk.               A collection of seven fork/join test programs are briefly
The use of volatile indices also enables the push operation         described in the following table. These programs are adaptations
to proceed without synchronization unless the deque array is        of those available as demos in the util.concurrent
about to overflow, in which case it must first obtain the deque     package. They were selected to show some diversity in the kinds
lock to resize the array. Otherwise, simply ensuring that top is    of problems that can be run within this framework, as well as to
updated only after the deque array slot is filled in suppresses     obtain results for some common parallel test programs.
interference by any take.                                                 Program                      Description
Subsequent to initial implementation, it was discovered that
                                                                    Fib                The Fibonnaci program shown in section 2,
several JVMs do not conform to the Java Memory Model [6]
                                                                                       run with argument 47 and granularity
rule requiring accurate reads after writes of pairs of volatile
                                                                                       threshold 13.
fields. As a workaround, the criterion for pop to retry under
lock was adjusted to trigger if there appear to be two or fewer     Integrate          Recursive      Gaussian    quadrature      of
elements, and the take operation added a secondary lock to
                                                                                         2⋅i 1 ⋅x     2⋅i 1
                                                                                                            summing over odd
ensure a memory barrier. This suffices as long as at most one
index change is missed by the owner thread (which holds here                           values of i from 1 to 5 and integrating from
for platforms that otherwise maintain proper memory order                              −47 to 48.
when reading volatile fields), and causes only a tiny               Micro              Best−move finder for a board game, run
slowdown in performance.                                                               with a lookahead of 4 moves.
3.2 Stealing and Idling                                             Sort               Merge/Quick sort (based on an algorithm
Worker threads in work−stealing frameworks know nothing                                from Cilk) of 100 million numbers.
about the synchronization demands of the programs they are          MM                 Multiply 2048 X 2048 matrices of doubles.
running. They simply generate, push, pop, take, manage the
status of, and execute tasks. The simplicity of this scheme leads   LU                 Decompose 4096 X 4096 matrix of doubles.
to efficient execution when there is plenty of work for all
                                                                    Jacobi             Iterative mesh relaxation with barriers: 100
threads. However, this streamlining comes at the price of relying
                                                                                       steps of nearest neighbor averaging on a
on heuristics when there is not enough work; i.e., during startup
                                                                                       4096 X 4096 matrix of doubles.
For the main tests, programs were run on a 30−CPU Sun                           However, in general, the results show that increasing the number
Enterprise 10000 running the Solaris Production 1.2 JVM (an                     of threads reliably increased the number of CPUs employed.
early version of the 1.2.2_05 release) on Solaris 7. The JVM                    Speedups are reported as Timen / Time1. The best overall
was run with environment parameters selecting "bound threads"                   speedup was seen for the integration program (speedup of 28.2
for thread mappings, and memory parameters discussed in                         for 30 threads). The worst was for the LU decomposition
section 4.2. A few additional measurements reported below were                  program (speedup of 15.35 for 30 threads).
run on a 4−CPU Sun Enterprise 450.
                                                                                Another way of measuring scalability is in terms of task rates,
                                                                                the average time taken to execute a single task (which may be
                                      Times                                     either a recursive or a leaf step). The following figure shows
                700                                                             data from a single instrumented run capturing task rates. Ideally,
                                                                                the numbers of tasks processed per unit time per thread should
                600                                                             be constant. The fact that they generally slightly decrease with
                                                                                numbers of threads indicates some scalability limitations. Note
                500                                                             the fairly wide difference in task rates, reflecting differences in
                                                                                task granularities. The smallest tasks sizes are seen in Fib,
  Seconds




                400                                                             which even with a threshold setting of 13, generated and
                                                                                executed a total of 2.8 million tasks per second when run with
                300                                                             30 threads.

                200                                                                                                       Task rates
                                                                                                           120000
                100

                                                                                                           100000

                                                                                    Tasks/sec per thread
                 0
                         Fib   Micro Integ       MM      LU     Jaco   Sort
                                                                bi                                         80000
Programs were run with very large input parameters in order to
minimize timer granularity and JVM warm−up artifacts. Some                                                 60000
other warm−up effects were avoided by running an initial
problem set before starting timers. Most data are medians of                                               40000
three runs, but some (including most follow−up measurements
in sections 4.2−4.4) only reflect single runs so are a bit noisy.                                          20000
4.1 Speedups
Scalability measurements were obtained by running the same                                                     0
problem set using worker thread groups of size 1 ... 30. There is                                                   Fib   Micr Inte− MM   LU   Jaco Sort
                                                                                                                          o    grat            bi
                                                                                Four factors appear to account for tail−offs in speedups in those
                                    Speedups                                    programs that did not scale linearly. Three of them are common
                30                                                              to any parallel framework, but we’ll start with one unique to
                                                                                FJTask (as opposed to Cilk etc), GC effects.
                25
                                                                       Ideal    4.2 Garbage Collection
                                                                       Fib
     Speedups




                20                                                              In many ways, modern GC facilities are perfect matches to
                                                                       Micro    fork/join frameworks: These programs can generate enormous
                15                                                              numbers of tasks, nearly all of which quickly turn into garbage
                                                                       Integ
                                                                                after they are executed. At any given time, deterministic
                10                                                     MM
                                                                                fork/join programs require only at most p (where p is the
                                                                       LU       number of threads) times the bookkeeping space that would be
                 5                                                     Jacobi   required for sequential versions of these programs[10].
                                                                       Sort     Generational semispace copying collectors (including the one in
                 0                                                              the JVM used for these measurements − see [1]) cope with this
                      1234567891 1 11 1 1 11 1 1 22 2 22 2 2 22 2 3             well because they only traverse and copy non−garbage cells. By
                               0 1 23 4 5 67 8 9 01 2 34 5 6 78 9 0             doing so, they evade one of the trickiest problems in manual
                                                                                parallel memory management, tracing memory that is allocated
                                      Threads                                   by one thread but then used by another. The garbage collector is
no way to know whether the JVM always mapped each thread to                     oblivious to the origin of memory, so need not deal with such
a different CPU when available. While there is no evidence                      issues.
about this either way, it is possible that the lags to map new                  As a simple indicator of the general superiority of generational
threads to CPUs increased with numbers of threads and/or                        copying collection, a four−thread run of Fib that executes in 5.1
varied systematically across the different test programs.                       seconds when using the memory settings used in the main
                                                                                experiments reported here runs in 9.1 seconds if the generational
GC Effects: Fib                              Memory bandwidth effects: Sorting
              30                                                                30

              25                                                                25

              20                                                                20
    Speedup




                                                                      Speedup
                                                                                                                                           Ideal
                                                   Ideal
              15                                                                15                                                         Bytes
                                                   Fib−64m
                                                                                                                                           Shorts
              10                                   Fib−4m                       10                                                         Ints
                                                   Fib−scaled
              5                                                                     5                                                      Longs

              0                                                                     0
                   1 3 5 7 9 1 1 1 1 1 2 2 2 2 2                                        1234567891 11 1 1 11 1 1 12 2 22 2 2 22 2 2 3
                             1 3 5 7 9 1 3 5 7 9                                                 0 12 3 4 56 7 8 90 1 23 4 5 67 8 9 0

                             Threads                                                                    Threads
copying phase is disabled on this JVM (in which case this JVM       4.4 Task Synchronization
relies entirely on mark−sweep).                                     As discussed in section 3.2, work−stealing frameworks
However, these GC mechanisms turn into scaling problems             sometimes encounter problems dealing with frequent global
when memory is allocated at such a high rate that threads must      synchronization of tasks. The worker threads continue to poll for
be stopped nearly continuously to perform collections. The          more work even though there isn’t any, thus generating
following figure shows the difference in speedups across three      contention, and, in FJTask, sometimes even forcing threads
memory settings (this JVM supports optional arguments to set        into idle sleeps.
memory parameters): with the default 4 megabyte semispaces,         The Jacobi program illustrates some of the consequent issues.
with 64 megabyte semispaces, and with the amount of memory          This program performs 100 steps, where in each step, all cells
scaled to the number of threads (2 + 2p megabytes). With            are updated according to a simple nearest neighbor averaging
smaller semispaces, the overhead of stopping threads and            formula. A global (tree−based) barrier separates each step. To
collecting garbage starts to impact scaling as the garbage−         determine the magnitude of synchronization effects, a version of
generation rate climbs due to additional threads.
In light of these results, 64m semispaces were used for all other
test runs. A better policy would have been to scale the amount
                                                                                               Sync Effects: Jacobi
of memory to the number of threads in each test. (As seen in the                    30
figure, this would have caused all speedups to appear slightly
more linear.). Alternatively, or in addition, program−specific                      25
task granularity thresholds could be increased proportionally
with numbers of threads.                                                            20
                                                                          Speedup




                                                                                                                                        Ideal
4.3 Memory Locality and Bandwidth                                                   15
                                                                                                                                        1step/sync
Four of the test programs create and operate on very large
                                                                                    10                                                  10steps/−
shared arrays or matrices: Sorting numbers, and multiplying,
                                                                                                                                        sync
decomposing or performing relaxation on matrices. Of these,
sorting is probably the most sensitive to the consequences of                           5
having to move data around processors, and thus to the
aggregate memory bandwidth of the system as a whole. To help                            0
determine the nature of these effects, the Sort program was                                 1 3 5 7 9 1 1 1 1 1 2 2 2 2 2
recast into four versions, that sorted bytes, shorts, ints, and                                       1 3 5 7 9 1 3 5 7 9
longs respectively. Each version sorted data only in the range of
0...255, to ensure that they were otherwise equivalent. The wider                                      Threads
the data elements, the more memory traffic.                         this program was written to synchronize only after every 10
The results show that increased memory traffic leads to poorer      steps. The scaling differences show the impact of the current
speedups, although they do not provide definitive evidence that     policy, and indicate the need to include additional methods in
this is the only cause of tail−offs.                                future versions of this framework to allow programmers to
                                                                    override default parameters and policies. (Note however that
Element width also impacts absolute performance. For example,       this graph probably slightly exaggerates pure synchronization
using only one thread, sorting bytes took 122.5 seconds, while      effects, since the 10−step version is also likely to maintain
sorting longs took 242.4 seconds.                                   somewhat greater task locality.)
4.5 Task Locality
FJTask, like other fork/join frameworks, is optimized for the                                       Other Frameworks
case where worker threads locally consume the vast majority of                            8
the tasks they create. When this does not occur, performance can
suffer, for two reasons:                                                                  7
•   Stealing tasks encounters much more overhead than does                                6
    popping tasks.                                                                                                             FJTask
                                                                                          5                                    Cilk
•




                                                                                Seconds
    In most programs in which tasks operate on shared data,
    running your own subdivided task is likely to maintain better                                                              Hood
                                                                                          4                                    Stack−
    locality of data access.
                                                                                                                               threads
                                                                                          3
                                    Locality effects                                                                           Filaments
                        0.225                                                             2
                          0.2                                                             1
    Proportion stolen




                        0.175
                                                                Fib                       0
                         0.15
                                                                Micro                         Fib   MM Sort   LU   Intg Jac
                        0.125
                                                                Integrate   different choices among these than used here could produce
                          0.1
                                                                MM          nearly any relative performance ranking among these
                        0.075                                   LU          frameworks in many high−performance applications.
                         0.05                                   Jacobi      FJTask generally performs worse (on this JVM) on programs
                        0.025                                   Sort        that mainly do floating point computations on arrays and
                           0                                                matrices. Even though JVMs are improving, they are still not
                                                                            always competitive with powerful back−end optimizers used in
                                1 3 5 7 9 1 1 1 1 1 2 2 2 2 2
                                                                            C and C++ programs. Although not shown in this chart,
                                          1 3 5 7 9 1 3 5 7 9
                                                                            FJTask versions of all programs ran faster than versions of
                                          Threads                           programs in these other frameworks when they were compiled
                                                                            without optimization enabled, and some informal tests suggest
As seen in the figure, in most programs, the relative number of             that most of the remaining differences are due to array bounds
stolen tasks is at most a few percent. However, the LU and MM               checks and related runtime obligations. These are, of course,
programs generate larger imbalances in workloads (and thus                  issues and problems that are attracting much attention and effort
relatively more stealing) as the number of threads increase. It is          by JVM and compiler developers. The observed differences in
possible that some algorithmic tuning of these programs could               code quality for computationally intensive programs are likely
reduce this effect, and in turn lead to better speedups.                    to diminish.
4.6 Comparisons to Other Frameworks                                         5. CONCLUSIONS
It is impossible to perform definitive or even very meaningful              This paper has demonstrated that it is possible to support
measurements comparing programs across languages and                        portable, efficient, scalable parallel processing in pure Java, and
frameworks. However, relative measurements can at least show                to provide a convenient API for programmers who can exploit
some of the basic advantages and limitations of the FJTask                  the framework by following only a few simple design rules and
framework versus similar frameworks written in other languages              design patterns (as presented and discussed in [7]). The
(here, C and C++). The following chart compares performance                 observed performance characteristics of the sample programs
of programs that were either based on or are similar to those               measured here both provide further guidance for users of the
supplied with the Cilk, Hood, Stackthreads, and/or Filaments                framework, and suggest some areas of potential improvement to
distributions. All of these were run with 4 threads on a 4−CPU              the framework itself.
Enterprise 450. To avoid reconfiguring other frameworks or
their test programs, all tests were run with smaller problems sets          Even though scalability results were shown only for a single
than used above. All results represent the best of three runs,              JVM, some of the main empirical findings should hold more
using the compiler and run−time settings that appeared to                   generally:
provide the fastest times. Fib was run without any granularity              •   While generational GC generally meshes well with
threshold; i.e., an implicit threshold of 1. (The "prune" setting in            parallelism, it can hinder scalability when garbage
Filaments Fib was set to 1024, which causes it to behave more                   generation rates force very frequent collections. On this
similarly to the other versions.)                                               JVM, the underlying cause appears to be that stopping
Speedups going from one to four threads were very similar                       threads for collection takes time approximately proportional
(between 3.0 and 4.0) across the different frameworks and test                  to the number of running threads. Because more running
programs, so the accompanying chart focuses on differences in                   threads generate more garbage per unit time, overhead can
absolute performance. However, because the multithreading                       climb approximately quadratically with numbers of threads.
aspects of all these frameworks are all fast, most of these                     Even so, this significantly impacts performance only when
differences reflect differences in application−specific code                    the GC rate is relatively high to begin with. However, the
quality stemming from different compilers, optimization                         resulting issues invite further research and development of
switches, and configuration parameters. In fact, it is likely that              concurrent and parallel GC algorithms. The results presented
here additionally demonstrate the desirability of providing     7. REFERENCES
    tuning options and adaptive mechanisms on multiprocessor
                                                                    [1] Agesen, Ole, David Detlefs, and J. Eliot B. Moss. Garbage
    JVMs to scale memory to the number of active processors.
                                                                        Collection and Local Variable Type−Precision and
•   Most scalability issues reveal themselves only when                 Liveness in Java Virtual Machines. In Proceedings of 1998
    programs are run using more CPUs than are even available            ACM SIGPLAN Conference on Programming Language
    on most stock multiprocessors. FJTask (as well as other             Design and Implementation (PLDI), 1998.
    fork/join frameworks) appear to provide nearly ideal
    speedups for nearly any fork/join program on commonly           [2] Agesen, Ole, David Detlefs, Alex Garthwaite, Ross
    available 2−way, 4−way, and 8−way SMP machines. The                 Knippel, Y.S. Ramakrishna, and Derek White. An Efficient
    present paper appears to be the first reporting systematic          Meta−lock for Implementing Ubiquitous Synchronization.
    results for any fork/join framework designed for stock              In Proceedings of OOPSLA ’99, ACM, 1999.
    multiprocessors run on more than 16 CPUs. Further               [3] Arora, Nimar, Robert D. Blumofe, and C. Greg Plaxton.
    measurements are needed to see if the patterns of results           Thread Scheduling for Multiprogrammed Multiprocessors.
    seen here hold in other frameworks as well.                         In Proceedings of the Tenth Annual ACM Symposium on
•   Characteristics of application programs (including memory           Parallel Algorithms and Architectures (SPAA), Puerto
    locality, task locality, use of global synchronization) often       Vallarta, Mexico, June 28 − July 2, 1998.
    have more bearing on both scalability and absolute              [4] Blumofe, Robert D. and Dionisios Papadopoulos. Hood: A
    performance than do characteristics of the framework, JVM           User−Level Threads Library for Multiprogrammed
    or underlying OS. For example, informal tests showed that           Multiprocessors. Technical Report, University of Texas at
    the careful avoidance of synchronization in deques                  Austin, 1999.
    (discussed in section 3.1) has essentially no impact on
    programs with relatively low task generation rates such as      [5] Frigo, Matteo, Charles Leiserson, and Keith Randall. The
    LU. However, the focus on keeping task management                   Implementation of the Cilk−5 Multithreaded Language. In
    overhead to a minimum widens the range of applicability             Proceedings of 1998 ACM SIGPLAN Conference on
    and utility of the framework and the associated design and          Programming Language Design and Implementation
    programming techniques.                                             (PLDI), 1998.
In addition to incremental improvements, future work on this        [6] Gosling, James, Bill Joy, and Guy Steele. The Java
framework may include construction of useful applications (as           Language Specification, Addison−Wesley, 1996.
opposed to demos and tests), subsequent evaluations under           [7] Lea, Doug. Concurrent Programming in Java, second
production program loads, measurements on different JVMs,               edition, Addison−Wesley, 1999.
and development of extensions geared for use with clusters of
multiprocessors.                                                    [8] Lowenthal, David K., Vincent W. Freeh, and Gregory R.
                                                                        Andrews. Efficient Fine−Grain Parallelism on Shared−
6. ACKNOWLEDGMENTS                                                      Memory Machines. Concurrency−Practice and Experience,
This work was supported in part by a collaborative research             10,3:157−173, 1998.
grant from Sun Labs. Thanks to Ole Agesen, Dave Detlefs,            [9] Simpson, David, and F. Warren Burton. Space efficient
Christine Flood, Alex Garthwaite, and Steve Heller of the Sun           execution of deterministic parallel programs. IEEE
Labs Java Topics Group for advice, help, and comments. David            Transactions on Software Engineering, December, 1999.
Holmes, Ole Agesen, Keith Randall, Kenjiro Taura, and the
anonymous referees provided useful comments on drafts of this       [10] Taura, Kenjiro, Kunio Tabata, and Akinori Yonezawa.
paper. Bill Pugh pointed out the read−after−write limitations of        "Stackthreads/MP: Integrating Futures into Calling
JVMs discussed in section 3.1. Very special thanks to Dave Dice         Standards." In Proceedings of ACM SIGPLAN Symposium
for reserving time and performing test runs on the 30−way               on Principles & Practice of Parallel Programming
Enterprise.                                                             (PPoPP), 1999.

Contenu connexe

Tendances

A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel Architectures
Jason Hearne-McGuiness
 
Aspect Oriented Programming
Aspect Oriented ProgrammingAspect Oriented Programming
Aspect Oriented Programming
mukhtarhudaya
 
LSC Revisited - From Scenarios to Distributed Components
LSC Revisited - From Scenarios to Distributed ComponentsLSC Revisited - From Scenarios to Distributed Components
LSC Revisited - From Scenarios to Distributed Components
Dirk Fahland
 
Computer Architecture: A quantitative approach - Cap4 - Section 8
Computer Architecture: A quantitative approach - Cap4 - Section 8Computer Architecture: A quantitative approach - Cap4 - Section 8
Computer Architecture: A quantitative approach - Cap4 - Section 8
Marcelo Arbore
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
SANKARSAN BOSE
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
Ericsson
 

Tendances (18)

A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel Architectures
 
Aspect Oriented Programming
Aspect Oriented ProgrammingAspect Oriented Programming
Aspect Oriented Programming
 
Modular Mathematical Modelling of Biological Systems
Modular Mathematical Modelling of Biological SystemsModular Mathematical Modelling of Biological Systems
Modular Mathematical Modelling of Biological Systems
 
Java Concurrency Quick Guide
Java Concurrency Quick GuideJava Concurrency Quick Guide
Java Concurrency Quick Guide
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
Transactional Roll-backs and upgrades [preview]
Transactional Roll-backs and upgrades [preview]Transactional Roll-backs and upgrades [preview]
Transactional Roll-backs and upgrades [preview]
 
LSC Revisited - From Scenarios to Distributed Components
LSC Revisited - From Scenarios to Distributed ComponentsLSC Revisited - From Scenarios to Distributed Components
LSC Revisited - From Scenarios to Distributed Components
 
Computer Architecture: A quantitative approach - Cap4 - Section 8
Computer Architecture: A quantitative approach - Cap4 - Section 8Computer Architecture: A quantitative approach - Cap4 - Section 8
Computer Architecture: A quantitative approach - Cap4 - Section 8
 
ASM Core API - methods (part 1)
ASM Core API - methods (part 1)ASM Core API - methods (part 1)
ASM Core API - methods (part 1)
 
shashank_micro92_00697015
shashank_micro92_00697015shashank_micro92_00697015
shashank_micro92_00697015
 
BENEVOL'11 - Reverse Engineering Architectural Feature Models
BENEVOL'11 - Reverse Engineering Architectural Feature ModelsBENEVOL'11 - Reverse Engineering Architectural Feature Models
BENEVOL'11 - Reverse Engineering Architectural Feature Models
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
 
Async and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NLAsync and parallel patterns and application design - TechDays2013 NL
Async and parallel patterns and application design - TechDays2013 NL
 
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
 
Chapter6 threads
Chapter6 threadsChapter6 threads
Chapter6 threads
 

En vedette

Concurrency in java
Concurrency in javaConcurrency in java
Concurrency in java
Abhra Basak
 
Coordinating non blocking io melb-clj
Coordinating non blocking io melb-cljCoordinating non blocking io melb-clj
Coordinating non blocking io melb-clj
Logan Campbell
 
Through the JMX Window
Through the JMX WindowThrough the JMX Window
Through the JMX Window
C2B2 Consulting
 
Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threads
tech2click
 
Bigtable
BigtableBigtable
Bigtable
nextlib
 
Practiced agile developer with tdd & bdd
Practiced agile developer with tdd & bdd Practiced agile developer with tdd & bdd
Practiced agile developer with tdd & bdd
Sakares Saengkaew
 

En vedette (20)

Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
What Your Jvm Has Been Trying To Tell You
What Your Jvm Has Been Trying To Tell YouWhat Your Jvm Has Been Trying To Tell You
What Your Jvm Has Been Trying To Tell You
 
Many Cores Java - Session One: Threads and Threads
Many Cores Java - Session One: Threads and ThreadsMany Cores Java - Session One: Threads and Threads
Many Cores Java - Session One: Threads and Threads
 
Closures for Java
Closures for JavaClosures for Java
Closures for Java
 
Concurrency in java
Concurrency in javaConcurrency in java
Concurrency in java
 
Correct and efficient synchronization of java thread
Correct and efficient synchronization of java threadCorrect and efficient synchronization of java thread
Correct and efficient synchronization of java thread
 
Coordinating non blocking io melb-clj
Coordinating non blocking io melb-cljCoordinating non blocking io melb-clj
Coordinating non blocking io melb-clj
 
Through the JMX Window
Through the JMX WindowThrough the JMX Window
Through the JMX Window
 
#JavaOne What's in an object?
#JavaOne What's in an object?#JavaOne What's in an object?
#JavaOne What's in an object?
 
Reactive Principles and Microservices
Reactive Principles and MicroservicesReactive Principles and Microservices
Reactive Principles and Microservices
 
Lecture on Java Concurrency Day 3 on Feb 11, 2009.
Lecture on Java Concurrency Day 3 on Feb 11, 2009.Lecture on Java Concurrency Day 3 on Feb 11, 2009.
Lecture on Java Concurrency Day 3 on Feb 11, 2009.
 
Java Memory Consistency Model - concepts and context
Java Memory Consistency Model - concepts and contextJava Memory Consistency Model - concepts and context
Java Memory Consistency Model - concepts and context
 
JVM Memory Management Details
JVM Memory Management DetailsJVM Memory Management Details
JVM Memory Management Details
 
IT MATCH: Vastgoedfinanciering voor zelfstandigen / freelancers in België
IT MATCH: Vastgoedfinanciering voor zelfstandigen / freelancers in BelgiëIT MATCH: Vastgoedfinanciering voor zelfstandigen / freelancers in België
IT MATCH: Vastgoedfinanciering voor zelfstandigen / freelancers in België
 
Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threads
 
JAVA NIO
JAVA NIOJAVA NIO
JAVA NIO
 
Bigtable
BigtableBigtable
Bigtable
 
Threads in Java
Threads in JavaThreads in Java
Threads in Java
 
Practiced agile developer with tdd & bdd
Practiced agile developer with tdd & bdd Practiced agile developer with tdd & bdd
Practiced agile developer with tdd & bdd
 
Ch4 Threads
Ch4 ThreadsCh4 Threads
Ch4 Threads
 

Similaire à A Java Fork_Join Framework

Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Baruch Sadogursky
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
IndicThreads
 
High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devices
Mr. Chanuwan
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
ramikaurraminder
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms
913245857
 

Similaire à A Java Fork_Join Framework (20)

Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
Wiki 2
Wiki 2Wiki 2
Wiki 2
 
Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
 
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
Breaking The Clustering Limits @ AlphaCSP JavaEdge 2007
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch Processing
 
J threads-pdf
J threads-pdfJ threads-pdf
J threads-pdf
 
Cloud computing and its applications.pptx
Cloud computing and its applications.pptxCloud computing and its applications.pptx
Cloud computing and its applications.pptx
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
LOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONSLOCK-FREE PARALLEL ACCESS COLLECTIONS
LOCK-FREE PARALLEL ACCESS COLLECTIONS
 
Lock free parallel access collections
Lock free parallel access collectionsLock free parallel access collections
Lock free parallel access collections
 
Scala a case4
Scala a case4Scala a case4
Scala a case4
 
Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
Java Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data StructuresJava Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data Structures
 
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
BARRACUDA, AN OPEN SOURCE FRAMEWORK FOR PARALLELIZING DIVIDE AND CONQUER ALGO...
 
jvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrencyjvm/java - towards lock-free concurrency
jvm/java - towards lock-free concurrency
 
High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devices
 
Multithreading
MultithreadingMultithreading
Multithreading
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
 
Threads
ThreadsThreads
Threads
 
Distributed Algorithms
Distributed AlgorithmsDistributed Algorithms
Distributed Algorithms
 

Plus de Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
Hiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
Hiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
Hiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
Hiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
Hiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
Hiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
Hiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
Hiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
Hiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
Hiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
Hiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
Hiroshi Ono
 

Plus de Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

A Java Fork_Join Framework

  • 1. A Java Fork/Join Framework Doug Lea State University of New York at Oswego Oswego NY 13126 315−341−2688 dl@cs.oswego.edu ABSTRACT Some associated programming techniques and examples are discussed in section 4.4 of Concurrent Programming in Java, This paper describes the design, implementation, and second edition [7]. This paper discusses the design (section 2), performance of a Java framework for supporting a style of implementation (section 3), and performance (section 4) of parallel programming in which problems are solved by FJTask, a JavaTM framework that supports this programming (recursively) splitting them into subtasks that are solved in style. FJTask is available as part of the util.concurrent parallel, waiting for them to complete, and then composing package from http://gee.cs.oswego.edu. results. The general design is a variant of the work−stealing framework devised for Cilk. The main implementation 2. DESIGN techniques surround efficient construction and management of Fork/join programs can be run using any framework that tasks queues and worker threads. The measured performance supports construction of subtasks that are executed in parallel, shows good parallel speedups for most programs, but also along with a mechanism for waiting out their completion. suggests possible improvements. However, the java.lang.Thread class (as well as POSIX pthreads, upon which Java threads are often based) are suboptimal vehicles for supporting fork/join programs: 1. INTRODUCTION • Fork/join tasks have simple and regular synchronization and Fork/Join parallelism is among the simplest and most effective management requirements. The computation graphs design techniques for obtaining good parallel performance. produced by fork/join tasks admit much more efficient Fork/join algorithms are parallel versions of familiar divide− scheduling tactics than needed for general−purpose threads. and−conquer algorithms, taking the typical form: For example, fork/join tasks never need to block except to wait out subtasks. Thus, the overhead and bookkeeping Result solve(Problem problem) { necessary for tracking blocked general−purpose threads are wasted. if (problem is small) directly solve problem • Given reasonable base task granularities, the cost of constructing and managing a thread can be greater than the else { computation time of the task itself. While granularities can split problem into independent parts and should be subject to tuning when running programs on fork new subtasks to solve each part particular platforms, the extremely coarse granularities join all subtasks necessary to outweigh thread overhead limits opportunities for exploiting parallelism. compose result from subresults } In short, standard thread frameworks are just too heavy to support most fork/join programs. But since threads form the } basis of many other styles of concurrent and parallel programming as well, it is impossible (or at least impractical) to The fork operation starts a new parallel fork/join subtask. The remove overhead or tune scheduling of threads themselves just join operation causes the current task not to proceed until the for the sake of supporting this style. forked subtask has completed. Fork/join algorithms, like other While the ideas surely have a longer heritage, the first published divide−and−conquer algorithms, are nearly always recursive, framework to offer systematic solutions to these problems was repeatedly splitting subtasks until they are small enough to solve Cilk[5]. Cilk and other lightweight executable frameworks layer using simple, short sequential methods. special−purpose fork/join support on top of an operating system’s basic thread or process mechanisms. This tactic applies equally well to Java, even though Java threads are in turn layered onto lower−level OS capabilities. The main advantage of creating such a Java lightweight execution framework is to enable fork/join programs to be written in a more portable fashion and to run on the wide range of systems supporting JVMs. The FJTask framework is based on a variant of the design used in Cilk. Other variants are seen in Hood[4], Filaments[8], stackthreads[10], and related systems relying on lightweight
  • 2. class ATask As the standard example of how this framework appears to the extends FJTask { programmer, here is a class computing the Fibonacci function: public void run() { class Fib extends FJTask { split... Task static final int threshold = 13; fork... volatile int number; // arg/result join... compose... Fib(int n) { number = n; } } int getAnswer() { } Task if (!isDone()) throw new IllegalStateException(); return number; } public void run() { Task int n = number; if (n <= threshold) // granularity ctl number = seqFib(n); else { Fib f1 = new Fib(n − 1); Fib f2 = new Fib(n − 2); Worker Worker coInvoke(f1, f2); number = f1.number + f2.number; executable tasks. All of these frameworks map tasks to threads } in about the same way that operating systems map threads to } CPUs, but exploit the simplicity, regularity, and constraints of public static void main(String[] args) { fork/join programs in performing the mapping. While all of try { these frameworks can accommodate (to varying extents) parallel int groupSize = 2; // for example programs written in different styles, they optimize for fork/join FJTaskRunnerGroup group = designs: new FJTaskRunnerGroup(groupSize); Fib f = new Fib(35); // for example • A pool of worker threads is established. Each worker thread group.invoke(f); is a standard ("heavy") thread (here, an instance of Thread int result = f.getAnswer(); subclass FJTaskRunner) that processes tasks held in System.out.println("Answer: " + queues. Normally, there are as many worker threads as there result); } are CPUs on a system. In native frameworks such as Cilk, catch (InterruptedException ex) {} these are mapped to kernel threads or lightweight processes, } and in turn to CPUs. In Java, the JVM and OS must be trusted to map these threads to CPUs. However, this is a int seqFib(int n) { very simple task for the OS, since these threads are if (n <= 1) return n; computationally intensive. Any reasonable mapping strategy else return seqFib(n−1) + seqFib(n−2); } will map these threads to different CPUs. } • All fork/join tasks are instances of a lightweight executable This version runs at least thirty times faster than an equivalent class, not instances of threads. In Java, independently program in which each new task is run in a new executable tasks must implement interface Runnable and java.lang.Thread on the platforms described in section 4. define a run method. In the FJTask framework, these It does so while maintaining the intrinsic portability of tasks subclass FJTask rather than subclassing Thread, multithreaded Java programs. There are only two tuning both of which implement Runnable. (In both cases, a class parameters of typical interest to programmers: can alternatively implement Runnable and then supply instances to be run within executing tasks or threads. • The number of worker threads to construct, which should Because tasks operate under restricted rules supported by ordinarily correspond to the number of CPUs available on a FJTask methods, it is much more convenient to subclass platform (or fewer, to reserve processing for other unrelated FJTask, so as to be able to directly invoke them.) purposes, or occasionally more, to soak up non− computational slack). • A special purpose queuing and scheduling discipline is used to manage tasks and execute them via the worker threads • A granularity parameter that represents the point at which (see section 2.1). These mechanics are triggered by those the overhead of generating tasks outweighs potential few methods provided in the task class: principally fork, parallelism benefits. This parameter is typically more join, isDone (a completion status indicator), and some algorithm−dependent than platform−dependent. It is convenience methods such as coInvoke that forks then normally possible to settle on a threshold that achieves good joins two or more tasks. results when run on a uniprocessor, yet still exploits multiple CPUs when they are present. As a side−benefit, this • A simple control and management facility (here, approach meshes well with JVM dynamic compilation FJTaskRunnerGroup) sets up worker pools and initiates mechanics that optimize small methods better than execution of a given fork/join task when invoked from a monolithic procedures. This, along with data locality normal thread (such as the one performing main in a Java advantages, can cause fork/join algorithms to outperform program). other kinds of algorithms even on uniprocessors.
  • 3. 2.1 Work−Stealing It reduces contention by having stealers operate on the opposite The heart of a fork/join framework lies in its lightweight side of the deque as owners. It also exploits the property of scheduling mechanics. FJTask adapts the basic tactics recursive divide−and−conquer algorithms of generating "large" pioneered in the Cilk work−stealing scheduler: tasks early. Thus, an older stolen task is likely to provide a larger unit of work, leading to further recursive decompositions • Each worker thread maintains runnable tasks in its own by the stealing thread. scheduling queue. As one consequence of these rules, programs that employ • Queues are maintained as double−ended queues (i.e., relatively small task granularities for base actions tend to run deques, usually pronounced "decks"), supporting both LIFO faster than those that only use coarse−grained partitioning or push and pop operations, as well as a FIFO take those that do not use recursive decomposition. Even though operation. relatively few tasks are stolen in most fork/join programs, • Subtasks generated in tasks run by a given worker thread creating many fine−grained tasks means that a task is likely to are pushed onto that workers own deque. be available whenever a worker thread is ready to run it. • Worker threads process their own deques in LIFO 3. IMPLEMENTATION (youngest−first) order, by popping tasks. The framework has been implemented in about 800 lines of pure • When a worker thread has no local tasks to run, it attempts Java code, mainly in class FJTaskRunner, a subclass of to take ("steal") a task from another randomly chosen java.lang.Thread. FJTasks themselves maintain only a worker, using a FIFO (oldest first) rule. boolean completion status, and perform all other operations via delegation to their current worker threads. The • When a worker thread encounters a join operation, it FJTaskRunnerGroup class serves to construct worker processes other tasks, if available, until the target task is threads, maintains some shared state (for example, the identities noticed to have completed (via isDone). All tasks of all worker threads, needed for steal operations), and helps otherwise run to completion without blocking. coordinate startup and shutdown. • When a worker thread has no work and fails to steal any More detailed implementation documentation is available inside from others, it backs off (via yields, sleeps, and/or priority the util.concurrent package. This section discusses only adjustment − see section 3) and tries again later unless all two sets of problems and solutions encountered when workers are known to be similarly idle, in which case they implementing this framework: Supporting efficient deque all block until another task is invoked from top−level. operations (push, pop, and take), and managing the steal protocol by which threads obtain new work. 3.1 Deques To enable efficient and scalable execution, task management Stealing must be made as fast as possible. Creating, pushing, and later popping (or, much less frequently, taking) tasks are analogs of procedure call overhead in sequential programs. Lower overhead enables programmers to adopt smaller task granularities, and in turn better exploit parallelism. Task allocation itself is the responsibility of the JVM. Java garbage collection relieves us of needing to create a special− purpose memory allocator to maintain tasks. This substantially Pushing reduces the complexity and lines of code needed to implement FJTasks compared to similar frameworks in other languages. The basic structure of the deque employs the common scheme Deque of using a single (although resizable) array per deque, along Base Top with two indices: The top index acts just like an array−based stack pointer, changing upon push and pop. The base index is modified only by take. Since FJTaskRunner operations are all intimately tied to the concrete details of the deque (for example, fork simply invokes push), this data structure is directly embedded in the class rather than being defined as a Popping separate component. Because the deque array is accessed by multiple threads, sometimes without full synchronization (see below), yet individual Java array elements cannot be declared as volatile, each array element is actually a fixed reference to a little forwarding object maintaining a single volatile reference. This decision was made originally to ensure As discussed in more detail in [5], the use of LIFO rules for conformance with Java memory rules, but the level of each thread processing its own tasks, but FIFO rules for stealing indirection that it entails turns out to improve performance on other tasks is optimal for a wide class of recursive fork/join tested platforms, presumably by reducing cache contention due designs. Less formally, the scheme offers two basic advantages:
  • 4. to accesses of nearby elements, which are spread out a bit more of a main task, upon its completion, and around global full−stop in memory due to the indirection. synchronization points employed in some fork/join algorithms. The main challenges in deque implementation surround The main issue here is what to do when a worker thread has no synchronization and its avoidance. Even on JVMs with local tasks and cannot steal one from any other thread. If the optimized synchronization facilities[2], the need to obtain locks program is running on a dedicated multiprocessor, then one for every push and pop operation becomes a bottleneck. could make the case for relying on hard busy−wait spins looping However, adaptations of tactics taken in Cilk[5] provide a to try to steal work. However, even here, attempted steals solution based on the following observations: increase contention, which can slow down even those threads • The push and pop operations are only invoked by owner that are not idle (due to locking protocols in section 3.1). threads. Additionally, in more typical usage contexts of this framework, the operating system should somehow be convinced to try to run • Access to the take operation can easily be confined to one other unrelated runnable processes or threads. stealing thread at a time via an entry lock on take. (This deque lock also serves to disable take operations when The tools for achieving this in Java are weak, have no necessary.) Thus, interference control is reduced to a two− guarantees (see [6, 7]), but usually appear to be acceptable in party synchronization problem. practice (as do similar techniques described for Hood[3]). A thread that fails to obtain work from any other thread lowers its • The pop and take operations can only interfere if the priority before attempting additional steals, performs deque is about to become empty. Otherwise they are Thread.yield between attempts, and registers itself as guaranteed to operate on disjoint elements of the array. inactive in its FJTaskRunnerGroup. If all others become Defining the top and base indices as volatile ensures that inactive, they all block waiting for additional main tasks. a pop and take can proceed without locking if the deque is Otherwise, after a given number of additional spins, threads sure to have more than one element. This is done via a Dekker− enter a sleeping phase, where they sleep (for up to 100ms) rather like algorithm in which push pre−decrements top: than yield between steal attempts. These imposed sleeps can if (−−top >= base) ... cause artificial lags in programs that take a long time to split and take pre−increments base: their tasks. But this appears to be the best general−purpose if (++base < top) ... compromise. Future versions of the framework may supply In each case they must then check to see if this could have additional control methods so that programmers can override caused the deque to become empty by comparing the two defaults when they impact performance. indices. An asymmetric rule is used upon potential conflict: pop rechecks state and tries to continue after obtaining the deque 4. PERFORMANCE lock (the same one as held by take), backing off only if the In these times of nearly continuous performance improvements deque is truly empty. A take operation instead just backs off of compilers and JVMs, performance measurements are only of immediately, typically then trying to steal from a different transient value. However, the metrics reported in this section victim. This asymmetry represents the only significant departure reveal some basic properties of the framework. from the otherwise similar THE protocol used in Cilk. A collection of seven fork/join test programs are briefly The use of volatile indices also enables the push operation described in the following table. These programs are adaptations to proceed without synchronization unless the deque array is of those available as demos in the util.concurrent about to overflow, in which case it must first obtain the deque package. They were selected to show some diversity in the kinds lock to resize the array. Otherwise, simply ensuring that top is of problems that can be run within this framework, as well as to updated only after the deque array slot is filled in suppresses obtain results for some common parallel test programs. interference by any take. Program Description Subsequent to initial implementation, it was discovered that Fib The Fibonnaci program shown in section 2, several JVMs do not conform to the Java Memory Model [6] run with argument 47 and granularity rule requiring accurate reads after writes of pairs of volatile threshold 13. fields. As a workaround, the criterion for pop to retry under lock was adjusted to trigger if there appear to be two or fewer Integrate Recursive Gaussian quadrature of elements, and the take operation added a secondary lock to 2⋅i 1 ⋅x 2⋅i 1 summing over odd ensure a memory barrier. This suffices as long as at most one index change is missed by the owner thread (which holds here values of i from 1 to 5 and integrating from for platforms that otherwise maintain proper memory order −47 to 48. when reading volatile fields), and causes only a tiny Micro Best−move finder for a board game, run slowdown in performance. with a lookahead of 4 moves. 3.2 Stealing and Idling Sort Merge/Quick sort (based on an algorithm Worker threads in work−stealing frameworks know nothing from Cilk) of 100 million numbers. about the synchronization demands of the programs they are MM Multiply 2048 X 2048 matrices of doubles. running. They simply generate, push, pop, take, manage the status of, and execute tasks. The simplicity of this scheme leads LU Decompose 4096 X 4096 matrix of doubles. to efficient execution when there is plenty of work for all Jacobi Iterative mesh relaxation with barriers: 100 threads. However, this streamlining comes at the price of relying steps of nearest neighbor averaging on a on heuristics when there is not enough work; i.e., during startup 4096 X 4096 matrix of doubles.
  • 5. For the main tests, programs were run on a 30−CPU Sun However, in general, the results show that increasing the number Enterprise 10000 running the Solaris Production 1.2 JVM (an of threads reliably increased the number of CPUs employed. early version of the 1.2.2_05 release) on Solaris 7. The JVM Speedups are reported as Timen / Time1. The best overall was run with environment parameters selecting "bound threads" speedup was seen for the integration program (speedup of 28.2 for thread mappings, and memory parameters discussed in for 30 threads). The worst was for the LU decomposition section 4.2. A few additional measurements reported below were program (speedup of 15.35 for 30 threads). run on a 4−CPU Sun Enterprise 450. Another way of measuring scalability is in terms of task rates, the average time taken to execute a single task (which may be Times either a recursive or a leaf step). The following figure shows 700 data from a single instrumented run capturing task rates. Ideally, the numbers of tasks processed per unit time per thread should 600 be constant. The fact that they generally slightly decrease with numbers of threads indicates some scalability limitations. Note 500 the fairly wide difference in task rates, reflecting differences in task granularities. The smallest tasks sizes are seen in Fib, Seconds 400 which even with a threshold setting of 13, generated and executed a total of 2.8 million tasks per second when run with 300 30 threads. 200 Task rates 120000 100 100000 Tasks/sec per thread 0 Fib Micro Integ MM LU Jaco Sort bi 80000 Programs were run with very large input parameters in order to minimize timer granularity and JVM warm−up artifacts. Some 60000 other warm−up effects were avoided by running an initial problem set before starting timers. Most data are medians of 40000 three runs, but some (including most follow−up measurements in sections 4.2−4.4) only reflect single runs so are a bit noisy. 20000 4.1 Speedups Scalability measurements were obtained by running the same 0 problem set using worker thread groups of size 1 ... 30. There is Fib Micr Inte− MM LU Jaco Sort o grat bi Four factors appear to account for tail−offs in speedups in those Speedups programs that did not scale linearly. Three of them are common 30 to any parallel framework, but we’ll start with one unique to FJTask (as opposed to Cilk etc), GC effects. 25 Ideal 4.2 Garbage Collection Fib Speedups 20 In many ways, modern GC facilities are perfect matches to Micro fork/join frameworks: These programs can generate enormous 15 numbers of tasks, nearly all of which quickly turn into garbage Integ after they are executed. At any given time, deterministic 10 MM fork/join programs require only at most p (where p is the LU number of threads) times the bookkeeping space that would be 5 Jacobi required for sequential versions of these programs[10]. Sort Generational semispace copying collectors (including the one in 0 the JVM used for these measurements − see [1]) cope with this 1234567891 1 11 1 1 11 1 1 22 2 22 2 2 22 2 3 well because they only traverse and copy non−garbage cells. By 0 1 23 4 5 67 8 9 01 2 34 5 6 78 9 0 doing so, they evade one of the trickiest problems in manual parallel memory management, tracing memory that is allocated Threads by one thread but then used by another. The garbage collector is no way to know whether the JVM always mapped each thread to oblivious to the origin of memory, so need not deal with such a different CPU when available. While there is no evidence issues. about this either way, it is possible that the lags to map new As a simple indicator of the general superiority of generational threads to CPUs increased with numbers of threads and/or copying collection, a four−thread run of Fib that executes in 5.1 varied systematically across the different test programs. seconds when using the memory settings used in the main experiments reported here runs in 9.1 seconds if the generational
  • 6. GC Effects: Fib Memory bandwidth effects: Sorting 30 30 25 25 20 20 Speedup Speedup Ideal Ideal 15 15 Bytes Fib−64m Shorts 10 Fib−4m 10 Ints Fib−scaled 5 5 Longs 0 0 1 3 5 7 9 1 1 1 1 1 2 2 2 2 2 1234567891 11 1 1 11 1 1 12 2 22 2 2 22 2 2 3 1 3 5 7 9 1 3 5 7 9 0 12 3 4 56 7 8 90 1 23 4 5 67 8 9 0 Threads Threads copying phase is disabled on this JVM (in which case this JVM 4.4 Task Synchronization relies entirely on mark−sweep). As discussed in section 3.2, work−stealing frameworks However, these GC mechanisms turn into scaling problems sometimes encounter problems dealing with frequent global when memory is allocated at such a high rate that threads must synchronization of tasks. The worker threads continue to poll for be stopped nearly continuously to perform collections. The more work even though there isn’t any, thus generating following figure shows the difference in speedups across three contention, and, in FJTask, sometimes even forcing threads memory settings (this JVM supports optional arguments to set into idle sleeps. memory parameters): with the default 4 megabyte semispaces, The Jacobi program illustrates some of the consequent issues. with 64 megabyte semispaces, and with the amount of memory This program performs 100 steps, where in each step, all cells scaled to the number of threads (2 + 2p megabytes). With are updated according to a simple nearest neighbor averaging smaller semispaces, the overhead of stopping threads and formula. A global (tree−based) barrier separates each step. To collecting garbage starts to impact scaling as the garbage− determine the magnitude of synchronization effects, a version of generation rate climbs due to additional threads. In light of these results, 64m semispaces were used for all other test runs. A better policy would have been to scale the amount Sync Effects: Jacobi of memory to the number of threads in each test. (As seen in the 30 figure, this would have caused all speedups to appear slightly more linear.). Alternatively, or in addition, program−specific 25 task granularity thresholds could be increased proportionally with numbers of threads. 20 Speedup Ideal 4.3 Memory Locality and Bandwidth 15 1step/sync Four of the test programs create and operate on very large 10 10steps/− shared arrays or matrices: Sorting numbers, and multiplying, sync decomposing or performing relaxation on matrices. Of these, sorting is probably the most sensitive to the consequences of 5 having to move data around processors, and thus to the aggregate memory bandwidth of the system as a whole. To help 0 determine the nature of these effects, the Sort program was 1 3 5 7 9 1 1 1 1 1 2 2 2 2 2 recast into four versions, that sorted bytes, shorts, ints, and 1 3 5 7 9 1 3 5 7 9 longs respectively. Each version sorted data only in the range of 0...255, to ensure that they were otherwise equivalent. The wider Threads the data elements, the more memory traffic. this program was written to synchronize only after every 10 The results show that increased memory traffic leads to poorer steps. The scaling differences show the impact of the current speedups, although they do not provide definitive evidence that policy, and indicate the need to include additional methods in this is the only cause of tail−offs. future versions of this framework to allow programmers to override default parameters and policies. (Note however that Element width also impacts absolute performance. For example, this graph probably slightly exaggerates pure synchronization using only one thread, sorting bytes took 122.5 seconds, while effects, since the 10−step version is also likely to maintain sorting longs took 242.4 seconds. somewhat greater task locality.)
  • 7. 4.5 Task Locality FJTask, like other fork/join frameworks, is optimized for the Other Frameworks case where worker threads locally consume the vast majority of 8 the tasks they create. When this does not occur, performance can suffer, for two reasons: 7 • Stealing tasks encounters much more overhead than does 6 popping tasks. FJTask 5 Cilk • Seconds In most programs in which tasks operate on shared data, running your own subdivided task is likely to maintain better Hood 4 Stack− locality of data access. threads 3 Locality effects Filaments 0.225 2 0.2 1 Proportion stolen 0.175 Fib 0 0.15 Micro Fib MM Sort LU Intg Jac 0.125 Integrate different choices among these than used here could produce 0.1 MM nearly any relative performance ranking among these 0.075 LU frameworks in many high−performance applications. 0.05 Jacobi FJTask generally performs worse (on this JVM) on programs 0.025 Sort that mainly do floating point computations on arrays and 0 matrices. Even though JVMs are improving, they are still not always competitive with powerful back−end optimizers used in 1 3 5 7 9 1 1 1 1 1 2 2 2 2 2 C and C++ programs. Although not shown in this chart, 1 3 5 7 9 1 3 5 7 9 FJTask versions of all programs ran faster than versions of Threads programs in these other frameworks when they were compiled without optimization enabled, and some informal tests suggest As seen in the figure, in most programs, the relative number of that most of the remaining differences are due to array bounds stolen tasks is at most a few percent. However, the LU and MM checks and related runtime obligations. These are, of course, programs generate larger imbalances in workloads (and thus issues and problems that are attracting much attention and effort relatively more stealing) as the number of threads increase. It is by JVM and compiler developers. The observed differences in possible that some algorithmic tuning of these programs could code quality for computationally intensive programs are likely reduce this effect, and in turn lead to better speedups. to diminish. 4.6 Comparisons to Other Frameworks 5. CONCLUSIONS It is impossible to perform definitive or even very meaningful This paper has demonstrated that it is possible to support measurements comparing programs across languages and portable, efficient, scalable parallel processing in pure Java, and frameworks. However, relative measurements can at least show to provide a convenient API for programmers who can exploit some of the basic advantages and limitations of the FJTask the framework by following only a few simple design rules and framework versus similar frameworks written in other languages design patterns (as presented and discussed in [7]). The (here, C and C++). The following chart compares performance observed performance characteristics of the sample programs of programs that were either based on or are similar to those measured here both provide further guidance for users of the supplied with the Cilk, Hood, Stackthreads, and/or Filaments framework, and suggest some areas of potential improvement to distributions. All of these were run with 4 threads on a 4−CPU the framework itself. Enterprise 450. To avoid reconfiguring other frameworks or their test programs, all tests were run with smaller problems sets Even though scalability results were shown only for a single than used above. All results represent the best of three runs, JVM, some of the main empirical findings should hold more using the compiler and run−time settings that appeared to generally: provide the fastest times. Fib was run without any granularity • While generational GC generally meshes well with threshold; i.e., an implicit threshold of 1. (The "prune" setting in parallelism, it can hinder scalability when garbage Filaments Fib was set to 1024, which causes it to behave more generation rates force very frequent collections. On this similarly to the other versions.) JVM, the underlying cause appears to be that stopping Speedups going from one to four threads were very similar threads for collection takes time approximately proportional (between 3.0 and 4.0) across the different frameworks and test to the number of running threads. Because more running programs, so the accompanying chart focuses on differences in threads generate more garbage per unit time, overhead can absolute performance. However, because the multithreading climb approximately quadratically with numbers of threads. aspects of all these frameworks are all fast, most of these Even so, this significantly impacts performance only when differences reflect differences in application−specific code the GC rate is relatively high to begin with. However, the quality stemming from different compilers, optimization resulting issues invite further research and development of switches, and configuration parameters. In fact, it is likely that concurrent and parallel GC algorithms. The results presented
  • 8. here additionally demonstrate the desirability of providing 7. REFERENCES tuning options and adaptive mechanisms on multiprocessor [1] Agesen, Ole, David Detlefs, and J. Eliot B. Moss. Garbage JVMs to scale memory to the number of active processors. Collection and Local Variable Type−Precision and • Most scalability issues reveal themselves only when Liveness in Java Virtual Machines. In Proceedings of 1998 programs are run using more CPUs than are even available ACM SIGPLAN Conference on Programming Language on most stock multiprocessors. FJTask (as well as other Design and Implementation (PLDI), 1998. fork/join frameworks) appear to provide nearly ideal speedups for nearly any fork/join program on commonly [2] Agesen, Ole, David Detlefs, Alex Garthwaite, Ross available 2−way, 4−way, and 8−way SMP machines. The Knippel, Y.S. Ramakrishna, and Derek White. An Efficient present paper appears to be the first reporting systematic Meta−lock for Implementing Ubiquitous Synchronization. results for any fork/join framework designed for stock In Proceedings of OOPSLA ’99, ACM, 1999. multiprocessors run on more than 16 CPUs. Further [3] Arora, Nimar, Robert D. Blumofe, and C. Greg Plaxton. measurements are needed to see if the patterns of results Thread Scheduling for Multiprogrammed Multiprocessors. seen here hold in other frameworks as well. In Proceedings of the Tenth Annual ACM Symposium on • Characteristics of application programs (including memory Parallel Algorithms and Architectures (SPAA), Puerto locality, task locality, use of global synchronization) often Vallarta, Mexico, June 28 − July 2, 1998. have more bearing on both scalability and absolute [4] Blumofe, Robert D. and Dionisios Papadopoulos. Hood: A performance than do characteristics of the framework, JVM User−Level Threads Library for Multiprogrammed or underlying OS. For example, informal tests showed that Multiprocessors. Technical Report, University of Texas at the careful avoidance of synchronization in deques Austin, 1999. (discussed in section 3.1) has essentially no impact on programs with relatively low task generation rates such as [5] Frigo, Matteo, Charles Leiserson, and Keith Randall. The LU. However, the focus on keeping task management Implementation of the Cilk−5 Multithreaded Language. In overhead to a minimum widens the range of applicability Proceedings of 1998 ACM SIGPLAN Conference on and utility of the framework and the associated design and Programming Language Design and Implementation programming techniques. (PLDI), 1998. In addition to incremental improvements, future work on this [6] Gosling, James, Bill Joy, and Guy Steele. The Java framework may include construction of useful applications (as Language Specification, Addison−Wesley, 1996. opposed to demos and tests), subsequent evaluations under [7] Lea, Doug. Concurrent Programming in Java, second production program loads, measurements on different JVMs, edition, Addison−Wesley, 1999. and development of extensions geared for use with clusters of multiprocessors. [8] Lowenthal, David K., Vincent W. Freeh, and Gregory R. Andrews. Efficient Fine−Grain Parallelism on Shared− 6. ACKNOWLEDGMENTS Memory Machines. Concurrency−Practice and Experience, This work was supported in part by a collaborative research 10,3:157−173, 1998. grant from Sun Labs. Thanks to Ole Agesen, Dave Detlefs, [9] Simpson, David, and F. Warren Burton. Space efficient Christine Flood, Alex Garthwaite, and Steve Heller of the Sun execution of deterministic parallel programs. IEEE Labs Java Topics Group for advice, help, and comments. David Transactions on Software Engineering, December, 1999. Holmes, Ole Agesen, Keith Randall, Kenjiro Taura, and the anonymous referees provided useful comments on drafts of this [10] Taura, Kenjiro, Kunio Tabata, and Akinori Yonezawa. paper. Bill Pugh pointed out the read−after−write limitations of "Stackthreads/MP: Integrating Futures into Calling JVMs discussed in section 3.1. Very special thanks to Dave Dice Standards." In Proceedings of ACM SIGPLAN Symposium for reserving time and performing test runs on the 30−way on Principles & Practice of Parallel Programming Enterprise. (PPoPP), 1999.