SlideShare a Scribd company logo
1 of 21
Download to read offline
C++ Data-flow Parallelism sounds great! But how
practical is it? Let’s see how well it works with the
             SPEC2006 benchmark suite
    (Yet Another!) Investigation into using Library-Based,
       Data-Parallelism Support in C++(03 & 11-ish).


               Jason Mc Guiness & Colin Egan

                  accuconf-12@hussar.demon.co.uk
                         c.egan@herts.ac.uk
                     http://libjmmcg.sf.net/


                       30th April 2012
Copyright © J.M.McGuiness, 2012.

  Sequence of Presentation.

               An introduction: why I am here.
               How do we manage all these threads:
                     libraries, languages and compilers.
               An outline of a data-flow library: Parallel Pixie Dust (PPD).
               Introduction to SPEC2006 followed by my methodology in
               applying data-flow to it.
               Leading on to an analysis of SPEC2006.
               How well did the coding go ... some simple examples &
               discussion around pitfalls.
                     If time & technology allow some real-live examples!
               Some results. Wow numbers! Yes really!
               Conclusion & Future directions.
Copyright © J.M.McGuiness, 2012.

  Introduction.

        Why yet another thread-library presentation?
               Because we still find it hard to write multi-threaded programs
               correctly.
                     According to programming folklore.
               We haven’t successfully replaced the von Neumann
               architecture:
                     Stored program architectures are still prevalent.
               The memory wall still affects us:
                     The CPU-instruction retirement rate, i.e. rate at which
                     programs require and generate data, exceeds the the memory
                     bandwidth - a by product of Moore’s Law.
                            Modern architectures add extra cores to CPUs, in this
                            instance, extra memory buses which feed into those cores.
Copyright © J.M.McGuiness, 2012.

  A Quick Review of Some Threading Models:

               Restrict ourselves to library-based techniques.
               Raw library calls.
               The “thread as an active class”.
               Co-routines.
               The “thread pool” containing those objects.
                     Producer-consumer models are a sub-set of this model.
               The issue: classes used to implement business logic also
               implement the operations to be run on the threads. This often
               means that the locking is intimately entangled with those
               objects, and possibly even the threading logic.
                     This makes it much harder to debug these applications. The
                     question of how to implement multi-threaded debuggers
                     correctly is still an open question.
Copyright © J.M.McGuiness, 2012.

  What is Data-flow and why is it so good?
        “Manchester Data-flow Machine”, developed in Manchester in 80s1 .

                 Not von Neumann architecture. No PC. Large number of
                 execution units & special memory (technically CAMs).
                         Compiler-aided:
                                Compiler automatically identifies & tags dependencies which
                                annotate the output binary.
                                Dependencies & related instructions loaded into CAM.
                                When all of the tags are satisfied for a sequence of
                                instructions, that is fired, i.e. submitted to an execution unit.
                         No explicit locks.
                                No race conditions nor deadlocks.
                                Considerable effort put into the data-flow compiler to generate
                                provably correct code.
        Is The Way Forward™.
             Why don’t we use it? Need to recompile source-code & buy
             “exotic” hardware.
                         Inertia, nih2 -ism. Companies hate to do this.
            1
                Stephen Furber is continuing the work there.
Copyright © J.M.McGuiness, 2012.

  Part 1: Design Features of PPD.

        The main design features of PPD are:
               It targets general purpose threading using a data-flow model of
               parallelism:
                     This type of scheduling may be viewed as dynamic scheduling
                     (run-time) as opposed to static scheduling (potentially
                     compile-time), where the operations are statically assigned to
                     execution pipelines, with relatively fixed timing constraints.
               DSEL implemented as futures and thread pools (of many
               different types using traits).
                     Can be used to implement a tree-like thread schedule.
                     “thread-as-an-active-class” exists.
                     Gives rise to important properties: efficient, dead-lock &
                     race-condition free schedules.
Copyright © J.M.McGuiness, 2012.

  Part 2: Design Features of PPD.

        Extensions to the basic DSEL have been added:
               Certain binary functors: each operand is executed in parallel.
               Adapters for the STL collections to assist with thread-safety.
                     Combined with thread pools allows replacement of the STL
                     algorithms.
               Optimisations including void return-type & GSS(k), or baker’s
               scheduling:
                     May reduce synchronisation costs on thread-limited systems or
                     pools in which the synchronisation costs are high.
               Amongst other influences, PPD was born out of discussions
               with Richard Harris and motivated by Kevlin Henney’s
               presentation to ACCU’04 regarding threads.
Copyright © J.M.McGuiness, 2012.

  The STL-style Parallel Algorithms.
        Analysis of SPEC2006, amongst other reasons lead me to
        implement in PPD:
          1. for_each()
          2. find_if() & find()
          3. count_if() & count()
          4. transform() (both overloads)
          5. copy()
          6. accumulate() (both overloads)
          7. fill_n() & fill()
          8. reverse()
          9. max_element() & min_element() (both overloads)
         10. merge() & sort() (both overloads)
                     Based upon Batcher’s bitonic sort, not Cole’s as recommended
                     by Knuth.
                     Work in progress: still getting the scalability of these right.
Copyright © J.M.McGuiness, 2012.

  Example using accumulate().
                     Listing 1: Accumulate with a Thread Pool and Future.
        t y p e d e f ppd : : t h r e a d _ p o o l <
                        p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e ,
                        g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading ,
                        pool_traits : : normal_fifo , std : less , 1
        > pool_type ;
        t y p e d e f ppd : : s a f e _ c o l l n <
                        v e c t o r <i n t >, l o c k _ t r a i t s : : c r i t i c a l _ s e c t i o n _ l o c k _ t y p e
        > vtr_colln_t ;
        t y p e d e f pool_type : : accumulate_t<
                        vtr_colln_t
        >:: e x e c u t i o n _ c o n t e x t e x e c u t i o n _ c o n t e x t ;
        vtr_colln_t v ;
        v . push_back ( 1 ) ; v . push_back ( 2 ) ;
        execution_context context (
                        p o o l . a c c u m u l a t e ( v , 1 , s t d : : p l u s <v t r _ c o l l n _ t : : v a l u e _ t y p e > ( ) )
        );
        a s s e r t ( ∗ c o n t e x t ==4);


                 The accumulate() returns an execution_context:
                           Released when all the asynchronous applications of the binary
                           operation complete & read-lock is dropped.
                 Note accumulate_t: contains an atomic, specialized counter
                 that accrues the result using suitable locking according to the
                 API and counter-type.
                           This is effectively a map-reduce operation.
Copyright © J.M.McGuiness, 2012.

  Scalability of accumulate().
                                                      Time for accumulate algorithm to run with varying
                                                       number of random data items (boost::mt19937).
                                         0.18

                                         0.16                                                                 2097152

                                         0.14                                                                 33554432

                       Run time (sec).   0.12                                                                 67108864

                                          0.1                                                                 134217728

                                         0.08

                                         0.06

                                         0.04

                                         0.02

                                           0
                                         Sequential       1               2                  4            8               12
                                                                        Number of threads.




               Some scaling, but too small range of threads to really see.
               Smallest data-set 2,097,152 items, run-time 3ms → 2ms.
                     Scalability affected by the size of a quantum of a thread on a processor.
                                            Effect ceases with over 67,108,864 data elements.
                     Need huge data-sets to make effective use of coarse-grain parallelism.
                     ~7% increase from 1 → 2 threads probably due to movement of dataset.
               One run was taken. Dual processor, 6-cores each, AMD Opteron 4168 2.6GHz, 16Gb RAM in 4
               ranked DIMMs (2 per processor), NUMA mode m/b & kernel v3.2.1 on Gentoo GNU/Linux, gcc
               “v4.6.2 p1.4 pie-0.5.0” with –std=c++0x & head revision (std::move added).
Copyright © J.M.McGuiness, 2012.

  Why SPEC20064 ?
                 Published since 1989, numerous updated versions in that time.
                 “A useful tool for anyone interested in how hardware systems
                 will perform under compute-intensive workloads based on real
                 applications.”
                 Publications in ACM, IEEE & LNCS amongst many others.
                 Studied in academia a great deal:
                         For example ~60% performance improvement on 4 cores and a
                         further ~10% on 8 when examining loop-level parallelism3 .
                         Poor scalability, code was written to be implicitly sequential.
                                For better performance design for parallelism...
                                Is that a premature optimisation, ignoring Hoare’s dictum!

                 Choose something - not pulled it out of a hat; worse still
                 something that makes me look artificially good!
            3
                Packirisamy, V.; Zhai, A.; Hsu, W.-C.; Yew, P.-C. & Ngai, T.-F. (2009), Exploring speculative
        parallelism in SPEC2006., in ’ISPASS’ , IEEE, , pp. 77-88 .
            4
                http://www.spec.org/cpu2006/press/release.html
Copyright © J.M.McGuiness, 2012.

  Methodology.

        Naive approach...
               Examine by how much is the performance affected by a simple
               analysis of a C++ code-base and subsequent application of
               high-level data-parallel constructs to it.
                     Will give a flavour of the analysis done & a subjective report
                     on how well the application seemed to go.
               Did not:
                     run a profiler to examine hot-spots in the code,
                     attempt to understand algorithms and program design,
                     do a full, detailed, critical analysis of the code,
                     do a full rewrite in a data-flow style.
                     No time, but of course one would really do these.
Copyright © J.M.McGuiness, 2012.

  An Examination of SPEC2006 for STL Usage.

         SPEC2006 was examined for usage of STL algorithms used.
                                                               Distribution of algorithms in 483.xalancbmk.                                                                                                                                                        Distribution of algorithms in 477.dealII.
                                100                                                                                                                                                     100

                                                                                                                                     set::count                                                                                                                                                                                                                                                                    set::count
                                                                                                                                     map::count                                                                                                                                                                                                                                                                    map::count
                                                                                                                                     string::find                                                                                                                                                                                                                                                                  string::find
                                                                                                                                     set::find                                                                                                                                                                                                                                                                     set::find
                                80                                                                                                                                                       80
                                                                                                                                     map::find                                                                                                                                                                                                                                                                     map::find
                                                                                                                                     Global algorithm                                                                                                                                                                                                                                                              Global algorithm



                                60                                                                                                                                                       60
        Number of occurences.




                                                                                                                                                                Number of occurences.
                                40                                                                                                                                                       40




                                20                                                                                                                                                       20




                                  0                                                                                                                                                       0
                                                                                                                                                                                              copy


                                                                                                                                                                                                              fill


                                                                                                                                                                                                                            sort


                                                                                                                                                                                                                                           find


                                                                                                                                                                                                                                                                count_if


                                                                                                                                                                                                                                                                                         accumulate


                                                                                                                                                                                                                                                                                                                    min_element


                                                                                                                                                                                                                                                                                                                                           inner_product


                                                                                                                                                                                                                                                                                                                                                                       transform


                                                                                                                                                                                                                                                                                                                                                                                                 inplace_merge


                                                                                                                                                                                                                                                                                                                                                                                                                            find_if


                                                                                                                                                                                                                                                                                                                                                                                                                                              binary_search


                                                                                                                                                                                                                                                                                                                                                                                                                                                                              find
                                      for_each


                                                 swap


                                                        find


                                                                find_if


                                                                          copy


                                                                                 reverse


                                                                                           remove


                                                                                                    transform


                                                                                                                stable_sort


                                                                                                                              sort


                                                                                                                                       replace


                                                                                                                                                 fill


                                                                                                                                                        count




                                                                                                                                                                                                     fill_n


                                                                                                                                                                                                                     swap


                                                                                                                                                                                                                                   count


                                                                                                                                                                                                                                                  lower_bound


                                                                                                                                                                                                                                                                           max_element


                                                                                                                                                                                                                                                                                                      upper_bound


                                                                                                                                                                                                                                                                                                                                  unique


                                                                                                                                                                                                                                                                                                                                                           remove_if


                                                                                                                                                                                                                                                                                                                                                                                   nth_element


                                                                                                                                                                                                                                                                                                                                                                                                                 for_each


                                                                                                                                                                                                                                                                                                                                                                                                                                      equal


                                                                                                                                                                                                                                                                                                                                                                                                                                                              adjacent_find
                                                                                 Algorithm.
                                                                                                                                                                                                     Algorithm.
Copyright © J.M.McGuiness, 2012.

  Analysis of 483.xalancbmk


        Classified as an integer benchmark using a version of the Xalan-C
        XSLT processor.
               Large, not a toy: over 500000 lines in over 2500 files.
               Single global thread_pool.
               Effort: at least 5 days.
                     Lot of effort for little reward.
               Small function bodies make it hard to easily increase
               parallelism by moving independent statements so they overlap.
               We cannot wantonly replace algorithms with parallel ones...
Copyright © J.M.McGuiness, 2012.

  Parallelising 483.xalancbmk
        9 of the over 200 identified uses of STL algorithms were actually
        easily parallelised:
               for_each(): none parallelisable:
                     57 used for deallocation: needs parallel memory allocator!
                     Accounts for over half of all of the applications of
                     for_each().
                     Remainder unlikely to be replaceable: the functor must have
                     side-effects, further locking required for correct execution.
               swap(): right type of algorithm is important; better
               implemented by swapping internal pointers, not the data
               elements themselves. None parallelisable.
               find() & find_if(): replaced 5 & 2 respectively.
               copy() & reverse(): none replaced.
               transform(): none replaced.
               fill() & sort(): replaced 1 each
Copyright © J.M.McGuiness, 2012.

  Analysis of 477.dealII

        Classified as a floating-point benchmark: “numerical solution of
        partial differential equations using the adaptive finite element
        method”.
               Large: over 180000 lines in over 483 files.
               Actually contains a hybrid implementation of data-flow for
               parallelism, but I didn’t test it.
               Again, single global thread_pool.
               Old version of boost included: got odd errors, so had to
               remove that!
                     Be sure to use compatible versions of libraries, beware that
                     they can be hidden anywhere!
               Effort: at least 2 days.
                     Excludes “fiddling” time.
Copyright © J.M.McGuiness, 2012.

  Parallelising 477.dealII
        Of the over 230 identified uses of STL algorithms 33 were actually
        easily parallelised:
               copy(): replaced none: usage requires full treatment of
               arbitrary ranges.
               fill() & fill_n(): replaced none & 8 respectively.
               swap(): none, as per xalancbmk.
               sort(): replaced 9.
               count() & count_if(): replaced 3 & 5 respectively.
               find() & find_if(): replaced none: mainly
               member-functions.
               min_element() & max_element(): replaced 2 & 3
               respectively.
               accumulate(): replaced 3.
               transform(): replaced none.
Copyright © J.M.McGuiness, 2012.

  Reflect Upon Original Analysis of Benchmarks.
        Very few of the initially identified algorithms could be replaced.
            Functional-decomposition nature of the code breaks the basic
            blocks into too small a scale to readily identify parallelism.
                     Larger function bodies & style of use make it easier to increase
                     parallelism by overlapping independent statements.
               Uninvestigated potential side-effects confound the ready
               replacement of many algorithms with parallel versions.
               Use of aspect-style programming would have greatly eased
               replacement of types.
                     Ease conversion of collections into those that are adapted to
                     parallel use, return types, etc.
               Complicated compilation options used to test robustness of
               build (gcc “v4.5.3-r2 p1.1 pie-0.4.7”):
               -pipe -std=c++98 -mtune=amdfam10 -march=amdfam10 -O3
               -fomit-frame-pointer -fpeel-loops -ftracer -funswitch-loops
               -funit-at-a-time -ftree-vectorize -freg-struct-return -msseregparm

                     Would be very sensible to move to C++11.
Copyright © J.M.McGuiness, 2012.

  Performance of Modified Benchmarks.
                                               SPEC2006 performance results.
                   20

                   18

                   16

                   14

                   12
        SPECbase




                   10

                    8
                                                                                              477.dealII 4.5.3
                    6
                                                                                              477.dealII 4.6.2
                    4
                                                                                              483.xalancbmk 4.5.3
                    2
                                                                                              483.xalancbmk 4.6.2
                    0
                   Baseline       Sequential          1              2              4               8            12
                                                          Number of threads.

                        “ref” dataset. The median of five runs was taken. Error bars: max & min performance.
Copyright © J.M.McGuiness, 2012.

  Commentary upon the results.
        xalancbmk:
             General lack of scaling. Not enough parallelism exposed, given
             it is not very scalable.
             Sequential performance is effectively the same as the baseline,
             so no harm to use it speculatively.
             ~5% performance reduction when using threads.
                     Shows that the threads are actually used, just not effectively.
                     Roughly constant, so independent of number of threads, likely
                     to be overhead of the implementation of data-flow in PPD.
        dealII:
             No cost to speculatively use parallel data-flow model.
                     I did check that the parallel code was called!
               Other code costs swamp costs of threading, unlike xalancbmk.
                     Don’t sweat the O/S-level threading construct costs.
               Design of algorithm critical to scalability.
                     Up-front design cost higher as more care required.
Copyright © J.M.McGuiness, 2012.

  Conclusions & Future Directions.
               Benchmarks were not written with parallelism in mind, so the
               parallelism revealed is poor.
                     When designing an algorithm that may become parallel, vital
                     to target a programming model.
                            Ensure option of running sequentially: no cost if it is not used.
                            Different models scale differently on the same code-base.
                            So carefully choose a model that will reflect any inherent
                            parallelism in the selected algorithm.
                            This is hard: look around on the web.

               Better statistics including collection sizes to assist in
               identifying opportunities for parallelisation.
                     Yes, profiling is vital, good profiling better!
                            Perfectly parallelised code runs at the speed of the sequential
                            portion of it, which is asymptotically approached as the
                            parallel portion is enhanced. (One of Amdhal’s Laws?)

               No golden bullet: post-facto bodging parallelism onto a large
               pre-existing code-base is a waste of time & money.

More Related Content

What's hot

IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data FlowsSANKARSAN BOSE
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hoodRichardWarburton
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 
Final training course
Final training courseFinal training course
Final training courseNoor Dhiya
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」kurotaki_weblab
 
First steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with ExamplesFirst steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with ExamplesFelipe
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionTony Albrecht
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshopodsc
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and KerasJie He
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming PrimerSri Prasanna
 
Shifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapterShifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapterLisa Hua
 

What's hot (19)

IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Task Parallel Library Data Flows
Task Parallel Library Data FlowsTask Parallel Library Data Flows
Task Parallel Library Data Flows
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Final training course
Final training courseFinal training course
Final training course
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」論文輪読資料「Gated Feedback Recurrent Neural Networks」
論文輪読資料「Gated Feedback Recurrent Neural Networks」
 
First steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with ExamplesFirst steps with Keras 2: A tutorial with Examples
First steps with Keras 2: A tutorial with Examples
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 
fj
fjfj
fj
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Parallel Programming Primer
Parallel Programming PrimerParallel Programming Primer
Parallel Programming Primer
 
Shifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapterShifu plugin-trainer and pmml-adapter
Shifu plugin-trainer and pmml-adapter
 

Similar to C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see how well it works with the SPEC2006 benchmark suite

Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Codelbergmans
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codelbergmans
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014ijcite
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionMarc Seeger
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
High Performance NodeJS
High Performance NodeJSHigh Performance NodeJS
High Performance NodeJSDicoding
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Usingjorgerodriguessimao
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with DockerDocker, Inc.
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS MeetupLINAGORA
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkbanq jdon
 
MarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkMarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkLeonardo Arcari
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
Serenity Project: Security in Software Enginering
Serenity Project: Security in Software EngineringSerenity Project: Security in Software Enginering
Serenity Project: Security in Software EngineringFrancisco Sanchez Cid
 
Determan SummerSim_submit_rev3
Determan SummerSim_submit_rev3Determan SummerSim_submit_rev3
Determan SummerSim_submit_rev3John Determan
 

Similar to C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see how well it works with the SPEC2006 benchmark suite (20)

Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Code
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet code
 
Introducing Parallel Pixie Dust
Introducing Parallel Pixie DustIntroducing Parallel Pixie Dust
Introducing Parallel Pixie Dust
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Concurrency and parallel in .net
Concurrency and parallel in .netConcurrency and parallel in .net
Concurrency and parallel in .net
 
Surge2012
Surge2012Surge2012
Surge2012
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
Eventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introductionEventdriven I/O - A hands on introduction
Eventdriven I/O - A hands on introduction
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
High Performance NodeJS
High Performance NodeJSHigh Performance NodeJS
High Performance NodeJS
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Comparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus UsingComparing Write-Ahead Logging and the Memory Bus Using
Comparing Write-Ahead Logging and the Memory Bus Using
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
 
Advanced Node.JS Meetup
Advanced Node.JS MeetupAdvanced Node.JS Meetup
Advanced Node.JS Meetup
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
MarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkMarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt framework
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Serenity Project: Security in Software Enginering
Serenity Project: Security in Software EngineringSerenity Project: Security in Software Enginering
Serenity Project: Security in Software Enginering
 
Determan SummerSim_submit_rev3
Determan SummerSim_submit_rev3Determan SummerSim_submit_rev3
Determan SummerSim_submit_rev3
 
Caffe2
Caffe2Caffe2
Caffe2
 

C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see how well it works with the SPEC2006 benchmark suite

  • 1. C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see how well it works with the SPEC2006 benchmark suite (Yet Another!) Investigation into using Library-Based, Data-Parallelism Support in C++(03 & 11-ish). Jason Mc Guiness & Colin Egan accuconf-12@hussar.demon.co.uk c.egan@herts.ac.uk http://libjmmcg.sf.net/ 30th April 2012
  • 2. Copyright © J.M.McGuiness, 2012. Sequence of Presentation. An introduction: why I am here. How do we manage all these threads: libraries, languages and compilers. An outline of a data-flow library: Parallel Pixie Dust (PPD). Introduction to SPEC2006 followed by my methodology in applying data-flow to it. Leading on to an analysis of SPEC2006. How well did the coding go ... some simple examples & discussion around pitfalls. If time & technology allow some real-live examples! Some results. Wow numbers! Yes really! Conclusion & Future directions.
  • 3. Copyright © J.M.McGuiness, 2012. Introduction. Why yet another thread-library presentation? Because we still find it hard to write multi-threaded programs correctly. According to programming folklore. We haven’t successfully replaced the von Neumann architecture: Stored program architectures are still prevalent. The memory wall still affects us: The CPU-instruction retirement rate, i.e. rate at which programs require and generate data, exceeds the the memory bandwidth - a by product of Moore’s Law. Modern architectures add extra cores to CPUs, in this instance, extra memory buses which feed into those cores.
  • 4. Copyright © J.M.McGuiness, 2012. A Quick Review of Some Threading Models: Restrict ourselves to library-based techniques. Raw library calls. The “thread as an active class”. Co-routines. The “thread pool” containing those objects. Producer-consumer models are a sub-set of this model. The issue: classes used to implement business logic also implement the operations to be run on the threads. This often means that the locking is intimately entangled with those objects, and possibly even the threading logic. This makes it much harder to debug these applications. The question of how to implement multi-threaded debuggers correctly is still an open question.
  • 5. Copyright © J.M.McGuiness, 2012. What is Data-flow and why is it so good? “Manchester Data-flow Machine”, developed in Manchester in 80s1 . Not von Neumann architecture. No PC. Large number of execution units & special memory (technically CAMs). Compiler-aided: Compiler automatically identifies & tags dependencies which annotate the output binary. Dependencies & related instructions loaded into CAM. When all of the tags are satisfied for a sequence of instructions, that is fired, i.e. submitted to an execution unit. No explicit locks. No race conditions nor deadlocks. Considerable effort put into the data-flow compiler to generate provably correct code. Is The Way Forward™. Why don’t we use it? Need to recompile source-code & buy “exotic” hardware. Inertia, nih2 -ism. Companies hate to do this. 1 Stephen Furber is continuing the work there.
  • 6. Copyright © J.M.McGuiness, 2012. Part 1: Design Features of PPD. The main design features of PPD are: It targets general purpose threading using a data-flow model of parallelism: This type of scheduling may be viewed as dynamic scheduling (run-time) as opposed to static scheduling (potentially compile-time), where the operations are statically assigned to execution pipelines, with relatively fixed timing constraints. DSEL implemented as futures and thread pools (of many different types using traits). Can be used to implement a tree-like thread schedule. “thread-as-an-active-class” exists. Gives rise to important properties: efficient, dead-lock & race-condition free schedules.
  • 7. Copyright © J.M.McGuiness, 2012. Part 2: Design Features of PPD. Extensions to the basic DSEL have been added: Certain binary functors: each operand is executed in parallel. Adapters for the STL collections to assist with thread-safety. Combined with thread pools allows replacement of the STL algorithms. Optimisations including void return-type & GSS(k), or baker’s scheduling: May reduce synchronisation costs on thread-limited systems or pools in which the synchronisation costs are high. Amongst other influences, PPD was born out of discussions with Richard Harris and motivated by Kevlin Henney’s presentation to ACCU’04 regarding threads.
  • 8. Copyright © J.M.McGuiness, 2012. The STL-style Parallel Algorithms. Analysis of SPEC2006, amongst other reasons lead me to implement in PPD: 1. for_each() 2. find_if() & find() 3. count_if() & count() 4. transform() (both overloads) 5. copy() 6. accumulate() (both overloads) 7. fill_n() & fill() 8. reverse() 9. max_element() & min_element() (both overloads) 10. merge() & sort() (both overloads) Based upon Batcher’s bitonic sort, not Cole’s as recommended by Knuth. Work in progress: still getting the scalability of these right.
  • 9. Copyright © J.M.McGuiness, 2012. Example using accumulate(). Listing 1: Accumulate with a Thread Pool and Future. t y p e d e f ppd : : t h r e a d _ p o o l < p o o l _ t r a i t s : : worker_threads_get_work , p o o l _ t r a i t s : : f i x e d _ s i z e , g e n e r i c _ t r a i t s : : j o i n a b l e , platform_api , heavyweight_threading , pool_traits : : normal_fifo , std : less , 1 > pool_type ; t y p e d e f ppd : : s a f e _ c o l l n < v e c t o r <i n t >, l o c k _ t r a i t s : : c r i t i c a l _ s e c t i o n _ l o c k _ t y p e > vtr_colln_t ; t y p e d e f pool_type : : accumulate_t< vtr_colln_t >:: e x e c u t i o n _ c o n t e x t e x e c u t i o n _ c o n t e x t ; vtr_colln_t v ; v . push_back ( 1 ) ; v . push_back ( 2 ) ; execution_context context ( p o o l . a c c u m u l a t e ( v , 1 , s t d : : p l u s <v t r _ c o l l n _ t : : v a l u e _ t y p e > ( ) ) ); a s s e r t ( ∗ c o n t e x t ==4); The accumulate() returns an execution_context: Released when all the asynchronous applications of the binary operation complete & read-lock is dropped. Note accumulate_t: contains an atomic, specialized counter that accrues the result using suitable locking according to the API and counter-type. This is effectively a map-reduce operation.
  • 10. Copyright © J.M.McGuiness, 2012. Scalability of accumulate(). Time for accumulate algorithm to run with varying number of random data items (boost::mt19937). 0.18 0.16 2097152 0.14 33554432 Run time (sec). 0.12 67108864 0.1 134217728 0.08 0.06 0.04 0.02 0 Sequential 1 2 4 8 12 Number of threads. Some scaling, but too small range of threads to really see. Smallest data-set 2,097,152 items, run-time 3ms → 2ms. Scalability affected by the size of a quantum of a thread on a processor. Effect ceases with over 67,108,864 data elements. Need huge data-sets to make effective use of coarse-grain parallelism. ~7% increase from 1 → 2 threads probably due to movement of dataset. One run was taken. Dual processor, 6-cores each, AMD Opteron 4168 2.6GHz, 16Gb RAM in 4 ranked DIMMs (2 per processor), NUMA mode m/b & kernel v3.2.1 on Gentoo GNU/Linux, gcc “v4.6.2 p1.4 pie-0.5.0” with –std=c++0x & head revision (std::move added).
  • 11. Copyright © J.M.McGuiness, 2012. Why SPEC20064 ? Published since 1989, numerous updated versions in that time. “A useful tool for anyone interested in how hardware systems will perform under compute-intensive workloads based on real applications.” Publications in ACM, IEEE & LNCS amongst many others. Studied in academia a great deal: For example ~60% performance improvement on 4 cores and a further ~10% on 8 when examining loop-level parallelism3 . Poor scalability, code was written to be implicitly sequential. For better performance design for parallelism... Is that a premature optimisation, ignoring Hoare’s dictum! Choose something - not pulled it out of a hat; worse still something that makes me look artificially good! 3 Packirisamy, V.; Zhai, A.; Hsu, W.-C.; Yew, P.-C. & Ngai, T.-F. (2009), Exploring speculative parallelism in SPEC2006., in ’ISPASS’ , IEEE, , pp. 77-88 . 4 http://www.spec.org/cpu2006/press/release.html
  • 12. Copyright © J.M.McGuiness, 2012. Methodology. Naive approach... Examine by how much is the performance affected by a simple analysis of a C++ code-base and subsequent application of high-level data-parallel constructs to it. Will give a flavour of the analysis done & a subjective report on how well the application seemed to go. Did not: run a profiler to examine hot-spots in the code, attempt to understand algorithms and program design, do a full, detailed, critical analysis of the code, do a full rewrite in a data-flow style. No time, but of course one would really do these.
  • 13. Copyright © J.M.McGuiness, 2012. An Examination of SPEC2006 for STL Usage. SPEC2006 was examined for usage of STL algorithms used. Distribution of algorithms in 483.xalancbmk. Distribution of algorithms in 477.dealII. 100 100 set::count set::count map::count map::count string::find string::find set::find set::find 80 80 map::find map::find Global algorithm Global algorithm 60 60 Number of occurences. Number of occurences. 40 40 20 20 0 0 copy fill sort find count_if accumulate min_element inner_product transform inplace_merge find_if binary_search find for_each swap find find_if copy reverse remove transform stable_sort sort replace fill count fill_n swap count lower_bound max_element upper_bound unique remove_if nth_element for_each equal adjacent_find Algorithm. Algorithm.
  • 14. Copyright © J.M.McGuiness, 2012. Analysis of 483.xalancbmk Classified as an integer benchmark using a version of the Xalan-C XSLT processor. Large, not a toy: over 500000 lines in over 2500 files. Single global thread_pool. Effort: at least 5 days. Lot of effort for little reward. Small function bodies make it hard to easily increase parallelism by moving independent statements so they overlap. We cannot wantonly replace algorithms with parallel ones...
  • 15. Copyright © J.M.McGuiness, 2012. Parallelising 483.xalancbmk 9 of the over 200 identified uses of STL algorithms were actually easily parallelised: for_each(): none parallelisable: 57 used for deallocation: needs parallel memory allocator! Accounts for over half of all of the applications of for_each(). Remainder unlikely to be replaceable: the functor must have side-effects, further locking required for correct execution. swap(): right type of algorithm is important; better implemented by swapping internal pointers, not the data elements themselves. None parallelisable. find() & find_if(): replaced 5 & 2 respectively. copy() & reverse(): none replaced. transform(): none replaced. fill() & sort(): replaced 1 each
  • 16. Copyright © J.M.McGuiness, 2012. Analysis of 477.dealII Classified as a floating-point benchmark: “numerical solution of partial differential equations using the adaptive finite element method”. Large: over 180000 lines in over 483 files. Actually contains a hybrid implementation of data-flow for parallelism, but I didn’t test it. Again, single global thread_pool. Old version of boost included: got odd errors, so had to remove that! Be sure to use compatible versions of libraries, beware that they can be hidden anywhere! Effort: at least 2 days. Excludes “fiddling” time.
  • 17. Copyright © J.M.McGuiness, 2012. Parallelising 477.dealII Of the over 230 identified uses of STL algorithms 33 were actually easily parallelised: copy(): replaced none: usage requires full treatment of arbitrary ranges. fill() & fill_n(): replaced none & 8 respectively. swap(): none, as per xalancbmk. sort(): replaced 9. count() & count_if(): replaced 3 & 5 respectively. find() & find_if(): replaced none: mainly member-functions. min_element() & max_element(): replaced 2 & 3 respectively. accumulate(): replaced 3. transform(): replaced none.
  • 18. Copyright © J.M.McGuiness, 2012. Reflect Upon Original Analysis of Benchmarks. Very few of the initially identified algorithms could be replaced. Functional-decomposition nature of the code breaks the basic blocks into too small a scale to readily identify parallelism. Larger function bodies & style of use make it easier to increase parallelism by overlapping independent statements. Uninvestigated potential side-effects confound the ready replacement of many algorithms with parallel versions. Use of aspect-style programming would have greatly eased replacement of types. Ease conversion of collections into those that are adapted to parallel use, return types, etc. Complicated compilation options used to test robustness of build (gcc “v4.5.3-r2 p1.1 pie-0.4.7”): -pipe -std=c++98 -mtune=amdfam10 -march=amdfam10 -O3 -fomit-frame-pointer -fpeel-loops -ftracer -funswitch-loops -funit-at-a-time -ftree-vectorize -freg-struct-return -msseregparm Would be very sensible to move to C++11.
  • 19. Copyright © J.M.McGuiness, 2012. Performance of Modified Benchmarks. SPEC2006 performance results. 20 18 16 14 12 SPECbase 10 8 477.dealII 4.5.3 6 477.dealII 4.6.2 4 483.xalancbmk 4.5.3 2 483.xalancbmk 4.6.2 0 Baseline Sequential 1 2 4 8 12 Number of threads. “ref” dataset. The median of five runs was taken. Error bars: max & min performance.
  • 20. Copyright © J.M.McGuiness, 2012. Commentary upon the results. xalancbmk: General lack of scaling. Not enough parallelism exposed, given it is not very scalable. Sequential performance is effectively the same as the baseline, so no harm to use it speculatively. ~5% performance reduction when using threads. Shows that the threads are actually used, just not effectively. Roughly constant, so independent of number of threads, likely to be overhead of the implementation of data-flow in PPD. dealII: No cost to speculatively use parallel data-flow model. I did check that the parallel code was called! Other code costs swamp costs of threading, unlike xalancbmk. Don’t sweat the O/S-level threading construct costs. Design of algorithm critical to scalability. Up-front design cost higher as more care required.
  • 21. Copyright © J.M.McGuiness, 2012. Conclusions & Future Directions. Benchmarks were not written with parallelism in mind, so the parallelism revealed is poor. When designing an algorithm that may become parallel, vital to target a programming model. Ensure option of running sequentially: no cost if it is not used. Different models scale differently on the same code-base. So carefully choose a model that will reflect any inherent parallelism in the selected algorithm. This is hard: look around on the web. Better statistics including collection sizes to assist in identifying opportunities for parallelisation. Yes, profiling is vital, good profiling better! Perfectly parallelised code runs at the speed of the sequential portion of it, which is asymptotically approached as the parallel portion is enhanced. (One of Amdhal’s Laws?) No golden bullet: post-facto bodging parallelism onto a large pre-existing code-base is a waste of time & money.