SlideShare a Scribd company logo
1 of 13
Download to read offline
Datorarkitektur                                        Fö 11/12 - 1         Datorarkitektur                                         Fö 11/12 - 2




                                                                                                    Why Parallel Computation?

                  ARCHITECTURES FOR PARALLEL
                         COMPUTATION                                                  •      The need for high performance!

                                                                                  Two main factors contribute to high performance of
                                                                                  modern processors:
                 1. Why Parallel Computation
                                                                                    1. Fast circuit technology
                                                                                    2. Architectural features:
                 2. Parallel Programs
                                                                                        - large caches
                                                                                        - multiple fast buses
                 3. A Classification of Computer Architectures
                                                                                        - pipelining
                                                                                        - superscalar architectures (multiple funct. units)
                 4. Performance of Parallel Architectures
                                                                                  However
                 5. The Interconnection Network
                                                                                   • Computers running with a single CPU, often are not
                                                                                     able to meet performance needs in certain areas:
                 6. Array Processors                                                   - Fluid flow analysis and aerodynamics;
                                                                                       - Simulation of large complex systems, for exam-
                 7. Multiprocessors                                                      ple in physics, economy, biology, technic;
                                                                                       - Computer aided design;
                 8. Multicomputers                                                     - Multimedia.

                 9. Vector Processors                                                 •      Applications in the above domains are
                                                                                             characterized by a very high amount of numerical
                                                                                             computation and/or a high quantity of input data.
               10. Multimedia Extensions to Microprocessors




Petru Eles, IDA, LiTH                                                       Petru Eles, IDA, LiTH




      Datorarkitektur                                        Fö 11/12 - 3         Datorarkitektur                                         Fö 11/12 - 4




                        A Solution: Parallel Computers                                                   Parallel Programs


                                                                                  1. Parallel sorting
          •      One solution to the need for high performance:
                 architectures in which several CPUs are running in
                 order to solve a certain application.
                                                                                     Unsorted-1 Unsorted-2 Unsorted-3 Unsorted-4

          •      Such computers have been organized in very                                Sort-1        Sort-2       Sort-3       Sort-4
                 different ways. Some key features:
                   - number and complexity of individual CPUs
                   - availability of common (shared memory)                               Sorted-1      Sorted-2     Sorted-3     Sorted-4
                   - interconnection topology
                   - performance of interconnection network                                                     Merge
                   - I/O devices
                   - -------------
                                                                                                      S O R T E D




Petru Eles, IDA, LiTH                                                       Petru Eles, IDA, LiTH
Datorarkitektur                                Fö 11/12 - 5         Datorarkitektur                                                            Fö 11/12 - 6




                        Parallel Programs (cont’d)                                            Parallel Programs (cont’d)


      A possible program for parallel sorting:                            2. Matrix addition:

      var t: array[1..1000] of integer;                                  a11 a21 ⋅ ⋅ ⋅ ⋅ am1          b11 b21 ⋅ ⋅ ⋅ ⋅ bm1          c11 c21 ⋅ ⋅ ⋅ ⋅ cm1
      - - - - - - - - - - -                                              a12 a22 ⋅ ⋅ ⋅ ⋅ am2          b12 b22 ⋅ ⋅ ⋅ ⋅ bm2          c12 c22 ⋅ ⋅ ⋅ ⋅ cm2
      procedure sort(i,j:integer);                                       a13 a23 ⋅ ⋅ ⋅ ⋅ am3 + b13 b23 ⋅ ⋅ ⋅ ⋅ bm3 = c13 c23 ⋅ ⋅ ⋅ ⋅ cm3
                                                                         ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅
        -sort elements between t[i] and t[j]-                            a1n a2n ⋅ ⋅ ⋅ ⋅ amn          b1n b2n ⋅ ⋅ ⋅ ⋅ bmn          c1n c2n ⋅ ⋅ ⋅ ⋅ cmn
      end sort;
      procedure merge;
        - - merge the four sub-arrays - -
      end merge;                                                          var  a: array[1..n,1..m] of integer;
                                                                               b: array[1..n,1..m] of integer;
      - - - - - - - - - - -                                                    c: array[1..n,1..m] of integer;
      begin                                                                    i:integer
        - - - - - - - -                                                   - - - - - - - - - - -
        cobegin
             sort(1,250)|                                                 begin
             sort(251,500)|                                                 - - - - - - - -
             sort(501,750)|                                                 for i:=1 to n do
             sort(751,1000)                                                    for j:= 1 to m do
        coend;                                                                    c[i,j]:=a[i,j]+b[i,j];
        merge;                                                                 end for
        - - - - - - - -                                                     end for
      end;                                                                  - - - - - - - -
                                                                          end;


Petru Eles, IDA, LiTH                                               Petru Eles, IDA, LiTH




      Datorarkitektur                                Fö 11/12 - 7         Datorarkitektur                                                            Fö 11/12 - 8




                                                                                              Parallel Programs (cont’d)
                        Parallel Programs (cont’d)
                                                                          Matrix addition - vector computation version:
      Matrix addition - parallel version:
                                                                          var  a: array[1..n,1..m] of integer;
      var  a: array[1..n,1..m] of integer;                                     b: array[1..n,1..m] of integer;
           b: array[1..n,1..m] of integer;                                     c: array[1..n,1..m] of integer;
           c: array[1..n,1..m] of integer;                                     i,j:integer
           i:integer                                                      - - - - - - - - - - -
      - - - - - - - - - - -
      procedure add_vector(n_ln:integer);                                 begin
        var j:integer                                                       - - - - - - - -
      begin                                                                 for i:=1 to n do
        for j:=1 to m do                                                        c[i,1:m]:=a[i,1:m]+b[i,1:m];
           c[n_ln,j]:=a[n_ln,j]+b[n_ln,j];                                  end for;
        end for                                                             - - - - - - - -
      end add_vector;                                                     end;

      begin
        - - - - - - - -                                                   Or even so:
        cobegin for i:=1 to n do
            add_vector(i);                                                begin
        coend;                                                              - - - - - - - -
        - - - - - - - -                                                     c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m];
      end;                                                                  - - - - - - - -
                                                                          end;



Petru Eles, IDA, LiTH                                               Petru Eles, IDA, LiTH
Datorarkitektur                                                                   Fö 11/12 - 9          Datorarkitektur                                                          Fö 11/12 - 10




                             Parallel Programs (cont’d)                                                                              Parallel Programs (cont’d)



      Pipeline model computation:                                                                             A program for the previous computation:



                                                                                                              channel ch:real;
                                                                                                              - - - - - - - - -
                        x                                                           y                         cobegin
                                            y = 5 × 45 + log x
                                                                                                                var x:real;
                                                                                                                while true do
                                                                                                                   read(x);
                                                                                                                   send(ch,45+log(x));
                                                                                                                end while |
           x                                      a                                           y
                                                                                                                var v:real;
                         a = 45 + log x                            y = 5× a
                                                                                                                while true do
                                                                                                                   receive(ch,v);
                                                                                                                   write(5*sqrt(v));
                                                                                                                end while
                                                                                                              coend;
                                                                                                              - - - - - - - - -




Petru Eles, IDA, LiTH                                                                                   Petru Eles, IDA, LiTH




      Datorarkitektur                                                                   Fö 11/12 - 11         Datorarkitektur                                                          Fö 11/12 - 12




      Flynn’s Classification of Computer Architectures
                                                                                                                                 Flynn’s Classification (cont’d)

          •      Flynn’s classification is based on the nature of the
                 instruction flow executed by the computer and that                                            2. Single Instruction stream, Multiple Data stream (SIMD)
                 of the data flow on which the instructions operate.


                                                                                                              SIMD with shared memory

      1. Single Instruction stream, Single Data stream (SISD)



                                                                                                                                          Processing DS1
                                                                                                                                            unit_1
                                                                                                                                                            Interconnection Network




                   CPU
                                                                                                                                          Processing DS2
                        Control           instr. stream          Processing                                                                 unit_2
                         unit                                       unit                                        Control         IS                                                    Shared
                                                                                                                 unit                                                                 Memory
                                                                      data stream




                                                                                                                                          Processing DSn
                                            Memory                                                                                          unit_n




Petru Eles, IDA, LiTH                                                                                   Petru Eles, IDA, LiTH
Datorarkitektur                                                                      Fö 11/12 - 13         Datorarkitektur                                                        Fö 11/12 - 14




                             Flynn’s Classification (cont’d)                                                                         Flynn’s Classification (cont’d)


                                                                                                                 3. Multiple Instruction stream, Multiple Data stream (MIMD)
      SIMD with no shared memory
                                                                                                                 MIMD with shared memory
                                             LM1
                                                       DS1
                                                                                                                                     IS1
                                          Processing                                                                                          LM1
                                            unit_1                                                            CPU_1                                     DS1
                                                                                                                         Control           Processing
                                             LM2                                                                                             unit_1
                                                                                                                         unit_1


                                                                 Interconnection Network
                                                       DS2
                                                                                                                                     IS2
                        LM                Processing                                                                                          LM2
                                            unit_2




                                                                                                                                                              Interconnection Network
                                                                                                              CPU_2                                     DS2
                 Control       IS
                  unit                                                                                                    Control          Processing
                                                                                                                          unit_2             unit_2
                                                                                                                                                                                        Shared
                                                                                                                                                                                        Memory

                                             LMn
                                                       DSn
                                                                                                                                     ISn
                                          Processing                                                                                          LMn
                                            unit_n                                                            CPU_n                                     DSn
                                                                                                                          Control          Processing
                                                                                                                          unit_n             unit_n



Petru Eles, IDA, LiTH                                                                                      Petru Eles, IDA, LiTH




      Datorarkitektur                                                                      Fö 11/12 - 15         Datorarkitektur                                                        Fö 11/12 - 16




                             Flynn’s Classification (cont’d)                                                                  Performance of Parallel Architectures



      MIMD with no shared memory

                                                                                                                 Important questions:
                                    IS1
                                             LM1
             CPU_1                                      DS1
                        Control           Processing                                                                 •      How fast runs a parallel computer at its maximal
                        unit_1              unit_1                                                                          potential?
                                    IS2
                                             LM2
                                                                                                                     •      How fast execution can we expect from a parallel
                                                              Interconnection Network




             CPU_2                                     DS2                                                                  computer for a concrete application?
                        Control           Processing
                        unit_2              unit_2                                                                   •      How do we measure the performance of a parallel
                                                                                                                            computer and the performance improvement we
                                                                                                                            get by using such a computer?



                                    ISn
                                             LMn
             CPU_n                                      DSn
                        Control           Processing
                        unit_n              unit_n




Petru Eles, IDA, LiTH                                                                                      Petru Eles, IDA, LiTH
Datorarkitektur                                                                                         Fö 11/12 - 17         Datorarkitektur                                                              Fö 11/12 - 18




                                        Performance Metrics                                                                                             Performance Metrics (cont’d)



                                                                                                                                        •      Efficiency: this metric relates the speedup to the
          •      Peak rate: the maximal computation rate that can                                                                              number of processors used; by this it provides a
                 be theoretically achieved when all modules are fully                                                                          measure of the efficiency with which the processors
                 utilized.                                                                                                                     are used.

                 The peak rate is of no practical significance for the                                                                                                        S
                                                                                                                                                                         E = --
                                                                                                                                                                              -
                 user. It is mostly used by vendor companies for                                                                                                             p
                 marketing of their computers.

          •      Speedup: measures the gain we get by using a                                                                                  S: speedup;
                 certain parallel computer to run a given parallel                                                                             p: number of processors.
                 program in order to solve a specific problem.
                                             TS                                                                                                For the ideal situation, in theory:
                                         S = ------
                                             TP                                                                                                           TS
                                                                                                                                                      S = ----- = p ;
                                                                                                                                                              -                which means                 E=1
                                                                                                                                                          TS
                                                                                                                                                          -----
                                                                                                                                                              -
                                                                                                                                                            p
                 TS: execution time needed with the best sequential
                     algorithm;
                 TP: execution time needed with the parallel
                     algorithm.
                                                                                                                                    Practically the ideal efficiency of 1 can not be achieved!




Petru Eles, IDA, LiTH                                                                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                                                                         Fö 11/12 - 19         Datorarkitektur                                                              Fö 11/12 - 20




                                                 Amdahl’s Law                                                                                               Amdahl’s Law (cont’d)

          •      Consider f to be the ratio of computations that,
                 according to the algorithm, have to be executed
                 sequentially (0 ≤ f ≤ 1); p is the number of
                 processors;                                                                                                        Amdahl’s law: even a little ratio of sequential
                                                                                                                                    computation imposes a certain limit to speedup; a higher
                                                (1 – f ) × T S                                                                      speedup than 1/f can not be achieved, regardless the
                  T P = f × T S + -----------------------------
                                                              p                                                                     number of processors.
                                             TS                                         1
                  S = --------------------------------------------------- = --------------------------
                                                                        -                            -
                                                                    TS                (1 – f )
                       f × T S + ( 1 – f ) × -----                      -    f + ----------------
                                                                                             p
                                                                                                     -                                                          S                     1
                                                                      p                                                                                     E = -- = -----------------------------------
                                                                                                                                                                 -                                     -
                                                                                                                                                                P     f × ( p – 1) + 1

              S

           10
            9                                                                                                                       To efficiently exploit a high number of processors, f must
            8                                                                                                                       be small (the algorithm has to be highly parallel).
            7
            6
            5
            4
            3
            2
            1
                               0.2               0.4              0.6              0.8              1.0   f




Petru Eles, IDA, LiTH                                                                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                            Fö 11/12 - 21         Datorarkitektur                                              Fö 11/12 - 22




               Other Aspects which Limit the Speedup                                                 Efficiency and Communication Cost

                                                                                       Consider a highly parallel computation, so that f (the
                                                                                       ratio of sequential computations) can be neglected.
          •      Beside the intrinsic sequentiality of some parts of
                 an algorithm there are also other factors that limit                  We define fc, the fractional communication overhead of a
                 the achievable speedup:                                               processor:
                   - communication cost                                                     Tcalc: time that a processor executes computations;
                   - load balancing of processors                                           Tcomm: time that a processor is idle because of
                   - costs of creating and scheduling processes                                      communication;
                   - I/O operations                                                                                 T comm
                                                                                                             f c = ---------------
                                                                                                                                 -
                                                                                                                     T calc
                                                                                                                     TS
                                                                                                             T p = ----- × ( 1 + f c )
                                                                                                                        -
                                                                                                                      p
          •      There are many algorithms with a high degree of                                                  TS                p
                 parallelism; for such algorithms the value of f is very                                     S = ------ = --------------
                                                                                                                                       -
                 small and can be ignored. These algorithms are                                                   TP             1 + fc
                 suited for massively parallel systems; in such cases                                                  1
                                                                                                             E = -------------- ≈ 1 – f c
                                                                                                                              -
                 the other limiting factors, like the cost of                                                    1 + fc
                 communications, become critical.

                                                                                           •      With algorithms that have a high degree of
                                                                                                  parallelism, massively parallel computers,
                                                                                                  consisting of large number of processors, can be
                                                                                                  efficiently used if fc is small; this means that the
                                                                                                  time spent by a processor for communication has to
                                                                                                  be small compared to its effective time of
                                                                                                  computation.
                                                                                           •      In order to keep fc reasonably small, the size of
                                                                                                  processes can not go below a certain limit.

Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                            Fö 11/12 - 23         Datorarkitektur                                              Fö 11/12 - 24




                        The Interconnection Network                                                      The Interconnection Network (cont’d)


                                                                                       Single Bus

          •      The interconnection network (IN) is a key
                 component of the architecture. It has a decisive
                 influence on the overall performance and cost.                                           Node1            Node2             Noden

          •      The traffic in the IN consists of data transfer and
                 transfer of commands and requests.

          •      The key parameters of the IN are
                  - total bandwidth: transferred bits/second
                  - cost                                                                   •      Single bus networks are simple and cheap.
                                                                                           •      One single communication is allowed at a time; the
                                                                                                  bandwidth is shred by all nodes.
                                                                                           •      Performance is relatively poor.
                                                                                           •      In order to keep a certain performance, the number
                                                                                                  of nodes is limited (16 - 20).




Petru Eles, IDA, LiTH                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                          Fö 11/12 - 25         Datorarkitektur                                          Fö 11/12 - 26




                        The Interconnection Network (cont’d)                                           The Interconnection Network (cont’d)

                                                                                     Crossbar network
      Completely connected network




                                        Node1                                                            Node1




                Node2                                    Node5                                           Node2




                              Node3              Node4
                                                                                                         Noden


                                                                                         •      The crossbar is a dynamic network: the
          •      Each node is connected to every other one.                                     interconnection topology can be modified by
                                                                                                positioning of switches.
          •      Communications can be performed in parallel
                 between any pair of nodes.                                              •      The crossbar switch is completely connected: any
                                                                                                node can be directly connected to any other.
          •      Both performance and cost are high.
                                                                                         •      Fewer interconnections are needed than for the
          •      Cost increases rapidly with number of nodes.
                                                                                                static completely connected network; however, a
                                                                                                large number of switches is needed.
                                                                                         •      A large number of communications can be
                                                                                                performed in parallel (one certain node can receive
                                                                                                or send only one data at a time).


Petru Eles, IDA, LiTH                                                          Petru Eles, IDA, LiTH




      Datorarkitektur                                          Fö 11/12 - 27         Datorarkitektur                                          Fö 11/12 - 28




                        The Interconnection Network (cont’d)                                           The Interconnection Network (cont’d)


      Mesh network                                                                   Hypercube network

                                                                                             N0                                  N4
              Node1             Node5           Node9     Node13

                                                                                                           N8                 N12
                                                                                                                 N1                              N5
              Node2             Node6           Node10    Node14
                                                                                                                      N9              N13


              Node3             Node7           Node11    Node15

                                                                                                           N10                N14

              Node4             Node8           Node12    Node16                             N2                                  N6
                                                                                                                      N11             N15


          •      Mesh networks are cheaper than completely
                 connected ones and provide relatively good                                                      N3                              N7
                 performance.
          •      In order to transmit an information between certain
                 nodes, routing through intermediate nodes is need-                      •      2n nodes are arranged in an n-dimensional cube.
                 ed (maximum 2*(n-1) intermediates for an n*n mesh).                            Each node is connected to n neighbours.
          •      It is possible to provide wraparound connections:                       •      In order to transmit an information between certain
                 between nodes 1 and 13, 2 and 14, etc.                                         nodes, routing through intermediate nodes is
                                                                                                needed (maximum n intermediates).
          •      Three dimensional meshes have been also
                 implemented.


Petru Eles, IDA, LiTH                                                          Petru Eles, IDA, LiTH
Datorarkitektur                                        Fö 11/12 - 29         Datorarkitektur                                          Fö 11/12 - 30




                             SIMD Computers                                                              MULTIPROCESSORS

                            PU         PU                        PU
                                                                                   Shared memory MIMD computers are called
                                                                                   multiprocessors:

                            PU         PU                        PU
                                                                                                Local         Local                 Local
      Control                                                                                  Memory        Memory                Memory
       unit
                                                                                             Processor       Processor             Processor
                                                                                                 1               2                     n


                             PU        PU                         PU

          •      SIMD computers are usually called array processors.
          •      PU’s are usually very simple: an ALU which                                                     Shared
                 executes the instruction broadcast by the CU, a few                                            Memory
                 registers, and some local memory.
          •      The first SIMD computer:
                   - ILLIAC IV (1970s): 64 relatively powerful                         •      Some multiprocessors have no shared memory
                     processors (mesh connection, see above).                                 which is central to the system and equally
          •      Contemporary commercial computer:                                            accessible to all processors. All the memory is
                   - CM-2 (Connection Machine) by Thinking Ma-                                distributed as local memory to the processors.
                     chines Corporation: 65 536 very simple proces-                           However, each processor has access to the local
                     sors (connected as a hypercube).                                         memory of any other processor ⇒ a global
                                                                                              physical address space is available.
          •      Array processors are highly specialized for
                                                                                              This memory organization is called distributed
                 numerical problems that can be expressed in matrix
                                                                                              shared memory.
                 or vector format (see program on slide 8).
                 Each PU computes one element of the result.


Petru Eles, IDA, LiTH                                                        Petru Eles, IDA, LiTH




      Datorarkitektur                                        Fö 11/12 - 31         Datorarkitektur                                          Fö 11/12 - 32




                                                                                                         Mutiprocessors (cont’d)
                          Multiprocessors (cont’d)


                                                                                       •      IBM System/370 (1970s): two IBM CPUs
                                                                                              connected to shared memory.
                                                                                              IBM System370/XA (1981): multiple CPUs can be
          •      Communication between processors is through the
                                                                                              connected to shared memory.
                 shared memory. One processor can change the
                                                                                              IBM System/390 (1990s): similar features like
                 value in a location and the other processors can
                                                                                              S370/XA, with improved performance. Possibility to
                 read the new value.
                                                                                              connect several multiprocessor systems together
                                                                                              through fast fibre-optic connection.
          •      From the programmers point of view
                 communication is realised by shared variables;
                                                                                       •      CRAY X-MP (mid 1980s): from one to four vector
                 these are variables which can be accessed by each
                                                                                              processors connected to shared memory (cycle
                 of the parallel activities (processes):
                                                                                              time: 8.5 ns).
                   - table t in slide 5;                                                      CRAY Y-MP (1988): from one to 8 vector processors
                   - matrixes a, b, and c in slide 7;                                         connected to shared memory; 3 times more
                                                                                              powerful than CRAY X-MP (cycle time: 4 ns).
                                                                                              C90 (early 1990s): further development of CRAY Y-
          •      With many fast processors memory contention can
                 seriously degrade performance ⇒ multiprocessor
                                                                                              MP; 16 vector processors.
                                                                                              CRAY 3 (1993): maximum 16 vector processors
                 architectures don’t support a high number of
                                                                                              (cycle time 2ns).
                 processors.

                                                                                       •      Butterfly multiprocessor system, by BBN Advanced
                                                                                              Computers (1985/87): maximum 256 Motorola
                                                                                              68020 processors, interconnected by a
                                                                                              sophisticated dynamic switching network;
                                                                                              distributed shared memory organization.
                                                                                              BBN TC2000 (1990): improved version of the
                                                                                              Butterfly using Motorola 88100 RISC processor.



Petru Eles, IDA, LiTH                                                        Petru Eles, IDA, LiTH
Datorarkitektur                                         Fö 11/12 - 33         Datorarkitektur                                           Fö 11/12 - 34




                               Multicomputers
                                                                                                         Multicomputers (cont’d)


      MIMD computers with a distributed address space, so that
      each processor has its one private memory which is not                            •      Communication between processors is only by
      visible to other processors, are called multicomputers:                                  passing messages over the interconnection
                                                                                               network.

                                                                                        •      From the programmers point of view this means
                                                                                               that no shared variables are available (a variable
                                                                                               can be accessed only by one single process). For
                  Private       Private               Private                                  communication between parallel activities
                  Memory        Memory                Memory                                   (processes) the programmer uses channels and
                                                                                               send/receive operations (see program in slide 10).
                Processor       Processor             Processor
                    1               2                     n                             •      There is no competition of the processors for the
                                                                                               shared memory ⇒ the number of processors is
                                                                                               not limited by memory contention.

                                                                                        •      The speed of the interconnection network is an
                                                                                               important parameter for the overall performance.




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                         Fö 11/12 - 35         Datorarkitektur                                           Fö 11/12 - 36




                            Multicomputers (cont’d)                                                        Vector Processors

          •      Intel iPSC/2 (1989): 128 CPUs of type 80386 inter-
                 connected by a 7-dimensional hypercube (27=128).
                                                                                        •      Vector processors include in their instruction set,
          •      Intel Paragon (1991): over 2000 processors of type                            beside scalar instructions, also instructions
                 i860 (high performance RISC) interconnected by a                              operating on vectors.
                 two-dimensional mesh network.
                                                                                        •      Array processors (SIMD) computers (see slide 29)
          •      KSR-1 by Kendal Square Research (1992): 1088                                  can operate on vectors by executing simultaneously
                 processors interconnected by a ring network.                                  the same instruction on pairs of vector elements;
                                                                                               each instruction is executed by a separate
                                                                                               processing element.
          •      nCUBE/2S by nCUBE (1992): 8192 processors in-
                 terconnected by a 10-dimensional hypercube.
                                                                                        •      Several computer architectures have implemented
                                                                                               vector operations using the parallelism provided by
          •      Cray T3E MC512 (1995): 512 CPUs interconnected                                pipelined functional units. Such architectures are
                 by a tree-dimensional mesh; each CPU is a DEC                                 called vector processors.
                 Alpha RISC.

          •      Network of workstations:
                 A group of workstations connected through a Local
                 Area Network (LAN), can be used together as a
                 multicomputer for parallel computation.
                 Performances usually will be lower than with
                 specialized multicomputers, because of the
                 communication speed over the LAN. However, this
                 is a cheap solution.




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                           Fö 11/12 - 37         Datorarkitektur                                           Fö 11/12 - 38




                                                                                                          Vector Processors (cont’d)

                          Vector Processors (cont’d)


                                                                                                                Scalar unit
          •      Vector processors are not parallel processors;                                     Scalar         Scalar registers
                 there are not several CPUs running in parallel.                                 instructions
                 They are SISD processors which have                                                               Scalar functional
                 implemented vector instructions executed on                                                            units
                 pipelined functional units.
                                                                                        Instruction
                                                                                         decoder                                               Memory
          •      Vector computers usually have vector registers
                 which can store each 64 up to 128 words.                                                          Vector registers

                                                                                                    Vector         Vector functional
          •      Vector instructions (see slide 40):                                                                    units
                                                                                                 instructions
                  - load vector from memory into vector register
                                                                                                                Vector unit
                  - store vector into memory
                  - arithmetic and logic operations between vectors
                  - operations between vectors and scalars
                  - etc.
                                                                                          •      Vector computers:
                                                                                                  - CDC Cyber 205
          •      From the programmers point of view this means
                 that he is allowed to use operations on vectors in                               - CRAY
                 his programmes (see program in slide 8), and the                                 - IBM 3090 (an extension to the IBM System/370)
                 compiler translates these instructions into vector                               - NEC SX
                 instructions at machine level.
                                                                                                  - Fujitsu VP
                                                                                                  - HITACHI S8000




Petru Eles, IDA, LiTH                                                           Petru Eles, IDA, LiTH




      Datorarkitektur                                           Fö 11/12 - 39         Datorarkitektur                                           Fö 11/12 - 40




                               The Vector Unit                                                               Vector Instructions
                                                                                      LOAD-STORE instructions:
                                                                                         R ← A(x1:x2:incr)                    load
                                                                                         A(x1:x2:incr) ← R                    store
          •      A vector unit typically consists of
                  - pipelined functional units
                                                                                                 R ← MASKED(A)                masked load
                  - vector registers
                                                                                                 A ← MASKED(R)                masked store

                                                                                                 R ← INDIRECT(A(X))           indirect load
                                                                                                 A(X) ← INDIRECT(R)           indirect store

          •      Vector registers:
                                                                                      Arithmetic - logic
                  - n general purpose vector registers Ri, 0 ≤ i ≤ n-1;
                                                                                           R ← R' b_op R''
                  - vector length register VL; stores the length l
                    (0 ≤ l ≤ s), of the currently processed vector(s);                     R ← S b_op R'
                    s is the length of the vector registers Ri.                            R ← u_op R'
                  - mask register M; stores a set of l bits, one for                       M ← R rel_op R'
                    each element in a vector register, interpreted as                      WHERE(M) R ← R' b_op R''
                    boolean values; vector instructions can be exe-
                    cuted in masked mode so that vector register
                    elements corresponding to a false value in M,                     chaining
                    are ignored.                                                           R2 ← R0 + R1
                                                                                           R3 ← R2 * R4

                                                                                                 execution of the vector multiplication has not to wait
                                                                                                 until the vector addition has terminated; as
                                                                                                 elements of the sum are generated by the addition
                                                                                                 pipeline they enter the multiplication pipeline; thus,
                                                                                                 addition and multiplication are performed (partially)
                                                                                                 in parallel.

Petru Eles, IDA, LiTH                                                           Petru Eles, IDA, LiTH
Datorarkitektur                                         Fö 11/12 - 41         Datorarkitektur                                               Fö 11/12 - 42




                              Vector Instructions (cont’d)                               Multimedia Extensions to General Purpose
                                                                                                     Microprocessors


      In a Pascal-like language with vector computation:
                                                                                        •      Video and audio applications very often deal with
                                                                                               large arrays of small data types (8 or 16 bits).
                 if T[1..50]>0 then
                    T[1..50]:=T[1..50]+1;
                                                                                        •      Such applications exhibit a large potential of SIMD
                                                                                               (vector) parallelism.


      A compiler for a vector computer generates something like:

                 R0 ← T(0:49:1)
                 VL ← 50
                                                                                               New generations of general purpose microproces-
                 M ← R0 > 0                                                                    sors have been equipped with special instructions
                 WHERE(M) R0 ← R0 + 1                                                          to exploit this potential of parallelism.


                                                                                        •      The specialised multimedia instructions perform
                                                                                               vector computations on bytes, half-words, or words.




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH




      Datorarkitektur                                         Fö 11/12 - 43         Datorarkitektur                                               Fö 11/12 - 44




                 Multimedia Extensions to General Purpose                                      Multimedia Extensions to General Purpose
                         Microprocessors (cont’d)                                                      Microprocessors (cont’d)



                                                                                    The basic idea: subword execution
      Several vendors have extended the instruction set of
      their processors in order to improve performance with                                           • Use the entire width of a processor data path
      multimedia applications:                                                                          (32 or 64 bits), even when processing the small
                                                                                                        data types used in signal processing (8, 12, or
                                                                                                        16 bits).
                        • MMX for Intel x86 family

                        • VIS for UltraSparc
                                                                                                      With word size 64 bits, the adders will be used to
                        • MDMX for MIPS                                                               implement eight 8 bit additions in parallel

                        • MAX-2 for Hewlett-Packard PA-RISC


                                                                                    This is practically a kind of SIMD parallelism, at a
      The Pentium line provides 57 MMX instructions. They                           reduced scale.
      treat data in a SIMD fashion (see textbook pg. 353).




Petru Eles, IDA, LiTH                                                         Petru Eles, IDA, LiTH
Datorarkitektur                                                            Fö 11/12 - 45         Datorarkitektur                                             Fö 11/12 - 46




                 Multimedia Extensions to General Purpose                                                         Multimedia Extensions to General Purpose
                         Microprocessors (cont’d)                                                                         Microprocessors (cont’d)


      Three packed data types are defined for parallel                                                      •      Examples of SIMD arithmetics with the MMX
      operations: packed byte, packed half word, packed word.                                                     instruction set:

                                                                                                  ADD R3 ← R1,R2
    Packed byte                                                                                    a7     a6    a5                       a4   a3         a2   a1           a0
     q7     q6                     q5        q4        q3        q2        q1        q0                +                 +    +          +    +          +    +             +
                                                                                                      b7                 b6   b5         b4   b3         b2   b1           b0
                                                                                                       =                 =    =          =    =          =    =             =
    Packed half word
         q3                                                 q1                  q0                 a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0
                                        q2


     Packed word
                             q1                                       q0                          MPYADD R3 ← R1,R2
                                                                                                   a7    a6    a5                        a4   a3         a2   a1           a0
                                                                                                               x-+                 x-+             x-+             x-+
                                                                                                      b7                 b6   b5         b4   b3         b2   b1           b0
    Long word
                                                  q0                                                            =                  =               =                =
                                                                                                  (a6xb6)+(a7xb7) (a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1)
                                                  64 bits




Petru Eles, IDA, LiTH                                                                            Petru Eles, IDA, LiTH




      Datorarkitektur                                                            Fö 11/12 - 47         Datorarkitektur                                             Fö 11/12 - 48




                 Multimedia Extensions to General Purpose                                                                                Summary
                         Microprocessors (cont’d)

                                                                                                           •      The growing need for high performance can not
      How to get the data ready for computation?                                                                  always be satisfied by computers running a single
                                                                                                                  CPU.
      How to get the results back in the right format?

          •      Packing and Unpacking                                                                     •      With Parallel computers, several CPUs are running
                                                                                                                  in order to solve a given application.

                                                                                                           •      Parallel programs have to be available in order to
                                                                                                                  use parallel computers.
 PACK.W R3 ← R1,R2
                                  a1                                   a0
                                                                                                           •      Computers can be classified based on the nature of
                                                                                                                  the instruction flow executed and that of the data
                        b1                                       b0                                               flow on which the instructions operate: SISD, SIMD,
                                                                                                                  and MIMD architectures.

      truncated b1 truncated b0 truncated a1 truncated a0                                                  •      The performance we effectively can get by using a
                                                                                                                  parallel computer depends not only on the number
                                                                                                                  of available processors but is limited by
                                                                                                                  characteristics of the executed programs.
   UNPACK R3 ← R1
              a3                        a2                  a1              a0                             •      The efficiency of using a parallel computer is
                                                                                                                  influenced by features of the parallel program, like:
                                                                                                                  degree of parallelism, intensity of inter-processor
                                                                                                                  communication, etc.
                         a1                                       a0




Petru Eles, IDA, LiTH                                                                            Petru Eles, IDA, LiTH
Datorarkitektur                                            Fö 11/12 - 49




                               Summary (cont’d)


          •      A key component of a parallel architecture is the
                 interconnection network.

          •      Array processors execute the same operation on a
                 set of interconnected processing units. They are
                 specialized for numerical problems expressed in
                 matrix or vector formats.

          •      Multiprocessors are MIMD computers in which all
                 CPUs have access to a common shared address
                 space. The number of CPUs is limited.

          •      Multicomputers have a distributed address
                 space. Communication between CPUs is only by
                 message passing over the interconnection network.
                 The number of interconnected CPUs can be high.

          •      Vector processors are SISD processors which
                 include in their instruction set instructions operating
                 on vectors. They are implemented using pipelined
                 functional units.

          •      Multimedia applications exhibit a large potential of
                 SIMD parallelism. The instruction set of modern
                 general purpose microprocessors (Pentium,
                 UltraSparc) has been extended to support SIMD-
                 style parallelism with operations on short vectors.



Petru Eles, IDA, LiTH

More Related Content

What's hot

computer architecture
computer architecture computer architecture
computer architecture Dr.Umadevi V
 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...IOSR Journals
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Data flow architecture
Data flow architectureData flow architecture
Data flow architectureSourav Routh
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionShobha Kumar
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecturevamsi krishna
 
PLA Minimization -Testing
PLA Minimization -TestingPLA Minimization -Testing
PLA Minimization -TestingDr.YNM
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsRajalakshmiSermadurai
 

What's hot (20)

computer architecture
computer architecture computer architecture
computer architecture
 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
 
EC6703 unit-4
EC6703 unit-4EC6703 unit-4
EC6703 unit-4
 
Reconfigurable computing
Reconfigurable computingReconfigurable computing
Reconfigurable computing
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 
Physical design
Physical design Physical design
Physical design
 
CA introduction
CA  introductionCA  introduction
CA introduction
 
Lecture 46
Lecture 46Lecture 46
Lecture 46
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Data flow architecture
Data flow architectureData flow architecture
Data flow architecture
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
RISC AND CISC PROCESSOR
RISC AND CISC PROCESSORRISC AND CISC PROCESSOR
RISC AND CISC PROCESSOR
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
CS6303 - Computer Architecture
CS6303 - Computer ArchitectureCS6303 - Computer Architecture
CS6303 - Computer Architecture
 
PLA Minimization -Testing
PLA Minimization -TestingPLA Minimization -Testing
PLA Minimization -Testing
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Ec8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systemsEc8791 unit 5 processes and operating systems
Ec8791 unit 5 processes and operating systems
 
Hpc 4 5
Hpc 4 5Hpc 4 5
Hpc 4 5
 
24
2424
24
 

Viewers also liked

02 computer evolution and performance.ppt [compatibility mode]
02 computer evolution and performance.ppt [compatibility mode]02 computer evolution and performance.ppt [compatibility mode]
02 computer evolution and performance.ppt [compatibility mode]bogi007
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAmir Masoud Sefidian
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Interconnection Network
Interconnection NetworkInterconnection Network
Interconnection NetworkAli A Jalil
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architectureMr SMAK
 
Parallel computing chapter 2
Parallel computing chapter 2Parallel computing chapter 2
Parallel computing chapter 2Md. Mahedi Mahfuj
 
Interconnecting Devices
Interconnecting DevicesInterconnecting Devices
Interconnecting Devicestrisson255
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classificationNarayan Kandel
 
Lecture 1
Lecture 1Lecture 1
Lecture 1Mr SMAK
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architecturesA B Shinde
 
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Lauri Eloranta
 
Coherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureCoherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureUniversity of Pisa
 
Cache coherence
Cache coherenceCache coherence
Cache coherenceEmployee
 

Viewers also liked (20)

Computational models
Computational models Computational models
Computational models
 
02 computer evolution and performance.ppt [compatibility mode]
02 computer evolution and performance.ppt [compatibility mode]02 computer evolution and performance.ppt [compatibility mode]
02 computer evolution and performance.ppt [compatibility mode]
 
Analysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topologyAnalysis and design of a half hypercube interconnection network topology
Analysis and design of a half hypercube interconnection network topology
 
Mimd
MimdMimd
Mimd
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Lecture5
Lecture5Lecture5
Lecture5
 
Aca 2
Aca 2Aca 2
Aca 2
 
Interconnection Network
Interconnection NetworkInterconnection Network
Interconnection Network
 
Cachememory
CachememoryCachememory
Cachememory
 
Parallel architecture
Parallel architectureParallel architecture
Parallel architecture
 
Parallel computing chapter 2
Parallel computing chapter 2Parallel computing chapter 2
Parallel computing chapter 2
 
Interconnecting Devices
Interconnecting DevicesInterconnecting Devices
Interconnecting Devices
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classification
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
Basics of Computation and Modeling - Lecture 2 in Introduction to Computation...
 
Coherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architectureCoherence and consistency models in multiprocessor architecture
Coherence and consistency models in multiprocessor architecture
 
Aca2 07 new
Aca2 07 newAca2 07 new
Aca2 07 new
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 

Similar to Architectures for parallel

Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...Malobe Lottin Cyrille Marcel
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
06_1_design_flow.ppt
06_1_design_flow.ppt06_1_design_flow.ppt
06_1_design_flow.pptMohammedMianA
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptxPratik Gohel
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptMohmdUmer
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017Kuniyasu Suzaki
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorSmit Shah
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016Kuniyasu Suzaki
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 

Similar to Architectures for parallel (20)

Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
06_1_design_flow.ppt
06_1_design_flow.ppt06_1_design_flow.ppt
06_1_design_flow.ppt
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
ARM-Unit-1.pdf
ARM-Unit-1.pdfARM-Unit-1.pdf
ARM-Unit-1.pdf
 
D031201021027
D031201021027D031201021027
D031201021027
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
esunit1.pptx
esunit1.pptxesunit1.pptx
esunit1.pptx
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017BMC: Bare Metal Container @Open Source Summit Japan 2017
BMC: Bare Metal Container @Open Source Summit Japan 2017
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Lect_1.pptx
Lect_1.pptxLect_1.pptx
Lect_1.pptx
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline Processor
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 

Architectures for parallel

  • 1. Datorarkitektur Fö 11/12 - 1 Datorarkitektur Fö 11/12 - 2 Why Parallel Computation? ARCHITECTURES FOR PARALLEL COMPUTATION • The need for high performance! Two main factors contribute to high performance of modern processors: 1. Why Parallel Computation 1. Fast circuit technology 2. Architectural features: 2. Parallel Programs - large caches - multiple fast buses 3. A Classification of Computer Architectures - pipelining - superscalar architectures (multiple funct. units) 4. Performance of Parallel Architectures However 5. The Interconnection Network • Computers running with a single CPU, often are not able to meet performance needs in certain areas: 6. Array Processors - Fluid flow analysis and aerodynamics; - Simulation of large complex systems, for exam- 7. Multiprocessors ple in physics, economy, biology, technic; - Computer aided design; 8. Multicomputers - Multimedia. 9. Vector Processors • Applications in the above domains are characterized by a very high amount of numerical computation and/or a high quantity of input data. 10. Multimedia Extensions to Microprocessors Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 3 Datorarkitektur Fö 11/12 - 4 A Solution: Parallel Computers Parallel Programs 1. Parallel sorting • One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application. Unsorted-1 Unsorted-2 Unsorted-3 Unsorted-4 • Such computers have been organized in very Sort-1 Sort-2 Sort-3 Sort-4 different ways. Some key features: - number and complexity of individual CPUs - availability of common (shared memory) Sorted-1 Sorted-2 Sorted-3 Sorted-4 - interconnection topology - performance of interconnection network Merge - I/O devices - ------------- S O R T E D Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 2. Datorarkitektur Fö 11/12 - 5 Datorarkitektur Fö 11/12 - 6 Parallel Programs (cont’d) Parallel Programs (cont’d) A possible program for parallel sorting: 2. Matrix addition: var t: array[1..1000] of integer; a11 a21 ⋅ ⋅ ⋅ ⋅ am1 b11 b21 ⋅ ⋅ ⋅ ⋅ bm1 c11 c21 ⋅ ⋅ ⋅ ⋅ cm1 - - - - - - - - - - - a12 a22 ⋅ ⋅ ⋅ ⋅ am2 b12 b22 ⋅ ⋅ ⋅ ⋅ bm2 c12 c22 ⋅ ⋅ ⋅ ⋅ cm2 procedure sort(i,j:integer); a13 a23 ⋅ ⋅ ⋅ ⋅ am3 + b13 b23 ⋅ ⋅ ⋅ ⋅ bm3 = c13 c23 ⋅ ⋅ ⋅ ⋅ cm3 ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅⋅ ⋅ ⋅ ⋅ -sort elements between t[i] and t[j]- a1n a2n ⋅ ⋅ ⋅ ⋅ amn b1n b2n ⋅ ⋅ ⋅ ⋅ bmn c1n c2n ⋅ ⋅ ⋅ ⋅ cmn end sort; procedure merge; - - merge the four sub-arrays - - end merge; var a: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; - - - - - - - - - - - c: array[1..n,1..m] of integer; begin i:integer - - - - - - - - - - - - - - - - - - - cobegin sort(1,250)| begin sort(251,500)| - - - - - - - - sort(501,750)| for i:=1 to n do sort(751,1000) for j:= 1 to m do coend; c[i,j]:=a[i,j]+b[i,j]; merge; end for - - - - - - - - end for end; - - - - - - - - end; Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 7 Datorarkitektur Fö 11/12 - 8 Parallel Programs (cont’d) Parallel Programs (cont’d) Matrix addition - vector computation version: Matrix addition - parallel version: var a: array[1..n,1..m] of integer; var a: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; b: array[1..n,1..m] of integer; c: array[1..n,1..m] of integer; c: array[1..n,1..m] of integer; i,j:integer i:integer - - - - - - - - - - - - - - - - - - - - - - procedure add_vector(n_ln:integer); begin var j:integer - - - - - - - - begin for i:=1 to n do for j:=1 to m do c[i,1:m]:=a[i,1:m]+b[i,1:m]; c[n_ln,j]:=a[n_ln,j]+b[n_ln,j]; end for; end for - - - - - - - - end add_vector; end; begin - - - - - - - - Or even so: cobegin for i:=1 to n do add_vector(i); begin coend; - - - - - - - - - - - - - - - - c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m]; end; - - - - - - - - end; Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 3. Datorarkitektur Fö 11/12 - 9 Datorarkitektur Fö 11/12 - 10 Parallel Programs (cont’d) Parallel Programs (cont’d) Pipeline model computation: A program for the previous computation: channel ch:real; - - - - - - - - - x y cobegin y = 5 × 45 + log x var x:real; while true do read(x); send(ch,45+log(x)); end while | x a y var v:real; a = 45 + log x y = 5× a while true do receive(ch,v); write(5*sqrt(v)); end while coend; - - - - - - - - - Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 11 Datorarkitektur Fö 11/12 - 12 Flynn’s Classification of Computer Architectures Flynn’s Classification (cont’d) • Flynn’s classification is based on the nature of the instruction flow executed by the computer and that 2. Single Instruction stream, Multiple Data stream (SIMD) of the data flow on which the instructions operate. SIMD with shared memory 1. Single Instruction stream, Single Data stream (SISD) Processing DS1 unit_1 Interconnection Network CPU Processing DS2 Control instr. stream Processing unit_2 unit unit Control IS Shared unit Memory data stream Processing DSn Memory unit_n Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 4. Datorarkitektur Fö 11/12 - 13 Datorarkitektur Fö 11/12 - 14 Flynn’s Classification (cont’d) Flynn’s Classification (cont’d) 3. Multiple Instruction stream, Multiple Data stream (MIMD) SIMD with no shared memory MIMD with shared memory LM1 DS1 IS1 Processing LM1 unit_1 CPU_1 DS1 Control Processing LM2 unit_1 unit_1 Interconnection Network DS2 IS2 LM Processing LM2 unit_2 Interconnection Network CPU_2 DS2 Control IS unit Control Processing unit_2 unit_2 Shared Memory LMn DSn ISn Processing LMn unit_n CPU_n DSn Control Processing unit_n unit_n Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 15 Datorarkitektur Fö 11/12 - 16 Flynn’s Classification (cont’d) Performance of Parallel Architectures MIMD with no shared memory Important questions: IS1 LM1 CPU_1 DS1 Control Processing • How fast runs a parallel computer at its maximal unit_1 unit_1 potential? IS2 LM2 • How fast execution can we expect from a parallel Interconnection Network CPU_2 DS2 computer for a concrete application? Control Processing unit_2 unit_2 • How do we measure the performance of a parallel computer and the performance improvement we get by using such a computer? ISn LMn CPU_n DSn Control Processing unit_n unit_n Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 5. Datorarkitektur Fö 11/12 - 17 Datorarkitektur Fö 11/12 - 18 Performance Metrics Performance Metrics (cont’d) • Efficiency: this metric relates the speedup to the • Peak rate: the maximal computation rate that can number of processors used; by this it provides a be theoretically achieved when all modules are fully measure of the efficiency with which the processors utilized. are used. The peak rate is of no practical significance for the S E = -- - user. It is mostly used by vendor companies for p marketing of their computers. • Speedup: measures the gain we get by using a S: speedup; certain parallel computer to run a given parallel p: number of processors. program in order to solve a specific problem. TS For the ideal situation, in theory: S = ------ TP TS S = ----- = p ; - which means E=1 TS ----- - p TS: execution time needed with the best sequential algorithm; TP: execution time needed with the parallel algorithm. Practically the ideal efficiency of 1 can not be achieved! Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 19 Datorarkitektur Fö 11/12 - 20 Amdahl’s Law Amdahl’s Law (cont’d) • Consider f to be the ratio of computations that, according to the algorithm, have to be executed sequentially (0 ≤ f ≤ 1); p is the number of processors; Amdahl’s law: even a little ratio of sequential computation imposes a certain limit to speedup; a higher (1 – f ) × T S speedup than 1/f can not be achieved, regardless the T P = f × T S + ----------------------------- p number of processors. TS 1 S = --------------------------------------------------- = -------------------------- - - TS (1 – f ) f × T S + ( 1 – f ) × ----- - f + ---------------- p - S 1 p E = -- = ----------------------------------- - - P f × ( p – 1) + 1 S 10 9 To efficiently exploit a high number of processors, f must 8 be small (the algorithm has to be highly parallel). 7 6 5 4 3 2 1 0.2 0.4 0.6 0.8 1.0 f Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 6. Datorarkitektur Fö 11/12 - 21 Datorarkitektur Fö 11/12 - 22 Other Aspects which Limit the Speedup Efficiency and Communication Cost Consider a highly parallel computation, so that f (the ratio of sequential computations) can be neglected. • Beside the intrinsic sequentiality of some parts of an algorithm there are also other factors that limit We define fc, the fractional communication overhead of a the achievable speedup: processor: - communication cost Tcalc: time that a processor executes computations; - load balancing of processors Tcomm: time that a processor is idle because of - costs of creating and scheduling processes communication; - I/O operations T comm f c = --------------- - T calc TS T p = ----- × ( 1 + f c ) - p • There are many algorithms with a high degree of TS p parallelism; for such algorithms the value of f is very S = ------ = -------------- - small and can be ignored. These algorithms are TP 1 + fc suited for massively parallel systems; in such cases 1 E = -------------- ≈ 1 – f c - the other limiting factors, like the cost of 1 + fc communications, become critical. • With algorithms that have a high degree of parallelism, massively parallel computers, consisting of large number of processors, can be efficiently used if fc is small; this means that the time spent by a processor for communication has to be small compared to its effective time of computation. • In order to keep fc reasonably small, the size of processes can not go below a certain limit. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 23 Datorarkitektur Fö 11/12 - 24 The Interconnection Network The Interconnection Network (cont’d) Single Bus • The interconnection network (IN) is a key component of the architecture. It has a decisive influence on the overall performance and cost. Node1 Node2 Noden • The traffic in the IN consists of data transfer and transfer of commands and requests. • The key parameters of the IN are - total bandwidth: transferred bits/second - cost • Single bus networks are simple and cheap. • One single communication is allowed at a time; the bandwidth is shred by all nodes. • Performance is relatively poor. • In order to keep a certain performance, the number of nodes is limited (16 - 20). Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 7. Datorarkitektur Fö 11/12 - 25 Datorarkitektur Fö 11/12 - 26 The Interconnection Network (cont’d) The Interconnection Network (cont’d) Crossbar network Completely connected network Node1 Node1 Node2 Node5 Node2 Node3 Node4 Noden • The crossbar is a dynamic network: the • Each node is connected to every other one. interconnection topology can be modified by positioning of switches. • Communications can be performed in parallel between any pair of nodes. • The crossbar switch is completely connected: any node can be directly connected to any other. • Both performance and cost are high. • Fewer interconnections are needed than for the • Cost increases rapidly with number of nodes. static completely connected network; however, a large number of switches is needed. • A large number of communications can be performed in parallel (one certain node can receive or send only one data at a time). Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 27 Datorarkitektur Fö 11/12 - 28 The Interconnection Network (cont’d) The Interconnection Network (cont’d) Mesh network Hypercube network N0 N4 Node1 Node5 Node9 Node13 N8 N12 N1 N5 Node2 Node6 Node10 Node14 N9 N13 Node3 Node7 Node11 Node15 N10 N14 Node4 Node8 Node12 Node16 N2 N6 N11 N15 • Mesh networks are cheaper than completely connected ones and provide relatively good N3 N7 performance. • In order to transmit an information between certain nodes, routing through intermediate nodes is need- • 2n nodes are arranged in an n-dimensional cube. ed (maximum 2*(n-1) intermediates for an n*n mesh). Each node is connected to n neighbours. • It is possible to provide wraparound connections: • In order to transmit an information between certain between nodes 1 and 13, 2 and 14, etc. nodes, routing through intermediate nodes is needed (maximum n intermediates). • Three dimensional meshes have been also implemented. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 8. Datorarkitektur Fö 11/12 - 29 Datorarkitektur Fö 11/12 - 30 SIMD Computers MULTIPROCESSORS PU PU PU Shared memory MIMD computers are called multiprocessors: PU PU PU Local Local Local Control Memory Memory Memory unit Processor Processor Processor 1 2 n PU PU PU • SIMD computers are usually called array processors. • PU’s are usually very simple: an ALU which Shared executes the instruction broadcast by the CU, a few Memory registers, and some local memory. • The first SIMD computer: - ILLIAC IV (1970s): 64 relatively powerful • Some multiprocessors have no shared memory processors (mesh connection, see above). which is central to the system and equally • Contemporary commercial computer: accessible to all processors. All the memory is - CM-2 (Connection Machine) by Thinking Ma- distributed as local memory to the processors. chines Corporation: 65 536 very simple proces- However, each processor has access to the local sors (connected as a hypercube). memory of any other processor ⇒ a global physical address space is available. • Array processors are highly specialized for This memory organization is called distributed numerical problems that can be expressed in matrix shared memory. or vector format (see program on slide 8). Each PU computes one element of the result. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 31 Datorarkitektur Fö 11/12 - 32 Mutiprocessors (cont’d) Multiprocessors (cont’d) • IBM System/370 (1970s): two IBM CPUs connected to shared memory. IBM System370/XA (1981): multiple CPUs can be • Communication between processors is through the connected to shared memory. shared memory. One processor can change the IBM System/390 (1990s): similar features like value in a location and the other processors can S370/XA, with improved performance. Possibility to read the new value. connect several multiprocessor systems together through fast fibre-optic connection. • From the programmers point of view communication is realised by shared variables; • CRAY X-MP (mid 1980s): from one to four vector these are variables which can be accessed by each processors connected to shared memory (cycle of the parallel activities (processes): time: 8.5 ns). - table t in slide 5; CRAY Y-MP (1988): from one to 8 vector processors - matrixes a, b, and c in slide 7; connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns). C90 (early 1990s): further development of CRAY Y- • With many fast processors memory contention can seriously degrade performance ⇒ multiprocessor MP; 16 vector processors. CRAY 3 (1993): maximum 16 vector processors architectures don’t support a high number of (cycle time 2ns). processors. • Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola 68020 processors, interconnected by a sophisticated dynamic switching network; distributed shared memory organization. BBN TC2000 (1990): improved version of the Butterfly using Motorola 88100 RISC processor. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 9. Datorarkitektur Fö 11/12 - 33 Datorarkitektur Fö 11/12 - 34 Multicomputers Multicomputers (cont’d) MIMD computers with a distributed address space, so that each processor has its one private memory which is not • Communication between processors is only by visible to other processors, are called multicomputers: passing messages over the interconnection network. • From the programmers point of view this means that no shared variables are available (a variable can be accessed only by one single process). For Private Private Private communication between parallel activities Memory Memory Memory (processes) the programmer uses channels and send/receive operations (see program in slide 10). Processor Processor Processor 1 2 n • There is no competition of the processors for the shared memory ⇒ the number of processors is not limited by memory contention. • The speed of the interconnection network is an important parameter for the overall performance. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 35 Datorarkitektur Fö 11/12 - 36 Multicomputers (cont’d) Vector Processors • Intel iPSC/2 (1989): 128 CPUs of type 80386 inter- connected by a 7-dimensional hypercube (27=128). • Vector processors include in their instruction set, • Intel Paragon (1991): over 2000 processors of type beside scalar instructions, also instructions i860 (high performance RISC) interconnected by a operating on vectors. two-dimensional mesh network. • Array processors (SIMD) computers (see slide 29) • KSR-1 by Kendal Square Research (1992): 1088 can operate on vectors by executing simultaneously processors interconnected by a ring network. the same instruction on pairs of vector elements; each instruction is executed by a separate processing element. • nCUBE/2S by nCUBE (1992): 8192 processors in- terconnected by a 10-dimensional hypercube. • Several computer architectures have implemented vector operations using the parallelism provided by • Cray T3E MC512 (1995): 512 CPUs interconnected pipelined functional units. Such architectures are by a tree-dimensional mesh; each CPU is a DEC called vector processors. Alpha RISC. • Network of workstations: A group of workstations connected through a Local Area Network (LAN), can be used together as a multicomputer for parallel computation. Performances usually will be lower than with specialized multicomputers, because of the communication speed over the LAN. However, this is a cheap solution. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 10. Datorarkitektur Fö 11/12 - 37 Datorarkitektur Fö 11/12 - 38 Vector Processors (cont’d) Vector Processors (cont’d) Scalar unit • Vector processors are not parallel processors; Scalar Scalar registers there are not several CPUs running in parallel. instructions They are SISD processors which have Scalar functional implemented vector instructions executed on units pipelined functional units. Instruction decoder Memory • Vector computers usually have vector registers which can store each 64 up to 128 words. Vector registers Vector Vector functional • Vector instructions (see slide 40): units instructions - load vector from memory into vector register Vector unit - store vector into memory - arithmetic and logic operations between vectors - operations between vectors and scalars - etc. • Vector computers: - CDC Cyber 205 • From the programmers point of view this means that he is allowed to use operations on vectors in - CRAY his programmes (see program in slide 8), and the - IBM 3090 (an extension to the IBM System/370) compiler translates these instructions into vector - NEC SX instructions at machine level. - Fujitsu VP - HITACHI S8000 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 39 Datorarkitektur Fö 11/12 - 40 The Vector Unit Vector Instructions LOAD-STORE instructions: R ← A(x1:x2:incr) load A(x1:x2:incr) ← R store • A vector unit typically consists of - pipelined functional units R ← MASKED(A) masked load - vector registers A ← MASKED(R) masked store R ← INDIRECT(A(X)) indirect load A(X) ← INDIRECT(R) indirect store • Vector registers: Arithmetic - logic - n general purpose vector registers Ri, 0 ≤ i ≤ n-1; R ← R' b_op R'' - vector length register VL; stores the length l (0 ≤ l ≤ s), of the currently processed vector(s); R ← S b_op R' s is the length of the vector registers Ri. R ← u_op R' - mask register M; stores a set of l bits, one for M ← R rel_op R' each element in a vector register, interpreted as WHERE(M) R ← R' b_op R'' boolean values; vector instructions can be exe- cuted in masked mode so that vector register elements corresponding to a false value in M, chaining are ignored. R2 ← R0 + R1 R3 ← R2 * R4 execution of the vector multiplication has not to wait until the vector addition has terminated; as elements of the sum are generated by the addition pipeline they enter the multiplication pipeline; thus, addition and multiplication are performed (partially) in parallel. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 11. Datorarkitektur Fö 11/12 - 41 Datorarkitektur Fö 11/12 - 42 Vector Instructions (cont’d) Multimedia Extensions to General Purpose Microprocessors In a Pascal-like language with vector computation: • Video and audio applications very often deal with large arrays of small data types (8 or 16 bits). if T[1..50]>0 then T[1..50]:=T[1..50]+1; • Such applications exhibit a large potential of SIMD (vector) parallelism. A compiler for a vector computer generates something like: R0 ← T(0:49:1) VL ← 50 New generations of general purpose microproces- M ← R0 > 0 sors have been equipped with special instructions WHERE(M) R0 ← R0 + 1 to exploit this potential of parallelism. • The specialised multimedia instructions perform vector computations on bytes, half-words, or words. Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 43 Datorarkitektur Fö 11/12 - 44 Multimedia Extensions to General Purpose Multimedia Extensions to General Purpose Microprocessors (cont’d) Microprocessors (cont’d) The basic idea: subword execution Several vendors have extended the instruction set of their processors in order to improve performance with • Use the entire width of a processor data path multimedia applications: (32 or 64 bits), even when processing the small data types used in signal processing (8, 12, or 16 bits). • MMX for Intel x86 family • VIS for UltraSparc With word size 64 bits, the adders will be used to • MDMX for MIPS implement eight 8 bit additions in parallel • MAX-2 for Hewlett-Packard PA-RISC This is practically a kind of SIMD parallelism, at a The Pentium line provides 57 MMX instructions. They reduced scale. treat data in a SIMD fashion (see textbook pg. 353). Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 12. Datorarkitektur Fö 11/12 - 45 Datorarkitektur Fö 11/12 - 46 Multimedia Extensions to General Purpose Multimedia Extensions to General Purpose Microprocessors (cont’d) Microprocessors (cont’d) Three packed data types are defined for parallel • Examples of SIMD arithmetics with the MMX operations: packed byte, packed half word, packed word. instruction set: ADD R3 ← R1,R2 Packed byte a7 a6 a5 a4 a3 a2 a1 a0 q7 q6 q5 q4 q3 q2 q1 q0 + + + + + + + + b7 b6 b5 b4 b3 b2 b1 b0 = = = = = = = = Packed half word q3 q1 q0 a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1 a0+b0 q2 Packed word q1 q0 MPYADD R3 ← R1,R2 a7 a6 a5 a4 a3 a2 a1 a0 x-+ x-+ x-+ x-+ b7 b6 b5 b4 b3 b2 b1 b0 Long word q0 = = = = (a6xb6)+(a7xb7) (a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1) 64 bits Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH Datorarkitektur Fö 11/12 - 47 Datorarkitektur Fö 11/12 - 48 Multimedia Extensions to General Purpose Summary Microprocessors (cont’d) • The growing need for high performance can not How to get the data ready for computation? always be satisfied by computers running a single CPU. How to get the results back in the right format? • Packing and Unpacking • With Parallel computers, several CPUs are running in order to solve a given application. • Parallel programs have to be available in order to use parallel computers. PACK.W R3 ← R1,R2 a1 a0 • Computers can be classified based on the nature of the instruction flow executed and that of the data b1 b0 flow on which the instructions operate: SISD, SIMD, and MIMD architectures. truncated b1 truncated b0 truncated a1 truncated a0 • The performance we effectively can get by using a parallel computer depends not only on the number of available processors but is limited by characteristics of the executed programs. UNPACK R3 ← R1 a3 a2 a1 a0 • The efficiency of using a parallel computer is influenced by features of the parallel program, like: degree of parallelism, intensity of inter-processor communication, etc. a1 a0 Petru Eles, IDA, LiTH Petru Eles, IDA, LiTH
  • 13. Datorarkitektur Fö 11/12 - 49 Summary (cont’d) • A key component of a parallel architecture is the interconnection network. • Array processors execute the same operation on a set of interconnected processing units. They are specialized for numerical problems expressed in matrix or vector formats. • Multiprocessors are MIMD computers in which all CPUs have access to a common shared address space. The number of CPUs is limited. • Multicomputers have a distributed address space. Communication between CPUs is only by message passing over the interconnection network. The number of interconnected CPUs can be high. • Vector processors are SISD processors which include in their instruction set instructions operating on vectors. They are implemented using pipelined functional units. • Multimedia applications exhibit a large potential of SIMD parallelism. The instruction set of modern general purpose microprocessors (Pentium, UltraSparc) has been extended to support SIMD- style parallelism with operations on short vectors. Petru Eles, IDA, LiTH