SlideShare une entreprise Scribd logo
1  sur  105
UNIT 3 - Basic
Processing Unit
Overview
   Instruction Set Processor (ISP)
   Central Processing Unit (CPU)
   A typical computing task consists of a series
    of steps specified by a sequence of machine
    instructions that constitute a program.
   An instruction is executed by carrying out a
    sequence of more rudimentary operations.
Some Fundamental
        Concepts
Fundamental Concepts
   Processor fetches one instruction at a time and
    perform the operation specified.
   Instructions are fetched from successive memory
    locations until a branch or a jump instruction is
    encountered.
   Processor keeps track of the address of the memory
    location containing the next instruction to be fetched
    using Program Counter (PC).
   Instruction Register (IR)
Executing an Instruction
   Fetch the contents of the memory location pointed
    to by the PC. The contents of this location are
    loaded into the IR (fetch phase).
                     IR ← [[PC]]
   Assuming that the memory is byte addressable,
    increment the contents of the PC by 4 (fetch phase).
                     PC ← [PC] + 4
   Carry out the actions specified by the instruction in
    the IR (execution phase).
Processor Organization                                          Internal processor
                                                                       bus

                                                                                       Control signals

                                                 PC

                                                                                          Instruction
                                Address
                                                                                         decoder and
                                 lines
     MDR HAS                                    MAR                                      control logic
    TWO INPUTS       Memory
     AND TWO          bus
     OUTPUTS                                    MDR
                                  Data
                                  lines                                                      IR

                                                                                                            Datapath
                                                  Y
                                Constant 4                                                   R0


                      Select              MUX


                               Add
                                           A              B
                     ALU       Sub                                                        R( n ­ 1 )
                    control                      ALU
                     lines
                                                               Carry­in
                               XOR                                                         TEMP


                                                 Z

Textbook Page 413

                               Figure  7. 1.  Single­bus organization of the datapath inside a processor.
Executing an Instruction
   Transfer a word of data from one processor
    register to another or to the ALU.
   Perform an arithmetic or a logic operation
    and store the result in a processor register.
   Fetch the contents of a given memory
    location and load them into a processor
    register.
   Store a word of data from a processor
    register into a given memory location.
Register Transfers                        Riin
                                                       Internal processor
                                                             bus




                                    Ri


                                         Riout

                                          Yin


                                    Y

                    Constant 4

          Select            MUX


                                A            B
                                    ALU


                          Zin

                                    Z


                                          Z out

          Figure 7.2.  Input and output gating for the registers in Figure 7.1.
Register Transfers
   All operations and data transfers are controlled by the processor clock.
                                          Bus




                           0

                                              D     Q
                           1
                                                    Q
                                                                   Riout

                           Ri in
                                      Clock




               Figure  7. 3. Input and output gating for one register bit.
                 Figure 7.3. Input and output gating for one register bit.
Performing an Arithmetic or
Logic Operation
    The ALU is a combinational circuit that has no
     internal storage.
    ALU gets the two operands from MUX and bus.
     The result is temporarily stored in register Z.
    What is the sequence of operations to add the
     contents of register R1 to those of R2 and store the
     result in R3?
    1.   R1out, Yin
    2.   R2out, SelectY, Add, Zin
    3.   Zout, R3in
Fetching a Word from Memory
   Address into MAR; issue Read operation; data into MDR.
       Memory­bus                                                 Internal processor
        data lines   MDRoutE                             MDRout          bus




                                         MDR




                     MDRinE                               MDRin




            Figure  7.7.4. Connection and control signals for register MDR.
              Figure 4. Connection and control signals for register MDR.
Fetching a Word from Memory
   The response time of each memory access varies
    (cache miss, memory-mapped I/O,…).
   To accommodate this, the processor waits until it
    receives an indication that the requested operation
    has been completed (Memory-Function-Completed,
    MFC).
   Move (R1), R2
   MAR ← [R1]
   Start a Read operation on the memory bus
   Wait for the MFC response from the memory
   Load MDR from the memory bus
   R2 ← [MDR]
Step      1                        2                         3

Timing                  Clock


                       MARin                  MAR ← [R1]
Assume MAR
is always available    Address
on the address lines
of the memory bus.                           Start a Read operation on the memory bus
                         Read


                          MR


                       MDRinE


                         Data

           Wait for the MFC response from the memory
                         MFC


                       MDRout    Load MDR from the memory bus
                                                                                           R2 ← [MDR]


                                    Figure  7. 5. Timing of a memory Read operation.
Execution of a Complete
Instruction
   Add (R3), R1
   Fetch the instruction
   Fetch the first operand (the contents of the
    memory location pointed to by R3)
   Perform the addition
   Load the result into R1
Architecture                              Riin
                                                       Internal processor
                                                             bus




                                    Ri


                                         Riout

                                          Yin


                                    Y

                    Constant 4

          Select            MUX


                                A            B
                                    ALU


                          Zin

                                    Z


                                          Z out

          Figure 7.2.  Input and output gating for the registers in Figure 7.1.
Execution of a Complete
       Instruction                                                                                          Internal processor
                                                                                                                   bus

           Add (R3), R1                                                                                                            Control signals

                                                                                             PC

                                                                                                                                      Instruction
             S tep   Action                                                 Address
                                                                                                                                     decoder and
                                                                             lines
                                                                                            MAR                                      control logic

             1       PCout , MAR in , Read, Select4 ,Add, Zin    Memory
                                                                  bus

             2       Zout , PCin , Y in , WMF C                                             MDR
                                                                              Data
                                                                                                                                         IR
             3       MDR out , IR in                                          lines


             4       R3out , MAR in , Read                                                    Y
                                                                                                                                         R0
             5       R1out , Y in , WMF C                                   Constant 4


             6       MDR out , SelectY,Add, Zin                   Select              MUX

             7       Zout , R1 in , End                                    Add
                                                                                       A              B
                                                                 ALU       Sub                                                        R( n ­ 1 )
                                                                control                      ALU
                                                                 lines
                                                                                                           Carry­in
                                                                           XOR                                                         TEMP
Figure 7 .6 . Control sequence executionof the instruction Add (R3 ),R1 .
                             for
                                                                                             Z




                                                                           Figure  7. 1.  Single­bus organization of the datapath inside a processor.
Execution of Branch
Instructions
   A branch instruction replaces the contents of
    PC with the branch target address, which is
    usually obtained by adding an offset X given
    in the branch instruction.
   The offset X is usually the difference between
    the branch target address and the address
    immediately following the branch instruction.
   Conditional branch
Execution of Branch
Instructions

          S tep Action

          1      PCout , MAR in , Read, Select4,Add, Z in
          2      Zout , PCin , Yin , WMF C
          3      MDR out , IR in
          4      Offset-field-of-IR , Add, Z in
                                   out

          5      Z out , PCin , End




 Figure 7.7. Control sequence for an unconditional branch instruction.
Multiple-Bus Organization
          Bus A    Bus B                                         Bus C

                                            Incrementer



                                                 PC




                                               Register
                                                 file


                          Constant 4




                                MUX
                                           A

                                                ALU R

                                           B



                                            Instruction
                                             decoder



                                                 IR



                                                MDR




                                                MAR



                          Memory bus           Address
                           data lines           lines




             Figure  7. 8. Three­b us organization of the datapath.
Multiple-Bus Organization
   Add R4, R5, R6


             S tep Action

             1       PCout, R=B, MAR in , Read, IncPC
             2       WMFC
             3       MDR outB , R=B, IR in
             4       R4outA , R5outB , SelectA, Add, R6in , End




          Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,
                      for the three-bus organization in Figure 7.8.
Quiz                                                                  Internal processor
                                                                             bus

                                                                                             Control signals


   What is the control                                PC



    sequence for
                                                                                                Instruction
                                      Address
                                                                                               decoder and
                                       lines
                                                      MAR                                      control logic

    execution of the       Memory
                            bus


    instruction                         Data
                                        lines
                                                      MDR
                                                                                                   IR



        Add R1, R2                    Constant 4
                                                        Y
                                                                                                   R0


    including the           Select              MUX


    instruction fetch                Add
                                                 A              B


    phase? (Assume         ALU       Sub                                                        R( n ­ 1 )
                          control                      ALU
                           lines
                                                                     Carry­in

    single bus                       XOR                                                         TEMP




    architecture)
                                                       Z




                                     Figure  7. 1.  Single­bus organization of the datapath inside a processor.
Hardwired Control
Overview
   To execute instructions, the processor must
    have some means of generating the control
    signals needed in the proper sequence.
   Two categories: hardwired control and
    microprogrammed control
   Hardwired system can operate at high speed;
    but with little flexibility.
Control Unit Organization
               CLK       Control step
      Clock                counter




                                                   External
                                                    inputs
                           Decoder/
          IR
                           encoder
                                                   Condition
                                                     codes




                        Control signals


               Figure 7.10.  Control unit organization.
Detailed Block Description
                           CLK
                  Clock              Control step       Reset
                                       counter




                                     Step decoder

                                   T 1 T2        Tn


                          INS1
                                                                External
                          INS2                                   inputs
           Instruction
    IR                                 Encoder
             decoder
                                                                Condition
                                                                 codes
                          INSm


                             Run                      End


                                   Control signals




         Figure  7. 11. Separation of the decoding and encoding functions.
Generating Zin
   Zin = T1 + T6 • ADD + T4 • BR + …
                                          Branch          Add


                              T4                                         T6




                 T1




          Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
           
Generating End
   End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
                                       Branch<0
                    Add                                        Branch
                                   N              N

              T7          T5                              T4               T5




                                         End




                    Figure  7. 13. Generation of the End control signal.
A Complete Processor
       Instruction             Integer             Floating­point
          unit                   unit                  unit




       Instruction                         Data
         cache                             cache




                       Bus interf
                                ace
                                                     Processor



                                                               System bus

         Main                             Input/
        memory                            Output




       Figure  7. 14. Block diagram of a complete processor.
Microprogrammed
          Control
Overview
   Control signals are generated by a program similar to machine
    language programs.
   Control Word (CW); microroutine; microinstruction




                                                      MDRout




                                                                                                                                 WMFC
                                       MARin




                                                                            Select
                               PCout




                                                                                                         R1out


                                                                                                                         R3out
        Micro ­

                                               Read
                        PCin




                                                                                                                 R1 in
                                                                                     Add




                                                                                                                                        End
                                                                                                 Z out
                                                               IRin
                                                                      Yin
        instruction




                                                                                           Zin
            1            0      1       1       1       0       0      0     1       1      1      0       0      0        0      0      0
            2            1      0       0       0       0       0      1     0       0      0      1       0      0        0      1      0
            3            0      0       0       0       1       1      0     0       0      0      0       0      0        0      0      0
            4            0      0       1       1       0       0      0     0       0      0      0       0      0        1      0      0
            5            0      0       0       0       0       0      1     0       0      0      0       1      0        0      1      0
            6            0      0       0       0       1       0      0     0       1      1      0       0      0        0      0      0
            7            0      0       0       0       0       0      0     0       0      0      1       0      1        0      0      1




                      Figure  7. 15 An example of microinstructions for Figure  7. 6.
Overview

                 S tep   Action

                 1       PCout , MAR in , Read, Select4 ,Add, Zin
                 2       Zout , PCin , Y in , WMF C
                 3       MDR out , IR in
                 4       R3out , MAR in , Read
                 5       R1out , Y in , WMF C
                 6       MDR out , SelectY,Add, Zin
                 7       Zout , R1 in , End




    Figure 7 .6 . Control sequence executionof the instruction Add (R3 ),R1 .
                                 for
Overview
   Control store
                                                Starting
                               IR               address
                                               generator                One function
                                                                        cannot be carried
                                                                        out by this simple
                                                                        organization.

                             Clock               µP C




                                                Control
                                                 store             CW




                    Figure  7. 16. Basic organization of a microprogrammed control unit.
Overview
   The previous organization cannot handle the situation when the control
    unit is required to check the status of the condition codes or external
    inputs to choose between alternative courses of action.
   Use conditional branch microinstruction.
              Address Microinstruction

             0               PCout , MAR in , Read, Select4,Add, Z in
             1               Zout , PCin , Y in , WMF C
             2               MDRout , IR in
             3               Branch to startingaddress appropriatemicroroutine
                                                                of
             . ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
             25              If N=0, then branch to microinstruction0
             26              Offset-field-of-IRout , SelectY, Add, Z in
             27              Zout , PCin , End


                   Figure 7.17. Microroutine for the instruction Branch<0.
Overview                                         External
                                                  inputs

                           Starting and
                          branch address         Condition
            IR                                     codes
                            generator




           Clock             µ PC




                            Control
                             store                 CW




       Figure 7.18.   Organization of the control unit to allow
                      conditional branching in the microprogram.
Microinstructions
   A straightforward way to structure
    microinstructions is to assign one bit position
    to each control signal.
   However, this is very inefficient.
   The length can be reduced: most signals are
    not needed simultaneously, and many
    signals are mutually exclusive.
   All mutually exclusive signals are placed in
    the same group in binary coding.
Partial Format for the
Microinstructions
    Microinstruction

            F1                  F2                   F3             F4               F5


    F1  (4  bits)         F2  (3  bits)        F3  (3  bits)    F4  (4  bits)   F5  (2  bits)

    0000 : No transfer    000 : No transfer    000 : No transfer 0000 : Add     00 : No action
    0001 : PCout          001 : PC in          001 : MAR  in     0001 : Sub     01 : Read
    0010 : MDR   out      010 : IRin           010 : MDR  in                    10 : Write
    0011 : Zout           011 : Zin            011 : TEMP  in
    0100 : R0out          100 : R0in           100 : Yin         1111 : XOR
    0101 : R1out          101 : R1in
    0110 : R2             110 : R2in                              16  ALU
             out                                                 functions
    0111 : R3out          111 : R3in
    1010 : TEMP   out
    1011 : Offsetout




            F6                F7                   F8
                                                                                                What is the price paid for
                                                                                                this scheme?
      F6  (1  bit)       F7  (1  bit)         F8  (1  bit)

      0 : SelectY        0 : No action        0 : Continue
      1 : Select 4       1 : WMFC             1 : End




    Figure  7. 19. An example of a partial format for field­encoded microinstructions.
Further Improvement
   Enumerate the patterns of required signals in
    all possible microinstructions. Each
    meaningful combination of active control
    signals can then be assigned a distinct code.
   Vertical organization
   Horizontal organization
Microprogram Sequencing
   If all microprograms require only straightforward
    sequential execution of microinstructions except for
    branches, letting a μPC governs the sequencing
    would be efficient.
   However, two disadvantages:
   Having a separate microroutine for each machine instruction results
    in a large total number of microinstructions and a large control store.
   Longer execution time because it takes more time to carry out the
    required branches.
   Example: Add src, Rdst
   Four addressing modes: register, autoincrement,
    autodecrement, and indexed (with indirect forms).
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode


Contents of IR               OP code                   0    1     0         Rsrc         Rdst

                                                   11 10              8 7          4 3          0




              Address        Microinstruction
              (octal)

              000            PCout, MARin , Read, Select , Add, Zin
                                                       4
              001            Zout , PCin, Yin, WMFC
              002            MDRout, IRin
              003            µ Branch { PC ←  101 (from Instruction decoder);
                                       µ
                             µ PC5,4 ←  [IR 10,9]; µ PC3 ← [IR 10] ⋅ [IR9] ⋅ [IR8]}
              121            Rsrcout , MARin , Read, Select4, Add, Z
                                                                   in

              122            Zout , Rsrcin
              123            µBranch {µPC ←  170;µPC0 ← [IR8]}, WMFC
              170            MDRout, MARin , Read, WMFC
              171            MDRout, Yin
              172            Rdstout , SelectY
                                             , Add, Zin
              173            Zout , Rdstin , End




                     Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.
      Note:       Microinstruction at location 170 is not executed for this addressing mode. 
Microinstructions with Next-
Address Field
   The microprogram we discussed requires several
    branch microinstructions, which perform no useful
    operation in the datapath.
   A powerful alternative approach is to include an
    address field as a part of every microinstruction to
    indicate the location of the next microinstruction to
    be fetched.
   Pros: separate branch microinstructions are virtually
    eliminated; few limitations in assigning addresses to
    microinstructions.
   Cons: additional bits for the address field (around
    1/6)
Microinstructions with Next-
Address Field
                                                  IR


                                    External             Condition
                                     Inputs               codes



                                          Decoding circuits




                                          µAR




                                       Control store




         Next address                                          µI R



                                 Microinstruction decoder



                                   Control signals


           Figure  7. 22.   Microinstruction­sequencing organization.
Microinstruction

           F0                   F1                  F2               F3


   F0  (8  bits)        F1  (3  bits)       F2  (3  bits)       F3  (3  bits)

   Address of next 000 : No transfer        000 : No transfer   000 : No transfer
   microinstruction 001 : PCout             001 : PC in         001 : MAR  in
                    010 : MDR   out         010 : IRin          010 : MDR  in
                    011 : Zout              011 : Zin           011 : TEMP  in
                    100 : Rsrc out          100 : Rsrc  in      100 : Yin
                    101 : Rdst out          101 : Rdst  in
                    110 : TEMP   out




             F4                   F5                  F6                F7



        F4  (4  bits)      F5  (2  bits)         F6  (1  bit)     F7  (1  bit)

        0000 : Add         00 : No action        0 : SelectY      0 : No action
        0001 : Sub         01 : Read             1 : Select 4     1 : WMFC
                           10 : Write
        1111 : XOR




             F8                   F9                 F10



       F8  (1  bit)         F9  (1  bit)        F10  (1  bit)

       0 : NextAdrs         0 : No action       0 : No action
       1 : InstDec          1 : ORmode          1 : ORindsrc




Figure  7. 23. Format for microinstructions in the example of Section  7. 5 . 3 .
Implementation of the
Microroutine
    Octal
   address                   F0                      F1       F2           F3           F4          F5      F6 F7 F8 F9 F10

   0   0   0     0   0   0   0   0   0   0   1   0   0    1   01   1   0   0    1   0   0   0   0   0   1   1   0   0   0   0
   0   0   1     0   0   0   0   0   0   1   0   0   1    1   00   1   1   0    0   0   0   0   0   0   0   0   1   0   0   0
   0   0   2     0   0   0   0   0   0   1   1   0   1    0   01   0   0   0    0   0   0   0   0   0   0   0   0   0   0   0
   0   0   3     0   0   0   0   0   0   0   0   0   0    0   00   0   0   0    0   0   0   0   0   0   0   0   0   1   1   0

   1 2 1         0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01                                                1   0   0   0   0
   1 2 2         0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00                                                0   1   0   0   1

   1   7   0     0   1   1   1   1   0   0   1   0   1    0   00   0   0   0    1   0   0   0   0   0   1   0   1   0   0   0
   1   7   1     0   1   1   1   1   0   1   0   0   1    0   00   0   1   0    0   0   0   0   0   0   0   0   0   0   0   0
   1   7   2     0   1   1   1   1   0   1   1   1   0    1   01   1   0   0    0   0   0   0   0   0   0   0   0   0   0   0
   1   7   3     0   0   0   0   0   0   0   0   0   1    1   10   1   0   0    0   0   0   0   0   0   0   0   0   0   0   0



                Figure  7. 24.  Implementation of the microroutine  of Figure  7. 21 using a
                next­microinstruction address field. (See Figure  7. 23 for encoded signals.)
R15in   R15out           R0 in     R0out




                                         Decoder


                                                              Decoder

                   IR                            Rsrc       Rdst

                                                            InstDec
                                                                  out
                    External
                     inputs                                 ORmode
                                       Decoding
                                        circuits
                   Condition                                ORindsrc
                    codes




                                  µAR


                               Control store


           Next address   F1    F2                         F8 F9 F10


  Rdstout


  Rdstin
                                        Microinstruction
                                           decoder
  Rsrcout


  Rsrcin



                                 Other control signals


Figure  7. 25. Some details of the control­signal­generating circuitry.
bit-ORing
Prefetching
   One drawback of Micro Programmed control
    is that it leads to slower operating speed
    because of the time it takes to fetch
    microinstructions from control store
   Faster operation is achieved if the next
    microinstruction is prefetched while the
    current one is executing
   In this way execution time is overlapped with
    fetch time
Prefetching – Disadvantages
   Sometimes the status flag & the result of the
    currently executed microinstructions are
    needed to know the next address
   Thus there is a probability of wrong
    instructions being prefetched
   In this case fetch must be repeated with the
    correct address
Emulation
   Emulation allows us to replace obsolete
    equipment with more up-to-date machines
   It facilitates transitions to new computer
    systems with minimal disruption
   It is the easiest way when machines with
    similar architecture are involved
Pipelining
Overview
   Pipelining is widely used in modern
    processors.
   Pipelining improves system performance in
    terms of throughput.
   Pipelined organization requires sophisticated
    compilation techniques.
Basic Concepts
Making the Execution of
Programs Faster
   Use faster circuit technology to build the
    processor and the main memory.
   Arrange the hardware so that more than one
    operation can be performed at the same time.
   In the latter way, the number of operations
    performed per second is increased even
    though the elapsed time needed to perform
    any one operation is not changed.
Traditional Pipeline Concept

 Laundry  Example
 Ann, Brian, Cathy, Dave
  each have one load of clothes
  to wash, dry, and fold          A   B   C   D
 Washer takes 30 minutes



 Dryer   takes 40 minutes

 “Folder”   takes 20 minutes
Traditional Pipeline Concept
     6 PM     7           8          9     10             11        Midnight

                                   Time

        30   40   20 30       40   20 30   40       20 30      40   20
                                                   Sequential laundry takes 6
 A                                                  hours for 4 loads
                                                   If they learned pipelining,
                                                    how long would laundry
 B                                                  take?

 C


 D
Traditional Pipeline Concept
        6 PM    7        8      9           10       11    Midnight

                              Time
T
a         30   40   40   40   40 20
s
k   A
                                       Pipelined laundry takes
                                        3.5 hours for 4 loads
O   B
r
d   C
e
r   D
Traditional Pipeline Concept
                                                  Pipelining doesn’t help
          6 PM     7        8         9
                                                   latency of single task, it
                                                   helps throughput of entire
                                 Time
                                                   workload
T
             30   40   40   40   40       20      Pipeline rate limited by
a
                                                   slowest pipeline stage
s
      A                                           Multiple tasks operating
k
                                                   simultaneously using
                                                   different resources
O     B                                           Potential speedup = Number
r                                                  pipe stages
d    C                                            Unbalanced lengths of pipe
e                                                  stages reduces speedup
r                                                 Time to “fill” pipeline and
     D
                                                   time to “drain” it reduces
                                                   speedup
                                                  Stall for Dependences
Use the Idea of Pipelining in a
Computer
        Fetch + Execution
                                                                       T ime
         I1                   I2                      I3
                                                                                                                             Time
                                                                               Clock cycle       1        2        3     4
F             E       F            E          F              E
    1         1           2        2              3           3                Instruction

                                                                                   I1            F1      E1
                  (a) Sequential execution


                                                                                   I2                    F2       E2
                          Interstage buffer
                                 B1
                                                                                   I3                             F3    E3

    Instruction                                            Execution
       fetch                                                 unit                            (c) Pipelined execution
       unit


                                                                               Figure 8.1. Basic idea of instruction pipelining.
                  (b) Hardware organization
Use the Idea of Pipelining in a
Computer                  Clock cycle        1         2        3         4          5      6     7
                                                                                                        Time


                          Instruction

                               I1            F1       D1        E1        W1
 Fetch + Decode
 + Execution + Write           I2                     F2        D2        E2       W2


                               I3                               F3        D3        E3      W3


                               I4                                         F4        D4      E4    W4



                                        (a) Instruction execution divided into four steps


                                                           Interstage b
                                                                      uffers




                                                  D : Decode
                        F : Fetch                 instruction                  E: Execute             W : Write
                       instruction                 and fetch                   operation               results
                                                   operands
                                        B1                           B2                      B3

                                                      (b) Hardware organization

Textbook page: 457

                                                  Figure  8. 2. A  4­stage pipeline.
Role of Cache Memory
   Each pipeline stage is expected to complete in one
    clock cycle.
   The clock period should be long enough to let the
    slowest pipeline stage to complete.
   Faster stages can only wait for the slowest one to
    complete.
   Since main memory is very slow compared to the
    execution, if each instruction needs to be fetched
    from main memory, pipeline is almost useless.
   Fortunately, we have cache.
Pipeline Performance
   The potential increase in performance
    resulting from pipelining is proportional to the
    number of pipeline stages.
   However, this increase would be achieved
    only if all pipeline stages require the same
    time to complete, and there is no interruption
    throughout program execution.
   Unfortunately, this is not true.
Pipeline Performance
                                                                                        Time
Clock cycle   1       2       3        4       5        6        7       8        9

Instruction

    I1        F1     D1       E1      W1


    I2               F2       D2               E2               W2


    I3                        F3      D3                        E3      W3


    I4                                F4                        D4       E4      W4


    I5                                                          F5      D5       E5



     Figure  8. 3. Effect of an execution operation taking more than one clock cycle.
Pipeline Performance
   The previous pipeline is said to have been stalled for two clock
    cycles.
   Any condition that causes a pipeline to stall is called a hazard.
   Data hazard – any condition in which either the source or the
    destination operands of an instruction are not available at the
    time expected in the pipeline. So some operation has to be
    delayed, and the pipeline stalls.
   Instruction (control) hazard – a delay in the availability of an
    instruction causes the pipeline to stall.
   Structural hazard – the situation when two instructions require
    the use of a given hardware resource at the same time.
Pipeline Performance                                                                              Time
              Clock cycle    1        2       3           4      5       6        7       8     9
Instruction   Instruction
hazard           I1          F1      D1       E1         W1


                 I2                                 F2                  D2       E2      W2


                 I3                                                     F3       D3       E3    W3



                              (a) Instruction execution steps in successive clock cycles

                                                                                                    Time
              Clock cycle    1        2       3           4      5       6        7       8     9

                Stage
               F: Fetch      F1      F2       F2         F2     F2      F3
                                                                                                           Idle periods –
               D: Decode             D1      idle        idle   idle    D2       D3
                                                                                                           stalls (bubbles)
               E: Execute                     E1         idle   idle   idle      E2       E3

               W: Write                                  W1     idle   idle     idle     W2     W3


                      (b) Function performed by each processor stage in successive clock cycles




                                  Figure  8. 4. Pipeline stall caused by a cache miss in F 2.
Pipeline Performance
             Load X(R1), R2
Structural
                                                                                         Time
hazard         Clock cycle      1      2        3       4        5       6        7

               Instruction

                   I1           F1    D1       E1      W1


                   I2  (Load)         F2       D2      E2       M2      W2


                   I3                          F3      D3       E3               W3


                   I4                                  F4       D4               E4


                   I5                                                    F5      D5




                        Figure  8. 5. Effect of a Load instruction on pipeline timing.
Pipeline Performance
   Again, pipelining does not result in individual
    instructions being executed faster; rather, it is the
    throughput that increases.
   Throughput is measured by the rate at which
    instruction execution is completed.
   Pipeline stall causes degradation in pipeline
    performance.
   We need to identify all hazards that may cause the
    pipeline to stall and to find ways to minimize their
    impact.
Data Hazards
Data Hazards
   We must ensure that the results obtained when instructions are
    executed in a pipelined processor are identical to those obtained
    when the same instructions are executed sequentially.
   Hazard occurs
        A←3+A
        B←4×A
   No hazard
        A←5×C
        B ← 20 + C
   When two operations depend on each other, they must be
    executed sequentially in the correct order.
   Another example:
        Mul R2, R3, R4
        Add R5, R4, R6
Data Hazards
                                                                                Time
Clock cycle     1      2      3      4      5       6      7      8      9

Instruction

 I1  (Mul)      F1     D1     E1     W1


 I2  (Add)            F2     D2            D2 A    E2     W2


 I3                           F3                   D3     E3     W3


 I4                                                F4     D4     E4     W4



      Figure  8. 6. Pipeline stalled by data dependency between D2 and W1.
           Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Operand Forwarding
   Instead of from the register file, the second
    instruction can get data directly from the
    output of ALU after the previous instruction is
    completed.
   A special arrangement needs to be made to
    “forward” the output of ALU to the input of
    ALU.
Source 1
   Source 2



                                     SRC1             SRC2




                 Register
                   file



                                               ALU




                                               RSLT

   Destination


                                (a) Datapath


SRC1 ,SRC2                                     RSLT


                       E: Execute                               W: Write
                        (ALU)                                 (Register file)




                                         Forwarding path


  (b) Position of the source and result registers in the processor pipeline



       Figure  8. 7. Operand forw arding in a pipelined processor.
Handling Data Hazards in
Software
   Let the compiler detect and handle the
    hazard:
        I1: Mul R2, R3, R4
            NOP
            NOP
        I2: Add R5, R4, R6
   The compiler can reorder the instructions to
    perform some useful work during the NOP
    slots.
Side Effects
   The previous example is explicit and easily detected.
   Sometimes an instruction changes the contents of a register
    other than the one named as the destination.
   When a location other than one explicitly named in an instruction
    as a destination operand is affected, the instruction is said to
    have a side effect. (Example?)
   Example: conditional code flags:
         Add R1, R3
         AddWithCarry R2, R4
   Instructions designed for execution on pipelined hardware should
    have few side effects.
Instruction Hazards
Overview
   Whenever the stream of instructions supplied
    by the instruction fetch unit is interrupted, the
    pipeline stalls.
   Cache miss
   Branch
Unconditional Branches
                                                                        Time
    Clock cycle        1        2       3        4        5        6

    Instruction
    I1                 F1      E1


    I2  (Branch)               F2       E2                     Execution unit idle



    I3                                  F3       X


    Ik                                           Fk      Ek


    Ik+1                                                Fk+1     Ek+1




             Figure  8. 8. An idle cycle caused by a branch instruction.
Time
                          Clock cycle         1        2      3      4     5       6      7       8



Branch Timing             I1                  F1       D1    E1     W1


                          I2  (Branch)                 F2    D2     E2


                          I3                                  F3    D3     X


 - Branch penalty         I4                                        F4     X


                          Ik                                               Fk     Dk      Ek     Wk
 - Reducing the penalty
                          Ik+1                                                    Fk+1   Dk+1   E k+1


                                         (a) Branch address computed in Execute stage


                                                                                                Time
                          Clock cycle         1        2      3      4     5       6      7

                          I1                  F1       D1    E1     W1


                          I2  (Branch)                 F2    D2


                          I3                                  F3    X


                          Ik                                        Fk    Dk       Ek    Wk


                          Ik+1                                            Fk+1    D k+1 E k+1


                                         (b) Branch address computed in Decode stage




                                                   Figure  8. 9. Branch timing.
Instruction Queue and
Prefetching
        Instruction fetch unit
                                 Instruction queue
             F : Fetch
            instruction




           D : Dispatch/
             Decode                 E : Ex ecute           W : Write
                                    instruction             results
               unit




 Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
   A conditional branch instruction introduces
    the added hazard caused by the dependency
    of the branch condition on the result of a
    preceding instruction.
   The decision to branch cannot be made until
    the execution of that instruction has been
    completed.
   Branch instructions represent about 20% of
    the dynamic instruction count of most
    programs.
Delayed Branch
   The instructions in the delay slots are always
    fetched. Therefore, we would like to arrange
    for them to be fully executed whether or not
    the branch is taken.
   The objective is to place useful instructions in
    these slots.
   The effectiveness of the delayed branch
    approach depends on how often it is possible
    to reorder instructions.
Delayed Branch
            LOOP          Shift_left            R1
                          Decrement             R2
                          Branch=0              LOOP
            NEXT          Add                   R1,R3

                   (a) Original program loop



            LOOP          Decrement             R2
                          Branch=0              LOOP
                          Shift_left            R1
            NEXT          Add                   R1,R3


                   (b) Reordered instructions


    Figure 8.12. Reordering of instructions for a delayed branch.
Delayed Branch
                                                                                  Time
   Clock cycle          1     2      3       4        5       6       7       8

   Instruction
   Decrement            F     E


   Branch                     F      E


   Shift (delay slot)                F       E


    Decrement (Branch tak
                        en)                  F       E


   Branch                                            F        E


   Shift (delay slot)                                         F       E


   Add (Branch not taken)                                             F       E


            Figure 8.13. Execution timing showing the delay slot being filled
                         during the last two passes through the loop in Figure 8.12.
Branch Prediction
   To predict whether or not a particular branch will be taken.
   Simplest form: assume branch will not take place and continue to
    fetch instructions in sequential address order.
   Until the branch is evaluated, instruction execution along the
    predicted path must be done on a speculative basis.
   Speculative execution: instructions are executed before the
    processor is certain that they are in the correct execution
    sequence.
   Need to be careful so that no processor registers or memory
    locations are updated until it is confirmed that these instructions
    should indeed be executed.
Incorrectly Predicted Branch
                                                                       Time
     Clock cycle          1       2       3       4        5       6

     Instruction

    I 1  (Compare)       F1      D1       E1      W1


    I 2  (Branch>0)               F2    D 2 /P2   E2


    I3                                    F3      D3      X


    I4                                            F4      X


    Ik                                                    Fk      Dk




         Figure 8.14. Timing when a branch decision has been incorrectly predicted
                      as not taken.
Branch Prediction
   Better performance can be achieved if we arrange
    for some branch instructions to be predicted as
    taken and others as not taken.
   Use hardware to observe whether the target
    address is lower or higher than that of the branch
    instruction.
   Let compiler include a branch prediction bit.
   So far the branch prediction decision is always the
    same every time a given instruction is executed –
    static branch prediction.
Influence on
Instruction Sets
Overview
   Some instructions are much better suited to
    pipeline execution than others.
   Addressing modes
   Conditional code flags
Addressing Modes
   Addressing modes include simple ones and
    complex ones.
   In choosing the addressing modes to be
    implemented in a pipelined processor, we
    must consider the effect of each addressing
    mode on instruction flow in the pipeline:
   Side effects
   The extent to which complex addressing modes cause
    the pipeline to stall
   Whether a given mode is likely to be used by compilers
Recall
         Load X(R1), R2
                                                                                      Time
            Clock cycle      1      2        3       4        5       6        7

            Instruction

                I1           F1    D1       E1      W1


                I2  (Load)         F2       D2      E2       M2      W2


                I3                          F3      D3       E3               W3


                I4                                  F4       D4               E4


                I5                                                    F5      D5



Load (R1), R2        Figure  8. 5. Effect of a Load instruction on pipeline timing.
Complex Addressing Mode
   Load (X(R1)), R2



                                                                             Time
  Clock cycle 1      2        3           4            5             6   7


  Load        F      D     X + [R1]   [X +[R1]] [[X +[R1]]]          W


                                                           Forward


  Next instruction   F        D                                      E   W



                         (a) Complex addressing mode
Simple Addressing Mode
      Add #X, R1, R2
      Load (R2), R2
      Load (R2), R2




   Add        F       D    X + [R1]      W



   Load               F       D       [X +[R1]]       W



   Load                       F          D        [[X +[R1]]]   W




   Next instruction                      F            D         E   W



                          (b) Simple addressing mode
Addressing Modes
   In a pipelined processor, complex addressing
    modes do not necessarily lead to faster execution.
   Advantage: reducing the number of instructions /
    program space
   Disadvantage: cause pipeline to stall / more
    hardware to decode / not convenient for compiler to
    work with
   Conclusion: complex addressing modes are not
    suitable for pipelined execution.
Addressing Modes
   Good addressing modes should have:
   Access to an operand does not require more than one
    access to the memory
   Only load and store instruction access memory operands
   The addressing modes used do not have side effects
   Register, register indirect, index
Conditional Codes
   If an optimizing compiler attempts to reorder
    instruction to avoid stalling the pipeline when
    branches or data dependencies between
    successive instructions occur, it must ensure
    that reordering does not cause a change in
    the outcome of a computation.
   The dependency introduced by the condition-
    code flags reduces the flexibility available for
    the compiler to reorder instructions.
Conditional Codes
          Add                  R1,R2
          Compare              R3,R4
          Branch=0             . . .


             (a) A program fragment




          Compare              R3,R4
          Add                  R1,R2
          Branch=0             . . .


            (b) Instructions reordered
        Figure 8.17. Instruction reordering.
Conditional Codes
   Two conclusion:
   To provide flexibility in reordering instructions, the
    condition-code flags should be affected by as few
    instruction as possible.
   The compiler should be able to specify in which
    instructions of a program the condition codes are
    affected and in which they are not.
Datapath and Control
     Considerations
Bus A    Bus B                                         Bus C

                                                    Incrementer




Original Design                                          PC




                                                       Register
                                                         file


                                  Constant 4




                                        MUX
                                                   A

                                                        ALU R

                                                   B



                                                    Instruction
                                                     decoder



                                                         IR



                                                        MDR




                                                        MAR



                                  Memory bus           Address
                                   data lines           lines




                     Figure  7. 8. Three­b us organization of the datapath.
Register
                                                                                                file




     Pipelined Design




                                                                                 Bus A
                                                                                             A




                                                                         Bus B
                                                                                                 ALU R

    - Separate instruction and data caches                                                   B
    - PC is connected to IMAR




                                                                                                                      Bus C
    - DMAR
    - Separate MDR                                                                               PC
    - Buffers for ALU                    Control signal pipeline
    - Instruction queue                                                                     Incrementer
    - Instruction decoder output
                                                      Instruction                              IMAR
                                                        decoder

                                                                                           Memory address
                                                                                         (Instruction fetches)
                                                      Instruction
                                                        queue


                                                                        MDR/Write              DMAR               MDR/Read
                                                    Instruction cache

                                                                                          Memory address
- Reading an instruction from the instruction cache                                 (Data access)

- Incrementing the PC
- Decoding an instruction
- Reading from or writing into the data cache                                          Data cache
- Reading the contents of up to two regs
- Writing into one register in the reg file      Figure  8. 18. Datapath modified for pipelined execution, with
- Performing an ALU operation                        interstage buffers at the input and output of the ALU.
Superscalar Operation
Overview
   The maximum throughput of a pipelined processor
    is one instruction per clock cycle.
   If we equip the processor with multiple processing
    units to handle several instructions in parallel in
    each processing stage, several instructions start
    execution in the same clock cycle – multiple-issue.
   Processors are capable of achieving an instruction
    execution throughput of more than one instruction
    per cycle – superscalar processors.
   Multiple-issue requires a wider path to the cache
    and multiple execution units.
Superscalar
   F : Instruction
      fetch unit


                                                Instruction queue




                                                   Floating­
                                                     point
                                                     unit
                        Dispatch
                          unit                                              W : Write
                                                                             results

                                                   Integer
                                                     unit




                     Figure  8. 19. A processor with two execution units.
Timing
                                                                         Time
  Clock cycle     1       2        3       4        5       6        7


 I 1  (Fadd)      F1      D1      E1A     E1B     E 1C     W1


 I 2  (Add)       F2      D2      E2      W2


 I 3  (Fsub)              F3      D3       E3      E3       E3      W3


 I 4  (Sub)               F4      D4       E4      W4




 Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
              assuming no hazards are encountered.      
Out-of-Order Execution
   Hazards
   Exceptions
   Imprecise exceptions
   Precise exceptions
                                                                         Time
           Clock cycle   1     2       3          4    5      6     7

           I 1  (Fadd)   F1   D1      E1A     E 1B    E 1C   W1


           I 2  (Add)    F2   D2      E2                     W2


           I 3  (Fsub)        F3      D3      E3A     E 3B   E 3C   W3


           I 4  (Sub)         F4      D4                     E4     W4



                              (a) Delayed write
Execution Completion
   It is desirable to used out-of-order execution, so that an
    execution unit is freed to execute other instructions as soon as
    possible.
   At the same time, instructions must be completed in program
    order to allow precise exceptions.
   The use of temporary registers
   Commitment unit
                                                                            Time
               Clock cycle   1       2       3        4       5      6          7

               I 1  (Fadd)   F1     D1      E1A      E 1B     E 1C   W1


               I 2  (Add)    F2     D2       E2     TW2              W2


               I 3  (Fsub)           F3      D3      E3A      E 3B   E 3C      W3


               I 4  (Sub)            F4      D4      E4       TW4              W4



                              (b) Using temporary registers

Contenu connexe

Tendances

Basic Processing Unit
Basic Processing UnitBasic Processing Unit
Basic Processing Unit
Slideshare
 
instruction cycle ppt
instruction cycle pptinstruction cycle ppt
instruction cycle ppt
sheetal singh
 
Computer instructions
Computer instructionsComputer instructions
Computer instructions
Anuj Modi
 
Unit 4 memory system
Unit 4   memory systemUnit 4   memory system
Unit 4 memory system
chidabdu
 
T-states in microprocessor 8085
T-states in microprocessor 8085T-states in microprocessor 8085
T-states in microprocessor 8085
yedles
 

Tendances (20)

Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
 
Basic Processing Unit
Basic Processing UnitBasic Processing Unit
Basic Processing Unit
 
Interrupt
InterruptInterrupt
Interrupt
 
Hardwired control
Hardwired controlHardwired control
Hardwired control
 
Minimum mode and Maximum mode Configuration in 8086
Minimum mode and Maximum mode Configuration in 8086Minimum mode and Maximum mode Configuration in 8086
Minimum mode and Maximum mode Configuration in 8086
 
instruction cycle ppt
instruction cycle pptinstruction cycle ppt
instruction cycle ppt
 
Addressing modes 8085
Addressing modes 8085Addressing modes 8085
Addressing modes 8085
 
Computer instructions
Computer instructionsComputer instructions
Computer instructions
 
Assemblers: Ch03
Assemblers: Ch03Assemblers: Ch03
Assemblers: Ch03
 
Timing diagram 8085 microprocessor
Timing diagram 8085 microprocessorTiming diagram 8085 microprocessor
Timing diagram 8085 microprocessor
 
Direct memory access
Direct memory accessDirect memory access
Direct memory access
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
Computer registers
Computer registersComputer registers
Computer registers
 
Unit 4 memory system
Unit 4   memory systemUnit 4   memory system
Unit 4 memory system
 
Addressing sequencing
Addressing sequencingAddressing sequencing
Addressing sequencing
 
Cache memory
Cache memoryCache memory
Cache memory
 
8086 micro processor
8086 micro processor8086 micro processor
8086 micro processor
 
T-states in microprocessor 8085
T-states in microprocessor 8085T-states in microprocessor 8085
T-states in microprocessor 8085
 
Instruction format
Instruction formatInstruction format
Instruction format
 
Control unit
Control unitControl unit
Control unit
 

En vedette

Microinstruction sequencing new
Microinstruction sequencing newMicroinstruction sequencing new
Microinstruction sequencing new
Mahesh Kumar Attri
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
chidabdu
 
Chapter 6 - Process Synchronization
Chapter 6 - Process SynchronizationChapter 6 - Process Synchronization
Chapter 6 - Process Synchronization
Wayne Jones Jnr
 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1
chidabdu
 
Algorithm chapter 5
Algorithm chapter 5Algorithm chapter 5
Algorithm chapter 5
chidabdu
 
Instruction set-of-8085
Instruction set-of-8085Instruction set-of-8085
Instruction set-of-8085
saleForce
 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2
chidabdu
 

En vedette (20)

CO By Rakesh Roshan
CO By Rakesh RoshanCO By Rakesh Roshan
CO By Rakesh Roshan
 
Basic processing unit by aniket bhute
Basic processing unit by aniket bhuteBasic processing unit by aniket bhute
Basic processing unit by aniket bhute
 
OS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and MonitorsOS Process Synchronization, semaphore and Monitors
OS Process Synchronization, semaphore and Monitors
 
Microinstruction sequencing new
Microinstruction sequencing newMicroinstruction sequencing new
Microinstruction sequencing new
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
 
Dining Philosopher's Problem
Dining Philosopher's ProblemDining Philosopher's Problem
Dining Philosopher's Problem
 
Chapter 6 - Process Synchronization
Chapter 6 - Process SynchronizationChapter 6 - Process Synchronization
Chapter 6 - Process Synchronization
 
High-performance asset cluster analysis implemented with Parallel Genetic Alg...
High-performance asset cluster analysis implemented with Parallel Genetic Alg...High-performance asset cluster analysis implemented with Parallel Genetic Alg...
High-performance asset cluster analysis implemented with Parallel Genetic Alg...
 
4th sem,(cs is),computer org unit-7
4th sem,(cs is),computer org unit-74th sem,(cs is),computer org unit-7
4th sem,(cs is),computer org unit-7
 
Dinning philosopher
Dinning philosopherDinning philosopher
Dinning philosopher
 
Instruction set
Instruction setInstruction set
Instruction set
 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1
 
CO by Rakesh Roshan
CO by Rakesh RoshanCO by Rakesh Roshan
CO by Rakesh Roshan
 
Algorithm chapter 5
Algorithm chapter 5Algorithm chapter 5
Algorithm chapter 5
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Operating systems
Operating systemsOperating systems
Operating systems
 
Csa stack
Csa stackCsa stack
Csa stack
 
Instruction set-of-8085
Instruction set-of-8085Instruction set-of-8085
Instruction set-of-8085
 
CO By Rakesh Roshan
CO By Rakesh RoshanCO By Rakesh Roshan
CO By Rakesh Roshan
 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2
 

Similaire à Unit 3 basic processing unit

basic-processing-unit computer organ.ppt
basic-processing-unit computer organ.pptbasic-processing-unit computer organ.ppt
basic-processing-unit computer organ.ppt
ssuser702532
 
Unit 1 basic structure of computers
Unit 1   basic structure of computersUnit 1   basic structure of computers
Unit 1 basic structure of computers
chidabdu
 
Operasi dasar prosesor
Operasi dasar prosesorOperasi dasar prosesor
Operasi dasar prosesor
Alam Yuda
 
basic structure of computers
basic structure of computersbasic structure of computers
basic structure of computers
Himanshu Chawla
 

Similaire à Unit 3 basic processing unit (20)

basic-processing-unit computer organ.ppt
basic-processing-unit computer organ.pptbasic-processing-unit computer organ.ppt
basic-processing-unit computer organ.ppt
 
Computer Organization for third semester Vtu SyllabusModule 4.ppt
Computer Organization  for third semester Vtu SyllabusModule 4.pptComputer Organization  for third semester Vtu SyllabusModule 4.ppt
Computer Organization for third semester Vtu SyllabusModule 4.ppt
 
Basic_Processing_Unit.pdf
Basic_Processing_Unit.pdfBasic_Processing_Unit.pdf
Basic_Processing_Unit.pdf
 
Orakom
OrakomOrakom
Orakom
 
Computer Organisation and Architecture
Computer Organisation and ArchitectureComputer Organisation and Architecture
Computer Organisation and Architecture
 
chapter3 - Basic Processing Unit.ppt
chapter3 - Basic Processing Unit.pptchapter3 - Basic Processing Unit.ppt
chapter3 - Basic Processing Unit.ppt
 
Unit 1 basic structure of computers
Unit 1   basic structure of computersUnit 1   basic structure of computers
Unit 1 basic structure of computers
 
Operasi dasar prosesor
Operasi dasar prosesorOperasi dasar prosesor
Operasi dasar prosesor
 
310471266 chapter-7-notes-computer-organization
310471266 chapter-7-notes-computer-organization310471266 chapter-7-notes-computer-organization
310471266 chapter-7-notes-computer-organization
 
COMPUTER ORGANIZATION NOTES Unit 7
COMPUTER ORGANIZATION NOTES Unit 7COMPUTER ORGANIZATION NOTES Unit 7
COMPUTER ORGANIZATION NOTES Unit 7
 
BCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdf
BCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdfBCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdf
BCS302-DDCO-basic processing unit-Module 5- VTU 2022 scheme-DDCO-pdf
 
basic structure of computers
basic structure of computersbasic structure of computers
basic structure of computers
 
Computer Organization
Computer OrganizationComputer Organization
Computer Organization
 
Coa module2
Coa module2Coa module2
Coa module2
 
Computer architecture register transfer languages rtl
Computer architecture register transfer languages rtlComputer architecture register transfer languages rtl
Computer architecture register transfer languages rtl
 
instruction
instruction instruction
instruction
 
Central processing unit i
Central processing unit iCentral processing unit i
Central processing unit i
 
Compuer organizaion processing unit
Compuer organizaion processing unitCompuer organizaion processing unit
Compuer organizaion processing unit
 
module 4.pptx
module 4.pptxmodule 4.pptx
module 4.pptx
 
100 103
100 103100 103
100 103
 

Plus de chidabdu

Sienna 11 graphs
Sienna 11 graphsSienna 11 graphs
Sienna 11 graphs
chidabdu
 
Sienna 10 dynamic
Sienna 10 dynamicSienna 10 dynamic
Sienna 10 dynamic
chidabdu
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
chidabdu
 
Sienna 8 countingsorts
Sienna 8 countingsortsSienna 8 countingsorts
Sienna 8 countingsorts
chidabdu
 
Sienna 7 heaps
Sienna 7 heapsSienna 7 heaps
Sienna 7 heaps
chidabdu
 
Sienna 6 bst
Sienna 6 bstSienna 6 bst
Sienna 6 bst
chidabdu
 
Sienna 5 decreaseandconquer
Sienna 5 decreaseandconquerSienna 5 decreaseandconquer
Sienna 5 decreaseandconquer
chidabdu
 
Sienna 4 divideandconquer
Sienna 4 divideandconquerSienna 4 divideandconquer
Sienna 4 divideandconquer
chidabdu
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
chidabdu
 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysis
chidabdu
 
Sienna 1 intro
Sienna 1 introSienna 1 intro
Sienna 1 intro
chidabdu
 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitations
chidabdu
 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11
chidabdu
 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10
chidabdu
 
Algorithm chapter 9
Algorithm chapter 9Algorithm chapter 9
Algorithm chapter 9
chidabdu
 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8
chidabdu
 
Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7
chidabdu
 
Algorithm chapter 6
Algorithm chapter 6Algorithm chapter 6
Algorithm chapter 6
chidabdu
 
Algorithm review
Algorithm reviewAlgorithm review
Algorithm review
chidabdu
 
Unit 2 arithmetics
Unit 2   arithmeticsUnit 2   arithmetics
Unit 2 arithmetics
chidabdu
 

Plus de chidabdu (20)

Sienna 11 graphs
Sienna 11 graphsSienna 11 graphs
Sienna 11 graphs
 
Sienna 10 dynamic
Sienna 10 dynamicSienna 10 dynamic
Sienna 10 dynamic
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
 
Sienna 8 countingsorts
Sienna 8 countingsortsSienna 8 countingsorts
Sienna 8 countingsorts
 
Sienna 7 heaps
Sienna 7 heapsSienna 7 heaps
Sienna 7 heaps
 
Sienna 6 bst
Sienna 6 bstSienna 6 bst
Sienna 6 bst
 
Sienna 5 decreaseandconquer
Sienna 5 decreaseandconquerSienna 5 decreaseandconquer
Sienna 5 decreaseandconquer
 
Sienna 4 divideandconquer
Sienna 4 divideandconquerSienna 4 divideandconquer
Sienna 4 divideandconquer
 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysis
 
Sienna 1 intro
Sienna 1 introSienna 1 intro
Sienna 1 intro
 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitations
 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11
 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10
 
Algorithm chapter 9
Algorithm chapter 9Algorithm chapter 9
Algorithm chapter 9
 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8
 
Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7
 
Algorithm chapter 6
Algorithm chapter 6Algorithm chapter 6
Algorithm chapter 6
 
Algorithm review
Algorithm reviewAlgorithm review
Algorithm review
 
Unit 2 arithmetics
Unit 2   arithmeticsUnit 2   arithmetics
Unit 2 arithmetics
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Unit 3 basic processing unit

  • 1. UNIT 3 - Basic Processing Unit
  • 2. Overview  Instruction Set Processor (ISP)  Central Processing Unit (CPU)  A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program.  An instruction is executed by carrying out a sequence of more rudimentary operations.
  • 3. Some Fundamental Concepts
  • 4. Fundamental Concepts  Processor fetches one instruction at a time and perform the operation specified.  Instructions are fetched from successive memory locations until a branch or a jump instruction is encountered.  Processor keeps track of the address of the memory location containing the next instruction to be fetched using Program Counter (PC).  Instruction Register (IR)
  • 5. Executing an Instruction  Fetch the contents of the memory location pointed to by the PC. The contents of this location are loaded into the IR (fetch phase). IR ← [[PC]]  Assuming that the memory is byte addressable, increment the contents of the PC by 4 (fetch phase). PC ← [PC] + 4  Carry out the actions specified by the instruction in the IR (execution phase).
  • 6. Processor Organization Internal processor bus Control signals PC Instruction Address decoder and lines MDR HAS MAR control logic TWO INPUTS Memory AND TWO bus OUTPUTS MDR Data lines IR Datapath Y Constant 4 R0 Select MUX Add A B ALU Sub R( n ­ 1 ) control ALU lines Carry­in XOR TEMP Z Textbook Page 413 Figure  7. 1.  Single­bus organization of the datapath inside a processor.
  • 7. Executing an Instruction  Transfer a word of data from one processor register to another or to the ALU.  Perform an arithmetic or a logic operation and store the result in a processor register.  Fetch the contents of a given memory location and load them into a processor register.  Store a word of data from a processor register into a given memory location.
  • 8. Register Transfers Riin Internal processor bus Ri Riout Yin Y Constant 4 Select MUX A B ALU Zin Z Z out Figure 7.2.  Input and output gating for the registers in Figure 7.1.
  • 9. Register Transfers  All operations and data transfers are controlled by the processor clock. Bus 0 D Q 1 Q Riout Ri in Clock Figure  7. 3. Input and output gating for one register bit. Figure 7.3. Input and output gating for one register bit.
  • 10. Performing an Arithmetic or Logic Operation  The ALU is a combinational circuit that has no internal storage.  ALU gets the two operands from MUX and bus. The result is temporarily stored in register Z.  What is the sequence of operations to add the contents of register R1 to those of R2 and store the result in R3? 1. R1out, Yin 2. R2out, SelectY, Add, Zin 3. Zout, R3in
  • 11. Fetching a Word from Memory  Address into MAR; issue Read operation; data into MDR. Memory­bus Internal processor data lines MDRoutE MDRout bus MDR MDRinE MDRin Figure  7.7.4. Connection and control signals for register MDR. Figure 4. Connection and control signals for register MDR.
  • 12. Fetching a Word from Memory  The response time of each memory access varies (cache miss, memory-mapped I/O,…).  To accommodate this, the processor waits until it receives an indication that the requested operation has been completed (Memory-Function-Completed, MFC).  Move (R1), R2  MAR ← [R1]  Start a Read operation on the memory bus  Wait for the MFC response from the memory  Load MDR from the memory bus  R2 ← [MDR]
  • 13. Step 1 2 3 Timing Clock MARin MAR ← [R1] Assume MAR is always available Address on the address lines of the memory bus. Start a Read operation on the memory bus Read MR MDRinE Data Wait for the MFC response from the memory MFC MDRout Load MDR from the memory bus R2 ← [MDR] Figure  7. 5. Timing of a memory Read operation.
  • 14. Execution of a Complete Instruction  Add (R3), R1  Fetch the instruction  Fetch the first operand (the contents of the memory location pointed to by R3)  Perform the addition  Load the result into R1
  • 15. Architecture Riin Internal processor bus Ri Riout Yin Y Constant 4 Select MUX A B ALU Zin Z Z out Figure 7.2.  Input and output gating for the registers in Figure 7.1.
  • 16. Execution of a Complete Instruction Internal processor bus Add (R3), R1 Control signals PC Instruction S tep Action Address decoder and lines MAR control logic 1 PCout , MAR in , Read, Select4 ,Add, Zin Memory bus 2 Zout , PCin , Y in , WMF C MDR Data IR 3 MDR out , IR in lines 4 R3out , MAR in , Read Y R0 5 R1out , Y in , WMF C Constant 4 6 MDR out , SelectY,Add, Zin Select MUX 7 Zout , R1 in , End Add A B ALU Sub R( n ­ 1 ) control ALU lines Carry­in XOR TEMP Figure 7 .6 . Control sequence executionof the instruction Add (R3 ),R1 . for Z Figure  7. 1.  Single­bus organization of the datapath inside a processor.
  • 17. Execution of Branch Instructions  A branch instruction replaces the contents of PC with the branch target address, which is usually obtained by adding an offset X given in the branch instruction.  The offset X is usually the difference between the branch target address and the address immediately following the branch instruction.  Conditional branch
  • 18. Execution of Branch Instructions S tep Action 1 PCout , MAR in , Read, Select4,Add, Z in 2 Zout , PCin , Yin , WMF C 3 MDR out , IR in 4 Offset-field-of-IR , Add, Z in out 5 Z out , PCin , End Figure 7.7. Control sequence for an unconditional branch instruction.
  • 19. Multiple-Bus Organization Bus A Bus B Bus C Incrementer PC Register file Constant 4 MUX A ALU R B Instruction decoder IR MDR MAR Memory bus Address data lines lines Figure  7. 8. Three­b us organization of the datapath.
  • 20. Multiple-Bus Organization  Add R4, R5, R6 S tep Action 1 PCout, R=B, MAR in , Read, IncPC 2 WMFC 3 MDR outB , R=B, IR in 4 R4outA , R5outB , SelectA, Add, R6in , End Figure 7.9. Control sequence for the instruction. Add R4,R5,R6, for the three-bus organization in Figure 7.8.
  • 21. Quiz Internal processor bus Control signals  What is the control PC sequence for Instruction Address decoder and lines MAR control logic execution of the Memory bus instruction Data lines MDR IR Add R1, R2 Constant 4 Y R0 including the Select MUX instruction fetch Add A B phase? (Assume ALU Sub R( n ­ 1 ) control ALU lines Carry­in single bus XOR TEMP architecture) Z Figure  7. 1.  Single­bus organization of the datapath inside a processor.
  • 23. Overview  To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence.  Two categories: hardwired control and microprogrammed control  Hardwired system can operate at high speed; but with little flexibility.
  • 24. Control Unit Organization CLK Control step Clock counter External inputs Decoder/ IR encoder Condition codes Control signals Figure 7.10.  Control unit organization.
  • 25. Detailed Block Description CLK Clock Control step Reset counter Step decoder T 1 T2 Tn INS1 External INS2 inputs Instruction IR Encoder decoder Condition codes INSm Run End Control signals Figure  7. 11. Separation of the decoding and encoding functions.
  • 26. Generating Zin  Zin = T1 + T6 • ADD + T4 • BR + … Branch Add T4 T6 T1 Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.  
  • 27. Generating End  End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +… Branch<0 Add Branch N N T7 T5 T4 T5 End Figure  7. 13. Generation of the End control signal.
  • 28. A Complete Processor Instruction Integer Floating­point unit unit unit Instruction Data cache cache Bus interf ace Processor System bus Main Input/ memory Output Figure  7. 14. Block diagram of a complete processor.
  • 29. Microprogrammed Control
  • 30. Overview  Control signals are generated by a program similar to machine language programs.  Control Word (CW); microroutine; microinstruction MDRout WMFC MARin Select PCout R1out R3out Micro ­ Read PCin R1 in Add End Z out IRin Yin instruction Zin 1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 Figure  7. 15 An example of microinstructions for Figure  7. 6.
  • 31. Overview S tep Action 1 PCout , MAR in , Read, Select4 ,Add, Zin 2 Zout , PCin , Y in , WMF C 3 MDR out , IR in 4 R3out , MAR in , Read 5 R1out , Y in , WMF C 6 MDR out , SelectY,Add, Zin 7 Zout , R1 in , End Figure 7 .6 . Control sequence executionof the instruction Add (R3 ),R1 . for
  • 32. Overview  Control store Starting IR address generator One function cannot be carried out by this simple organization. Clock µP C Control store CW Figure  7. 16. Basic organization of a microprogrammed control unit.
  • 33. Overview  The previous organization cannot handle the situation when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action.  Use conditional branch microinstruction. Address Microinstruction 0 PCout , MAR in , Read, Select4,Add, Z in 1 Zout , PCin , Y in , WMF C 2 MDRout , IR in 3 Branch to startingaddress appropriatemicroroutine of . ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... .. 25 If N=0, then branch to microinstruction0 26 Offset-field-of-IRout , SelectY, Add, Z in 27 Zout , PCin , End Figure 7.17. Microroutine for the instruction Branch<0.
  • 34. Overview External inputs Starting and branch address Condition IR codes generator Clock µ PC Control store CW Figure 7.18. Organization of the control unit to allow conditional branching in the microprogram.
  • 35. Microinstructions  A straightforward way to structure microinstructions is to assign one bit position to each control signal.  However, this is very inefficient.  The length can be reduced: most signals are not needed simultaneously, and many signals are mutually exclusive.  All mutually exclusive signals are placed in the same group in binary coding.
  • 36. Partial Format for the Microinstructions Microinstruction F1 F2 F3 F4 F5 F1  (4  bits) F2  (3  bits) F3  (3  bits) F4  (4  bits) F5  (2  bits) 0000 : No transfer 000 : No transfer 000 : No transfer 0000 : Add 00 : No action 0001 : PCout 001 : PC in 001 : MAR in 0001 : Sub 01 : Read 0010 : MDR out 010 : IRin 010 : MDR in 10 : Write 0011 : Zout 011 : Zin 011 : TEMP in 0100 : R0out 100 : R0in 100 : Yin 1111 : XOR 0101 : R1out 101 : R1in 0110 : R2 110 : R2in 16  ALU out functions 0111 : R3out 111 : R3in 1010 : TEMP out 1011 : Offsetout F6 F7 F8 What is the price paid for this scheme? F6  (1  bit) F7  (1  bit) F8  (1  bit) 0 : SelectY 0 : No action 0 : Continue 1 : Select 4 1 : WMFC 1 : End Figure  7. 19. An example of a partial format for field­encoded microinstructions.
  • 37. Further Improvement  Enumerate the patterns of required signals in all possible microinstructions. Each meaningful combination of active control signals can then be assigned a distinct code.  Vertical organization  Horizontal organization
  • 38. Microprogram Sequencing  If all microprograms require only straightforward sequential execution of microinstructions except for branches, letting a μPC governs the sequencing would be efficient.  However, two disadvantages:  Having a separate microroutine for each machine instruction results in a large total number of microinstructions and a large control store.  Longer execution time because it takes more time to carry out the required branches.  Example: Add src, Rdst  Four addressing modes: register, autoincrement, autodecrement, and indexed (with indirect forms).
  • 39. - Bit-ORing - Wide-Branch Addressing - WMFC
  • 40. Mode Contents of IR OP code 0 1 0 Rsrc Rdst 11 10 8 7 4 3 0 Address Microinstruction (octal) 000 PCout, MARin , Read, Select , Add, Zin 4 001 Zout , PCin, Yin, WMFC 002 MDRout, IRin 003 µ Branch { PC ←  101 (from Instruction decoder); µ µ PC5,4 ←  [IR 10,9]; µ PC3 ← [IR 10] ⋅ [IR9] ⋅ [IR8]} 121 Rsrcout , MARin , Read, Select4, Add, Z in 122 Zout , Rsrcin 123 µBranch {µPC ←  170;µPC0 ← [IR8]}, WMFC 170 MDRout, MARin , Read, WMFC 171 MDRout, Yin 172 Rdstout , SelectY , Add, Zin 173 Zout , Rdstin , End Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst. Note:  Microinstruction at location 170 is not executed for this addressing mode. 
  • 41. Microinstructions with Next- Address Field  The microprogram we discussed requires several branch microinstructions, which perform no useful operation in the datapath.  A powerful alternative approach is to include an address field as a part of every microinstruction to indicate the location of the next microinstruction to be fetched.  Pros: separate branch microinstructions are virtually eliminated; few limitations in assigning addresses to microinstructions.  Cons: additional bits for the address field (around 1/6)
  • 42. Microinstructions with Next- Address Field IR External Condition Inputs codes Decoding circuits µAR Control store Next address µI R Microinstruction decoder Control signals Figure  7. 22.   Microinstruction­sequencing organization.
  • 43. Microinstruction F0 F1 F2 F3 F0  (8  bits) F1  (3  bits) F2  (3  bits) F3  (3  bits) Address of next 000 : No transfer 000 : No transfer 000 : No transfer microinstruction 001 : PCout 001 : PC in 001 : MAR in 010 : MDR out 010 : IRin 010 : MDR in 011 : Zout 011 : Zin 011 : TEMP in 100 : Rsrc out 100 : Rsrc in 100 : Yin 101 : Rdst out 101 : Rdst in 110 : TEMP out F4 F5 F6 F7 F4  (4  bits) F5  (2  bits) F6  (1  bit) F7  (1  bit) 0000 : Add 00 : No action 0 : SelectY 0 : No action 0001 : Sub 01 : Read 1 : Select 4 1 : WMFC 10 : Write 1111 : XOR F8 F9 F10 F8  (1  bit) F9  (1  bit) F10  (1  bit) 0 : NextAdrs 0 : No action 0 : No action 1 : InstDec 1 : ORmode 1 : ORindsrc Figure  7. 23. Format for microinstructions in the example of Section  7. 5 . 3 .
  • 44. Implementation of the Microroutine Octal address F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 1 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1 1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure  7. 24.  Implementation of the microroutine  of Figure  7. 21 using a  next­microinstruction address field. (See Figure  7. 23 for encoded signals.)
  • 45. R15in R15out R0 in R0out Decoder Decoder IR Rsrc Rdst InstDec out External inputs ORmode Decoding circuits Condition ORindsrc codes µAR Control store Next address F1 F2 F8 F9 F10 Rdstout Rdstin Microinstruction decoder Rsrcout Rsrcin Other control signals Figure  7. 25. Some details of the control­signal­generating circuitry.
  • 47. Prefetching  One drawback of Micro Programmed control is that it leads to slower operating speed because of the time it takes to fetch microinstructions from control store  Faster operation is achieved if the next microinstruction is prefetched while the current one is executing  In this way execution time is overlapped with fetch time
  • 48. Prefetching – Disadvantages  Sometimes the status flag & the result of the currently executed microinstructions are needed to know the next address  Thus there is a probability of wrong instructions being prefetched  In this case fetch must be repeated with the correct address
  • 49. Emulation  Emulation allows us to replace obsolete equipment with more up-to-date machines  It facilitates transitions to new computer systems with minimal disruption  It is the easiest way when machines with similar architecture are involved
  • 51. Overview  Pipelining is widely used in modern processors.  Pipelining improves system performance in terms of throughput.  Pipelined organization requires sophisticated compilation techniques.
  • 53. Making the Execution of Programs Faster  Use faster circuit technology to build the processor and the main memory.  Arrange the hardware so that more than one operation can be performed at the same time.  In the latter way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed.
  • 54. Traditional Pipeline Concept  Laundry Example  Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold A B C D  Washer takes 30 minutes  Dryer takes 40 minutes  “Folder” takes 20 minutes
  • 55. Traditional Pipeline Concept 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20  Sequential laundry takes 6 A hours for 4 loads  If they learned pipelining, how long would laundry B take? C D
  • 56. Traditional Pipeline Concept 6 PM 7 8 9 10 11 Midnight Time T a 30 40 40 40 40 20 s k A  Pipelined laundry takes 3.5 hours for 4 loads O B r d C e r D
  • 57. Traditional Pipeline Concept  Pipelining doesn’t help 6 PM 7 8 9 latency of single task, it helps throughput of entire Time workload T 30 40 40 40 40 20  Pipeline rate limited by a slowest pipeline stage s A  Multiple tasks operating k simultaneously using different resources O B  Potential speedup = Number r pipe stages d C  Unbalanced lengths of pipe e stages reduces speedup r  Time to “fill” pipeline and D time to “drain” it reduces speedup  Stall for Dependences
  • 58. Use the Idea of Pipelining in a Computer Fetch + Execution T ime I1 I2 I3 Time Clock cycle 1 2 3 4 F E F E F E 1 1 2 2 3 3 Instruction I1 F1 E1 (a) Sequential execution I2 F2 E2 Interstage buffer B1 I3 F3 E3 Instruction Execution fetch unit (c) Pipelined execution unit Figure 8.1. Basic idea of instruction pipelining. (b) Hardware organization
  • 59. Use the Idea of Pipelining in a Computer Clock cycle 1 2 3 4 5 6 7 Time Instruction I1 F1 D1 E1 W1 Fetch + Decode + Execution + Write I2 F2 D2 E2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 W4 (a) Instruction execution divided into four steps Interstage b uffers D : Decode F : Fetch instruction E: Execute W : Write instruction and fetch operation results operands B1 B2 B3 (b) Hardware organization Textbook page: 457 Figure  8. 2. A  4­stage pipeline.
  • 60. Role of Cache Memory  Each pipeline stage is expected to complete in one clock cycle.  The clock period should be long enough to let the slowest pipeline stage to complete.  Faster stages can only wait for the slowest one to complete.  Since main memory is very slow compared to the execution, if each instruction needs to be fetched from main memory, pipeline is almost useless.  Fortunately, we have cache.
  • 61. Pipeline Performance  The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages.  However, this increase would be achieved only if all pipeline stages require the same time to complete, and there is no interruption throughout program execution.  Unfortunately, this is not true.
  • 62. Pipeline Performance Time Clock cycle 1 2 3 4 5 6 7 8 9 Instruction I1 F1 D1 E1 W1 I2 F2 D2 E2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 W4 I5 F5 D5 E5 Figure  8. 3. Effect of an execution operation taking more than one clock cycle.
  • 63. Pipeline Performance  The previous pipeline is said to have been stalled for two clock cycles.  Any condition that causes a pipeline to stall is called a hazard.  Data hazard – any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. So some operation has to be delayed, and the pipeline stalls.  Instruction (control) hazard – a delay in the availability of an instruction causes the pipeline to stall.  Structural hazard – the situation when two instructions require the use of a given hardware resource at the same time.
  • 64. Pipeline Performance Time Clock cycle 1 2 3 4 5 6 7 8 9 Instruction Instruction hazard I1 F1 D1 E1 W1 I2 F2 D2 E2 W2 I3 F3 D3 E3 W3 (a) Instruction execution steps in successive clock cycles Time Clock cycle 1 2 3 4 5 6 7 8 9 Stage F: Fetch F1 F2 F2 F2 F2 F3 Idle periods – D: Decode D1 idle idle idle D2 D3 stalls (bubbles) E: Execute E1 idle idle idle E2 E3 W: Write W1 idle idle idle W2 W3 (b) Function performed by each processor stage in successive clock cycles Figure  8. 4. Pipeline stall caused by a cache miss in F 2.
  • 65. Pipeline Performance Load X(R1), R2 Structural Time hazard Clock cycle 1 2 3 4 5 6 7 Instruction I1 F1 D1 E1 W1 I2  (Load) F2 D2 E2 M2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 I5 F5 D5 Figure  8. 5. Effect of a Load instruction on pipeline timing.
  • 66. Pipeline Performance  Again, pipelining does not result in individual instructions being executed faster; rather, it is the throughput that increases.  Throughput is measured by the rate at which instruction execution is completed.  Pipeline stall causes degradation in pipeline performance.  We need to identify all hazards that may cause the pipeline to stall and to find ways to minimize their impact.
  • 68. Data Hazards  We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially.  Hazard occurs A←3+A B←4×A  No hazard A←5×C B ← 20 + C  When two operations depend on each other, they must be executed sequentially in the correct order.  Another example: Mul R2, R3, R4 Add R5, R4, R6
  • 69. Data Hazards Time Clock cycle 1 2 3 4 5 6 7 8 9 Instruction I1  (Mul) F1 D1 E1 W1 I2  (Add) F2 D2 D2 A E2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 W4 Figure  8. 6. Pipeline stalled by data dependency between D2 and W1. Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
  • 70. Operand Forwarding  Instead of from the register file, the second instruction can get data directly from the output of ALU after the previous instruction is completed.  A special arrangement needs to be made to “forward” the output of ALU to the input of ALU.
  • 71. Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination (a) Datapath SRC1 ,SRC2 RSLT E: Execute W: Write (ALU) (Register file) Forwarding path (b) Position of the source and result registers in the processor pipeline Figure  8. 7. Operand forw arding in a pipelined processor.
  • 72. Handling Data Hazards in Software  Let the compiler detect and handle the hazard: I1: Mul R2, R3, R4 NOP NOP I2: Add R5, R4, R6  The compiler can reorder the instructions to perform some useful work during the NOP slots.
  • 73. Side Effects  The previous example is explicit and easily detected.  Sometimes an instruction changes the contents of a register other than the one named as the destination.  When a location other than one explicitly named in an instruction as a destination operand is affected, the instruction is said to have a side effect. (Example?)  Example: conditional code flags: Add R1, R3 AddWithCarry R2, R4  Instructions designed for execution on pipelined hardware should have few side effects.
  • 75. Overview  Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline stalls.  Cache miss  Branch
  • 76. Unconditional Branches Time Clock cycle 1 2 3 4 5 6 Instruction I1 F1 E1 I2  (Branch) F2 E2 Execution unit idle I3 F3 X Ik Fk Ek Ik+1 Fk+1 Ek+1 Figure  8. 8. An idle cycle caused by a branch instruction.
  • 77. Time Clock cycle 1 2 3 4 5 6 7 8 Branch Timing I1 F1 D1 E1 W1 I2  (Branch) F2 D2 E2 I3 F3 D3 X - Branch penalty I4 F4 X Ik Fk Dk Ek Wk - Reducing the penalty Ik+1 Fk+1 Dk+1 E k+1 (a) Branch address computed in Execute stage Time Clock cycle 1 2 3 4 5 6 7 I1 F1 D1 E1 W1 I2  (Branch) F2 D2 I3 F3 X Ik Fk Dk Ek Wk Ik+1 Fk+1 D k+1 E k+1 (b) Branch address computed in Decode stage Figure  8. 9. Branch timing.
  • 78. Instruction Queue and Prefetching Instruction fetch unit Instruction queue F : Fetch instruction D : Dispatch/ Decode E : Ex ecute W : Write instruction results unit Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
  • 79. Conditional Braches  A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction.  The decision to branch cannot be made until the execution of that instruction has been completed.  Branch instructions represent about 20% of the dynamic instruction count of most programs.
  • 80. Delayed Branch  The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken.  The objective is to place useful instructions in these slots.  The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions.
  • 81. Delayed Branch LOOP Shift_left R1 Decrement R2 Branch=0 LOOP NEXT Add R1,R3 (a) Original program loop LOOP Decrement R2 Branch=0 LOOP Shift_left R1 NEXT Add R1,R3 (b) Reordered instructions Figure 8.12. Reordering of instructions for a delayed branch.
  • 82. Delayed Branch Time Clock cycle 1 2 3 4 5 6 7 8 Instruction Decrement F E Branch F E Shift (delay slot) F E Decrement (Branch tak en) F E Branch F E Shift (delay slot) F E Add (Branch not taken) F E Figure 8.13. Execution timing showing the delay slot being filled during the last two passes through the loop in Figure 8.12.
  • 83. Branch Prediction  To predict whether or not a particular branch will be taken.  Simplest form: assume branch will not take place and continue to fetch instructions in sequential address order.  Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis.  Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence.  Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed.
  • 84. Incorrectly Predicted Branch Time Clock cycle 1 2 3 4 5 6 Instruction I 1  (Compare) F1 D1 E1 W1 I 2  (Branch>0) F2 D 2 /P2 E2 I3 F3 D3 X I4 F4 X Ik Fk Dk Figure 8.14. Timing when a branch decision has been incorrectly predicted as not taken.
  • 85. Branch Prediction  Better performance can be achieved if we arrange for some branch instructions to be predicted as taken and others as not taken.  Use hardware to observe whether the target address is lower or higher than that of the branch instruction.  Let compiler include a branch prediction bit.  So far the branch prediction decision is always the same every time a given instruction is executed – static branch prediction.
  • 87. Overview  Some instructions are much better suited to pipeline execution than others.  Addressing modes  Conditional code flags
  • 88. Addressing Modes  Addressing modes include simple ones and complex ones.  In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline:  Side effects  The extent to which complex addressing modes cause the pipeline to stall  Whether a given mode is likely to be used by compilers
  • 89. Recall Load X(R1), R2 Time Clock cycle 1 2 3 4 5 6 7 Instruction I1 F1 D1 E1 W1 I2  (Load) F2 D2 E2 M2 W2 I3 F3 D3 E3 W3 I4 F4 D4 E4 I5 F5 D5 Load (R1), R2 Figure  8. 5. Effect of a Load instruction on pipeline timing.
  • 90. Complex Addressing Mode Load (X(R1)), R2 Time Clock cycle 1 2 3 4 5 6 7 Load F D X + [R1] [X +[R1]] [[X +[R1]]] W Forward Next instruction F D E W (a) Complex addressing mode
  • 91. Simple Addressing Mode Add #X, R1, R2 Load (R2), R2 Load (R2), R2 Add F D X + [R1] W Load F D [X +[R1]] W Load F D [[X +[R1]]] W Next instruction F D E W (b) Simple addressing mode
  • 92. Addressing Modes  In a pipelined processor, complex addressing modes do not necessarily lead to faster execution.  Advantage: reducing the number of instructions / program space  Disadvantage: cause pipeline to stall / more hardware to decode / not convenient for compiler to work with  Conclusion: complex addressing modes are not suitable for pipelined execution.
  • 93. Addressing Modes  Good addressing modes should have:  Access to an operand does not require more than one access to the memory  Only load and store instruction access memory operands  The addressing modes used do not have side effects  Register, register indirect, index
  • 94. Conditional Codes  If an optimizing compiler attempts to reorder instruction to avoid stalling the pipeline when branches or data dependencies between successive instructions occur, it must ensure that reordering does not cause a change in the outcome of a computation.  The dependency introduced by the condition- code flags reduces the flexibility available for the compiler to reorder instructions.
  • 95. Conditional Codes Add R1,R2 Compare R3,R4 Branch=0 . . . (a) A program fragment Compare R3,R4 Add R1,R2 Branch=0 . . . (b) Instructions reordered Figure 8.17. Instruction reordering.
  • 96. Conditional Codes  Two conclusion:  To provide flexibility in reordering instructions, the condition-code flags should be affected by as few instruction as possible.  The compiler should be able to specify in which instructions of a program the condition codes are affected and in which they are not.
  • 97. Datapath and Control Considerations
  • 98. Bus A Bus B Bus C Incrementer Original Design PC Register file Constant 4 MUX A ALU R B Instruction decoder IR MDR MAR Memory bus Address data lines lines Figure  7. 8. Three­b us organization of the datapath.
  • 99. Register file Pipelined Design Bus A A Bus B ALU R - Separate instruction and data caches B - PC is connected to IMAR Bus C - DMAR - Separate MDR PC - Buffers for ALU Control signal pipeline - Instruction queue Incrementer - Instruction decoder output Instruction IMAR decoder Memory address (Instruction fetches) Instruction queue MDR/Write DMAR MDR/Read Instruction cache Memory address - Reading an instruction from the instruction cache (Data access) - Incrementing the PC - Decoding an instruction - Reading from or writing into the data cache Data cache - Reading the contents of up to two regs - Writing into one register in the reg file Figure  8. 18. Datapath modified for pipelined execution, with - Performing an ALU operation interstage buffers at the input and output of the ALU.
  • 101. Overview  The maximum throughput of a pipelined processor is one instruction per clock cycle.  If we equip the processor with multiple processing units to handle several instructions in parallel in each processing stage, several instructions start execution in the same clock cycle – multiple-issue.  Processors are capable of achieving an instruction execution throughput of more than one instruction per cycle – superscalar processors.  Multiple-issue requires a wider path to the cache and multiple execution units.
  • 102. Superscalar F : Instruction fetch unit Instruction queue Floating­ point unit Dispatch unit W : Write results Integer unit Figure  8. 19. A processor with two execution units.
  • 103. Timing Time Clock cycle 1 2 3 4 5 6 7 I 1  (Fadd) F1 D1 E1A E1B E 1C W1 I 2  (Add) F2 D2 E2 W2 I 3  (Fsub) F3 D3 E3 E3 E3 W3 I 4  (Sub) F4 D4 E4 W4 Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19, assuming no hazards are encountered.  
  • 104. Out-of-Order Execution  Hazards  Exceptions  Imprecise exceptions  Precise exceptions Time Clock cycle 1 2 3 4 5 6 7 I 1  (Fadd) F1 D1 E1A E 1B E 1C W1 I 2  (Add) F2 D2 E2 W2 I 3  (Fsub) F3 D3 E3A E 3B E 3C W3 I 4  (Sub) F4 D4 E4 W4 (a) Delayed write
  • 105. Execution Completion  It is desirable to used out-of-order execution, so that an execution unit is freed to execute other instructions as soon as possible.  At the same time, instructions must be completed in program order to allow precise exceptions.  The use of temporary registers  Commitment unit Time Clock cycle 1 2 3 4 5 6 7 I 1  (Fadd) F1 D1 E1A E 1B E 1C W1 I 2  (Add) F2 D2 E2 TW2 W2 I 3  (Fsub) F3 D3 E3A E 3B E 3C W3 I 4  (Sub) F4 D4 E4 TW4 W4 (b) Using temporary registers