SlideShare une entreprise Scribd logo
1  sur  55
Yaser Zhian
Fanafzar Game Studio
IGDI, Workshop 07, January 2nd, 2013
   Some notes about the subject
   CPUs and their gimmicks
   Caches and their importance
   How CPU and OS handle memory logically




             http://yaserzt.com/blog/        2
   These are very complex subjects
     Expect very few details and much simplification
   These are very complicated subjects
     Expect much generalization and omission
   No time
     Even a full course would be hilariously insufficient
   Not an expert
     Sorry! Can’t help much.
   Just a pile of loosely related stuff
                http://yaserzt.com/blog/                     3
   Pressure for performance
   Backwards compatibility
   Cost/power/etc.
   The ridiculous “numbers game”
   Law of diminishing returns
   Latency vs. Throughput



             http://yaserzt.com/blog/   4
   You can always solve your bandwidth
    (throughput) problems with money, but it is
    rarely so for lag (latency.)
   Relative rate of improvements (from David
    Patterson’s keynote, HPEC 2004)
     CPU, 80286 till Pentium 4: 21x vs. 2250x
     Ethernet, 10Mb till 10Gb: 16x vs. 1000x
     Disk, 3600 till 15000rpm: 8x vs. 143x
     DRAM, plain till DDR: 4x vs. 120x
                http://yaserzt.com/blog/          5
   At the simplest level, the von Neumann
    model stipulates:
     Program is data and is stored in memory along
      with data (departing from Turing’s model)
     Program is executed sequentially
   Not the way computers function anymore…
     Abstraction still used for thinking about programs
     But it’s leaky as heck!
   “Not Your Fathers’ von Neumann Machine!”
                http://yaserzt.com/blog/                   6
   Speed of Light: can’t send and receive signals to
    and from all parts of the die in a cycle anymore
   Power: more transistors leads to more power,
    which leads to much more heat
   Memory: the CPU isn’t even close to the
    bottleneck anymore. “All your base are belong
    to” memory
   Complexity: adding more transistors for more
    sophisticated operation won’t give much of a
    speedup (e.g. doubling transistors might give
    2%.)

               http://yaserzt.com/blog/                 7
   Family introduced with 8086 in 1978
   Today, new members are still fully binary
    backward-compatible with that puny machine
    (5MHz clock, 20-bit addressing, 16-bit regs.)
   It had very few registers
   It had segmented memory addressing (joy!)
   It had many complex instructions and several
    addressing modes

              http://yaserzt.com/blog/              8
   1982 (80286): Protected mode, MMU
   1985 (80386): 32-bit ISA, Paging
   1989 (80486): Pipelining, Cache, Intgrtd. FPU
   1993 (Pentium): Superscalar, 64-bit bus, MMX
   1995 (P-Pro): μ-ops, OoO Exec., Register
    Renaming, Speculative Exec.
   1997 (K6-2, PIII): 3DNow!/SSE
   2003 (Opteron): 64-bit ISA
   2006 (Core 2): Multi-core
              http://yaserzt.com/blog/              9
   Registers got expanded from (all 16 bit, non really
    general purpose)
     AX, BX, CX, DX
     SI, DI, BP, SP
     CS, DS, ES, SS, Flags, IP
   To
     16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI,
      R8-R15) plus RIP and Flags and others
     16 x 128-bit XMM regs. (XMM0-...)
         ▪ Or 16 x 256-bit YMM regs. (YMM0-...)
     More than a thousand logically different instructions (the
      usual, plus string processing, cryptography, CRC, complex
      numbers, etc.)
                      http://yaserzt.com/blog/                     10
   The Fetch-Decode-Execute-Retire Cycle
   Strategies for more performance:
     More complex instructions, doing more in
      hardware (CISCing things up)
     Faster CPU clock rates (the free lunch)
     Instruction-Level Parallelism (SIMD + gimmicks)
     Adding cores (free lunch is over!)
   And then, there are gimmicks…

               http://yaserzt.com/blog/                 11
   Pipelining
   µ-ops
   Superscalar Pipelines
   Out-of-order Execution
   Speculative Execution
   Register Renaming
   Branch Prediction
   Prefetching
   Store Buffer
   Trace Cache
   …

              http://yaserzt.com/blog/   12
Classic sequential execution:
   Length of instruction executions vary a lot (5-10
     times usual, several orders of magnitude also
     happen.)

 Instruction 1

                           Instruction 2

                                            Instruction 3

                                                            Instruction 4


                 http://yaserzt.com/blog/                                   13
It’s really more like this for the CPU:
   Instructions may have many sub-parts, and they
      engage different parts of the CPU


 F1   D1   E1   R1

                     F2   D2         E2          R2

                                                      F3   D3   E3   R3

                                                                          F4   D4   E4   R4



                      http://yaserzt.com/blog/                                                14
So why not do this:
   This is called “pipelining”
   It increases throughput (significantly)
   Doesn’t decrease latency for single instructions
 F1   D1   E1   R1

      F2   D2        E2          R2

           F3   D3               E3    R3


                F4        D4           E4     R4



                          http://yaserzt.com/blog/     15
But it has its own share of problems
   Hazards, stalls, flushing, etc.
   Execution of i2 depends on the result of i1
   After i2, we jump and the i3, i4,… are flushed out
 F1   D1   E1    R1                                        add EAX,120

      F2   D2                E2         R2                 jmp [EAX]

           F3   D3                      E3       R3        mov [4*EBX+42],EDX

                F4    D4                         E4   R4   add ECX,[EAX]


                      http://yaserzt.com/blog/                                  16
   Instructions are broken up into simple,
    orthogonal µ-ops
     mov EAX,EDX might generate only one µ-op
     mov EAX,[EDX] might generate two:
      1. µld tmp0,[EDX]
      2. µmov EAX,tmp0
     add [EAX],EDX probably generates three:
      1. µld tmp0,[EAX]
      2. µadd tmp0,EDX
      3. µst [EAX],tmp0

              http://yaserzt.com/blog/           17
   The CPU then, gets two layers:
     The one that breaks up operations into µ-ops
     The one that executes µ-ops
   The part that executes µ-ops can be simpler
    (more RISCy) and therefore faster.
   More complex instructions can be supported
    without (much) complicating the CPU
   The pipelining (and other gimmicks) can
    happen at the µ-op level
               http://yaserzt.com/blog/              18
   CPUs that issue (or retire) more than one
    instruction per cycle are called Superscalar
   Can be thought of as a pipeline with more
    than one line
   Simplest form: integer pipe plus floating-point
    pipe
   These days, CPUs do 4 or more
   Obviously requires more of each type of
    operational unit in the CPU
              http://yaserzt.com/blog/                19
   To prevent your pipeline from stalling as
    much as possible, issue the next instructions
    even if you can’t start the current one.
   But of course, only if there are no hazards
    (dependencies) and there are operational
    units available.
   add RAX,RAX
    add RAX,RBX         This can be and is started before
                        the previous instruction.
    add RCX,RDX
                http://yaserzt.com/blog/                    20
   This obviously also applies at the µ-op level:
   mov RAX,[mem0]          Fetching mem1 is started long
    mul RAX,42              before the result of the
                            multiply becomes available.
    add RAX,[mem1]
   push RAX
                      Pushing RAX is sub RSP,8 and then
    call Func         mov [RSP],RAX. Since call
                                  instruction needs RSP too, it will only
                                  wait for the subtraction and not the
                                  store to finish to start.

                http://yaserzt.com/blog/                                    21
   Consider this:
    mov RAX,[mem0]
    mul RAX,42
    mov [mem1],RAX
    mov RAX,[mem2]
    add RAX,7
    mov [mem3],RAX
   Logically, the two parts are totally separate.
   However, the use of RAX will stall the pipeline.
              http://yaserzt.com/blog/                 22
   Modern CPUs have a lot of temporary,
    unnamed registers at their disposal.
   They will detect the logical independence,
    and will use one of those in the second block
    instead on RAX.
   And they will track which reg. is which, where.
   In effect, they are renaming another register
    to RAX.
   There might not even be a real RAX!
              http://yaserzt.com/blog/                23
   This is, for once, simpler than it might seem!
   Every time a register is assigned to, a new
    temporary register is used in its stead.
   Consider this:




                                             Rename happens
    mov RAX,[cached]
    mov RBX,[uncached]
                                      Renaming on mul means
    add RBX,RAX                       that it won’t clobber RAX
    mul RAX,42                        (which we need for the
                                      add, that is waiting on the
    mov [mem0],RAX                    load of [uncached]) and we
    mov [mem1],RBX                    can do the multiply and
                                                              reach the first store much
                                                              sooner.
                  http://yaserzt.com/blog/                                                 24
   The CPU always depends on knowing where
    the next instruction is, so it can go ahead and
    work on it.
   That’s why branches in code are anathema to
    modern, deep pipelines and all the gimmicks
    they pull.
   Only if the CPU could somehow guess where
    the target of each branch is going to be…
   That’s where branch prediction comes in.
              http://yaserzt.com/blog/                25
   So the CPU guesses the target of a jump (if it
    doesn’t know for sure,) and continues to
    speculatively execute instructions from there.
   For a conditional jump, the CPU must also
    predict whether the branch is taken or not.
   If the CPU is right, the pipeline flows
    smoothly. If not, the pipeline must be flushed
    and much time and resource is wasted on a
    misprediction.
              http://yaserzt.com/blog/               26
   In this code:
    cmp RAX,0
    jne [RBX]
    both the target and whether the jump happens
    or not must be predicted.
   The above can effectively jump anywhere!
   But usually branches are closer to this:
    cmp RAX,0
    jne somewhere_specific
   Which can only have two possible targets.

              http://yaserzt.com/blog/             27
   In a simple form, when a branch is executed,
    its target is stored in a table called the BTB (or
    Branch Target Buffer.) When that branch is
    encountered again, the target address is
    predicted to be the value read from the BTB.
   As you might guess, this doesn’t work for
    many situations (e.g. alternating branch.)
   Also, the size of the BTB is limited, so CPU will
    forget about the last target of some jumps.
               http://yaserzt.com/blog/                  28
 A simple expansion on the previous idea is to use a
  saturating counter along with each entry of the BTB.
 For example, with a 2-bit counter,
     Branch is predicted not to be taken if the counter is 0 or 1.
     The branch is predicted to be taken if the counter is 2 or 3.
     Each time it is taken, counter is incremented, and vice versa.
                    T                          T             T
         Strongly             Weakly
                                                    Weakly        Strongly
    NT     Not                 Not                                           T
                                                    Taken          Taken
          Taken               Taken
                    NT                         NT            NT

                    http://yaserzt.com/blog/                                     29
   But this behaves very badly in common situations.
   For an alternating branch,
     If the counter starts in 00 or 11, it will mispredict 50%.
     If the counter starts in 01, and the first time the branch
      is taken, it will mispredict 100%!
   As an improvement, we can store the history of
    the last N occurrences of the branch in the BTB,
    and use 2N counters for each of the possible
    history patterns.

                  http://yaserzt.com/blog/                         30
   For N=4 and 2-bit counters, we’ll have:
     This is an extremely cool method of doing branch
     prediction!




        Branch History                             Prediction
                                               .
                                                      (0 or 1)
        0010                                   .
                                               .


                    http://yaserzt.com/blog/                     31
   Some predictions are simpler:
     For each ret instruction, the target is somewhere
      on the stack (pushed before.) Modern CPUs keep
      track of return addresses in an internal return
      stack buffer. Each time a call is executed, an
      entry is added and is used for the return address.
     On a cold encounter (a.k.a. static prediction) a
      branch is sometimes predicted to
      ▪ fall through if it goes forward.
      ▪ be taken if it goes backward.

                  http://yaserzt.com/blog/                 32
   Best general advice is to arrange your code so
    that the most common path for branches is
    “not taken”. This improves the effectiveness
    of code prefetching and the trace cache.
   Branch prediction, register renaming and
    speculative execution work extremely well
    together.


              http://yaserzt.com/blog/               33
mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/   34
Clock 0 – Instruction 0


mov   RAX,[RBX+16]                    Load RAX from memory
add   RBX,16                          Assume cache miss – 300
cmp   RAX,0                           cycles to load
                                      Instruction starts and
je    IsNull
                                      dispatch continues...
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                        35
Clock 0 – Instruction 1


mov   RAX,[RBX+16]                    This instruction writes RBX,
add   RBX,16                          which conflicts with the
cmp   RAX,0                           read in instruction 0.
                                      Rename this instance of
je    IsNull
                                      RBX and continue…
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                         36
Clock 0 – Instruction 2


mov   RAX,[RBX+16]                    Value of RAX not available
add   RBX,16                          yet; cannot calculate value
cmp   RAX,0                           of Flags reg.
                                      Queue up behind
je    IsNull
                                      instruction 0…
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                         37
Clock 0 – Instruction 3


mov   RAX,[RBX+16]                    Flags reg. still not available.
add   RBX,16                          Predict that this branch is
cmp   RAX,0                           not taken.
                                      Assuming 4-wide dispatch,
je    IsNull
                                      instruction issue limit is
mov   [RBX-16],RCX                    reached.
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                           38
Clock 1 – Instruction 4


mov   RAX,[RBX+16]                    Store is speculative. Result
add   RBX,16                          kept in Store Buffer. Also,
cmp   RAX,0                           RBX might not be available
                                      yet (from instruction 1.)
je    IsNull
                                      Load/Store Unit is tied up
mov   [RBX-16],RCX                    from now on; can’t issue
mov   RCX,[RDX+0]                     any more memory ops in
mov   RAX,[RAX+8]                     this cycle.

           http://yaserzt.com/blog/                                          39
Clock 2 – Instruction 5


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0                           Had to wait for L/S Unit.
je    IsNull                          Assume this is another (and
                                      unrelated) cache miss. We
mov   [RBX-16],RCX
                                      have 2 overlapping cache
mov   RCX,[RDX+0]                     misses now.
mov   RAX,[RAX+8]                     L/S Unit is busy again.

           http://yaserzt.com/blog/                                         40
Clock 3 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
                                      RAX is not ready yet (300-
mov   [RBX-16],RCX
                                      cycle latency, remember?!)
mov   RCX,[RDX+0]                     This load cannot even start
mov   RAX,[RAX+8]                     until instruction 0 is done.

           http://yaserzt.com/blog/                                          41
Clock 301 – Instruction 2


mov   RAX,[RBX+16]
add   RBX,16                          At clock 300 (or 301,) RAX is
cmp   RAX,0                           finally ready.
je    IsNull                          Do the comparison and
                                      update Flags register.
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      42
Clock 301 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull                          Issue this load too. Assume
mov   [RBX-16],RCX                    a cache hit (finally!) Result
mov   RCX,[RDX+0]                     will be available in clock
mov   RAX,[RAX+8]                     304.

           http://yaserzt.com/blog/                                      43
Clock 302 – Instruction 3


mov   RAX,[RBX+16]
add   RBX,16                          Now the Flags reg. is ready.
cmp   RAX,0                           Check the prediction.
je    IsNull                          Assume prediction was
                                      correct.
mov   [RBX-16],RCX
mov   RCX,[RDX+0]
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      44
Clock 302 – Instruction 4


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull                          This speculative store can
                                      actually be committed to
mov   [RBX-16],RCX                    memory (or cache,
mov   RCX,[RDX+0]                     actually.)
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                     45
Clock 302 – Instruction 5


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX                    At clock 302, the result of
mov   RCX,[RDX+0]                     this load arrives.
mov   RAX,[RAX+8]

           http://yaserzt.com/blog/                                      46
Clock 305 – Instruction 6


mov   RAX,[RBX+16]
add   RBX,16
cmp   RAX,0
je    IsNull
mov   [RBX-16],RCX
mov   RCX,[RDX+0]                     Result arrived at clock 304;
mov   RAX,[RAX+8]                     instruction retired at 305.

           http://yaserzt.com/blog/                                      47
To summarize,
mov RAX,[RBX+16] • In 4 clocks, started 7 ops
add RBX,16          and 2 cache misses
cmp RAX,0        • Retired 7 ops in 306 cycles.
                 • Cache misses totally
je IsNull
                    dominate performance.
mov [RBX-16],RCX • The only real benefit came
mov RCX,[RDX+0]     from being able to have 2
mov RAX,[RAX+8]     overlapping cache misses!

           http://yaserzt.com/blog/           48
To get to the next cache
miss as early as possible.

       http://yaserzt.com/blog/   49
   Main memory is slow; S.L.O.W.
   Very slow
   Painfully slow
   And it specially has very bad (high) latency
   But all is not lost! Many (most) references to
    memory have high temporal and address locality.
   So we use a small amount of very fast memory to
    keep recently-accessed or likely-to-be-accessed
    chunks of main memory close to CPU.

              http://yaserzt.com/blog/                50
   Typically come is several levels (3 these days.)
   Each lower level is several times smaller, but
    several times faster than the level above.
   CPU can only see the L1 cache, each level only
    sees the level above, and only the highest
    level can communicate with main memory.
   Data is transferred between memory and
    cache in units of fixed size, called a cache line.
    The most common size today is 64 bytes.
               http://yaserzt.com/blog/                  51
   When any memory byte is                       Main Memory
    needed, its place in cache is                 Each block is the
    calculated;                                   size of a cache line
   CPU asks the cache;
   If there, the cache returns the           The Cache
    data;                                     Each block also
   If not, the data is pulled in             holds metadata
    from memory;                              like tag (address)
   If the calculated cache line is           and some flags
    occupied by data with a
    different tag, that data is
    evicted.
   If the line is dirty (modified) it
    is written back to memory
    first.
                   http://yaserzt.com/blog/                              52
   In this basic model, if the CPU periodically
    accesses memory addresses that differ by a
    multiple of the cache size, they will constantly
    evict each other out and most cache accesses
    will be misses. This is called cache thrashing.
   An application can innocently and very easily
    trigger this.


               http://yaserzt.com/blog/                53
   To alleviate this problem, each cache block is
    turned into an associative memory that can
    house more than one cache line.
   Each cache block holds more cache lines (2, 4,
    8 or more,) and still uses the tag to look up
    the line requested by the CPU in the block.
   When a new line comes in from memory, an
    LRU (or similar) policy is used to evict only the
    least-likely-to-be-needed line.
               http://yaserzt.com/blog/                 54
   References:
     Patterson & Hennessy - Computer Organization and Design
     Intel 64 and IA-32 Architectures Software Developer’s
      Manual – vol. 1, 2 and 3
     Click & Goetz – A Crash Course in Modern Hardware
     Agner Fog - The Microarchitecture of Intel, AMD and VIA
      CPUs
     Drepper - What Every Programmer Should Know About
      Memory


                 http://yaserzt.com/blog/                       55

Contenu connexe

Tendances

FreeBSD Jail Complete Example
FreeBSD Jail Complete ExampleFreeBSD Jail Complete Example
FreeBSD Jail Complete Example
Mohammed Farrag
 
Simple and efficient way to get the last log using MMAP
Simple and efficient way to get the last log using MMAPSimple and efficient way to get the last log using MMAP
Simple and efficient way to get the last log using MMAP
Tetsuyuki Kobayashi
 

Tendances (20)

도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
도커 없이 컨테이너 만들기 5편 마운트 네임스페이스와 오버레이 파일시스템
 
FreeBSD Jail Complete Example
FreeBSD Jail Complete ExampleFreeBSD Jail Complete Example
FreeBSD Jail Complete Example
 
Install tomcat 5.5 in debian os and deploy war file
Install tomcat 5.5 in debian os and deploy war fileInstall tomcat 5.5 in debian os and deploy war file
Install tomcat 5.5 in debian os and deploy war file
 
Putting some "logic" in LVM.
Putting some "logic" in LVM.Putting some "logic" in LVM.
Putting some "logic" in LVM.
 
Linux con europe_2014_f
Linux con europe_2014_fLinux con europe_2014_f
Linux con europe_2014_f
 
SnapDiff
SnapDiffSnapDiff
SnapDiff
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
[ArabBSD] Unix Basics
[ArabBSD] Unix Basics[ArabBSD] Unix Basics
[ArabBSD] Unix Basics
 
Using cgroups in docker container
Using cgroups in docker containerUsing cgroups in docker container
Using cgroups in docker container
 
Oracle prm dul, jvm and os
Oracle prm dul, jvm and osOracle prm dul, jvm and os
Oracle prm dul, jvm and os
 
Lecture1 Introduction
Lecture1  IntroductionLecture1  Introduction
Lecture1 Introduction
 
Lec7
Lec7Lec7
Lec7
 
20150918 klug el performance tuning-v1.4
20150918 klug el performance tuning-v1.420150918 klug el performance tuning-v1.4
20150918 klug el performance tuning-v1.4
 
.ppt
.ppt.ppt
.ppt
 
Virtual memory 20070222-en
Virtual memory 20070222-enVirtual memory 20070222-en
Virtual memory 20070222-en
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
Vastsky xen summit20100428
Vastsky xen summit20100428Vastsky xen summit20100428
Vastsky xen summit20100428
 
Simple and efficient way to get the last log using MMAP
Simple and efficient way to get the last log using MMAPSimple and efficient way to get the last log using MMAP
Simple and efficient way to get the last log using MMAP
 
Solaris - nw
Solaris  - nwSolaris  - nw
Solaris - nw
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 

Similaire à Modern CPUs and Caches - A Starting Point for Programmers

Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
Tsukasa Oi
 
REAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEMREAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEM
prakrutijsh
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 

Similaire à Modern CPUs and Caches - A Starting Point for Programmers (20)

Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
Creating Secure VM (Comarison between Intel and AMD, and one more thing...) -...
 
REAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEMREAL TIME OPERATING SYSTEM
REAL TIME OPERATING SYSTEM
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Karl Grzeszczak: September Docker Presentation at Mediafly
Karl Grzeszczak: September Docker Presentation at MediaflyKarl Grzeszczak: September Docker Presentation at Mediafly
Karl Grzeszczak: September Docker Presentation at Mediafly
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Docker
DockerDocker
Docker
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBox
 
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Operating System Assignment Help
Operating System Assignment HelpOperating System Assignment Help
Operating System Assignment Help
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
os
osos
os
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 

Modern CPUs and Caches - A Starting Point for Programmers

  • 1. Yaser Zhian Fanafzar Game Studio IGDI, Workshop 07, January 2nd, 2013
  • 2. Some notes about the subject  CPUs and their gimmicks  Caches and their importance  How CPU and OS handle memory logically http://yaserzt.com/blog/ 2
  • 3. These are very complex subjects  Expect very few details and much simplification  These are very complicated subjects  Expect much generalization and omission  No time  Even a full course would be hilariously insufficient  Not an expert  Sorry! Can’t help much.  Just a pile of loosely related stuff http://yaserzt.com/blog/ 3
  • 4. Pressure for performance  Backwards compatibility  Cost/power/etc.  The ridiculous “numbers game”  Law of diminishing returns  Latency vs. Throughput http://yaserzt.com/blog/ 4
  • 5. You can always solve your bandwidth (throughput) problems with money, but it is rarely so for lag (latency.)  Relative rate of improvements (from David Patterson’s keynote, HPEC 2004)  CPU, 80286 till Pentium 4: 21x vs. 2250x  Ethernet, 10Mb till 10Gb: 16x vs. 1000x  Disk, 3600 till 15000rpm: 8x vs. 143x  DRAM, plain till DDR: 4x vs. 120x http://yaserzt.com/blog/ 5
  • 6. At the simplest level, the von Neumann model stipulates:  Program is data and is stored in memory along with data (departing from Turing’s model)  Program is executed sequentially  Not the way computers function anymore…  Abstraction still used for thinking about programs  But it’s leaky as heck!  “Not Your Fathers’ von Neumann Machine!” http://yaserzt.com/blog/ 6
  • 7. Speed of Light: can’t send and receive signals to and from all parts of the die in a cycle anymore  Power: more transistors leads to more power, which leads to much more heat  Memory: the CPU isn’t even close to the bottleneck anymore. “All your base are belong to” memory  Complexity: adding more transistors for more sophisticated operation won’t give much of a speedup (e.g. doubling transistors might give 2%.) http://yaserzt.com/blog/ 7
  • 8. Family introduced with 8086 in 1978  Today, new members are still fully binary backward-compatible with that puny machine (5MHz clock, 20-bit addressing, 16-bit regs.)  It had very few registers  It had segmented memory addressing (joy!)  It had many complex instructions and several addressing modes http://yaserzt.com/blog/ 8
  • 9. 1982 (80286): Protected mode, MMU  1985 (80386): 32-bit ISA, Paging  1989 (80486): Pipelining, Cache, Intgrtd. FPU  1993 (Pentium): Superscalar, 64-bit bus, MMX  1995 (P-Pro): μ-ops, OoO Exec., Register Renaming, Speculative Exec.  1997 (K6-2, PIII): 3DNow!/SSE  2003 (Opteron): 64-bit ISA  2006 (Core 2): Multi-core http://yaserzt.com/blog/ 9
  • 10. Registers got expanded from (all 16 bit, non really general purpose)  AX, BX, CX, DX  SI, DI, BP, SP  CS, DS, ES, SS, Flags, IP  To  16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI, R8-R15) plus RIP and Flags and others  16 x 128-bit XMM regs. (XMM0-...) ▪ Or 16 x 256-bit YMM regs. (YMM0-...)  More than a thousand logically different instructions (the usual, plus string processing, cryptography, CRC, complex numbers, etc.) http://yaserzt.com/blog/ 10
  • 11. The Fetch-Decode-Execute-Retire Cycle  Strategies for more performance:  More complex instructions, doing more in hardware (CISCing things up)  Faster CPU clock rates (the free lunch)  Instruction-Level Parallelism (SIMD + gimmicks)  Adding cores (free lunch is over!)  And then, there are gimmicks… http://yaserzt.com/blog/ 11
  • 12. Pipelining  µ-ops  Superscalar Pipelines  Out-of-order Execution  Speculative Execution  Register Renaming  Branch Prediction  Prefetching  Store Buffer  Trace Cache  … http://yaserzt.com/blog/ 12
  • 13. Classic sequential execution:  Length of instruction executions vary a lot (5-10 times usual, several orders of magnitude also happen.) Instruction 1 Instruction 2 Instruction 3 Instruction 4 http://yaserzt.com/blog/ 13
  • 14. It’s really more like this for the CPU:  Instructions may have many sub-parts, and they engage different parts of the CPU F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 14
  • 15. So why not do this:  This is called “pipelining”  It increases throughput (significantly)  Doesn’t decrease latency for single instructions F1 D1 E1 R1 F2 D2 E2 R2 F3 D3 E3 R3 F4 D4 E4 R4 http://yaserzt.com/blog/ 15
  • 16. But it has its own share of problems  Hazards, stalls, flushing, etc.  Execution of i2 depends on the result of i1  After i2, we jump and the i3, i4,… are flushed out F1 D1 E1 R1 add EAX,120 F2 D2 E2 R2 jmp [EAX] F3 D3 E3 R3 mov [4*EBX+42],EDX F4 D4 E4 R4 add ECX,[EAX] http://yaserzt.com/blog/ 16
  • 17. Instructions are broken up into simple, orthogonal µ-ops  mov EAX,EDX might generate only one µ-op  mov EAX,[EDX] might generate two: 1. µld tmp0,[EDX] 2. µmov EAX,tmp0  add [EAX],EDX probably generates three: 1. µld tmp0,[EAX] 2. µadd tmp0,EDX 3. µst [EAX],tmp0 http://yaserzt.com/blog/ 17
  • 18. The CPU then, gets two layers:  The one that breaks up operations into µ-ops  The one that executes µ-ops  The part that executes µ-ops can be simpler (more RISCy) and therefore faster.  More complex instructions can be supported without (much) complicating the CPU  The pipelining (and other gimmicks) can happen at the µ-op level http://yaserzt.com/blog/ 18
  • 19. CPUs that issue (or retire) more than one instruction per cycle are called Superscalar  Can be thought of as a pipeline with more than one line  Simplest form: integer pipe plus floating-point pipe  These days, CPUs do 4 or more  Obviously requires more of each type of operational unit in the CPU http://yaserzt.com/blog/ 19
  • 20. To prevent your pipeline from stalling as much as possible, issue the next instructions even if you can’t start the current one.  But of course, only if there are no hazards (dependencies) and there are operational units available.  add RAX,RAX add RAX,RBX This can be and is started before the previous instruction. add RCX,RDX http://yaserzt.com/blog/ 20
  • 21. This obviously also applies at the µ-op level:  mov RAX,[mem0] Fetching mem1 is started long mul RAX,42 before the result of the multiply becomes available. add RAX,[mem1]  push RAX Pushing RAX is sub RSP,8 and then call Func mov [RSP],RAX. Since call instruction needs RSP too, it will only wait for the subtraction and not the store to finish to start. http://yaserzt.com/blog/ 21
  • 22. Consider this: mov RAX,[mem0] mul RAX,42 mov [mem1],RAX mov RAX,[mem2] add RAX,7 mov [mem3],RAX  Logically, the two parts are totally separate.  However, the use of RAX will stall the pipeline. http://yaserzt.com/blog/ 22
  • 23. Modern CPUs have a lot of temporary, unnamed registers at their disposal.  They will detect the logical independence, and will use one of those in the second block instead on RAX.  And they will track which reg. is which, where.  In effect, they are renaming another register to RAX.  There might not even be a real RAX! http://yaserzt.com/blog/ 23
  • 24. This is, for once, simpler than it might seem!  Every time a register is assigned to, a new temporary register is used in its stead.  Consider this: Rename happens mov RAX,[cached] mov RBX,[uncached] Renaming on mul means add RBX,RAX that it won’t clobber RAX mul RAX,42 (which we need for the add, that is waiting on the mov [mem0],RAX load of [uncached]) and we mov [mem1],RBX can do the multiply and reach the first store much sooner. http://yaserzt.com/blog/ 24
  • 25. The CPU always depends on knowing where the next instruction is, so it can go ahead and work on it.  That’s why branches in code are anathema to modern, deep pipelines and all the gimmicks they pull.  Only if the CPU could somehow guess where the target of each branch is going to be…  That’s where branch prediction comes in. http://yaserzt.com/blog/ 25
  • 26. So the CPU guesses the target of a jump (if it doesn’t know for sure,) and continues to speculatively execute instructions from there.  For a conditional jump, the CPU must also predict whether the branch is taken or not.  If the CPU is right, the pipeline flows smoothly. If not, the pipeline must be flushed and much time and resource is wasted on a misprediction. http://yaserzt.com/blog/ 26
  • 27. In this code: cmp RAX,0 jne [RBX] both the target and whether the jump happens or not must be predicted.  The above can effectively jump anywhere!  But usually branches are closer to this: cmp RAX,0 jne somewhere_specific  Which can only have two possible targets. http://yaserzt.com/blog/ 27
  • 28. In a simple form, when a branch is executed, its target is stored in a table called the BTB (or Branch Target Buffer.) When that branch is encountered again, the target address is predicted to be the value read from the BTB.  As you might guess, this doesn’t work for many situations (e.g. alternating branch.)  Also, the size of the BTB is limited, so CPU will forget about the last target of some jumps. http://yaserzt.com/blog/ 28
  • 29.  A simple expansion on the previous idea is to use a saturating counter along with each entry of the BTB.  For example, with a 2-bit counter,  Branch is predicted not to be taken if the counter is 0 or 1.  The branch is predicted to be taken if the counter is 2 or 3.  Each time it is taken, counter is incremented, and vice versa. T T T Strongly Weakly Weakly Strongly NT Not Not T Taken Taken Taken Taken NT NT NT http://yaserzt.com/blog/ 29
  • 30. But this behaves very badly in common situations.  For an alternating branch,  If the counter starts in 00 or 11, it will mispredict 50%.  If the counter starts in 01, and the first time the branch is taken, it will mispredict 100%!  As an improvement, we can store the history of the last N occurrences of the branch in the BTB, and use 2N counters for each of the possible history patterns. http://yaserzt.com/blog/ 30
  • 31. For N=4 and 2-bit counters, we’ll have:  This is an extremely cool method of doing branch prediction! Branch History Prediction . (0 or 1) 0010 . . http://yaserzt.com/blog/ 31
  • 32. Some predictions are simpler:  For each ret instruction, the target is somewhere on the stack (pushed before.) Modern CPUs keep track of return addresses in an internal return stack buffer. Each time a call is executed, an entry is added and is used for the return address.  On a cold encounter (a.k.a. static prediction) a branch is sometimes predicted to ▪ fall through if it goes forward. ▪ be taken if it goes backward. http://yaserzt.com/blog/ 32
  • 33. Best general advice is to arrange your code so that the most common path for branches is “not taken”. This improves the effectiveness of code prefetching and the trace cache.  Branch prediction, register renaming and speculative execution work extremely well together. http://yaserzt.com/blog/ 33
  • 34. mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 34
  • 35. Clock 0 – Instruction 0 mov RAX,[RBX+16] Load RAX from memory add RBX,16 Assume cache miss – 300 cmp RAX,0 cycles to load Instruction starts and je IsNull dispatch continues... mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 35
  • 36. Clock 0 – Instruction 1 mov RAX,[RBX+16] This instruction writes RBX, add RBX,16 which conflicts with the cmp RAX,0 read in instruction 0. Rename this instance of je IsNull RBX and continue… mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 36
  • 37. Clock 0 – Instruction 2 mov RAX,[RBX+16] Value of RAX not available add RBX,16 yet; cannot calculate value cmp RAX,0 of Flags reg. Queue up behind je IsNull instruction 0… mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 37
  • 38. Clock 0 – Instruction 3 mov RAX,[RBX+16] Flags reg. still not available. add RBX,16 Predict that this branch is cmp RAX,0 not taken. Assuming 4-wide dispatch, je IsNull instruction issue limit is mov [RBX-16],RCX reached. mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 38
  • 39. Clock 1 – Instruction 4 mov RAX,[RBX+16] Store is speculative. Result add RBX,16 kept in Store Buffer. Also, cmp RAX,0 RBX might not be available yet (from instruction 1.) je IsNull Load/Store Unit is tied up mov [RBX-16],RCX from now on; can’t issue mov RCX,[RDX+0] any more memory ops in mov RAX,[RAX+8] this cycle. http://yaserzt.com/blog/ 39
  • 40. Clock 2 – Instruction 5 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 Had to wait for L/S Unit. je IsNull Assume this is another (and unrelated) cache miss. We mov [RBX-16],RCX have 2 overlapping cache mov RCX,[RDX+0] misses now. mov RAX,[RAX+8] L/S Unit is busy again. http://yaserzt.com/blog/ 40
  • 41. Clock 3 – Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull RAX is not ready yet (300- mov [RBX-16],RCX cycle latency, remember?!) mov RCX,[RDX+0] This load cannot even start mov RAX,[RAX+8] until instruction 0 is done. http://yaserzt.com/blog/ 41
  • 42. Clock 301 – Instruction 2 mov RAX,[RBX+16] add RBX,16 At clock 300 (or 301,) RAX is cmp RAX,0 finally ready. je IsNull Do the comparison and update Flags register. mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 42
  • 43. Clock 301 – Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull Issue this load too. Assume mov [RBX-16],RCX a cache hit (finally!) Result mov RCX,[RDX+0] will be available in clock mov RAX,[RAX+8] 304. http://yaserzt.com/blog/ 43
  • 44. Clock 302 – Instruction 3 mov RAX,[RBX+16] add RBX,16 Now the Flags reg. is ready. cmp RAX,0 Check the prediction. je IsNull Assume prediction was correct. mov [RBX-16],RCX mov RCX,[RDX+0] mov RAX,[RAX+8] http://yaserzt.com/blog/ 44
  • 45. Clock 302 – Instruction 4 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull This speculative store can actually be committed to mov [RBX-16],RCX memory (or cache, mov RCX,[RDX+0] actually.) mov RAX,[RAX+8] http://yaserzt.com/blog/ 45
  • 46. Clock 302 – Instruction 5 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX At clock 302, the result of mov RCX,[RDX+0] this load arrives. mov RAX,[RAX+8] http://yaserzt.com/blog/ 46
  • 47. Clock 305 – Instruction 6 mov RAX,[RBX+16] add RBX,16 cmp RAX,0 je IsNull mov [RBX-16],RCX mov RCX,[RDX+0] Result arrived at clock 304; mov RAX,[RAX+8] instruction retired at 305. http://yaserzt.com/blog/ 47
  • 48. To summarize, mov RAX,[RBX+16] • In 4 clocks, started 7 ops add RBX,16 and 2 cache misses cmp RAX,0 • Retired 7 ops in 306 cycles. • Cache misses totally je IsNull dominate performance. mov [RBX-16],RCX • The only real benefit came mov RCX,[RDX+0] from being able to have 2 mov RAX,[RAX+8] overlapping cache misses! http://yaserzt.com/blog/ 48
  • 49. To get to the next cache miss as early as possible. http://yaserzt.com/blog/ 49
  • 50. Main memory is slow; S.L.O.W.  Very slow  Painfully slow  And it specially has very bad (high) latency  But all is not lost! Many (most) references to memory have high temporal and address locality.  So we use a small amount of very fast memory to keep recently-accessed or likely-to-be-accessed chunks of main memory close to CPU. http://yaserzt.com/blog/ 50
  • 51. Typically come is several levels (3 these days.)  Each lower level is several times smaller, but several times faster than the level above.  CPU can only see the L1 cache, each level only sees the level above, and only the highest level can communicate with main memory.  Data is transferred between memory and cache in units of fixed size, called a cache line. The most common size today is 64 bytes. http://yaserzt.com/blog/ 51
  • 52. When any memory byte is Main Memory needed, its place in cache is Each block is the calculated; size of a cache line  CPU asks the cache;  If there, the cache returns the The Cache data; Each block also  If not, the data is pulled in holds metadata from memory; like tag (address)  If the calculated cache line is and some flags occupied by data with a different tag, that data is evicted.  If the line is dirty (modified) it is written back to memory first. http://yaserzt.com/blog/ 52
  • 53. In this basic model, if the CPU periodically accesses memory addresses that differ by a multiple of the cache size, they will constantly evict each other out and most cache accesses will be misses. This is called cache thrashing.  An application can innocently and very easily trigger this. http://yaserzt.com/blog/ 53
  • 54. To alleviate this problem, each cache block is turned into an associative memory that can house more than one cache line.  Each cache block holds more cache lines (2, 4, 8 or more,) and still uses the tag to look up the line requested by the CPU in the block.  When a new line comes in from memory, an LRU (or similar) policy is used to evict only the least-likely-to-be-needed line. http://yaserzt.com/blog/ 54
  • 55. References:  Patterson & Hennessy - Computer Organization and Design  Intel 64 and IA-32 Architectures Software Developer’s Manual – vol. 1, 2 and 3  Click & Goetz – A Crash Course in Modern Hardware  Agner Fog - The Microarchitecture of Intel, AMD and VIA CPUs  Drepper - What Every Programmer Should Know About Memory http://yaserzt.com/blog/ 55