A short and very cursory look at some of the features that make modern (x86) CPUs "modern".
I wished to include more examples, time-comparisons and more detailed information, but I the time allotted to the presentation barely allowed even this.
This was the first time I was presenting the subject, so expect much roughness around the edges.
Also, if you are even remotely interested in modern CPUs and caches and whatnot, don't look at this; Google for Cliff Click's excellent talk "A Crash Course in Modern Hardware".
2. Some notes about the subject
CPUs and their gimmicks
Caches and their importance
How CPU and OS handle memory logically
http://yaserzt.com/blog/ 2
3. These are very complex subjects
Expect very few details and much simplification
These are very complicated subjects
Expect much generalization and omission
No time
Even a full course would be hilariously insufficient
Not an expert
Sorry! Can’t help much.
Just a pile of loosely related stuff
http://yaserzt.com/blog/ 3
4. Pressure for performance
Backwards compatibility
Cost/power/etc.
The ridiculous “numbers game”
Law of diminishing returns
Latency vs. Throughput
http://yaserzt.com/blog/ 4
5. You can always solve your bandwidth
(throughput) problems with money, but it is
rarely so for lag (latency.)
Relative rate of improvements (from David
Patterson’s keynote, HPEC 2004)
CPU, 80286 till Pentium 4: 21x vs. 2250x
Ethernet, 10Mb till 10Gb: 16x vs. 1000x
Disk, 3600 till 15000rpm: 8x vs. 143x
DRAM, plain till DDR: 4x vs. 120x
http://yaserzt.com/blog/ 5
6. At the simplest level, the von Neumann
model stipulates:
Program is data and is stored in memory along
with data (departing from Turing’s model)
Program is executed sequentially
Not the way computers function anymore…
Abstraction still used for thinking about programs
But it’s leaky as heck!
“Not Your Fathers’ von Neumann Machine!”
http://yaserzt.com/blog/ 6
7. Speed of Light: can’t send and receive signals to
and from all parts of the die in a cycle anymore
Power: more transistors leads to more power,
which leads to much more heat
Memory: the CPU isn’t even close to the
bottleneck anymore. “All your base are belong
to” memory
Complexity: adding more transistors for more
sophisticated operation won’t give much of a
speedup (e.g. doubling transistors might give
2%.)
http://yaserzt.com/blog/ 7
8. Family introduced with 8086 in 1978
Today, new members are still fully binary
backward-compatible with that puny machine
(5MHz clock, 20-bit addressing, 16-bit regs.)
It had very few registers
It had segmented memory addressing (joy!)
It had many complex instructions and several
addressing modes
http://yaserzt.com/blog/ 8
10. Registers got expanded from (all 16 bit, non really
general purpose)
AX, BX, CX, DX
SI, DI, BP, SP
CS, DS, ES, SS, Flags, IP
To
16 x 64-bit GPRs (RAX, RBX, RCX, RDX, RBP, RSP, RSI, RDI,
R8-R15) plus RIP and Flags and others
16 x 128-bit XMM regs. (XMM0-...)
▪ Or 16 x 256-bit YMM regs. (YMM0-...)
More than a thousand logically different instructions (the
usual, plus string processing, cryptography, CRC, complex
numbers, etc.)
http://yaserzt.com/blog/ 10
11. The Fetch-Decode-Execute-Retire Cycle
Strategies for more performance:
More complex instructions, doing more in
hardware (CISCing things up)
Faster CPU clock rates (the free lunch)
Instruction-Level Parallelism (SIMD + gimmicks)
Adding cores (free lunch is over!)
And then, there are gimmicks…
http://yaserzt.com/blog/ 11
13. Classic sequential execution:
Length of instruction executions vary a lot (5-10
times usual, several orders of magnitude also
happen.)
Instruction 1
Instruction 2
Instruction 3
Instruction 4
http://yaserzt.com/blog/ 13
14. It’s really more like this for the CPU:
Instructions may have many sub-parts, and they
engage different parts of the CPU
F1 D1 E1 R1
F2 D2 E2 R2
F3 D3 E3 R3
F4 D4 E4 R4
http://yaserzt.com/blog/ 14
15. So why not do this:
This is called “pipelining”
It increases throughput (significantly)
Doesn’t decrease latency for single instructions
F1 D1 E1 R1
F2 D2 E2 R2
F3 D3 E3 R3
F4 D4 E4 R4
http://yaserzt.com/blog/ 15
16. But it has its own share of problems
Hazards, stalls, flushing, etc.
Execution of i2 depends on the result of i1
After i2, we jump and the i3, i4,… are flushed out
F1 D1 E1 R1 add EAX,120
F2 D2 E2 R2 jmp [EAX]
F3 D3 E3 R3 mov [4*EBX+42],EDX
F4 D4 E4 R4 add ECX,[EAX]
http://yaserzt.com/blog/ 16
17. Instructions are broken up into simple,
orthogonal µ-ops
mov EAX,EDX might generate only one µ-op
mov EAX,[EDX] might generate two:
1. µld tmp0,[EDX]
2. µmov EAX,tmp0
add [EAX],EDX probably generates three:
1. µld tmp0,[EAX]
2. µadd tmp0,EDX
3. µst [EAX],tmp0
http://yaserzt.com/blog/ 17
18. The CPU then, gets two layers:
The one that breaks up operations into µ-ops
The one that executes µ-ops
The part that executes µ-ops can be simpler
(more RISCy) and therefore faster.
More complex instructions can be supported
without (much) complicating the CPU
The pipelining (and other gimmicks) can
happen at the µ-op level
http://yaserzt.com/blog/ 18
19. CPUs that issue (or retire) more than one
instruction per cycle are called Superscalar
Can be thought of as a pipeline with more
than one line
Simplest form: integer pipe plus floating-point
pipe
These days, CPUs do 4 or more
Obviously requires more of each type of
operational unit in the CPU
http://yaserzt.com/blog/ 19
20. To prevent your pipeline from stalling as
much as possible, issue the next instructions
even if you can’t start the current one.
But of course, only if there are no hazards
(dependencies) and there are operational
units available.
add RAX,RAX
add RAX,RBX This can be and is started before
the previous instruction.
add RCX,RDX
http://yaserzt.com/blog/ 20
21. This obviously also applies at the µ-op level:
mov RAX,[mem0] Fetching mem1 is started long
mul RAX,42 before the result of the
multiply becomes available.
add RAX,[mem1]
push RAX
Pushing RAX is sub RSP,8 and then
call Func mov [RSP],RAX. Since call
instruction needs RSP too, it will only
wait for the subtraction and not the
store to finish to start.
http://yaserzt.com/blog/ 21
22. Consider this:
mov RAX,[mem0]
mul RAX,42
mov [mem1],RAX
mov RAX,[mem2]
add RAX,7
mov [mem3],RAX
Logically, the two parts are totally separate.
However, the use of RAX will stall the pipeline.
http://yaserzt.com/blog/ 22
23. Modern CPUs have a lot of temporary,
unnamed registers at their disposal.
They will detect the logical independence,
and will use one of those in the second block
instead on RAX.
And they will track which reg. is which, where.
In effect, they are renaming another register
to RAX.
There might not even be a real RAX!
http://yaserzt.com/blog/ 23
24. This is, for once, simpler than it might seem!
Every time a register is assigned to, a new
temporary register is used in its stead.
Consider this:
Rename happens
mov RAX,[cached]
mov RBX,[uncached]
Renaming on mul means
add RBX,RAX that it won’t clobber RAX
mul RAX,42 (which we need for the
add, that is waiting on the
mov [mem0],RAX load of [uncached]) and we
mov [mem1],RBX can do the multiply and
reach the first store much
sooner.
http://yaserzt.com/blog/ 24
25. The CPU always depends on knowing where
the next instruction is, so it can go ahead and
work on it.
That’s why branches in code are anathema to
modern, deep pipelines and all the gimmicks
they pull.
Only if the CPU could somehow guess where
the target of each branch is going to be…
That’s where branch prediction comes in.
http://yaserzt.com/blog/ 25
26. So the CPU guesses the target of a jump (if it
doesn’t know for sure,) and continues to
speculatively execute instructions from there.
For a conditional jump, the CPU must also
predict whether the branch is taken or not.
If the CPU is right, the pipeline flows
smoothly. If not, the pipeline must be flushed
and much time and resource is wasted on a
misprediction.
http://yaserzt.com/blog/ 26
27. In this code:
cmp RAX,0
jne [RBX]
both the target and whether the jump happens
or not must be predicted.
The above can effectively jump anywhere!
But usually branches are closer to this:
cmp RAX,0
jne somewhere_specific
Which can only have two possible targets.
http://yaserzt.com/blog/ 27
28. In a simple form, when a branch is executed,
its target is stored in a table called the BTB (or
Branch Target Buffer.) When that branch is
encountered again, the target address is
predicted to be the value read from the BTB.
As you might guess, this doesn’t work for
many situations (e.g. alternating branch.)
Also, the size of the BTB is limited, so CPU will
forget about the last target of some jumps.
http://yaserzt.com/blog/ 28
29. A simple expansion on the previous idea is to use a
saturating counter along with each entry of the BTB.
For example, with a 2-bit counter,
Branch is predicted not to be taken if the counter is 0 or 1.
The branch is predicted to be taken if the counter is 2 or 3.
Each time it is taken, counter is incremented, and vice versa.
T T T
Strongly Weakly
Weakly Strongly
NT Not Not T
Taken Taken
Taken Taken
NT NT NT
http://yaserzt.com/blog/ 29
30. But this behaves very badly in common situations.
For an alternating branch,
If the counter starts in 00 or 11, it will mispredict 50%.
If the counter starts in 01, and the first time the branch
is taken, it will mispredict 100%!
As an improvement, we can store the history of
the last N occurrences of the branch in the BTB,
and use 2N counters for each of the possible
history patterns.
http://yaserzt.com/blog/ 30
31. For N=4 and 2-bit counters, we’ll have:
This is an extremely cool method of doing branch
prediction!
Branch History Prediction
.
(0 or 1)
0010 .
.
http://yaserzt.com/blog/ 31
32. Some predictions are simpler:
For each ret instruction, the target is somewhere
on the stack (pushed before.) Modern CPUs keep
track of return addresses in an internal return
stack buffer. Each time a call is executed, an
entry is added and is used for the return address.
On a cold encounter (a.k.a. static prediction) a
branch is sometimes predicted to
▪ fall through if it goes forward.
▪ be taken if it goes backward.
http://yaserzt.com/blog/ 32
33. Best general advice is to arrange your code so
that the most common path for branches is
“not taken”. This improves the effectiveness
of code prefetching and the trace cache.
Branch prediction, register renaming and
speculative execution work extremely well
together.
http://yaserzt.com/blog/ 33
35. Clock 0 – Instruction 0
mov RAX,[RBX+16] Load RAX from memory
add RBX,16 Assume cache miss – 300
cmp RAX,0 cycles to load
Instruction starts and
je IsNull
dispatch continues...
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 35
36. Clock 0 – Instruction 1
mov RAX,[RBX+16] This instruction writes RBX,
add RBX,16 which conflicts with the
cmp RAX,0 read in instruction 0.
Rename this instance of
je IsNull
RBX and continue…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 36
37. Clock 0 – Instruction 2
mov RAX,[RBX+16] Value of RAX not available
add RBX,16 yet; cannot calculate value
cmp RAX,0 of Flags reg.
Queue up behind
je IsNull
instruction 0…
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 37
38. Clock 0 – Instruction 3
mov RAX,[RBX+16] Flags reg. still not available.
add RBX,16 Predict that this branch is
cmp RAX,0 not taken.
Assuming 4-wide dispatch,
je IsNull
instruction issue limit is
mov [RBX-16],RCX reached.
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 38
39. Clock 1 – Instruction 4
mov RAX,[RBX+16] Store is speculative. Result
add RBX,16 kept in Store Buffer. Also,
cmp RAX,0 RBX might not be available
yet (from instruction 1.)
je IsNull
Load/Store Unit is tied up
mov [RBX-16],RCX from now on; can’t issue
mov RCX,[RDX+0] any more memory ops in
mov RAX,[RAX+8] this cycle.
http://yaserzt.com/blog/ 39
40. Clock 2 – Instruction 5
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0 Had to wait for L/S Unit.
je IsNull Assume this is another (and
unrelated) cache miss. We
mov [RBX-16],RCX
have 2 overlapping cache
mov RCX,[RDX+0] misses now.
mov RAX,[RAX+8] L/S Unit is busy again.
http://yaserzt.com/blog/ 40
41. Clock 3 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
RAX is not ready yet (300-
mov [RBX-16],RCX
cycle latency, remember?!)
mov RCX,[RDX+0] This load cannot even start
mov RAX,[RAX+8] until instruction 0 is done.
http://yaserzt.com/blog/ 41
42. Clock 301 – Instruction 2
mov RAX,[RBX+16]
add RBX,16 At clock 300 (or 301,) RAX is
cmp RAX,0 finally ready.
je IsNull Do the comparison and
update Flags register.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 42
43. Clock 301 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull Issue this load too. Assume
mov [RBX-16],RCX a cache hit (finally!) Result
mov RCX,[RDX+0] will be available in clock
mov RAX,[RAX+8] 304.
http://yaserzt.com/blog/ 43
44. Clock 302 – Instruction 3
mov RAX,[RBX+16]
add RBX,16 Now the Flags reg. is ready.
cmp RAX,0 Check the prediction.
je IsNull Assume prediction was
correct.
mov [RBX-16],RCX
mov RCX,[RDX+0]
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 44
45. Clock 302 – Instruction 4
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull This speculative store can
actually be committed to
mov [RBX-16],RCX memory (or cache,
mov RCX,[RDX+0] actually.)
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 45
46. Clock 302 – Instruction 5
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX At clock 302, the result of
mov RCX,[RDX+0] this load arrives.
mov RAX,[RAX+8]
http://yaserzt.com/blog/ 46
47. Clock 305 – Instruction 6
mov RAX,[RBX+16]
add RBX,16
cmp RAX,0
je IsNull
mov [RBX-16],RCX
mov RCX,[RDX+0] Result arrived at clock 304;
mov RAX,[RAX+8] instruction retired at 305.
http://yaserzt.com/blog/ 47
48. To summarize,
mov RAX,[RBX+16] • In 4 clocks, started 7 ops
add RBX,16 and 2 cache misses
cmp RAX,0 • Retired 7 ops in 306 cycles.
• Cache misses totally
je IsNull
dominate performance.
mov [RBX-16],RCX • The only real benefit came
mov RCX,[RDX+0] from being able to have 2
mov RAX,[RAX+8] overlapping cache misses!
http://yaserzt.com/blog/ 48
49. To get to the next cache
miss as early as possible.
http://yaserzt.com/blog/ 49
50. Main memory is slow; S.L.O.W.
Very slow
Painfully slow
And it specially has very bad (high) latency
But all is not lost! Many (most) references to
memory have high temporal and address locality.
So we use a small amount of very fast memory to
keep recently-accessed or likely-to-be-accessed
chunks of main memory close to CPU.
http://yaserzt.com/blog/ 50
51. Typically come is several levels (3 these days.)
Each lower level is several times smaller, but
several times faster than the level above.
CPU can only see the L1 cache, each level only
sees the level above, and only the highest
level can communicate with main memory.
Data is transferred between memory and
cache in units of fixed size, called a cache line.
The most common size today is 64 bytes.
http://yaserzt.com/blog/ 51
52. When any memory byte is Main Memory
needed, its place in cache is Each block is the
calculated; size of a cache line
CPU asks the cache;
If there, the cache returns the The Cache
data; Each block also
If not, the data is pulled in holds metadata
from memory; like tag (address)
If the calculated cache line is and some flags
occupied by data with a
different tag, that data is
evicted.
If the line is dirty (modified) it
is written back to memory
first.
http://yaserzt.com/blog/ 52
53. In this basic model, if the CPU periodically
accesses memory addresses that differ by a
multiple of the cache size, they will constantly
evict each other out and most cache accesses
will be misses. This is called cache thrashing.
An application can innocently and very easily
trigger this.
http://yaserzt.com/blog/ 53
54. To alleviate this problem, each cache block is
turned into an associative memory that can
house more than one cache line.
Each cache block holds more cache lines (2, 4,
8 or more,) and still uses the tag to look up
the line requested by the CPU in the block.
When a new line comes in from memory, an
LRU (or similar) policy is used to evict only the
least-likely-to-be-needed line.
http://yaserzt.com/blog/ 54
55. References:
Patterson & Hennessy - Computer Organization and Design
Intel 64 and IA-32 Architectures Software Developer’s
Manual – vol. 1, 2 and 3
Click & Goetz – A Crash Course in Modern Hardware
Agner Fog - The Microarchitecture of Intel, AMD and VIA
CPUs
Drepper - What Every Programmer Should Know About
Memory
http://yaserzt.com/blog/ 55