Contenu connexe Similaire à Advanced Computer Architectures – Part 2.1 (20) Plus de Vincenzo De Florio (20) Advanced Computer Architectures – Part 2.12. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/2
Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
3. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/3
Computer Design
Quantitative assessments
• Instruction sets
• Pipelining
4. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/4
Computer design
• First part of the course: a survey of
computer history
• Key aspect of this history:
In the last 60 years computers have
experienced a formidable growth in
performance and a huge costs decrease
A 1000¤ PC today provides its user with more
performance, memory, and disk space of a 1M$
mainframe of the Sixties
5. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/5
Computer design
• How this could be possible?
• Through
Advances in computer technology
Advances in computer design
6. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer design
• The tasks of a computer designer:
Determine key attributes for a new machine
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/6
E.g., design a machine that maximize
performance keeping costs under control
Aspects:
Instruction set design
Functional organization
Logic design
Implementation
(To be defined later)
7. © V. De Florio
KULeuven 2003
Basic
Concepts
Significant improvements
• First 25 years:
From both technology and design
• From the Seventies:
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/7
Mainly from IC technology
Main concern = compatibility with the past
(to save investments)
Compatibility at ML
No room for design improvements
20-30% per year for mainframes and minis
• Late Seventies: advent of the mP
Higher rate (35% per year)
8. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/8
Significant improvements: the mP
• The mP
Mass-produced lower costs
Significant changes in computer
marketplace
Higher level language compatibility (no need
for object code compatibility)
Availability of standard, vendor-independent
OS (less risks and costs in producing a new
architecture)
allowed to develop a new concept:
RISC architectures
9. © V. De Florio
KULeuven 2003
Basic
Concepts
Significant improvements: RISC
RISC architectures
Designed in the Eighties, on the market ca.‘85
Since then, a 50% improvement per year
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
300
Sun UltraSparc
P
e
r
f
o
r
m
a
n
c
e
1.54X/yr
250
DEC 21064a
200
150
IBM Power 2/590
100
DEC AXP 3000
HP 9000/750
50
MIPS M/120
Sun-4/260 MIPS M2000
0
1987
2.1/9
1.35X/yr
IBM RS6000/540
1988
1989
1990
1991
Year
1992
1993
1994
1995
10. © V. De Florio
KULeuven 2003
Technology Trends
Basic
Concepts
1000
Computer
Design
Supercomputers
100
Mainframes
Computer
Architectures
for AI
10
Minicomputers
Microprocessors
1
Computer
Architectures
In Practice
0.1
1965
2.1/10
1970
1975
1980
1985
Year
1990
1995
2000
11. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/11
Computer design
• The mP allowed a 50% of performance
increase. How was that possible?
Enhanced capability for users
IBM Power 21993 Cray Y-MP1988
The fastest supercomputer in 1988 has approx.
the same performance of the fastest 1993
workstation
Price: 1/10
Computers became more and more mP-based
Mainframes were disappearing or becoming
based on off-the-shelf mPs
12. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/12
Computer design
• Big consequence
No more market urge for
object code compatibility
Freedom from compatibility with old designs
Renaissance in computer design
Again, significant improvements from both
technology and design
50% of performance growth!
13. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/13
Computer design
• The highest performance mP in ’95 is
mainly a result of design improvements
(1-to-5)
• In this section we focus on the design
techniques that allowed this state of
facts
14. © V. De Florio
KULeuven 2003
Performance
Computer
Design
• What are the aspects to be taken into
account in order to reach a higher
performance?
• How to choose between different
alternatives?
Computer
Architectures
for AI
Amdhal’s law
Quantitative assessment
Basic
Concepts
Computer
Architectures
In Practice
2.1/14
15. © V. De Florio
KULeuven 2003
Basic
Concepts
Amdhal’s law
• Speed-up:
Execution time for entire task w/o using the “enhancement”
S=
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/15
Execution time for entire task using enhancement when possible
• Amdhal’s law on speed-up:
• Speed up depends on the fraction of time
that may be affected by the enhancement
16. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Amdhal’s law
Let us call F the fraction of time
affected by the enhancement
For instance, F=0.40 means that the original program would
benefit of the enhancement for 40% of the time of execution
What do we gain by introducing
the enhancement?
Exec-timeNEW = Exec-timeOLD ((1 -F) + F/SENH)
Where SENH is the speedup in the enhanced mode. Hence,
Computer
Architectures
In Practice
2.1/16
S=
Exec-timeNEW
Exec-timeOLD
=
1
(1 - F) + F / SENH
17. © V. De Florio
KULeuven 2003
Amdhal’s law
Basic
Concepts
Computer
Design
SENH grows, but
SOVER does not
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/17
F = 40%
18. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Amdhal’s law
• Law of diminishing returns
the incremental improvement in speedup
gained by an additional improvement in the
performance of just a portion of the
computation
diminishes as improvements are added
Computer
Architectures
for AI
Computer
Architectures
In Practice
1
1
lim SENH S = lim SENH
=
(1 - F) + F / SENH
(1 - F)
= SMAX
2.1/18
19. © V. De Florio
KULeuven 2003
Amdhal’s law
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/19
To reach a maximum speedup = 3,
F must be at least 66%
20. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/20
Amdhal’s law…
• “…can serve as a guide to how much an
enhancement will improve performance
and how to distribute resources to
improve cost/performance.
• The goal, clearly, is to spend resources
proportional to where time is spent.’’
21. © V. De Florio
KULeuven 2003
Basic
Concepts
Amdhal’s law
• Example 1 (p.30 P&H)
Method allows an improvement by factor 10
That can be exploited for 40% of the time
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/21
speeupoveral
1
fract. enhanced
1 fract. enhanced
speedupenhanced
1
1.56
0.4
1 0.4
10
22. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Amdhal’s law
Example 2 (p.31 P&H)
50% of the instructions of a given benchmark
are floating point instructions
FPSQR applies to 20% of the same benchmark
Alternative 1: extra hardware: FPSQR is 10
times faster
Alternative 2: all the FP instructions go 2 times
faster
speedupoveral
speedupFPSQR
speedupFP
2.1/22
1
1 fract. enhanced
1
fract. enhanced
speedupenhanced
1.22
0.2
1 0.2
10
1
1.33
0.5
1 0.5
2.0
23. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/23
Quantitative assessment
• CPUTIME(p) = Time spent by the CPU to run
program p
• Clock cycle time = tcc , clock rate = 1/ tcc
• CPUTIME(p) = # clock cycles tcc
= # clock cycles / clock rate
• E.g.: clock cycle time = 2ns
clock rate = 500 MHz
• #CC(p) = number of clock cycles spent in
the execution of p
24. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/24
Quantitative assessment
• Instruction count
• IC(c,p) = number of instructions that CPU
c executed during the activity of program
p
• Often, IC(p)
25. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/25
Quantitative assessment
• Clock cycles per instruction
• CPI(p) = #CC(p) / IC(p)
average number of clock cycles needed
to execute one instruction of p
26. © V. De Florio
KULeuven 2003
Quantitative assessment
Basic
Concepts
• CPUTIME(p) =
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/26
= #clock cycles clock cycle time
= #CC(p)
tcc
= IC(p) CPI(p) tcc
= IC(p) CPI(p)
clock rate
We can influence the performance of a given
program p by optimizing the three key
variables IC(p), CPI(p), and clock rate.
27. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/27
Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
Clock rate (the higher, the better)
Clock cycles per instructions (the lesser, the
better)
Instruction count (the lesser, the better)
28. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/28
Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
Clock rate (HW technology & organization)
Clock cycles per instruction
(organization & instruction set architecture)
Instruction count
(instruction set architecture &
compiler technology)
• Note: technologies are not independent of
each other!
29. © V. De Florio
KULeuven 2003
Basic
Concepts
Quantitative assessment
CPU time
= Seconds
Program
Computer
Design
Program
Computer
Architectures
for AI
Computer
Architectures
In Practice
Program
Cycles
x Seconds
Instruction
Inst Count CPI
X
Compiler
X
Inst. Set.
X
X
Organization
X
Cycle
Clock Rate
(X)
Technology
2.1/29
= Instructions x
X
X
30. © V. De Florio
KULeuven 2003
Basic
Concepts
Quantitative assessment
• Decades long challenge: optimizing
CPUTIME(p) = IC(p) CPI(p)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/30
clock rate
• This is a function of p!
• The choice of benchmarks is
important
31. © V. De Florio
KULeuven 2003
Basic
Concepts
Quantitative assessment
• Which methods to use?
CPUTIME(p) = IC(p) CPI(p)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/31
clock rate
• Method 1: increasing the clock rate
(Note: independent of p!)
• Methods 2: those trying to decrease
IC(p)
• Methods 3: those trying to decrease
CPI(p)
• Each method is equally important
• Some methods are more effective
32. © V. De Florio
KULeuven 2003
Basic
Concepts
Quantitative assessment:
how to calculate CPI?
n
CPIi ICi
ICi
CPI =
CPIi
Instr. Count
Instr. count i 1
i 1
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/32
n
ICi = number of times instruction i is
executed by p
CPIi = average number of clock cycles
for instruction i
CPIi needs to be measured and not just
read from a table in the Reference
Manual!
That is, we need to take into account
the memory access time! (Cache
misses do count… a lot)
33. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Quantitative assessment
• Example 3: 2 alternatives for a
conditional branch instruction
A: a CMP that sets a condition code (Z bit)
followed by a JZ
B: a single instruction to do CMP and JZ
Arch. A
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/33
LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET
Arch. B
LD R1, 0
L: INC R1
JRZ R1, 5, L
RET
We assume that JZ and JRZ take 2 cycles,
all the other instructions take 1 cycle
34. © V. De Florio
KULeuven 2003
Quantitative assessment
Arch. A
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET
LD R1, 0
L: INC R1
JRZ R1,5,L
RET
Arch. B
• 20% of the instructions are c.jumps
(instructions such as JZ or JRZ)
• 80% are other instructions
• On A, for each c.jump there is a CMP on
A, 20% are c.jumps and 20% are CMP’s
• 60% are other instructions
Because of the extra complexity in B, the
clock of A is faster (CTB = 1.25 CTA)
2.1/34
35. © V. De Florio
KULeuven 2003
n
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/35
Quantitative assessment
n
• CPIA = Si instri x cyclesi / #CCA =
= #BRA x cyclesBR + #BRA x cyclesBR
#CCA
#CCA
= nBRA x cyclesBR + nBRA x cyclesBR
= 20% x 2 + 80% x 1 = 1.2
• CPUA = ICA x CPIA x CTA = ICA x 1.2 x CTA
• CPIB = Si instri x cyclesi / #CCB =
= #BRB x cyclesBR + #BRB x cyclesBR
#CCB
#CCB
= nBRB x cyclesBR + nBRB x cyclesBR
36. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Quantitative assessment
• Now, on B:
One spares 20% of the instructions (the extra
cmp’s), hence:
nBRB = 20 / (100 – 20) = 0.25 (25%)
Furthermore, ICB = 0.8 ICA
• Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25
• CPUB = ICB
x CPIB x
CTB =
= 0.8 ICA x 1.25 x 1.25 CTA
So CPUB = 1.25 x ICA x CTA
CPUA = 1.2 x ICA x CTA
So A is faster
2.1/36
(for which P?)
37. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/38
Performance
• A straightforward enhancement is given
by increasing the clock rate
• The entire program benefits
• Also, independent of the particular
program
• Dependent on the efficiency of the
compiler etc.
38. © V. De Florio
KULeuven 2003
Clock Frequency Growth Rate
1,000
Computer
Design
Computer
Architectures
for AI
Clock rate (MHz)
Basic
Concepts
100
R10000
Pentium100
i80386
i80286
10
i8086
1
i8080
i8008
i4004
Computer
Architectures
In Practice
0.1
1970
1975
• 30% per year
2.1/39
1980
1985
1990
1995
2000
2005
39. © V. De Florio
KULeuven 2003
Transistor Count Growth Rate
100,000,000
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Transistors
10,000,000
R10000
Pentium
i80386
i80286
R3000
R2000
1,000,000
100,000
i8086
10,000
i8080
i8008
Computer
Architectures
In Practice
i4004
1,000
1970
1975
1980
1985
1990
1995
2000
• 100 million transistors on chip in early year 2000.
• Transistor count grows much faster than clock rate
2.1/40
2005
40. © V. De Florio
KULeuven 2003
Basic
Concepts
Performance
• Another important factor for performance
is given by
Memory accesses
I/O (disk accesses)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/43
41. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Memory
• Semiconductor DRAM technology
Density: increase of 60% per year
(quadruplicate in 3 years)
Cycle time: much less increase than this!
Computer
Architectures
In Practice
Capacity
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
Computer
Architectures
for AI
4x in 3 years
1.4x in 10 years
Disk
2x in 3 years
1.4x in 10 years
Speed increases of memory and I/O have not
kept pace with processor speed increases.
2.1/44
42. © V. De Florio
KULeuven 2003
Memory
size
Basic
Concepts 1000000000
100000000
10000000
Bits
Computer
Design
year
1980
1983
1986
1989
1992
1996
2000
Computer
Architectures
for AI
1000000
100000
10000
1000
Computer
Architectures
In Practice
2.1/45
1970
1975
1980
1985
Year
1990
1995
2000
size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
43. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/46
Basic definitions
1. Bandwidth: the rate at which data can be
transferred. Bandwidth is typically measured in
bytes per second.
2. Block size: the amount of data transferred per
request. Block size is typically measured in bytes.
3. Latency: the time between making a request (e.g.
to read or write a block of data) and completing the
request. Latency is typically measured in seconds.
4. Throughput: The number of requests that can be
completed per unit time. Throughput is typically
measured in requests per second.
44. © V. De Florio
KULeuven 2003
Basic
Concepts
Memory
• DRAM: main memory of all computers
Commodity chip industry: no company >20% share
Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
• Capacity: 4X/3 years (60%/year)
Moore’s Law
• MB/$: + 25%/year
• Latency: – 7%/year,
Bandwidth: + 20%/year (so far)
SIMM = single in-line memory chip, a small circuit board that
can hold a group of memory chips. Measured in bytes vs bits
32-bit path to memory
DIMM = dual in-line memory chip. 64-bit to memory
source: www.pricewatch.com, 5/21/98
2.1/47
45. © V. De Florio
KULeuven 2003
Processor Limit: DRAM Gap
Basic
Concepts
1000
CPU
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/48
100
Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
1
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Computer
Design
Performance
“Moore’s Law”
µProc
60%/yr.
DRAM
7%/yr..
46. © V. De Florio
KULeuven 2003
Memory Summary
Basic
Concepts
• DRAM:
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/49
rapid improvements in capacity, MB/$, bandwidth;
slow improvement in latency
Processor-memory interface
is a bottleneck to delivered bandwidth
47. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/50
Disk Components
48. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/51
Disk Components: Platters
• Platters: the recording surfaces.
i. 1 to 8 inches in diameter (2.5 to 20 cm).
ii. Stacked on a spindle: typical disks have 1-12
platters.
iii. Data can be stored on one or both surfaces.
iv. Spindle and platters rotate at 3600 - 10000 rpm
(60-165 Hz).
v. Recording density depends on applying a
magnetic film with few defects.
vi. Rotation rate limited by bearings and power
consumption.
49. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/52
Disk Components: Heads
•
i.
Heads: write and read data to and from platters.
Data stored as presence or absence of
magnetization.
ii. Head “floats” on air-film that rotates with the disk.
Bernoulli effect pulls head toward disk but not into
it. A dust particle can cause a “head crash” where
the disk surface is scratched and any data on it is
lost.
iii. Disk heads are manufactured using thin film
technology. Advancing technology allows smaller
heads and therefore more closely spaced tracks
and bits.
50. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/53
Disk Components: Actuators
•
i.
ii.
iii.
Actuators: move heads radially over the platters.
Actuator arm needs to be light to move quickly.
Actuator arm needs to stiff to prevent flexing.
Smaller platters allow shorter arms: therefore
lighter and stiffer.
iv. Actuators limited by
•
•
power of actuator motor and
weight and strength of actuator components
51. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Disks: Data Layout
• Each surface consists of concentric rings called
tracks
• Each track is divided into sectors. Data is written to
and read from the disk a whole sector at a time
• The set of tracks that are a the same relative
position on each surface form a cylinder
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/54
cylinder
52. © V. De Florio
KULeuven 2003
Three Components of Disk Access Time
Basic
Concepts
1. Seek time: the time to move the heads to the
desired cylinder
Advertised to be 8 to 12 ms. May be lower in real life
Computer
Design
2. Rotational latency: the time for the desired sector
to arrive under the head
4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM
Computer
Architectures
for AI
3. Transfer time: the time to read the data from the
disk and send it over the I/O bus to the processor
2 to 12 MB per second
Computer
Architectures
In Practice
Queue
Proc
Ctrl
Disk Access Time
IOC
Device
Response time = Queue + Ctrl + Device Service time
2.1/55
53. © V. De Florio
KULeuven 2003
Hard Disks
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm
2.1/56
54. © V. De Florio
KULeuven 2003
Hard Disks
• Capacity
Basic
Concepts
+ 60%/year (2X / 1.5 yrs)
• Transfer rate (BW)
Latency =
Queuing Time +
+ 40%/year (2X / 2.0 yrs)
Computer
Controller time +
Design
• Rotation + Seek time
per access Seek Time +
Rotation Time
– 8%/ year (1/2 in 10 yrs)
+
+ Size / Bandwidth
per byte
• MB/$
Computer
{
Architectures
for AI
> 60%/year (2X / <1.5 yrs)
Computer
Architectures
In Practice
source: Ed Grochowski, 1996,
“IBM leadership in disk drive technology”;
www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,
2.1/57
55. © V. De Florio
KULeuven 2003
Hard disks
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/58
1973:
1. 7 Mbit/sq. in
140 MBytes
1979:
7. 7 Mbit/sq. in
2,300 MBytes
56. © V. De Florio
KULeuven 2003
Hard Disks
Areal Density
Basic
Concepts
10000
1000
100
10
1
1970
Computer
Design
1980
1990
2000
Year
Computer
Architectures
for AI
Computer
Architectures
In Practice
1989:
63 Mbit/sq. in
60,000 MBytes
2.1/59
1997:
1450 Mbit/sq. in
1600 MBytes
1997:
3090 Mbit/sq. in
8100 MBytes
57. © V. De Florio
KULeuven 2003
Hard Disks
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/60
• Continued advance in capacity (60%/yr)
and bandwidth (40%/yr.)
• Slow improvement in seek, rotation
(8%/yr)
• Time to read whole disk
Year Sequentially Randomly
1990
4 minutes
6 hours
2000 12 minutes
1 week
58. © V. De Florio
KULeuven 2003
Memory/Disk Summary
Basic
Concepts
• Memory:
DRAM rapid improvements in capacity, MB/$,
bandwidth; slow improvement in latency
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/61
• Disk:
Continued advance in capacity, cost/bit,
bandwidth; slow improvement in seek,
rotation
• Huge gap between CPU and external
memories
• How to address this problem?
• Classical way: memory hierarchies
59. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/62
Memory hierarchies
• Axiom of HW designer: smaller is faster
Larger memories => larger signal delay
More levels are required to encode addresses
In a smaller memory the designer can use more
power per cell => shorter access times
• Crucial features for performance
Huge bandwidth (in MB/sec.)
Short access times
• Principle of locality
The data most recently used is very likely to be
accessed again in the near future (temporal l.)
Memory cells close to the most recently used one
are likely to be accessed in the near future (spatial)
• Combining the above with the Amdhal law, the
“best” enhancement is using hierarchies of
memories
60. © V. De Florio
KULeuven 2003
Typical memory hierarchy (`95)
Basic
Concepts
CPU
Registers
Cache
Computer
Design
I/O bus
Memory bus
Memory
I/O devices
32 MB
100 ns
2 GB
5 ms
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/63
Size: 200B
Speed: 5 ns
64KB
10 ns
61. © V. De Florio
KULeuven 2003
Basic
Concepts
Memory hierarchies
Input/Output and Storage
Disks, WORM, Tape
Computer
Design
Computer
Architectures
for AI
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
Computer
Architectures
In Practice
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
VLSI
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
2.1/64
RAID
Pipelining and Instruction
Level Parallelism
62. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/65
Memory hierarchies
•
•
•
•
•
Registers: smallest and fastest memory
Size: less than 1KB
Access time: 2-5 ns
Bandwidth: 4000-32000 MB/sec
Managed by the compiler (or the
assembly programmer)
register int a;
• Special purpose vs. general purpose
• Monolithic or double-shaped
Rx = Rl + Rh
• Backed in cache
• Implemented via custom memory with
multiple ports
63. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/66
Memory hierarchies
• Cache = small, fast memory located close
to the CPU
• The cache holds the most recently
accessed code or data
Managed by HW
No way to tell “put these data in cache” at SW
New research: cache-conscious data
structures
•
•
•
•
•
Size: less than 4 MB
Access time: 3-10 ns
Bandwidth: 800-5000 MB/sec
Backed in main memory
Implemented with (on- or off-chip) CMOS
SRAM
64. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/67
Memory hierarchies
• Cache terminology: cache hit, cache
miss, cache block
Cache hit: the CPU has been able to find in
cache the requested data
Cache miss: Cache hit
Cache block: the fixed-size buffer used to load
a portion of memory into the cache
• A cache miss blocks the CPU until the
corresponding memory block gets cached
65. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/68
Memory hierarchies
• Virtual memory: same principles behind
the use of cache, but implemented
between main memory and disk storage
• At any point in time, not all the data
referenced by p need to be in main
memory
• Address space is partitioned into fixedsize blocks: pages
• A page is either in memory or on disk
• When CPU references an item within a
page
if ( Check-if-in-cache() == CACHE_MISS )
if ( Check-if-in-memory() == MEM_MISS)
PageFault(); // Loads page in memory
CPU doesn’t stall – switches to other tasks
66. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/69
Cache performance
• Example: speedup using a cache
Cache 10 times faster than main memory
Cache is used 90% of the cases
speedup
1
1 fract. enhanced
1
0.9
1 0.9
10
5.3
fract. enhanced
speedupenhanced
67. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/70
Cache performance
CPUtime = (CPU clock cycles + memory
stall cycles) x clock cycle time
Memory stall cycles = #(misses) £(miss)
= IC #(misses per instruction) £(miss)
= IC #(memory references per instr.)
miss-frequency £(miss)
68. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/71
Cache performance
• Example (P&H, p.43)
A computer has a CPI = 2 when data is in cache
Memory access is only required by load and
store instructions (40% of total #)
£(miss) = 25 clock cycles
Cache misses frequency = 2%
? How faster would the machine be when no
cache miss occurs?
CPU"-hit = (CPU clock cycles + memory stall cycles)
clock cycle time
= (IC CPI + 0) clock cycle time
= IC 2 clock cycle time
69. © V. De Florio
KULeuven 2003
Basic
Concepts
Cache performance
? How fast would the machine be when
cache misses do occur?
1. Compute the memory stall cycles (msc)
Computer
Design
msc = IC memory references per instruction
miss rate miss penalty
= IC (1 + 0.4) 0.02 25
Data access
Computer
Architectures
for AI
Instruction access
= IC 0.7
Computer
Architectures
In Practice
2.1/72
2. Compute total performance:
CPUcache=(CPU clock cycle + msc) clock cycle time
= (IC 2 + IC 0.7) clock cycle time
= 2.7 IC clock cycle time
70. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/73
Computer Design
• Quantitative assessments
Instruction sets
• Pipelining
71. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/74
Computer design
• Instruction-set architecture:
The architecture of the machine level
The boundary between SW and HW
• Organization:
High level aspects: memory system, bus
structure, internal CPU design
• Hardware:
The specifics of a machine: detailed logic
design, packaging technology…
• Architecture = I + O + H
72. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/75
Instruction Sets
• IS = Instruction sets = The architecture of
the machine language
• IS Classification
• Roles of the compilers
• DLX
73. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/76
Computer Design IS
IS Classification
• Role of the compilers
• DLX
74. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/77
Computer Design IS
IS Classification
• Key: type of internal storage in the CPU
• Three main classes
Stack architectures
Accumulator architectures
General-purpose register architectures
75. Computer Design IS
IS Classification Stack A.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
•
•
•
•
Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B
3. ADD
2.1/78
2. PUSH B
A
Computer
Architectures
In Practice
B
1. PUSH A
ADD = PUSH (POP + POP)
76. Computer Design IS
IS Classification Stack A.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
•
•
•
•
Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B
3. ADD
Computer
Architectures
for AI
2. PUSH B
A
Computer
Architectures
In Practice
2.1/79
1. PUSH A
ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
77. Computer Design IS
IS Classification Stack A.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
•
•
•
•
Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B
3. ADD
Computer
Architectures
for AI
2. PUSH B
B+A
Computer
Architectures
In Practice
2.1/80
1. PUSH A
ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
ADD = PUSH (B + A)
78. Computer Design IS
IS Classification Stack A.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
•
•
•
•
Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B
4. POP C
3. ADD
Computer
Architectures
for AI
2. PUSH B
1. PUSH A
Computer
Architectures
In Practice
C = TOP STACK = A+B
An example: the ARIEL virtual machine (Part 1, Slides 91 –)
2.1/81
79. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer Design IS
IS Classification Accumulator A.
• Accumulator Architectures
• A special register (the accumulator)
plays the role of an implicit argument
• Example: C = A + B
1.
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/82
LOAD A
; let Acml = A
2.
ADD B
; let Acml = Acml + B
3.
STORE C
; let C = Acml
80. Computer Design IS
IS Classification Register A.
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/83
•
•
•
•
General-purpose Register Architecture
Explicit operands only
Either registers or memory locations
Two flavors:
Register-memory architectures (RMA)
Register-register architectures (RRA)
• Example: C = A + B
RMA: Load R1, A
Add R1, B
; in C, R1 += B
Store C, R1
RRA: Load R1, A
Load R2, B
Add R3, R1, R2
Store C, R3
81. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer Design IS
IS Classification RRA
• Some old machines used stack or
accumulator architectures
For instance, T800 and 6502/6510
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/84
• Today the de facto standard is RRA
Regs are fast
Regs are easier to use (compiler writers)
Do not require to deal with associativity issues
Stacks do!
Regs can hold variables
register int I;
for (I=0; I<1000000;I++)
{ do-stgh(I); … }
Using regs you don’t need a memory address
82. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer Design IS
IS Classification Register A.
• RRA: no memory operands
All instructions are similar in size -> take
similar number of clocks to execute (very
useful property… see later)
No side effect
Higher instruction count
• RMA: one memory operand
Computer
Architectures
for AI
Computer
Architectures
In Practice
One load can be spared
A register operand is destroyed ( R += B )
Clocks per instruction varies by operand
location
• Memory-memory:
Compact
Large variation of work per instruction
Large variation in instruction size
2.1/85
83. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Computer Design IS
Memory addressing
• How is memory organized?
• What does it mean, e.g., read memory at
address 512?
• What do we read?
Bytes, half words, words, double words
• How are consecutive bytes stored in a
word? (Assumption: word is 4 bytes)
Little endian: &word = &LSB
Big endian: &word = &MSB
XDR routines are needed to exchange data
(&word address of word)
2.1/86
84. © V. De Florio
KULeuven 2003
Basic
Concepts
A memory model for didactics
• Memory can be thought as finite, long
array of cells, each of size 1 byte
0
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/87
1
2
3
4
5
6
7
…
• Each cell has a label, called address, and
a content, i.e. the byte stored into it
• Think of a chest of drawers, with a label
on each drawer and possibly something
into it
85. © V. De Florio
KULeuven 2003
A memory model for didactics
Basic
Concepts
Content
Computer
Design
Computer
Architectures
for AI
4
3
2
1
Computer
Architectures
In Practice
2.1/88
Address
86. © V. De Florio
KULeuven 2003
Basic
Concepts
A memory model for didactics
• The character * has a special meaning
• It refers to the contents of a cell
Computer
Design
• For instance:
Computer
Architectures
for AI
*(1)
Computer
Architectures
In Practice
This character means we’re inspecting the contents
of a cell (we open a drawer and see what’s in it)
2.1/89
87. © V. De Florio
KULeuven 2003
Basic
Concepts
A memory model for didactics
• The character * has a special meaning
• It refers to the contents of a cell
Computer
Design
• For instance:
Computer
Architectures
for AI
*(1)
Computer
Architectures
In Practice
This character means we’re writing new contents
into a cell (we open a drawer and change its contents)
2.1/90
88. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/91
A memory model for didactics
• Memory is (often) byte addressable,
though it is organized into small groups of
bytes: the machine word
• A common size for the machine word is 4
bytes (32 bits)
• Two possible organizations for the bytes
in a word
Little endian
Big endian
89. © V. De Florio
KULeuven 2003
Little endian versus Big endian
MSB0
LSB0
LSB0
MSB0
4
Big endian (Motorola)
Little endian
3
Basic
Concepts
Big endian
MSB1
LSB1
LSB1
MLSB1
0
MSB
0
Computer
Design
Computer
Architectures
for AI
1
2
LSB
3
1
MSB
4
5
6
LSB
7
2
2.1/92
5
LSB
3
Computer
Architectures
In Practice
Little endian (Intel)
2
1
MSB
0
6
LSB
7
6
5
MSB
4
7
90. © V. De Florio
KULeuven 2003
Little endian versus Big endian
Problem: communication
between the two
Little endian
0
MSB0
00
LSB0
00
1
00
00
2
00
00
3
LSB0
01
MSB0
01
4
MSB1
10
LSB1
10
Little endian (Intel)
5
00
00
LSB
00
Basic
Concepts
Big endian
00
00
MSB
01
6
00
00
LSB
10
00
00
MSB
00
7
LSB1
00
MLSB1
00
Big endian (Motorola)
MSB
00
Computer
Design
00
00
LSB
01
MSB
10
00
00
LSB
00
=268435456
Computer
Architectures
for AI
Computer
Architectures
In Practice
So they are the same; though, interpreted as if they were…
=16777216
01
00
00
00
00
2.1/93
=1
00
00
10
=16
91. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/94
Computer Design IS
Memory addressing
• Alignment is mandatory on some
machines
Object O; int t = sizeof(O);
ALIGNED(O) means
&O modulo t is 0
“access to O is aligned”
For instance if access to integers (4 bytes) is
aligned, then an integer can only be stored in
addresses divisible by 4
Alignment is sometimes necessary because
prevents hardware complications
Alignment implies faster access
92. Computer Design IS
Memory addressing
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
• Addressing modes: ways to specify the
address of an object in memory
• An addressing mode can specify
A constant
A register
A memory location
Computer
Architectures
for AI
Computer
Architectures
In Practice
In what follows,
A += B means
* (x)
means
x++
--x
Rx
2.1/95
A=A+B
return the contents of memory at
address x
means “at the end, let x = x + 1”
means “at the beginning, let x = x – 1”
means register x
93. Computer Design IS
Memory addressing
© V. De Florio
KULeuven 2003
Meaning
Add R4, R3
Add R4, #3
R4 += R3
R4 += 3
Displacement
Indirect
Add R4, 100(R1)
Add R4, (R1)
R4 += *(100+R1)
R4 += *(R1)
Add R4, (R1 + R2)
R4 += *(R1 + R2)
Absolute
Computer
Architectures
for AI
Example
Indexed
Computer
Design
Mode
Register
Immediate
Basic
Concepts
Add R4, (100)
R4 += *(100)
Deferred
Add R4, @(R3)
R4 += *(*(R3))
Autoincrement
Add R4, (R3)+
Indirect, R3++
Autodecrement Add R4, -(R2)
Computer
Architectures
In Practice
Scaled
Add R4,
100(R2)[R3]
R2--, indirect
R4 += * ( 100 + R2 +
R3 * d )
d = size of the addressed data (1, 2, 4, 8, or 16)
2.1/96
94. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/97
Computer Design IS
Memory addressing
• Addressing mode can reduce IC
• Complex addressing modes increase the
complexity of the hardware can
increase CPI
• Displacement, immediate and deferred
represent b/w 75% and 99% of addressing
modes (experiments done with TeX,
spice, and gcc)
• IC(p) = number of instructions that the CPU executed
during the activity of program p
• CPI(p) = clock cycles per instruction = #CC(p) / IC(p)
average number of clock cycles needed to execute one
instruction of p
95. Computer Design IS
Operations
© V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/98
•
•
•
•
•
•
•
•
Arithmetical and logical (add, and, sub...)
Data transfer (move, store)
Control (br, jmp, call, ret, iret…)
System (virtual memory mngt…)
Floating point (add, mul, …)
Decimal (decimal add, decimal mul…)
String (str move, str cmp, str search)
Graphics (pixel operations)
• Benchmarks show that often a small set
of simple instructions account for stg like
95% of instructions executed
(see Fig. 2.11, P&H p.81)
96. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/99
Computer Design IS
Operations
• Control Flow Instructions
Branch (conditional change)
Jump (unconditional change)
Procedure calls
Procedure returns
• Most of the comparisons in conditional
branches are simple “==“, “!=“ with 0!
• In some cases, the address to go to
is only known at run-time
“Return” uses a stack
Switch statements
Dynamic libraries
97. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/100
Computer Design IS
Operands
• When we say, e.g.,
“Add R1, #5”
do we work with bytes? Half-words?
Words?
• How do we specify the type of the
operand?
1. Classical method: the type of operand is
part of the opcode
• Add family is coded as ffff…fffvv
where f are fixed bits and v are bits
that specify the type
98. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/101
Computer Design IS
Operands and types
• Example: Add family =
10110101000100vv
• 1011010100010000 =
1011010100010001 =
1011010100010010 =
1011010100010011 =
Add
Add
Add
Add
float words
words
half-words
bytes
• Old fashioned method:
operand = data + tag
• Tag describes a type
• Tag is interpreted by HW
• Operation is chosen accordingly
99. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/102
Computer Design IS
Operands and types
• Which types to support?
• Old fashioned solution: all (bytes, semiwords, words, f.p., double words, double
precision f.p., …)
• Current trend: Only operations on items
greater than or equal to 32 bits
• On the DEC Alpha one needs multiple
instructions to access objects smaller
than 32 bits
100. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/103
Computer Design IS
Operands and types
• Floating point numbers:
IEEE standard 754
• In the early ’80, each manufacturer had
its own f.p. representation
• Sometimes string operations are available
(strcmp, strcpy…)
• Sometimes BCD is used to code numbers
Four bits are used to code a decimal digit
A byte codes two decimal digits
Functions for “packing” and “unpacking” are
required
It is unclear if this will stay in the future
101. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/104
Computer Design IS
• IS Classification
Role of the compilers
• DLX
102. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/105
Computer Design IS
Role of the compiler
• In the past, the role of Assembly language
was crucial
• Architectural decisions aimed at easing
assembly language programming
• Now, the user interface is a high level
language (C, C++, Java…)
• The user interfaces the machine via the
HLL, though the machine actually
executes some lower level code
• This lower level code is produced by a
compiler
The role of the compiler is fundamental
The IS architecture needs to take the
compiler into strong account
103. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
Computer Design IS
Role of the compiler
• Goals of the compiler writer
Correctness
Performance
…Fast compilation, debugging support, …
• Strategy for writing a compiler
Use a number of “passes”
From high level structures down to
lower levels, until machine level
This way complexity is decomposed in
smaller blocks
Optimizing becomes more difficult
2.1/106
104. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer Design IS
Role of the compiler
Dependencies
D(language)
D(machine)
Function
Front-end
Language common
intermediate form
HL Opt
Loop transformations,
function inlining…
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/107
Global Opt
D(language)
D(machine)
Register allocation…
Code
generator
Instruction selection,
D(machine) opt.
105. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer Design IS
Role of the compiler
• HL Optimizations: source-level
optimizations (code code’)
• Local optimizations: basic block
optimizations
• Global optimizations: loop optimization
and basic blocks optimizations
• Machine-dependent optimization: using
low level architectural knowledge
Computer
Architectures
In Practice
2.1/108
• Basic Block = a straight-line code fragment
106. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer Design IS
Role of the compiler
• Compilers have different optimization
levels
-O1 .. -On
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/109
• Optimization can have a big impact on
instruction count on performance
107. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/110
Computer Design IS
Role of the compiler
108. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/111
Computer Design IS
Role of the compiler
• In some cases, though, optimization may
be counterproductive!
• This happens because there might be
conflicts between local and global
optimization tasks
SAME EXPRESSION
• Example:
a = sqrt(x*x + y*y) + f()… ;
b = sqrt(x*x + y*y) + g()…;
• Idea:
tmp = sqrt(x*x + y*y);
a = tmp + f() …;
b = tmp + g() …;
109. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/112
Computer Design IS
Role of the compiler
• Effective, but only if tmp can be stored in
a register
• No register in memory cache misses
… bad performance
• Problem is
When the compiler performs, e.g., code
transformations like in the example, it does not
know whether a register will actually be
available
This will only become clear later (at global
optimization level)
• (Phase ordering problem)
110. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/113
Computer Design IS
Role of the compiler
• Key resource is the register file
• “Intelligent” register allocation
techniques are a must
• Current solution: graph coloring (graph
with possible candidates for allocation to
a register)
• NP-complete, though effective heuristic
algorithms exist
111. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer Design IS
Role of the compiler
• A special class of compilers – Algorithmdriven software generation
FFTW approach: Software generation system
based on symbolic computation
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/114
Objective CamL
Sort of FFT compiler that generates optimal C
code via symbolic computing
Possible future steps (project works, theses…):
Extending the approach going down to code
generation for, e.g., the TI ‘C67 DSP and other
VLIW CPUs
112. © V. De Florio
KULeuven 2003
Basic
Concepts
Computer
Design
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/115
Exam of 16 Jan 2002
• A program is composed of three classes of
instructions: i1 (integer instructions), i2 (loadstore instructions), and i3 (floating point
instructions)
• The three classes are responsible of r1 = 60%, r2 =
30% and r3 = 10% of the overall execution time,
respectively
• You can choose between three levels of
optimisation on your computer: O1, O2, and O3:
O1 optimises i1, O2 optimises i2, and O3 optimises
i3
• The corresponding enhancements would be
e1 = 2, e2 = 3, e3 = 10
• Suppose you can only choose one of the three
levels of optimisation. Which one would you
choose? Justify your choice
113. © V. De Florio
KULeuven 2003
Basic
Concepts
Solution
• r1 = 60%
r2 = 30%
r3 = 10%
Computer
Design
S=
Exec-timeNEW
=
Exec-timeOLD
Computer
Architectures
for AI
Computer
Architectures
In Practice
2.1/116
• s1 = 1.42857
s2 = 1.25
s3 = 1.0989
e1 = 2
e2 = 3
e3 = 10
1
(1 - r) + r / e