Advanced Computer Architectures – Part 2.1

Advanced Computer
Architectures
– HB49 –
Part 2.1
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/2

Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/3

Computer Design
Quantitative assessments
• Instruction sets
• Pipelining

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/4

Computer design
• First part of the course: a survey of
computer history
• Key aspect of this history:
 In the last 60 years computers have
experienced a formidable growth in
performance and a huge costs decrease
 A 1000¤ PC today provides its user with more
performance, memory, and disk space of a 1M$
mainframe of the Sixties

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/5

Computer design
• How this could be possible?
• Through
 Advances in computer technology
 Advances in computer design

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer design
• The tasks of a computer designer:
 Determine key attributes for a new machine

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/6

 E.g., design a machine that maximize
performance keeping costs under control

 Aspects:
 Instruction set design
 Functional organization
 Logic design
 Implementation
(To be defined later)

© V. De Florio
KULeuven 2003

Basic
Concepts

Significant improvements
• First 25 years:
 From both technology and design

• From the Seventies:
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/7

 Mainly from IC technology
 Main concern = compatibility with the past
(to save investments)
 Compatibility at ML
 No room for design improvements

 20-30% per year for mainframes and minis

• Late Seventies: advent of the mP
 Higher rate (35% per year)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/8

Significant improvements: the mP
• The mP
Mass-produced  lower costs
Significant changes in computer
marketplace
Higher level language compatibility (no need
for object code compatibility)
Availability of standard, vendor-independent
OS (less risks and costs in producing a new
architecture)

allowed to develop a new concept:
RISC architectures

© V. De Florio
KULeuven 2003

Basic
Concepts

Significant improvements: RISC
RISC architectures

 Designed in the Eighties, on the market ca.‘85
 Since then, a 50% improvement per year

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

300
Sun UltraSparc
P
e
r
f
o
r
m
a
n
c
e

1.54X/yr

250

DEC 21064a

200
150

IBM Power 2/590
100

DEC AXP 3000
HP 9000/750

50

MIPS M/120
Sun-4/260 MIPS M2000

0
1987
2.1/9

1.35X/yr
IBM RS6000/540

1988

1989

1990

1991
Year

1992

1993

1994

1995

© V. De Florio
KULeuven 2003

Technology Trends
Basic
Concepts

1000
Computer
Design

Supercomputers

100
Mainframes

Computer
Architectures
for AI

10
Minicomputers
Microprocessors

1
Computer
Architectures
In Practice

0.1
1965
2.1/10

1970

1975

1980

1985

Year

1990

1995

2000

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/11

Computer design
• The mP allowed a 50% of performance
increase. How was that possible?
 Enhanced capability for users
 IBM Power 21993  Cray Y-MP1988

 The fastest supercomputer in 1988 has approx.
the same performance of the fastest 1993
workstation
 Price: 1/10

 Computers became more and more mP-based
 Mainframes were disappearing or becoming
based on off-the-shelf mPs

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/12

Computer design
• Big consequence
 No more market urge for
object code compatibility
 Freedom from compatibility with old designs
 Renaissance in computer design
 Again, significant improvements from both
technology and design
 50% of performance growth!

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/13

Computer design
• The highest performance mP in ’95 is
mainly a result of design improvements
(1-to-5)
• In this section we focus on the design
techniques that allowed this state of
facts

© V. De Florio
KULeuven 2003

Performance

Computer
Design

• What are the aspects to be taken into
account in order to reach a higher
performance?
• How to choose between different
alternatives?

Computer
Architectures
for AI

 Amdhal’s law
 Quantitative assessment

Basic
Concepts

Computer
Architectures
In Practice

2.1/14

© V. De Florio
KULeuven 2003

Basic
Concepts

Amdhal’s law
• Speed-up:
Execution time for entire task w/o using the “enhancement”
S=

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/15

Execution time for entire task using enhancement when possible

• Amdhal’s law on speed-up:
• Speed up depends on the fraction of time
that may be affected by the enhancement

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Amdhal’s law
Let us call F the fraction of time
affected by the enhancement
For instance, F=0.40 means that the original program would
benefit of the enhancement for 40% of the time of execution
What do we gain by introducing
the enhancement?
Exec-timeNEW = Exec-timeOLD  ((1 -F) + F/SENH)
Where SENH is the speedup in the enhanced mode. Hence,

Computer
Architectures
In Practice

2.1/16

S=

Exec-timeNEW
Exec-timeOLD

=

1
(1 - F) + F / SENH

© V. De Florio
KULeuven 2003

Amdhal’s law

Basic
Concepts

Computer
Design

SENH grows, but
SOVER does not

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/17

F = 40%

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Amdhal’s law
• Law of diminishing returns
 the incremental improvement in speedup
gained by an additional improvement in the
performance of just a portion of the
computation
diminishes as improvements are added

Computer
Architectures
for AI

Computer
Architectures
In Practice

1
1
lim SENH S = lim SENH
=
(1 - F) + F / SENH
(1 - F)
= SMAX

2.1/18

© V. De Florio
KULeuven 2003

Amdhal’s law

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/19

To reach a maximum speedup = 3,
F must be at least 66%

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/20

Amdhal’s law…
• “…can serve as a guide to how much an
enhancement will improve performance
and how to distribute resources to
improve cost/performance.
• The goal, clearly, is to spend resources
proportional to where time is spent.’’

© V. De Florio
KULeuven 2003

Basic
Concepts

Amdhal’s law
• Example 1 (p.30 P&H)
 Method allows an improvement by factor 10
 That can be exploited for 40% of the time

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/21

speeupoveral 

1

fract. enhanced
1  fract. enhanced  
speedupenhanced
1

 1.56
0.4
1  0.4 
10

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Amdhal’s law
Example 2 (p.31 P&H)
 50% of the instructions of a given benchmark
are floating point instructions
 FPSQR applies to 20% of the same benchmark
 Alternative 1: extra hardware: FPSQR is 10
times faster
 Alternative 2: all the FP instructions go 2 times
faster

speedupoveral 

speedupFPSQR 

speedupFP
2.1/22

1

1

fract. enhanced
speedupenhanced

 1.22
0.2
1  0.2 
10
1

 1.33
0.5
1  0.5 
2.0

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/23

Quantitative assessment
• CPUTIME(p) = Time spent by the CPU to run
program p
• Clock cycle time = tcc , clock rate = 1/ tcc
• CPUTIME(p) = # clock cycles  tcc
= # clock cycles / clock rate
• E.g.: clock cycle time = 2ns
clock rate = 500 MHz
• #CC(p) = number of clock cycles spent in
the execution of p

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/24

• Instruction count
• IC(c,p) = number of instructions that CPU
c executed during the activity of program
p
• Often, IC(p)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/25

• Clock cycles per instruction
• CPI(p) = #CC(p) / IC(p)
average number of clock cycles needed
to execute one instruction of p

© V. De Florio
KULeuven 2003


Basic
Concepts

• CPUTIME(p) =
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/26

= #clock cycles  clock cycle time
= #CC(p)
 tcc
= IC(p)  CPI(p)  tcc
= IC(p)  CPI(p)
clock rate
 We can influence the performance of a given
program p by optimizing the three key
variables IC(p), CPI(p), and clock rate.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/27

• CPU performance is equally dependent
upon three characteristics
 Clock rate (the higher, the better)
 Clock cycles per instructions (the lesser, the
better)
 Instruction count (the lesser, the better)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/28

• CPU performance is equally dependent
upon three characteristics
 Clock rate (HW technology & organization)
 Clock cycles per instruction
(organization & instruction set architecture)
 Instruction count
(instruction set architecture &
compiler technology)

• Note: technologies are not independent of
each other!

© V. De Florio
KULeuven 2003

Basic
Concepts

CPU time

= Seconds

Program

Computer
Design

Program
Computer
Architectures
for AI

Computer
Architectures
In Practice

Program

Cycles

x Seconds

Instruction

Inst Count CPI
X

Compiler

X

Inst. Set.

X

X

Organization

X

Cycle

Clock Rate

(X)

Technology
2.1/29

= Instructions x

X
X

© V. De Florio
KULeuven 2003

Basic
Concepts

• Decades long challenge: optimizing

CPUTIME(p) = IC(p)  CPI(p)
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/30

clock rate

• This is a function of p!
• The choice of benchmarks is
important

© V. De Florio
KULeuven 2003

Basic
Concepts

• Which methods to use?

CPUTIME(p) = IC(p)  CPI(p)
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/31

clock rate
• Method 1: increasing the clock rate
(Note: independent of p!)
• Methods 2: those trying to decrease
IC(p)
• Methods 3: those trying to decrease
CPI(p)
• Each method is equally important
• Some methods are more effective

© V. De Florio
KULeuven 2003

Basic
Concepts

Quantitative assessment:
how to calculate CPI?
n

 CPIi  ICi

ICi


CPI =
  CPIi  

 Instr. Count 
Instr. count i 1
i 1

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/32

n

ICi = number of times instruction i is
executed by p
CPIi = average number of clock cycles
for instruction i
CPIi needs to be measured and not just
read from a table in the Reference
Manual!
That is, we need to take into account
the memory access time! (Cache
misses do count… a lot)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

• Example 3: 2 alternatives for a
conditional branch instruction
 A: a CMP that sets a condition code (Z bit)
followed by a JZ
 B: a single instruction to do CMP and JZ

Arch. A
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/33

LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET

Arch. B

LD R1, 0
L: INC R1
JRZ R1, 5, L
RET

We assume that JZ and JRZ take 2 cycles,
all the other instructions take 1 cycle

© V. De Florio
KULeuven 2003

Arch. A

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET

LD R1, 0
L: INC R1
JRZ R1,5,L
RET

Arch. B

• 20% of the instructions are c.jumps
(instructions such as JZ or JRZ)
• 80% are other instructions
• On A, for each c.jump there is a CMP  on
A, 20% are c.jumps and 20% are CMP’s
• 60% are other instructions
Because of the extra complexity in B, the
clock of A is faster (CTB = 1.25 CTA)

2.1/34

© V. De Florio
KULeuven 2003

n
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/35


n

• CPIA = Si instri x cyclesi / #CCA =
= #BRA x cyclesBR + #BRA x cyclesBR
#CCA
#CCA
= nBRA x cyclesBR + nBRA x cyclesBR
= 20% x 2 + 80% x 1 = 1.2
• CPUA = ICA x CPIA x CTA = ICA x 1.2 x CTA

• CPIB = Si instri x cyclesi / #CCB =
= #BRB x cyclesBR + #BRB x cyclesBR
#CCB
#CCB
= nBRB x cyclesBR + nBRB x cyclesBR

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

• Now, on B:
 One spares 20% of the instructions (the extra
cmp’s), hence:
nBRB = 20 / (100 – 20) = 0.25 (25%)
 Furthermore, ICB = 0.8 ICA

• Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25

• CPUB = ICB
x CPIB x
CTB =
= 0.8 ICA x 1.25 x 1.25 CTA
So CPUB = 1.25 x ICA x CTA
CPUA = 1.2 x ICA x CTA
So A is faster

2.1/36

(for which P?)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/38

Performance
• A straightforward enhancement is given
by increasing the clock rate
• The entire program benefits
• Also, independent of the particular
program
• Dependent on the efficiency of the
compiler etc.

© V. De Florio
KULeuven 2003

Clock Frequency Growth Rate
1,000

Computer
Design

Computer
Architectures
for AI

Clock rate (MHz)

Basic
Concepts

100





 R10000







 



 Pentium100









 
  
 

   




i80386


i80286

10

i8086




1

i8080



 i8008

i4004

Computer
Architectures
In Practice

0.1
1970

1975

• 30% per year
2.1/39

1980

1985

1990

1995

2000

2005

© V. De Florio
KULeuven 2003

Transistor Count Growth Rate
100,000,000

Basic
Concepts



Computer
Design

Computer
Architectures
for AI

Transistors

10,000,000


 R10000



 Pentium








 
 







i80386

i80286 
  R3000
R2000
 

1,000,000
100,000

i8086

10,000


i8080


 i8008
Computer
Architectures
In Practice

i4004

1,000
1970

1975

1980

1985

1990

1995

2000

• 100 million transistors on chip in early year 2000.
• Transistor count grows much faster than clock rate
2.1/40

2005

© V. De Florio
KULeuven 2003

Basic
Concepts

Performance
• Another important factor for performance
is given by
 Memory accesses
 I/O (disk accesses)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/43

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Memory
• Semiconductor DRAM technology
 Density: increase of 60% per year
(quadruplicate in 3 years)
 Cycle time: much less increase than this!

Computer
Architectures
In Practice

Capacity

Speed

Logic

2x in 3 years

2x in 3 years

DRAM

Computer
Architectures
for AI

4x in 3 years

1.4x in 10 years

Disk

2x in 3 years

1.4x in 10 years

Speed increases of memory and I/O have not
kept pace with processor speed increases.
2.1/44

© V. De Florio
KULeuven 2003

Memory
size

Basic
Concepts 1000000000

100000000

10000000

Bits

Computer
Design

year
1980
1983
1986
1989
1992
1996
2000

Computer
Architectures
for AI

1000000

100000

10000

1000

Computer
Architectures
In Practice

2.1/45

1970

1975

1980

1985
Year

1990

1995

2000

size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/46

Basic definitions
1. Bandwidth: the rate at which data can be
transferred. Bandwidth is typically measured in
bytes per second.
2. Block size: the amount of data transferred per
request. Block size is typically measured in bytes.
3. Latency: the time between making a request (e.g.
to read or write a block of data) and completing the
request. Latency is typically measured in seconds.
4. Throughput: The number of requests that can be
completed per unit time. Throughput is typically
measured in requests per second.

© V. De Florio
KULeuven 2003

Basic
Concepts

Memory
• DRAM: main memory of all computers
 Commodity chip industry: no company >20% share
 Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

• Capacity: 4X/3 years (60%/year)
 Moore’s Law

• MB/$: + 25%/year
• Latency: – 7%/year,
Bandwidth: + 20%/year (so far)
SIMM = single in-line memory chip, a small circuit board that
can hold a group of memory chips. Measured in bytes vs bits
32-bit path to memory
DIMM = dual in-line memory chip. 64-bit to memory
source: www.pricewatch.com, 5/21/98

2.1/47

© V. De Florio
KULeuven 2003

Processor Limit: DRAM Gap

Basic
Concepts

1000

CPU

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/48

100

Processor-Memory
Performance Gap:
(grows 50% / year)

10
DRAM

1

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000

Computer
Design

Performance

“Moore’s Law”

µProc
60%/yr.

DRAM
7%/yr..

© V. De Florio
KULeuven 2003

Memory Summary
Basic
Concepts

• DRAM:
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/49

 rapid improvements in capacity, MB/$, bandwidth;
 slow improvement in latency

 Processor-memory interface
is a bottleneck to delivered bandwidth

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/50

Disk Components

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/51

Disk Components: Platters
• Platters: the recording surfaces.
i. 1 to 8 inches in diameter (2.5 to 20 cm).
ii. Stacked on a spindle: typical disks have 1-12
platters.
iii. Data can be stored on one or both surfaces.
iv. Spindle and platters rotate at 3600 - 10000 rpm
(60-165 Hz).
v. Recording density depends on applying a
magnetic film with few defects.
vi. Rotation rate limited by bearings and power
consumption.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/52

Disk Components: Heads
•
i.

Heads: write and read data to and from platters.
Data stored as presence or absence of
magnetization.
ii. Head “floats” on air-film that rotates with the disk.
Bernoulli effect pulls head toward disk but not into
it. A dust particle can cause a “head crash” where
the disk surface is scratched and any data on it is
lost.
iii. Disk heads are manufactured using thin film
technology. Advancing technology allows smaller
heads and therefore more closely spaced tracks
and bits.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/53

Disk Components: Actuators
•
i.
ii.
iii.

Actuators: move heads radially over the platters.
Actuator arm needs to be light to move quickly.
Actuator arm needs to stiff to prevent flexing.
Smaller platters allow shorter arms: therefore
lighter and stiffer.
iv. Actuators limited by
•
•

power of actuator motor and
weight and strength of actuator components

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Disks: Data Layout
• Each surface consists of concentric rings called
tracks
• Each track is divided into sectors. Data is written to
and read from the disk a whole sector at a time
• The set of tracks that are a the same relative
position on each surface form a cylinder

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/54

cylinder

© V. De Florio
KULeuven 2003

Three Components of Disk Access Time
Basic
Concepts

1. Seek time: the time to move the heads to the
desired cylinder
 Advertised to be 8 to 12 ms. May be lower in real life

Computer
Design

2. Rotational latency: the time for the desired sector
to arrive under the head
 4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM

Computer
Architectures
for AI

3. Transfer time: the time to read the data from the
disk and send it over the I/O bus to the processor
 2 to 12 MB per second

Computer
Architectures
In Practice

Queue
Proc

Ctrl

Disk Access Time

IOC

Device

Response time = Queue + Ctrl + Device Service time
2.1/55

© V. De Florio
KULeuven 2003

Hard Disks

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm

2.1/56

© V. De Florio
KULeuven 2003

Hard Disks
• Capacity

Basic
Concepts

 + 60%/year (2X / 1.5 yrs)

• Transfer rate (BW)
Latency =
Queuing Time +
 + 40%/year (2X / 2.0 yrs)
Computer
Controller time +
Design
• Rotation + Seek time
per access Seek Time +
Rotation Time
 – 8%/ year (1/2 in 10 yrs)
+
+ Size / Bandwidth
per byte
• MB/$
Computer

{

Architectures
for AI

 > 60%/year (2X / <1.5 yrs)

Computer
Architectures
In Practice

source: Ed Grochowski, 1996,
“IBM leadership in disk drive technology”;
www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,
2.1/57

© V. De Florio
KULeuven 2003

Hard disks

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/58

1973:
1. 7 Mbit/sq. in
140 MBytes

1979:
7. 7 Mbit/sq. in
2,300 MBytes

© V. De Florio
KULeuven 2003

Hard Disks

Areal Density

Basic
Concepts

10000
1000
100
10
1
1970

Computer
Design

1980

1990

2000

Year

Computer
Architectures
for AI

Computer
Architectures
In Practice

1989:
63 Mbit/sq. in
60,000 MBytes
2.1/59

1997:
1450 Mbit/sq. in
1600 MBytes

1997:
3090 Mbit/sq. in
8100 MBytes

© V. De Florio
KULeuven 2003

Hard Disks
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/60

• Continued advance in capacity (60%/yr)
and bandwidth (40%/yr.)
• Slow improvement in seek, rotation
(8%/yr)
• Time to read whole disk
Year Sequentially Randomly
1990
4 minutes
6 hours
2000 12 minutes
1 week

© V. De Florio
KULeuven 2003

Memory/Disk Summary
Basic
Concepts

• Memory:
 DRAM rapid improvements in capacity, MB/$,
bandwidth; slow improvement in latency

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/61

• Disk:
 Continued advance in capacity, cost/bit,
bandwidth; slow improvement in seek,
rotation

• Huge gap between CPU and external
memories
• How to address this problem?
• Classical way: memory hierarchies

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/62

Memory hierarchies
• Axiom of HW designer: smaller is faster
 Larger memories => larger signal delay
 More levels are required to encode addresses
 In a smaller memory the designer can use more
power per cell => shorter access times
• Crucial features for performance
 Huge bandwidth (in MB/sec.)
 Short access times
• Principle of locality
 The data most recently used is very likely to be
accessed again in the near future (temporal l.)
 Memory cells close to the most recently used one
are likely to be accessed in the near future (spatial)

• Combining the above with the Amdhal law, the
“best” enhancement is using hierarchies of
memories

© V. De Florio
KULeuven 2003

Typical memory hierarchy (`95)

Basic
Concepts

CPU
Registers

Cache

Computer
Design

I/O bus

Memory bus

Memory

I/O devices

32 MB
100 ns

2 GB
5 ms

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/63

Size: 200B
Speed: 5 ns

64KB
10 ns

© V. De Florio
KULeuven 2003

Basic
Concepts

Memory hierarchies
Input/Output and Storage

Disks, WORM, Tape
Computer
Design

Computer
Architectures
for AI

Coherence,
Bandwidth,
Latency

L2 Cache

L1 Cache
Computer
Architectures
In Practice

Emerging Technologies
Interleaving
Bus protocols

DRAM

Memory
Hierarchy

VLSI
Instruction Set Architecture

Addressing,
Protection,
Exception Handling

Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
2.1/64

RAID

Pipelining and Instruction
Level Parallelism

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/65

Memory hierarchies
•
•
•
•
•

Registers: smallest and fastest memory
Size: less than 1KB
Access time: 2-5 ns
Bandwidth: 4000-32000 MB/sec
Managed by the compiler (or the
assembly programmer)
 register int a;

• Special purpose vs. general purpose
• Monolithic or double-shaped
 Rx = Rl + Rh

• Backed in cache
• Implemented via custom memory with
multiple ports

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/66

Memory hierarchies
• Cache = small, fast memory located close
to the CPU
• The cache holds the most recently
accessed code or data
 Managed by HW
 No way to tell “put these data in cache” at SW
 New research: cache-conscious data
structures

•
•
•
•
•

Size: less than 4 MB
Access time: 3-10 ns
Bandwidth: 800-5000 MB/sec
Backed in main memory
Implemented with (on- or off-chip) CMOS
SRAM

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/67

Memory hierarchies
• Cache terminology: cache hit, cache
miss, cache block
 Cache hit: the CPU has been able to find in
cache the requested data
 Cache miss:  Cache hit
 Cache block: the fixed-size buffer used to load
a portion of memory into the cache

• A cache miss blocks the CPU until the
corresponding memory block gets cached

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/68

Memory hierarchies
• Virtual memory: same principles behind
the use of cache, but implemented
between main memory and disk storage
• At any point in time, not all the data
referenced by p need to be in main
memory
• Address space is partitioned into fixedsize blocks: pages
• A page is either in memory or on disk
• When CPU references an item within a
page
if ( Check-if-in-cache() == CACHE_MISS )
if ( Check-if-in-memory() == MEM_MISS)
PageFault(); // Loads page in memory
 CPU doesn’t stall – switches to other tasks

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/69

Cache performance
• Example: speedup using a cache
 Cache 10 times faster than main memory
 Cache is used 90% of the cases

speedup 



1

1
0.9
1  0.9 
10

 5.3

fract. enhanced
speedupenhanced

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/70

Cache performance
CPUtime = (CPU clock cycles + memory
stall cycles) x clock cycle time
Memory stall cycles = #(misses)  £(miss)
= IC  #(misses per instruction)  £(miss)
= IC  #(memory references per instr.) 
miss-frequency  £(miss)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/71

Cache performance
• Example (P&H, p.43)
 A computer has a CPI = 2 when data is in cache
 Memory access is only required by load and
store instructions (40% of total #)

 £(miss) = 25 clock cycles
 Cache misses frequency = 2%

? How faster would the machine be when no
cache miss occurs?

CPU"-hit = (CPU clock cycles + memory stall cycles) 
clock cycle time
= (IC  CPI + 0)  clock cycle time
= IC  2  clock cycle time

© V. De Florio
KULeuven 2003

Basic
Concepts

Cache performance

? How fast would the machine be when
cache misses do occur?

1. Compute the memory stall cycles (msc)
Computer
Design

msc = IC  memory references per instruction
 miss rate  miss penalty
= IC  (1 + 0.4)  0.02  25
Data access

Computer
Architectures
for AI

Instruction access

= IC  0.7
Computer
Architectures
In Practice

2.1/72

2. Compute total performance:
CPUcache=(CPU clock cycle + msc)  clock cycle time
= (IC  2 + IC  0.7)  clock cycle time
= 2.7  IC  clock cycle time

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/73

Computer Design
• Quantitative assessments
Instruction sets
• Pipelining

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/74

Computer design
• Instruction-set architecture:
 The architecture of the machine level
 The boundary between SW and HW

• Organization:
 High level aspects: memory system, bus
structure, internal CPU design

• Hardware:
 The specifics of a machine: detailed logic
design, packaging technology…

• Architecture = I + O + H

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/75

Instruction Sets
• IS = Instruction sets = The architecture of
the machine language
• IS Classification
• Roles of the compilers
• DLX

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/76

Computer Design  IS
IS Classification
• Role of the compilers
• DLX

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/77

Computer Design  IS 
IS Classification
• Key: type of internal storage in the CPU
• Three main classes
 Stack architectures
 Accumulator architectures
 General-purpose register architectures

IS Classification  Stack A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

•
•
•
•

Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B

3. ADD

2.1/78

2. PUSH B

A
Computer
Architectures
In Practice

B

1. PUSH A

ADD = PUSH (POP + POP)


© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Example: C = A + B

3. ADD

Computer
Architectures
for AI

2. PUSH B
A

Computer
Architectures
In Practice

2.1/79

1. PUSH A

ADD = PUSH (B + POP)


© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Example: C = A + B

3. ADD

Computer
Architectures
for AI

2. PUSH B
B+A

Computer
Architectures
In Practice

2.1/80

1. PUSH A

ADD = PUSH (B + POP)
ADD = PUSH (B + A)


© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Example: C = A + B
4. POP C

3. ADD

Computer
Architectures
for AI

2. PUSH B
1. PUSH A

Computer
Architectures
In Practice

C = TOP STACK = A+B
An example: the ARIEL virtual machine (Part 1, Slides 91 –)

2.1/81

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

IS Classification  Accumulator A.
• Accumulator Architectures
• A special register (the accumulator)
plays the role of an implicit argument
• Example: C = A + B
1.

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/82

LOAD A

; let Acml = A

2.

ADD B

; let Acml = Acml + B

3.

STORE C

; let C = Acml

IS Classification  Register A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/83

•
•
•
•

General-purpose Register Architecture
Explicit operands only
Either registers or memory locations
Two flavors:
 Register-memory architectures (RMA)
 Register-register architectures (RRA)

• Example: C = A + B

 RMA: Load R1, A

Add R1, B
; in C, R1 += B

Store C, R1
 RRA: Load R1, A

Load R2, B

Add R3, R1, R2

Store C, R3

© V. De Florio
KULeuven 2003

Basic
Concepts

IS Classification  RRA
• Some old machines used stack or
accumulator architectures
 For instance, T800 and 6502/6510

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/84

• Today the de facto standard is RRA
 Regs are fast
 Regs are easier to use (compiler writers)
 Do not require to deal with associativity issues
 Stacks do!

 Regs can hold variables
register int I;
for (I=0; I<1000000;I++)
{ do-stgh(I); … }
 Using regs you don’t need a memory address

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

IS Classification  Register A.
• RRA: no memory operands
 All instructions are similar in size -> take
similar number of clocks to execute (very
useful property… see later)
 No side effect
 Higher instruction count

• RMA: one memory operand
Computer
Architectures
for AI

Computer
Architectures
In Practice

 One load can be spared
 A register operand is destroyed ( R += B )
 Clocks per instruction varies by operand
location

• Memory-memory:
 Compact
 Large variation of work per instruction
 Large variation in instruction size

2.1/85

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Memory addressing
• How is memory organized?
• What does it mean, e.g., read memory at
address 512?
• What do we read?
 Bytes, half words, words, double words

• How are consecutive bytes stored in a
word? (Assumption: word is 4 bytes)
 Little endian: &word = &LSB
 Big endian: &word = &MSB
 XDR routines are needed to exchange data

(&word  address of word)
2.1/86

© V. De Florio
KULeuven 2003

Basic
Concepts

A memory model for didactics
• Memory can be thought as finite, long
array of cells, each of size 1 byte
0

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/87

1

2

3

4

5

6

7

…

• Each cell has a label, called address, and
a content, i.e. the byte stored into it
• Think of a chest of drawers, with a label
on each drawer and possibly something
into it

© V. De Florio
KULeuven 2003


Basic
Concepts

Content
Computer
Design

Computer
Architectures
for AI

4
3
2
1

Computer
Architectures
In Practice

2.1/88

Address

© V. De Florio
KULeuven 2003

Basic
Concepts

• The character * has a special meaning
• It refers to the contents of a cell

Computer
Design

• For instance:
Computer
Architectures
for AI

*(1) 

Computer
Architectures
In Practice

This character means we’re inspecting the contents
of a cell (we open a drawer and see what’s in it)
2.1/89

© V. De Florio
KULeuven 2003

Basic
Concepts

• The character * has a special meaning
• It refers to the contents of a cell

Computer
Design

• For instance:
Computer
Architectures
for AI

*(1) 

Computer
Architectures
In Practice

This character means we’re writing new contents
into a cell (we open a drawer and change its contents)
2.1/90

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/91

• Memory is (often) byte addressable,
though it is organized into small groups of
bytes: the machine word
• A common size for the machine word is 4
bytes (32 bits)
• Two possible organizations for the bytes
in a word
 Little endian
 Big endian

© V. De Florio
KULeuven 2003

Little endian versus Big endian
MSB0

LSB0

LSB0

MSB0

4

Big endian (Motorola)

Little endian

3

Basic
Concepts

Big endian

MSB1

LSB1

LSB1

MLSB1

0

MSB
0
Computer
Design

Computer
Architectures
for AI

1

2

LSB
3

1

MSB
4

5

6

LSB
7

2

2.1/92

5

LSB
3
Computer
Architectures
In Practice

Little endian (Intel)
2

1

MSB
0

6

LSB
7

6

5

MSB
4

7

© V. De Florio
KULeuven 2003

Little endian versus Big endian
Problem: communication
between the two

Little endian

0

MSB0
00

LSB0
00

1

00

00

2

00

00

3

LSB0
01

MSB0
01

4

MSB1
10

LSB1
10

Little endian (Intel)

5

00

00

LSB
00

Basic
Concepts

Big endian

00

00

MSB
01

6

00

00

LSB
10

00

00

MSB
00

7

LSB1
00

MLSB1
00

Big endian (Motorola)

MSB
00
Computer
Design

00

00

LSB
01

MSB
10

00

00

LSB
00

=268435456
Computer
Architectures
for AI

Computer
Architectures
In Practice

So they are the same; though, interpreted as if they were…
=16777216
01
00
00
00
00

2.1/93

=1

00

00

10

=16

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/94

Memory addressing
• Alignment is mandatory on some
machines
 Object O; int t = sizeof(O);
 ALIGNED(O) means
&O modulo t is 0
 “access to O is aligned”
 For instance if access to integers (4 bytes) is
aligned, then an integer can only be stored in
addresses divisible by 4
 Alignment is sometimes necessary because
prevents hardware complications
 Alignment implies faster access

Memory addressing

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

• Addressing modes: ways to specify the
address of an object in memory
• An addressing mode can specify
 A constant
 A register
 A memory location

Computer
Architectures
for AI

Computer
Architectures
In Practice

In what follows,
A += B means
* (x)
means
x++
--x
Rx

2.1/95

A=A+B
return the contents of memory at
address x
means “at the end, let x = x + 1”
means “at the beginning, let x = x – 1”
means register x

Memory addressing

© V. De Florio
KULeuven 2003

Meaning

Add R4, R3
Add R4, #3

R4 += R3
R4 += 3

Displacement
Indirect

Add R4, 100(R1)
Add R4, (R1)

R4 += *(100+R1)
R4 += *(R1)

Add R4, (R1 + R2)

R4 += *(R1 + R2)

Absolute
Computer
Architectures
for AI

Example

Indexed

Computer
Design

Mode
Register
Immediate

Basic
Concepts

Add R4, (100)

R4 += *(100)

Deferred

Add R4, @(R3)

R4 += *(*(R3))

Autoincrement

Add R4, (R3)+

Indirect, R3++

Autodecrement Add R4, -(R2)
Computer
Architectures
In Practice

Scaled

Add R4,
100(R2)[R3]

R2--, indirect
R4 += * ( 100 + R2 +
R3 * d )

d = size of the addressed data (1, 2, 4, 8, or 16)
2.1/96

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/97

Memory addressing
• Addressing mode can reduce IC
• Complex addressing modes increase the
complexity of the hardware  can
increase CPI
• Displacement, immediate and deferred
represent b/w 75% and 99% of addressing
modes (experiments done with TeX,
spice, and gcc)

• IC(p) = number of instructions that the CPU executed
during the activity of program p
• CPI(p) = clock cycles per instruction = #CC(p) / IC(p)
average number of clock cycles needed to execute one
instruction of p

Operations

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/98

•
•
•
•
•
•
•
•

Arithmetical and logical (add, and, sub...)
Data transfer (move, store)
Control (br, jmp, call, ret, iret…)
System (virtual memory mngt…)
Floating point (add, mul, …)
Decimal (decimal add, decimal mul…)
String (str move, str cmp, str search)
Graphics (pixel operations)

• Benchmarks show that often a small set
of simple instructions account for stg like
95% of instructions executed
(see Fig. 2.11, P&H p.81)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/99

Operations
• Control Flow Instructions
 Branch (conditional change)
 Jump (unconditional change)
 Procedure calls
 Procedure returns

• Most of the comparisons in conditional
branches are simple “==“, “!=“ with 0!
• In some cases, the address to go to
is only known at run-time
 “Return” uses a stack
 Switch statements
 Dynamic libraries

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/100

Operands
• When we say, e.g.,
“Add R1, #5”
do we work with bytes? Half-words?
Words?
• How do we specify the type of the
operand?
1. Classical method: the type of operand is
part of the opcode
• Add family is coded as ffff…fffvv
where f are fixed bits and v are bits
that specify the type

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/101

Operands and types
• Example: Add family =
10110101000100vv
• 1011010100010000 =
1011010100010001 =
1011010100010010 =
1011010100010011 =

Add
Add
Add
Add

float words
words
half-words
bytes

• Old fashioned method:
operand = data + tag
• Tag describes a type
• Tag is interpreted by HW
• Operation is chosen accordingly

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/102

Operands and types
• Which types to support?
• Old fashioned solution: all (bytes, semiwords, words, f.p., double words, double
precision f.p., …)
• Current trend: Only operations on items
greater than or equal to 32 bits
• On the DEC Alpha one needs multiple
instructions to access objects smaller
than 32 bits

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/103

Operands and types
• Floating point numbers:
IEEE standard 754
• In the early ’80, each manufacturer had
its own f.p. representation
• Sometimes string operations are available
(strcmp, strcpy…)
• Sometimes BCD is used to code numbers
 Four bits are used to code a decimal digit
 A byte codes two decimal digits
 Functions for “packing” and “unpacking” are
required
 It is unclear if this will stay in the future

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/104

Computer Design  IS
• IS Classification
Role of the compilers
• DLX

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/105

Role of the compiler
• In the past, the role of Assembly language
was crucial
• Architectural decisions aimed at easing
assembly language programming
• Now, the user interface is a high level
language (C, C++, Java…)
• The user interfaces the machine via the
HLL, though the machine actually
executes some lower level code
• This lower level code is produced by a
compiler
 The role of the compiler is fundamental
 The IS architecture needs to take the
compiler into strong account

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

• Goals of the compiler writer
 Correctness
 Performance
 …Fast compilation, debugging support, …

• Strategy for writing a compiler
Use a number of “passes”
From high level structures down to
lower levels, until machine level
 This way complexity is decomposed in
smaller blocks
 Optimizing becomes more difficult

2.1/106

© V. De Florio
KULeuven 2003

Basic
Concepts

Dependencies
D(language)
D(machine)

Function
Front-end

Language  common
intermediate form

HL Opt

Loop transformations,
function inlining…

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/107

Global Opt

D(language)
D(machine)

Register allocation…

Code
generator

Instruction selection,
D(machine) opt.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

• HL Optimizations: source-level
optimizations (code  code’)
• Local optimizations: basic block
optimizations
• Global optimizations: loop optimization
and basic blocks optimizations
• Machine-dependent optimization: using
low level architectural knowledge

Computer
Architectures
In Practice

2.1/108

• Basic Block = a straight-line code fragment

© V. De Florio
KULeuven 2003

Basic
Concepts

• Compilers have different optimization
levels
 -O1 .. -On

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/109

• Optimization can have a big impact on
instruction count  on performance

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/110


© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/111

• In some cases, though, optimization may
be counterproductive!
• This happens because there might be
conflicts between local and global
optimization tasks
SAME EXPRESSION
• Example:
a = sqrt(x*x + y*y) + f()… ;
b = sqrt(x*x + y*y) + g()…;
• Idea:
tmp = sqrt(x*x + y*y);
a = tmp + f() …;
b = tmp + g() …;

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/112

• Effective, but only if tmp can be stored in

a register

• No register  in memory  cache misses
 … bad performance
• Problem is
 When the compiler performs, e.g., code
transformations like in the example, it does not
know whether a register will actually be
available
 This will only become clear later (at global
optimization level)

• (Phase ordering problem)

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/113

• Key resource is the register file
• “Intelligent” register allocation
techniques are a must
• Current solution: graph coloring (graph
with possible candidates for allocation to
a register)
• NP-complete, though effective heuristic
algorithms exist

© V. De Florio
KULeuven 2003

Basic
Concepts

• A special class of compilers – Algorithmdriven software generation
 FFTW approach: Software generation system
based on symbolic computation

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/114

 Objective CamL
 Sort of FFT compiler that generates optimal C
code via symbolic computing
 Possible future steps (project works, theses…):
Extending the approach going down to code
generation for, e.g., the TI ‘C67 DSP and other
VLIW CPUs

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/115

Exam of 16 Jan 2002
• A program is composed of three classes of
instructions: i1 (integer instructions), i2 (loadstore instructions), and i3 (floating point
instructions)
• The three classes are responsible of r1 = 60%, r2 =
30% and r3 = 10% of the overall execution time,
respectively
• You can choose between three levels of
optimisation on your computer: O1, O2, and O3:
O1 optimises i1, O2 optimises i2, and O3 optimises
i3
• The corresponding enhancements would be
e1 = 2, e2 = 3, e3 = 10
• Suppose you can only choose one of the three
levels of optimisation. Which one would you
choose? Justify your choice

© V. De Florio
KULeuven 2003

Basic
Concepts

Solution
• r1 = 60%
r2 = 30%
r3 = 10%

Computer
Design

S=

Exec-timeNEW
=
Exec-timeOLD

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/116

• s1 = 1.42857
s2 = 1.25
s3 = 1.0989

e1 = 2
e2 = 3
e3 = 10
1
(1 - r) + r / e

Advanced Computer Architectures – Part 2.1

Recommandé

Recommandé

Contenu connexe

Similaire à Advanced Computer Architectures – Part 2.1

Similaire à Advanced Computer Architectures – Part 2.1 (20)

Plus de Vincenzo De Florio

Plus de Vincenzo De Florio (20)

Dernier

Dernier (20)

Advanced Computer Architectures – Part 2.1