SlideShare une entreprise Scribd logo
1  sur  113
Télécharger pour lire hors ligne
Advanced Computer
Architectures
– HB49 –
Part 2.1
Vincenzo De Florio
K.U.Leuven / ESAT / ELECTA
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/2

Course contents
• Basic Concepts
Computer Design
• Computer Architectures for AI
• Computer Architectures in Practice
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/3

Computer Design
Quantitative assessments
• Instruction sets
• Pipelining
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/4

Computer design
• First part of the course: a survey of
computer history
• Key aspect of this history:
 In the last 60 years computers have
experienced a formidable growth in
performance and a huge costs decrease
 A 1000¤ PC today provides its user with more
performance, memory, and disk space of a 1M$
mainframe of the Sixties
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/5

Computer design
• How this could be possible?
• Through
 Advances in computer technology
 Advances in computer design
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer design
• The tasks of a computer designer:
 Determine key attributes for a new machine

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/6

 E.g., design a machine that maximize
performance keeping costs under control

 Aspects:
 Instruction set design
 Functional organization
 Logic design
 Implementation
(To be defined later)
© V. De Florio
KULeuven 2003

Basic
Concepts

Significant improvements
• First 25 years:
 From both technology and design

• From the Seventies:
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/7

 Mainly from IC technology
 Main concern = compatibility with the past
(to save investments)
 Compatibility at ML
 No room for design improvements

 20-30% per year for mainframes and minis

• Late Seventies: advent of the mP
 Higher rate (35% per year)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/8

Significant improvements: the mP
• The mP
Mass-produced  lower costs
Significant changes in computer
marketplace
Higher level language compatibility (no need
for object code compatibility)
Availability of standard, vendor-independent
OS (less risks and costs in producing a new
architecture)

allowed to develop a new concept:
RISC architectures
© V. De Florio
KULeuven 2003

Basic
Concepts

Significant improvements: RISC
RISC architectures

 Designed in the Eighties, on the market ca.‘85
 Since then, a 50% improvement per year

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

300
Sun UltraSparc
P
e
r
f
o
r
m
a
n
c
e

1.54X/yr

250

DEC 21064a

200
150

IBM Power 2/590
100

DEC AXP 3000
HP 9000/750

50

MIPS M/120
Sun-4/260 MIPS M2000

0
1987
2.1/9

1.35X/yr
IBM RS6000/540

1988

1989

1990

1991
Year

1992

1993

1994

1995
© V. De Florio
KULeuven 2003

Technology Trends
Basic
Concepts

1000
Computer
Design

Supercomputers

100
Mainframes

Computer
Architectures
for AI

10
Minicomputers
Microprocessors

1
Computer
Architectures
In Practice

0.1
1965
2.1/10

1970

1975

1980

1985

Year

1990

1995

2000
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/11

Computer design
• The mP allowed a 50% of performance
increase. How was that possible?
 Enhanced capability for users
 IBM Power 21993  Cray Y-MP1988

 The fastest supercomputer in 1988 has approx.
the same performance of the fastest 1993
workstation
 Price: 1/10

 Computers became more and more mP-based
 Mainframes were disappearing or becoming
based on off-the-shelf mPs
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/12

Computer design
• Big consequence
 No more market urge for
object code compatibility
 Freedom from compatibility with old designs
 Renaissance in computer design
 Again, significant improvements from both
technology and design
 50% of performance growth!
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/13

Computer design
• The highest performance mP in ’95 is
mainly a result of design improvements
(1-to-5)
• In this section we focus on the design
techniques that allowed this state of
facts
© V. De Florio
KULeuven 2003

Performance

Computer
Design

• What are the aspects to be taken into
account in order to reach a higher
performance?
• How to choose between different
alternatives?

Computer
Architectures
for AI

 Amdhal’s law
 Quantitative assessment

Basic
Concepts

Computer
Architectures
In Practice

2.1/14
© V. De Florio
KULeuven 2003

Basic
Concepts

Amdhal’s law
• Speed-up:
Execution time for entire task w/o using the “enhancement”
S=

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/15

Execution time for entire task using enhancement when possible

• Amdhal’s law on speed-up:
• Speed up depends on the fraction of time
that may be affected by the enhancement
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Amdhal’s law
Let us call F the fraction of time
affected by the enhancement
For instance, F=0.40 means that the original program would
benefit of the enhancement for 40% of the time of execution
What do we gain by introducing
the enhancement?
Exec-timeNEW = Exec-timeOLD  ((1 -F) + F/SENH)
Where SENH is the speedup in the enhanced mode. Hence,

Computer
Architectures
In Practice

2.1/16

S=

Exec-timeNEW
Exec-timeOLD

=

1
(1 - F) + F / SENH
© V. De Florio
KULeuven 2003

Amdhal’s law

Basic
Concepts

Computer
Design

SENH grows, but
SOVER does not

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/17

F = 40%
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Amdhal’s law
• Law of diminishing returns
 the incremental improvement in speedup
gained by an additional improvement in the
performance of just a portion of the
computation
diminishes as improvements are added

Computer
Architectures
for AI

Computer
Architectures
In Practice

1
1
lim SENH S = lim SENH
=
(1 - F) + F / SENH
(1 - F)
= SMAX

2.1/18
© V. De Florio
KULeuven 2003

Amdhal’s law

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/19

To reach a maximum speedup = 3,
F must be at least 66%
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/20

Amdhal’s law…
• “…can serve as a guide to how much an
enhancement will improve performance
and how to distribute resources to
improve cost/performance.
• The goal, clearly, is to spend resources
proportional to where time is spent.’’
© V. De Florio
KULeuven 2003

Basic
Concepts

Amdhal’s law
• Example 1 (p.30 P&H)
 Method allows an improvement by factor 10
 That can be exploited for 40% of the time

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/21

speeupoveral 

1

fract. enhanced
1  fract. enhanced  
speedupenhanced
1

 1.56
0.4
1  0.4 
10
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Amdhal’s law
Example 2 (p.31 P&H)
 50% of the instructions of a given benchmark
are floating point instructions
 FPSQR applies to 20% of the same benchmark
 Alternative 1: extra hardware: FPSQR is 10
times faster
 Alternative 2: all the FP instructions go 2 times
faster

speedupoveral 

speedupFPSQR 

speedupFP
2.1/22

1

1  fract. enhanced  
1

fract. enhanced
speedupenhanced

 1.22
0.2
1  0.2 
10
1

 1.33
0.5
1  0.5 
2.0
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/23

Quantitative assessment
• CPUTIME(p) = Time spent by the CPU to run
program p
• Clock cycle time = tcc , clock rate = 1/ tcc
• CPUTIME(p) = # clock cycles  tcc
= # clock cycles / clock rate
• E.g.: clock cycle time = 2ns
clock rate = 500 MHz
• #CC(p) = number of clock cycles spent in
the execution of p
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/24

Quantitative assessment
• Instruction count
• IC(c,p) = number of instructions that CPU
c executed during the activity of program
p
• Often, IC(p)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/25

Quantitative assessment
• Clock cycles per instruction
• CPI(p) = #CC(p) / IC(p)
average number of clock cycles needed
to execute one instruction of p
© V. De Florio
KULeuven 2003

Quantitative assessment

Basic
Concepts

• CPUTIME(p) =
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/26

= #clock cycles  clock cycle time
= #CC(p)
 tcc
= IC(p)  CPI(p)  tcc
= IC(p)  CPI(p)
clock rate
 We can influence the performance of a given
program p by optimizing the three key
variables IC(p), CPI(p), and clock rate.
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/27

Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
 Clock rate (the higher, the better)
 Clock cycles per instructions (the lesser, the
better)
 Instruction count (the lesser, the better)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/28

Quantitative assessment
• CPU performance is equally dependent
upon three characteristics
 Clock rate (HW technology & organization)
 Clock cycles per instruction
(organization & instruction set architecture)
 Instruction count
(instruction set architecture &
compiler technology)

• Note: technologies are not independent of
each other!
© V. De Florio
KULeuven 2003

Basic
Concepts

Quantitative assessment
CPU time

= Seconds

Program

Computer
Design

Program
Computer
Architectures
for AI

Computer
Architectures
In Practice

Program

Cycles

x Seconds

Instruction

Inst Count CPI
X

Compiler

X

Inst. Set.

X

X

Organization

X

Cycle

Clock Rate

(X)

Technology
2.1/29

= Instructions x

X
X
© V. De Florio
KULeuven 2003

Basic
Concepts

Quantitative assessment
• Decades long challenge: optimizing

CPUTIME(p) = IC(p)  CPI(p)
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/30

clock rate

• This is a function of p!
• The choice of benchmarks is
important
© V. De Florio
KULeuven 2003

Basic
Concepts

Quantitative assessment
• Which methods to use?

CPUTIME(p) = IC(p)  CPI(p)
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/31

clock rate
• Method 1: increasing the clock rate
(Note: independent of p!)
• Methods 2: those trying to decrease
IC(p)
• Methods 3: those trying to decrease
CPI(p)
• Each method is equally important
• Some methods are more effective
© V. De Florio
KULeuven 2003

Basic
Concepts

Quantitative assessment:
how to calculate CPI?
n

 CPIi  ICi

ICi


CPI =
  CPIi  

 Instr. Count 
Instr. count i 1
i 1

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/32

n

ICi = number of times instruction i is
executed by p
CPIi = average number of clock cycles
for instruction i
CPIi needs to be measured and not just
read from a table in the Reference
Manual!
That is, we need to take into account
the memory access time! (Cache
misses do count… a lot)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Quantitative assessment
• Example 3: 2 alternatives for a
conditional branch instruction
 A: a CMP that sets a condition code (Z bit)
followed by a JZ
 B: a single instruction to do CMP and JZ

Arch. A
Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/33

LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET

Arch. B

LD R1, 0
L: INC R1
JRZ R1, 5, L
RET

We assume that JZ and JRZ take 2 cycles,
all the other instructions take 1 cycle
© V. De Florio
KULeuven 2003

Quantitative assessment
Arch. A

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

LD R1, 0
L: INC R1
CMP R1, 5
JZ L
RET

LD R1, 0
L: INC R1
JRZ R1,5,L
RET

Arch. B

• 20% of the instructions are c.jumps
(instructions such as JZ or JRZ)
• 80% are other instructions
• On A, for each c.jump there is a CMP  on
A, 20% are c.jumps and 20% are CMP’s
• 60% are other instructions
Because of the extra complexity in B, the
clock of A is faster (CTB = 1.25 CTA)

2.1/34
© V. De Florio
KULeuven 2003

n
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/35

Quantitative assessment

n

• CPIA = Si instri x cyclesi / #CCA =
= #BRA x cyclesBR + #BRA x cyclesBR
#CCA
#CCA
= nBRA x cyclesBR + nBRA x cyclesBR
= 20% x 2 + 80% x 1 = 1.2
• CPUA = ICA x CPIA x CTA = ICA x 1.2 x CTA

• CPIB = Si instri x cyclesi / #CCB =
= #BRB x cyclesBR + #BRB x cyclesBR
#CCB
#CCB
= nBRB x cyclesBR + nBRB x cyclesBR
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Quantitative assessment
• Now, on B:
 One spares 20% of the instructions (the extra
cmp’s), hence:
nBRB = 20 / (100 – 20) = 0.25 (25%)
 Furthermore, ICB = 0.8 ICA

• Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25

• CPUB = ICB
x CPIB x
CTB =
= 0.8 ICA x 1.25 x 1.25 CTA
So CPUB = 1.25 x ICA x CTA
CPUA = 1.2 x ICA x CTA
So A is faster

2.1/36

(for which P?)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/38

Performance
• A straightforward enhancement is given
by increasing the clock rate
• The entire program benefits
• Also, independent of the particular
program
• Dependent on the efficiency of the
compiler etc.
© V. De Florio
KULeuven 2003

Clock Frequency Growth Rate
1,000

Computer
Design

Computer
Architectures
for AI

Clock rate (MHz)

Basic
Concepts

100





 R10000







 



 Pentium100









 
  
 

   




i80386


i80286

10

i8086




1

i8080



 i8008

i4004

Computer
Architectures
In Practice

0.1
1970

1975

• 30% per year
2.1/39

1980

1985

1990

1995

2000

2005
© V. De Florio
KULeuven 2003

Transistor Count Growth Rate
100,000,000

Basic
Concepts



Computer
Design

Computer
Architectures
for AI

Transistors

10,000,000


 R10000



 Pentium








 
 







i80386

i80286 
  R3000
R2000
 

1,000,000
100,000

i8086

10,000


i8080


 i8008
Computer
Architectures
In Practice

i4004

1,000
1970

1975

1980

1985

1990

1995

2000

• 100 million transistors on chip in early year 2000.
• Transistor count grows much faster than clock rate
2.1/40

2005
© V. De Florio
KULeuven 2003

Basic
Concepts

Performance
• Another important factor for performance
is given by
 Memory accesses
 I/O (disk accesses)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/43
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Memory
• Semiconductor DRAM technology
 Density: increase of 60% per year
(quadruplicate in 3 years)
 Cycle time: much less increase than this!

Computer
Architectures
In Practice

Capacity

Speed

Logic

2x in 3 years

2x in 3 years

DRAM

Computer
Architectures
for AI

4x in 3 years

1.4x in 10 years

Disk

2x in 3 years

1.4x in 10 years

Speed increases of memory and I/O have not
kept pace with processor speed increases.
2.1/44
© V. De Florio
KULeuven 2003

Memory
size

Basic
Concepts 1000000000

100000000

10000000

Bits

Computer
Design

year
1980
1983
1986
1989
1992
1996
2000

Computer
Architectures
for AI

1000000

100000

10000

1000

Computer
Architectures
In Practice

2.1/45

1970

1975

1980

1985
Year

1990

1995

2000

size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/46

Basic definitions
1. Bandwidth: the rate at which data can be
transferred. Bandwidth is typically measured in
bytes per second.
2. Block size: the amount of data transferred per
request. Block size is typically measured in bytes.
3. Latency: the time between making a request (e.g.
to read or write a block of data) and completing the
request. Latency is typically measured in seconds.
4. Throughput: The number of requests that can be
completed per unit time. Throughput is typically
measured in requests per second.
© V. De Florio
KULeuven 2003

Basic
Concepts

Memory
• DRAM: main memory of all computers
 Commodity chip industry: no company >20% share
 Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

• Capacity: 4X/3 years (60%/year)
 Moore’s Law

• MB/$: + 25%/year
• Latency: – 7%/year,
Bandwidth: + 20%/year (so far)
SIMM = single in-line memory chip, a small circuit board that
can hold a group of memory chips. Measured in bytes vs bits
32-bit path to memory
DIMM = dual in-line memory chip. 64-bit to memory
source: www.pricewatch.com, 5/21/98

2.1/47
© V. De Florio
KULeuven 2003

Processor Limit: DRAM Gap

Basic
Concepts

1000

CPU

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/48

100

Processor-Memory
Performance Gap:
(grows 50% / year)

10
DRAM

1

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000

Computer
Design

Performance

“Moore’s Law”

µProc
60%/yr.

DRAM
7%/yr..
© V. De Florio
KULeuven 2003

Memory Summary
Basic
Concepts

• DRAM:
Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/49

 rapid improvements in capacity, MB/$, bandwidth;
 slow improvement in latency

 Processor-memory interface
is a bottleneck to delivered bandwidth
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/50

Disk Components
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/51

Disk Components: Platters
• Platters: the recording surfaces.
i. 1 to 8 inches in diameter (2.5 to 20 cm).
ii. Stacked on a spindle: typical disks have 1-12
platters.
iii. Data can be stored on one or both surfaces.
iv. Spindle and platters rotate at 3600 - 10000 rpm
(60-165 Hz).
v. Recording density depends on applying a
magnetic film with few defects.
vi. Rotation rate limited by bearings and power
consumption.
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/52

Disk Components: Heads
•
i.

Heads: write and read data to and from platters.
Data stored as presence or absence of
magnetization.
ii. Head “floats” on air-film that rotates with the disk.
Bernoulli effect pulls head toward disk but not into
it. A dust particle can cause a “head crash” where
the disk surface is scratched and any data on it is
lost.
iii. Disk heads are manufactured using thin film
technology. Advancing technology allows smaller
heads and therefore more closely spaced tracks
and bits.
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/53

Disk Components: Actuators
•
i.
ii.
iii.

Actuators: move heads radially over the platters.
Actuator arm needs to be light to move quickly.
Actuator arm needs to stiff to prevent flexing.
Smaller platters allow shorter arms: therefore
lighter and stiffer.
iv. Actuators limited by
•
•

power of actuator motor and
weight and strength of actuator components
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Disks: Data Layout
• Each surface consists of concentric rings called
tracks
• Each track is divided into sectors. Data is written to
and read from the disk a whole sector at a time
• The set of tracks that are a the same relative
position on each surface form a cylinder

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/54

cylinder
© V. De Florio
KULeuven 2003

Three Components of Disk Access Time
Basic
Concepts

1. Seek time: the time to move the heads to the
desired cylinder
 Advertised to be 8 to 12 ms. May be lower in real life

Computer
Design

2. Rotational latency: the time for the desired sector
to arrive under the head
 4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM

Computer
Architectures
for AI

3. Transfer time: the time to read the data from the
disk and send it over the I/O bus to the processor
 2 to 12 MB per second

Computer
Architectures
In Practice

Queue
Proc

Ctrl

Disk Access Time

IOC

Device

Response time = Queue + Ctrl + Device Service time
2.1/55
© V. De Florio
KULeuven 2003

Hard Disks

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Disk Latency = Queueing Time +
Controller time +
Seek Time + Rotation Time + Xfer Time
Order of magnitude times for 4K byte transfers:
Average Seek: 8 ms or less
Rotate: 4.2 ms @ 7200 rpm
Xfer: 1 ms @ 7200 rpm

2.1/56
© V. De Florio
KULeuven 2003

Hard Disks
• Capacity

Basic
Concepts

 + 60%/year (2X / 1.5 yrs)

• Transfer rate (BW)
Latency =
Queuing Time +
 + 40%/year (2X / 2.0 yrs)
Computer
Controller time +
Design
• Rotation + Seek time
per access Seek Time +
Rotation Time
 – 8%/ year (1/2 in 10 yrs)
+
+ Size / Bandwidth
per byte
• MB/$
Computer

{

Architectures
for AI

 > 60%/year (2X / <1.5 yrs)

Computer
Architectures
In Practice

source: Ed Grochowski, 1996,
“IBM leadership in disk drive technology”;
www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,
2.1/57
© V. De Florio
KULeuven 2003

Hard disks

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/58

1973:
1. 7 Mbit/sq. in
140 MBytes

1979:
7. 7 Mbit/sq. in
2,300 MBytes
© V. De Florio
KULeuven 2003

Hard Disks

Areal Density

Basic
Concepts

10000
1000
100
10
1
1970

Computer
Design

1980

1990

2000

Year

Computer
Architectures
for AI

Computer
Architectures
In Practice

1989:
63 Mbit/sq. in
60,000 MBytes
2.1/59

1997:
1450 Mbit/sq. in
1600 MBytes

1997:
3090 Mbit/sq. in
8100 MBytes
© V. De Florio
KULeuven 2003

Hard Disks
Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/60

• Continued advance in capacity (60%/yr)
and bandwidth (40%/yr.)
• Slow improvement in seek, rotation
(8%/yr)
• Time to read whole disk
Year Sequentially Randomly
1990
4 minutes
6 hours
2000 12 minutes
1 week
© V. De Florio
KULeuven 2003

Memory/Disk Summary
Basic
Concepts

• Memory:
 DRAM rapid improvements in capacity, MB/$,
bandwidth; slow improvement in latency

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/61

• Disk:
 Continued advance in capacity, cost/bit,
bandwidth; slow improvement in seek,
rotation

• Huge gap between CPU and external
memories
• How to address this problem?
• Classical way: memory hierarchies
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/62

Memory hierarchies
• Axiom of HW designer: smaller is faster
 Larger memories => larger signal delay
 More levels are required to encode addresses
 In a smaller memory the designer can use more
power per cell => shorter access times
• Crucial features for performance
 Huge bandwidth (in MB/sec.)
 Short access times
• Principle of locality
 The data most recently used is very likely to be
accessed again in the near future (temporal l.)
 Memory cells close to the most recently used one
are likely to be accessed in the near future (spatial)

• Combining the above with the Amdhal law, the
“best” enhancement is using hierarchies of
memories
© V. De Florio
KULeuven 2003

Typical memory hierarchy (`95)

Basic
Concepts

CPU
Registers

Cache

Computer
Design

I/O bus

Memory bus

Memory

I/O devices

32 MB
100 ns

2 GB
5 ms

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/63

Size: 200B
Speed: 5 ns

64KB
10 ns
© V. De Florio
KULeuven 2003

Basic
Concepts

Memory hierarchies
Input/Output and Storage

Disks, WORM, Tape
Computer
Design

Computer
Architectures
for AI

Coherence,
Bandwidth,
Latency

L2 Cache

L1 Cache
Computer
Architectures
In Practice

Emerging Technologies
Interleaving
Bus protocols

DRAM

Memory
Hierarchy

VLSI
Instruction Set Architecture

Addressing,
Protection,
Exception Handling

Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
2.1/64

RAID

Pipelining and Instruction
Level Parallelism
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/65

Memory hierarchies
•
•
•
•
•

Registers: smallest and fastest memory
Size: less than 1KB
Access time: 2-5 ns
Bandwidth: 4000-32000 MB/sec
Managed by the compiler (or the
assembly programmer)
 register int a;

• Special purpose vs. general purpose
• Monolithic or double-shaped
 Rx = Rl + Rh

• Backed in cache
• Implemented via custom memory with
multiple ports
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/66

Memory hierarchies
• Cache = small, fast memory located close
to the CPU
• The cache holds the most recently
accessed code or data
 Managed by HW
 No way to tell “put these data in cache” at SW
 New research: cache-conscious data
structures

•
•
•
•
•

Size: less than 4 MB
Access time: 3-10 ns
Bandwidth: 800-5000 MB/sec
Backed in main memory
Implemented with (on- or off-chip) CMOS
SRAM
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/67

Memory hierarchies
• Cache terminology: cache hit, cache
miss, cache block
 Cache hit: the CPU has been able to find in
cache the requested data
 Cache miss:  Cache hit
 Cache block: the fixed-size buffer used to load
a portion of memory into the cache

• A cache miss blocks the CPU until the
corresponding memory block gets cached
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/68

Memory hierarchies
• Virtual memory: same principles behind
the use of cache, but implemented
between main memory and disk storage
• At any point in time, not all the data
referenced by p need to be in main
memory
• Address space is partitioned into fixedsize blocks: pages
• A page is either in memory or on disk
• When CPU references an item within a
page
if ( Check-if-in-cache() == CACHE_MISS )
if ( Check-if-in-memory() == MEM_MISS)
PageFault(); // Loads page in memory
 CPU doesn’t stall – switches to other tasks
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/69

Cache performance
• Example: speedup using a cache
 Cache 10 times faster than main memory
 Cache is used 90% of the cases

speedup 



1

1  fract. enhanced  
1
0.9
1  0.9 
10

 5.3

fract. enhanced
speedupenhanced
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/70

Cache performance
CPUtime = (CPU clock cycles + memory
stall cycles) x clock cycle time
Memory stall cycles = #(misses)  £(miss)
= IC  #(misses per instruction)  £(miss)
= IC  #(memory references per instr.) 
miss-frequency  £(miss)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/71

Cache performance
• Example (P&H, p.43)
 A computer has a CPI = 2 when data is in cache
 Memory access is only required by load and
store instructions (40% of total #)

 £(miss) = 25 clock cycles
 Cache misses frequency = 2%

? How faster would the machine be when no
cache miss occurs?

CPU"-hit = (CPU clock cycles + memory stall cycles) 
clock cycle time
= (IC  CPI + 0)  clock cycle time
= IC  2  clock cycle time
© V. De Florio
KULeuven 2003

Basic
Concepts

Cache performance

? How fast would the machine be when
cache misses do occur?

1. Compute the memory stall cycles (msc)
Computer
Design

msc = IC  memory references per instruction
 miss rate  miss penalty
= IC  (1 + 0.4)  0.02  25
Data access

Computer
Architectures
for AI

Instruction access

= IC  0.7
Computer
Architectures
In Practice

2.1/72

2. Compute total performance:
CPUcache=(CPU clock cycle + msc)  clock cycle time
= (IC  2 + IC  0.7)  clock cycle time
= 2.7  IC  clock cycle time
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/73

Computer Design
• Quantitative assessments
Instruction sets
• Pipelining
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/74

Computer design
• Instruction-set architecture:
 The architecture of the machine level
 The boundary between SW and HW

• Organization:
 High level aspects: memory system, bus
structure, internal CPU design

• Hardware:
 The specifics of a machine: detailed logic
design, packaging technology…

• Architecture = I + O + H
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/75

Instruction Sets
• IS = Instruction sets = The architecture of
the machine language
• IS Classification
• Roles of the compilers
• DLX
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/76

Computer Design  IS
IS Classification
• Role of the compilers
• DLX
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/77

Computer Design  IS 
IS Classification
• Key: type of internal storage in the CPU
• Three main classes
 Stack architectures
 Accumulator architectures
 General-purpose register architectures
Computer Design  IS 
IS Classification  Stack A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

•
•
•
•

Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B

3. ADD

2.1/78

2. PUSH B

A
Computer
Architectures
In Practice

B

1. PUSH A

ADD = PUSH (POP + POP)
Computer Design  IS 
IS Classification  Stack A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B

3. ADD

Computer
Architectures
for AI

2. PUSH B
A

Computer
Architectures
In Practice

2.1/79

1. PUSH A

ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
Computer Design  IS 
IS Classification  Stack A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B

3. ADD

Computer
Architectures
for AI

2. PUSH B
B+A

Computer
Architectures
In Practice

2.1/80

1. PUSH A

ADD = PUSH (POP + POP)
ADD = PUSH (B + POP)
ADD = PUSH (B + A)
Computer Design  IS 
IS Classification  Stack A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

•
•
•
•

Stack architecture:
Operands are implicitly referred to
Top two items on the system stack
Example: C = A + B
4. POP C

3. ADD

Computer
Architectures
for AI

2. PUSH B
1. PUSH A

Computer
Architectures
In Practice

C = TOP STACK = A+B
An example: the ARIEL virtual machine (Part 1, Slides 91 –)

2.1/81
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer Design  IS 
IS Classification  Accumulator A.
• Accumulator Architectures
• A special register (the accumulator)
plays the role of an implicit argument
• Example: C = A + B
1.

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/82

LOAD A

; let Acml = A

2.

ADD B

; let Acml = Acml + B

3.

STORE C

; let C = Acml
Computer Design  IS 
IS Classification  Register A.

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/83

•
•
•
•

General-purpose Register Architecture
Explicit operands only
Either registers or memory locations
Two flavors:
 Register-memory architectures (RMA)
 Register-register architectures (RRA)

• Example: C = A + B

 RMA: Load R1, A

Add R1, B
; in C, R1 += B

Store C, R1
 RRA: Load R1, A

Load R2, B

Add R3, R1, R2

Store C, R3
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer Design  IS 
IS Classification  RRA
• Some old machines used stack or
accumulator architectures
 For instance, T800 and 6502/6510

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/84

• Today the de facto standard is RRA
 Regs are fast
 Regs are easier to use (compiler writers)
 Do not require to deal with associativity issues
 Stacks do!

 Regs can hold variables
register int I;
for (I=0; I<1000000;I++)
{ do-stgh(I); … }
 Using regs you don’t need a memory address
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer Design  IS 
IS Classification  Register A.
• RRA: no memory operands
 All instructions are similar in size -> take
similar number of clocks to execute (very
useful property… see later)
 No side effect
 Higher instruction count

• RMA: one memory operand
Computer
Architectures
for AI

Computer
Architectures
In Practice

 One load can be spared
 A register operand is destroyed ( R += B )
 Clocks per instruction varies by operand
location

• Memory-memory:
 Compact
 Large variation of work per instruction
 Large variation in instruction size

2.1/85
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Computer Design  IS 
Memory addressing
• How is memory organized?
• What does it mean, e.g., read memory at
address 512?
• What do we read?
 Bytes, half words, words, double words

• How are consecutive bytes stored in a
word? (Assumption: word is 4 bytes)
 Little endian: &word = &LSB
 Big endian: &word = &MSB
 XDR routines are needed to exchange data

(&word  address of word)
2.1/86
© V. De Florio
KULeuven 2003

Basic
Concepts

A memory model for didactics
• Memory can be thought as finite, long
array of cells, each of size 1 byte
0

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/87

1

2

3

4

5

6

7

…

• Each cell has a label, called address, and
a content, i.e. the byte stored into it
• Think of a chest of drawers, with a label
on each drawer and possibly something
into it
© V. De Florio
KULeuven 2003

A memory model for didactics

Basic
Concepts

Content
Computer
Design

Computer
Architectures
for AI

4
3
2
1

Computer
Architectures
In Practice

2.1/88

Address
© V. De Florio
KULeuven 2003

Basic
Concepts

A memory model for didactics
• The character * has a special meaning
• It refers to the contents of a cell

Computer
Design

• For instance:
Computer
Architectures
for AI

*(1) 

Computer
Architectures
In Practice

This character means we’re inspecting the contents
of a cell (we open a drawer and see what’s in it)
2.1/89
© V. De Florio
KULeuven 2003

Basic
Concepts

A memory model for didactics
• The character * has a special meaning
• It refers to the contents of a cell

Computer
Design

• For instance:
Computer
Architectures
for AI

*(1) 

Computer
Architectures
In Practice

This character means we’re writing new contents
into a cell (we open a drawer and change its contents)
2.1/90
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/91

A memory model for didactics
• Memory is (often) byte addressable,
though it is organized into small groups of
bytes: the machine word
• A common size for the machine word is 4
bytes (32 bits)
• Two possible organizations for the bytes
in a word
 Little endian
 Big endian
© V. De Florio
KULeuven 2003

Little endian versus Big endian
MSB0

LSB0

LSB0

MSB0

4

Big endian (Motorola)

Little endian

3

Basic
Concepts

Big endian

MSB1

LSB1

LSB1

MLSB1

0

MSB
0
Computer
Design

Computer
Architectures
for AI

1

2

LSB
3

1

MSB
4

5

6

LSB
7

2

2.1/92

5

LSB
3
Computer
Architectures
In Practice

Little endian (Intel)
2

1

MSB
0

6

LSB
7

6

5

MSB
4

7
© V. De Florio
KULeuven 2003

Little endian versus Big endian
Problem: communication
between the two

Little endian

0

MSB0
00

LSB0
00

1

00

00

2

00

00

3

LSB0
01

MSB0
01

4

MSB1
10

LSB1
10

Little endian (Intel)

5

00

00

LSB
00

Basic
Concepts

Big endian

00

00

MSB
01

6

00

00

LSB
10

00

00

MSB
00

7

LSB1
00

MLSB1
00

Big endian (Motorola)

MSB
00
Computer
Design

00

00

LSB
01

MSB
10

00

00

LSB
00

=268435456
Computer
Architectures
for AI

Computer
Architectures
In Practice

So they are the same; though, interpreted as if they were…
=16777216
01
00
00
00
00

2.1/93

=1

00

00

10

=16
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/94

Computer Design  IS 
Memory addressing
• Alignment is mandatory on some
machines
 Object O; int t = sizeof(O);
 ALIGNED(O) means
&O modulo t is 0
 “access to O is aligned”
 For instance if access to integers (4 bytes) is
aligned, then an integer can only be stored in
addresses divisible by 4
 Alignment is sometimes necessary because
prevents hardware complications
 Alignment implies faster access
Computer Design  IS 
Memory addressing

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

• Addressing modes: ways to specify the
address of an object in memory
• An addressing mode can specify
 A constant
 A register
 A memory location

Computer
Architectures
for AI

Computer
Architectures
In Practice

In what follows,
A += B means
* (x)
means
x++
--x
Rx

2.1/95

A=A+B
return the contents of memory at
address x
means “at the end, let x = x + 1”
means “at the beginning, let x = x – 1”
means register x
Computer Design  IS 
Memory addressing

© V. De Florio
KULeuven 2003

Meaning

Add R4, R3
Add R4, #3

R4 += R3
R4 += 3

Displacement
Indirect

Add R4, 100(R1)
Add R4, (R1)

R4 += *(100+R1)
R4 += *(R1)

Add R4, (R1 + R2)

R4 += *(R1 + R2)

Absolute
Computer
Architectures
for AI

Example

Indexed

Computer
Design

Mode
Register
Immediate

Basic
Concepts

Add R4, (100)

R4 += *(100)

Deferred

Add R4, @(R3)

R4 += *(*(R3))

Autoincrement

Add R4, (R3)+

Indirect, R3++

Autodecrement Add R4, -(R2)
Computer
Architectures
In Practice

Scaled

Add R4,
100(R2)[R3]

R2--, indirect
R4 += * ( 100 + R2 +
R3 * d )

d = size of the addressed data (1, 2, 4, 8, or 16)
2.1/96
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/97

Computer Design  IS 
Memory addressing
• Addressing mode can reduce IC
• Complex addressing modes increase the
complexity of the hardware  can
increase CPI
• Displacement, immediate and deferred
represent b/w 75% and 99% of addressing
modes (experiments done with TeX,
spice, and gcc)

• IC(p) = number of instructions that the CPU executed
during the activity of program p
• CPI(p) = clock cycles per instruction = #CC(p) / IC(p)
average number of clock cycles needed to execute one
instruction of p
Computer Design  IS 
Operations

© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/98

•
•
•
•
•
•
•
•

Arithmetical and logical (add, and, sub...)
Data transfer (move, store)
Control (br, jmp, call, ret, iret…)
System (virtual memory mngt…)
Floating point (add, mul, …)
Decimal (decimal add, decimal mul…)
String (str move, str cmp, str search)
Graphics (pixel operations)

• Benchmarks show that often a small set
of simple instructions account for stg like
95% of instructions executed
(see Fig. 2.11, P&H p.81)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/99

Computer Design  IS 
Operations
• Control Flow Instructions
 Branch (conditional change)
 Jump (unconditional change)
 Procedure calls
 Procedure returns

• Most of the comparisons in conditional
branches are simple “==“, “!=“ with 0!
• In some cases, the address to go to
is only known at run-time
 “Return” uses a stack
 Switch statements
 Dynamic libraries
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/100

Computer Design  IS 
Operands
• When we say, e.g.,
“Add R1, #5”
do we work with bytes? Half-words?
Words?
• How do we specify the type of the
operand?
1. Classical method: the type of operand is
part of the opcode
• Add family is coded as ffff…fffvv
where f are fixed bits and v are bits
that specify the type
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/101

Computer Design  IS 
Operands and types
• Example: Add family =
10110101000100vv
• 1011010100010000 =
1011010100010001 =
1011010100010010 =
1011010100010011 =

Add
Add
Add
Add

float words
words
half-words
bytes

• Old fashioned method:
operand = data + tag
• Tag describes a type
• Tag is interpreted by HW
• Operation is chosen accordingly
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/102

Computer Design  IS 
Operands and types
• Which types to support?
• Old fashioned solution: all (bytes, semiwords, words, f.p., double words, double
precision f.p., …)
• Current trend: Only operations on items
greater than or equal to 32 bits
• On the DEC Alpha one needs multiple
instructions to access objects smaller
than 32 bits
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/103

Computer Design  IS 
Operands and types
• Floating point numbers:
IEEE standard 754
• In the early ’80, each manufacturer had
its own f.p. representation
• Sometimes string operations are available
(strcmp, strcpy…)
• Sometimes BCD is used to code numbers
 Four bits are used to code a decimal digit
 A byte codes two decimal digits
 Functions for “packing” and “unpacking” are
required
 It is unclear if this will stay in the future
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/104

Computer Design  IS
• IS Classification
Role of the compilers
• DLX
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/105

Computer Design  IS 
Role of the compiler
• In the past, the role of Assembly language
was crucial
• Architectural decisions aimed at easing
assembly language programming
• Now, the user interface is a high level
language (C, C++, Java…)
• The user interfaces the machine via the
HLL, though the machine actually
executes some lower level code
• This lower level code is produced by a
compiler
 The role of the compiler is fundamental
 The IS architecture needs to take the
compiler into strong account
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

Computer Design  IS 
Role of the compiler
• Goals of the compiler writer
 Correctness
 Performance
 …Fast compilation, debugging support, …

• Strategy for writing a compiler
Use a number of “passes”
From high level structures down to
lower levels, until machine level
 This way complexity is decomposed in
smaller blocks
 Optimizing becomes more difficult

2.1/106
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer Design  IS 
Role of the compiler
Dependencies
D(language)
D(machine)

Function
Front-end

Language  common
intermediate form

HL Opt

Loop transformations,
function inlining…

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/107

Global Opt

D(language)
D(machine)

Register allocation…

Code
generator

Instruction selection,
D(machine) opt.
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer Design  IS 
Role of the compiler
• HL Optimizations: source-level
optimizations (code  code’)
• Local optimizations: basic block
optimizations
• Global optimizations: loop optimization
and basic blocks optimizations
• Machine-dependent optimization: using
low level architectural knowledge

Computer
Architectures
In Practice

2.1/108

• Basic Block = a straight-line code fragment
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer Design  IS 
Role of the compiler
• Compilers have different optimization
levels
 -O1 .. -On

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/109

• Optimization can have a big impact on
instruction count  on performance
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/110

Computer Design  IS 
Role of the compiler
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/111

Computer Design  IS 
Role of the compiler
• In some cases, though, optimization may
be counterproductive!
• This happens because there might be
conflicts between local and global
optimization tasks
SAME EXPRESSION
• Example:
a = sqrt(x*x + y*y) + f()… ;
b = sqrt(x*x + y*y) + g()…;
• Idea:
tmp = sqrt(x*x + y*y);
a = tmp + f() …;
b = tmp + g() …;
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/112

Computer Design  IS 
Role of the compiler
• Effective, but only if tmp can be stored in

a register

• No register  in memory  cache misses
 … bad performance
• Problem is
 When the compiler performs, e.g., code
transformations like in the example, it does not
know whether a register will actually be
available
 This will only become clear later (at global
optimization level)

• (Phase ordering problem)
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/113

Computer Design  IS 
Role of the compiler
• Key resource is the register file
• “Intelligent” register allocation
techniques are a must
• Current solution: graph coloring (graph
with possible candidates for allocation to
a register)
• NP-complete, though effective heuristic
algorithms exist
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer Design  IS 
Role of the compiler
• A special class of compilers – Algorithmdriven software generation
 FFTW approach: Software generation system
based on symbolic computation

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/114

 Objective CamL
 Sort of FFT compiler that generates optimal C
code via symbolic computing
 Possible future steps (project works, theses…):
Extending the approach going down to code
generation for, e.g., the TI ‘C67 DSP and other
VLIW CPUs
© V. De Florio
KULeuven 2003

Basic
Concepts

Computer
Design

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/115

Exam of 16 Jan 2002
• A program is composed of three classes of
instructions: i1 (integer instructions), i2 (loadstore instructions), and i3 (floating point
instructions)
• The three classes are responsible of r1 = 60%, r2 =
30% and r3 = 10% of the overall execution time,
respectively
• You can choose between three levels of
optimisation on your computer: O1, O2, and O3:
O1 optimises i1, O2 optimises i2, and O3 optimises
i3
• The corresponding enhancements would be
e1 = 2, e2 = 3, e3 = 10
• Suppose you can only choose one of the three
levels of optimisation. Which one would you
choose? Justify your choice
© V. De Florio
KULeuven 2003

Basic
Concepts

Solution
• r1 = 60%
r2 = 30%
r3 = 10%

Computer
Design

S=

Exec-timeNEW
=
Exec-timeOLD

Computer
Architectures
for AI

Computer
Architectures
In Practice

2.1/116

• s1 = 1.42857
s2 = 1.25
s3 = 1.0989

e1 = 2
e2 = 3
e3 = 10
1
(1 - r) + r / e

Contenu connexe

Similaire à Advanced Computer Architectures – Part 2.1

Similaire à Advanced Computer Architectures – Part 2.1 (20)

“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
“eXtending” the Automation Toolbox: Introduction to TwinCAT 3 Software and eX...
 
L07_performance and cost in advanced hardware- computer architecture.pptx
L07_performance and cost in advanced hardware- computer architecture.pptxL07_performance and cost in advanced hardware- computer architecture.pptx
L07_performance and cost in advanced hardware- computer architecture.pptx
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
 
Unit i-introduction
Unit i-introductionUnit i-introduction
Unit i-introduction
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).ppt
 
Debate on RISC-CISC
Debate on RISC-CISCDebate on RISC-CISC
Debate on RISC-CISC
 
Computer architecture short note (version 8)
Computer architecture short note (version 8)Computer architecture short note (version 8)
Computer architecture short note (version 8)
 
Renesas DevCon 2010: Starting a QT Application with Minimal Boot
Renesas DevCon 2010: Starting a QT Application with Minimal BootRenesas DevCon 2010: Starting a QT Application with Minimal Boot
Renesas DevCon 2010: Starting a QT Application with Minimal Boot
 
Kiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.pptKiến trúc máy tính-COE 301 - Performance.ppt
Kiến trúc máy tính-COE 301 - Performance.ppt
 
Introduction into the problems of developing parallel programs
Introduction into the problems of developing parallel programsIntroduction into the problems of developing parallel programs
Introduction into the problems of developing parallel programs
 
Pyconuk2011
Pyconuk2011Pyconuk2011
Pyconuk2011
 
MCS1_SJUSD_Resume
MCS1_SJUSD_ResumeMCS1_SJUSD_Resume
MCS1_SJUSD_Resume
 
ERTS 2008 - Using Linux for industrial projects
ERTS 2008 - Using Linux for industrial projectsERTS 2008 - Using Linux for industrial projects
ERTS 2008 - Using Linux for industrial projects
 
LatestCOCOMO model presentation for college students .pptx
LatestCOCOMO model presentation for college students .pptxLatestCOCOMO model presentation for college students .pptx
LatestCOCOMO model presentation for college students .pptx
 
Ch05
Ch05Ch05
Ch05
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Computer performance
Computer performanceComputer performance
Computer performance
 
TI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBootTI TechDays 2010: swiftBoot
TI TechDays 2010: swiftBoot
 
Chapter_01.pptx
Chapter_01.pptxChapter_01.pptx
Chapter_01.pptx
 

Plus de Vincenzo De Florio

Considerations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali AnaniConsiderations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali Anani
Vincenzo De Florio
 

Plus de Vincenzo De Florio (20)

My little grundgestalten
My little grundgestaltenMy little grundgestalten
My little grundgestalten
 
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
Models and Concepts for Socio-technical Complex Systems: Towards Fractal Soci...
 
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
On the Role of Perception and Apperception in Ubiquitous and Pervasive Enviro...
 
Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...Service-oriented Communities: A Novel Organizational Architecture for Smarter...
Service-oriented Communities: A Novel Organizational Architecture for Smarter...
 
On codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiencesOn codes, machines, and environments: reflections and experiences
On codes, machines, and environments: reflections and experiences
 
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
Tapping Into the Wells of Social Energy: A Case Study Based on Falls Identifi...
 
How Resilient Are Our Societies? Analyses, Models, Preliminary Results
How Resilient Are Our Societies?Analyses, Models, Preliminary ResultsHow Resilient Are Our Societies?Analyses, Models, Preliminary Results
How Resilient Are Our Societies? Analyses, Models, Preliminary Results
 
Advanced C Language for Engineering
Advanced C Language for EngineeringAdvanced C Language for Engineering
Advanced C Language for Engineering
 
A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...A framework for trustworthiness assessment based on fidelity in cyber and phy...
A framework for trustworthiness assessment based on fidelity in cyber and phy...
 
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
Fractally-organized Connectionist Networks - Keynote speech @PEWET 2015
 
A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...A behavioural model for the discussion of resilience, elasticity, and antifra...
A behavioural model for the discussion of resilience, elasticity, and antifra...
 
Considerations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali AnaniConsiderations and ideas after reading a presentation by Ali Anani
Considerations and ideas after reading a presentation by Ali Anani
 
A Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and AntifragilityA Behavioral Interpretation of Resilience and Antifragility
A Behavioral Interpretation of Resilience and Antifragility
 
Community Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational ModelsCommunity Resilience: Challenges, Requirements, and Organizational Models
Community Resilience: Challenges, Requirements, and Organizational Models
 
On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-ResilienceOn the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
On the Behavioral Interpretation of System-Environment Fit and Auto-Resilience
 
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
Antifragility = Elasticity + Resilience + Machine Learning. Models and Algori...
 
Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...Service-oriented Communities and Fractal Social Organizations - Models and co...
Service-oriented Communities and Fractal Social Organizations - Models and co...
 
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
Seminarie Computernetwerken 2012-2013: Lecture I, 26-02-2013
 
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMINGTOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
TOWARDS PARSIMONIOUS RESOURCE ALLOCATION IN CONTEXT-AWARE N-VERSION PROGRAMMING
 
A Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a MultisetA Formal Model and an Algorithm for Generating the Permutations of a Multiset
A Formal Model and an Algorithm for Generating the Permutations of a Multiset
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Advanced Computer Architectures – Part 2.1

  • 1. Advanced Computer Architectures – HB49 – Part 2.1 Vincenzo De Florio K.U.Leuven / ESAT / ELECTA
  • 2. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/2 Course contents • Basic Concepts Computer Design • Computer Architectures for AI • Computer Architectures in Practice
  • 3. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/3 Computer Design Quantitative assessments • Instruction sets • Pipelining
  • 4. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/4 Computer design • First part of the course: a survey of computer history • Key aspect of this history:  In the last 60 years computers have experienced a formidable growth in performance and a huge costs decrease  A 1000¤ PC today provides its user with more performance, memory, and disk space of a 1M$ mainframe of the Sixties
  • 5. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/5 Computer design • How this could be possible? • Through  Advances in computer technology  Advances in computer design
  • 6. © V. De Florio KULeuven 2003 Basic Concepts Computer design • The tasks of a computer designer:  Determine key attributes for a new machine Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/6  E.g., design a machine that maximize performance keeping costs under control  Aspects:  Instruction set design  Functional organization  Logic design  Implementation (To be defined later)
  • 7. © V. De Florio KULeuven 2003 Basic Concepts Significant improvements • First 25 years:  From both technology and design • From the Seventies: Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/7  Mainly from IC technology  Main concern = compatibility with the past (to save investments)  Compatibility at ML  No room for design improvements  20-30% per year for mainframes and minis • Late Seventies: advent of the mP  Higher rate (35% per year)
  • 8. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/8 Significant improvements: the mP • The mP Mass-produced  lower costs Significant changes in computer marketplace Higher level language compatibility (no need for object code compatibility) Availability of standard, vendor-independent OS (less risks and costs in producing a new architecture) allowed to develop a new concept: RISC architectures
  • 9. © V. De Florio KULeuven 2003 Basic Concepts Significant improvements: RISC RISC architectures  Designed in the Eighties, on the market ca.‘85  Since then, a 50% improvement per year Computer Design Computer Architectures for AI Computer Architectures In Practice 300 Sun UltraSparc P e r f o r m a n c e 1.54X/yr 250 DEC 21064a 200 150 IBM Power 2/590 100 DEC AXP 3000 HP 9000/750 50 MIPS M/120 Sun-4/260 MIPS M2000 0 1987 2.1/9 1.35X/yr IBM RS6000/540 1988 1989 1990 1991 Year 1992 1993 1994 1995
  • 10. © V. De Florio KULeuven 2003 Technology Trends Basic Concepts 1000 Computer Design Supercomputers 100 Mainframes Computer Architectures for AI 10 Minicomputers Microprocessors 1 Computer Architectures In Practice 0.1 1965 2.1/10 1970 1975 1980 1985 Year 1990 1995 2000
  • 11. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/11 Computer design • The mP allowed a 50% of performance increase. How was that possible?  Enhanced capability for users  IBM Power 21993  Cray Y-MP1988  The fastest supercomputer in 1988 has approx. the same performance of the fastest 1993 workstation  Price: 1/10  Computers became more and more mP-based  Mainframes were disappearing or becoming based on off-the-shelf mPs
  • 12. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/12 Computer design • Big consequence  No more market urge for object code compatibility  Freedom from compatibility with old designs  Renaissance in computer design  Again, significant improvements from both technology and design  50% of performance growth!
  • 13. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/13 Computer design • The highest performance mP in ’95 is mainly a result of design improvements (1-to-5) • In this section we focus on the design techniques that allowed this state of facts
  • 14. © V. De Florio KULeuven 2003 Performance Computer Design • What are the aspects to be taken into account in order to reach a higher performance? • How to choose between different alternatives? Computer Architectures for AI  Amdhal’s law  Quantitative assessment Basic Concepts Computer Architectures In Practice 2.1/14
  • 15. © V. De Florio KULeuven 2003 Basic Concepts Amdhal’s law • Speed-up: Execution time for entire task w/o using the “enhancement” S= Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/15 Execution time for entire task using enhancement when possible • Amdhal’s law on speed-up: • Speed up depends on the fraction of time that may be affected by the enhancement
  • 16. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Amdhal’s law Let us call F the fraction of time affected by the enhancement For instance, F=0.40 means that the original program would benefit of the enhancement for 40% of the time of execution What do we gain by introducing the enhancement? Exec-timeNEW = Exec-timeOLD  ((1 -F) + F/SENH) Where SENH is the speedup in the enhanced mode. Hence, Computer Architectures In Practice 2.1/16 S= Exec-timeNEW Exec-timeOLD = 1 (1 - F) + F / SENH
  • 17. © V. De Florio KULeuven 2003 Amdhal’s law Basic Concepts Computer Design SENH grows, but SOVER does not Computer Architectures for AI Computer Architectures In Practice 2.1/17 F = 40%
  • 18. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Amdhal’s law • Law of diminishing returns  the incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added Computer Architectures for AI Computer Architectures In Practice 1 1 lim SENH S = lim SENH = (1 - F) + F / SENH (1 - F) = SMAX 2.1/18
  • 19. © V. De Florio KULeuven 2003 Amdhal’s law Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/19 To reach a maximum speedup = 3, F must be at least 66%
  • 20. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/20 Amdhal’s law… • “…can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve cost/performance. • The goal, clearly, is to spend resources proportional to where time is spent.’’
  • 21. © V. De Florio KULeuven 2003 Basic Concepts Amdhal’s law • Example 1 (p.30 P&H)  Method allows an improvement by factor 10  That can be exploited for 40% of the time Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/21 speeupoveral  1 fract. enhanced 1  fract. enhanced   speedupenhanced 1   1.56 0.4 1  0.4  10
  • 22. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Amdhal’s law Example 2 (p.31 P&H)  50% of the instructions of a given benchmark are floating point instructions  FPSQR applies to 20% of the same benchmark  Alternative 1: extra hardware: FPSQR is 10 times faster  Alternative 2: all the FP instructions go 2 times faster speedupoveral  speedupFPSQR  speedupFP 2.1/22 1 1  fract. enhanced   1 fract. enhanced speedupenhanced  1.22 0.2 1  0.2  10 1   1.33 0.5 1  0.5  2.0
  • 23. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/23 Quantitative assessment • CPUTIME(p) = Time spent by the CPU to run program p • Clock cycle time = tcc , clock rate = 1/ tcc • CPUTIME(p) = # clock cycles  tcc = # clock cycles / clock rate • E.g.: clock cycle time = 2ns clock rate = 500 MHz • #CC(p) = number of clock cycles spent in the execution of p
  • 24. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/24 Quantitative assessment • Instruction count • IC(c,p) = number of instructions that CPU c executed during the activity of program p • Often, IC(p)
  • 25. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/25 Quantitative assessment • Clock cycles per instruction • CPI(p) = #CC(p) / IC(p) average number of clock cycles needed to execute one instruction of p
  • 26. © V. De Florio KULeuven 2003 Quantitative assessment Basic Concepts • CPUTIME(p) = Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/26 = #clock cycles  clock cycle time = #CC(p)  tcc = IC(p)  CPI(p)  tcc = IC(p)  CPI(p) clock rate  We can influence the performance of a given program p by optimizing the three key variables IC(p), CPI(p), and clock rate.
  • 27. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/27 Quantitative assessment • CPU performance is equally dependent upon three characteristics  Clock rate (the higher, the better)  Clock cycles per instructions (the lesser, the better)  Instruction count (the lesser, the better)
  • 28. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/28 Quantitative assessment • CPU performance is equally dependent upon three characteristics  Clock rate (HW technology & organization)  Clock cycles per instruction (organization & instruction set architecture)  Instruction count (instruction set architecture & compiler technology) • Note: technologies are not independent of each other!
  • 29. © V. De Florio KULeuven 2003 Basic Concepts Quantitative assessment CPU time = Seconds Program Computer Design Program Computer Architectures for AI Computer Architectures In Practice Program Cycles x Seconds Instruction Inst Count CPI X Compiler X Inst. Set. X X Organization X Cycle Clock Rate (X) Technology 2.1/29 = Instructions x X X
  • 30. © V. De Florio KULeuven 2003 Basic Concepts Quantitative assessment • Decades long challenge: optimizing CPUTIME(p) = IC(p)  CPI(p) Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/30 clock rate • This is a function of p! • The choice of benchmarks is important
  • 31. © V. De Florio KULeuven 2003 Basic Concepts Quantitative assessment • Which methods to use? CPUTIME(p) = IC(p)  CPI(p) Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/31 clock rate • Method 1: increasing the clock rate (Note: independent of p!) • Methods 2: those trying to decrease IC(p) • Methods 3: those trying to decrease CPI(p) • Each method is equally important • Some methods are more effective
  • 32. © V. De Florio KULeuven 2003 Basic Concepts Quantitative assessment: how to calculate CPI? n  CPIi  ICi ICi   CPI =   CPIi     Instr. Count  Instr. count i 1 i 1 Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/32 n ICi = number of times instruction i is executed by p CPIi = average number of clock cycles for instruction i CPIi needs to be measured and not just read from a table in the Reference Manual! That is, we need to take into account the memory access time! (Cache misses do count… a lot)
  • 33. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Quantitative assessment • Example 3: 2 alternatives for a conditional branch instruction  A: a CMP that sets a condition code (Z bit) followed by a JZ  B: a single instruction to do CMP and JZ Arch. A Computer Architectures for AI Computer Architectures In Practice 2.1/33 LD R1, 0 L: INC R1 CMP R1, 5 JZ L RET Arch. B LD R1, 0 L: INC R1 JRZ R1, 5, L RET We assume that JZ and JRZ take 2 cycles, all the other instructions take 1 cycle
  • 34. © V. De Florio KULeuven 2003 Quantitative assessment Arch. A Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice LD R1, 0 L: INC R1 CMP R1, 5 JZ L RET LD R1, 0 L: INC R1 JRZ R1,5,L RET Arch. B • 20% of the instructions are c.jumps (instructions such as JZ or JRZ) • 80% are other instructions • On A, for each c.jump there is a CMP  on A, 20% are c.jumps and 20% are CMP’s • 60% are other instructions Because of the extra complexity in B, the clock of A is faster (CTB = 1.25 CTA) 2.1/34
  • 35. © V. De Florio KULeuven 2003 n Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/35 Quantitative assessment n • CPIA = Si instri x cyclesi / #CCA = = #BRA x cyclesBR + #BRA x cyclesBR #CCA #CCA = nBRA x cyclesBR + nBRA x cyclesBR = 20% x 2 + 80% x 1 = 1.2 • CPUA = ICA x CPIA x CTA = ICA x 1.2 x CTA • CPIB = Si instri x cyclesi / #CCB = = #BRB x cyclesBR + #BRB x cyclesBR #CCB #CCB = nBRB x cyclesBR + nBRB x cyclesBR
  • 36. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Quantitative assessment • Now, on B:  One spares 20% of the instructions (the extra cmp’s), hence: nBRB = 20 / (100 – 20) = 0.25 (25%)  Furthermore, ICB = 0.8 ICA • Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25 • CPUB = ICB x CPIB x CTB = = 0.8 ICA x 1.25 x 1.25 CTA So CPUB = 1.25 x ICA x CTA CPUA = 1.2 x ICA x CTA So A is faster 2.1/36 (for which P?)
  • 37. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/38 Performance • A straightforward enhancement is given by increasing the clock rate • The entire program benefits • Also, independent of the particular program • Dependent on the efficiency of the compiler etc.
  • 38. © V. De Florio KULeuven 2003 Clock Frequency Growth Rate 1,000 Computer Design Computer Architectures for AI Clock rate (MHz) Basic Concepts 100      R10000              Pentium100                          i80386  i80286 10 i8086   1 i8080    i8008 i4004 Computer Architectures In Practice 0.1 1970 1975 • 30% per year 2.1/39 1980 1985 1990 1995 2000 2005
  • 39. © V. De Florio KULeuven 2003 Transistor Count Growth Rate 100,000,000 Basic Concepts  Computer Design Computer Architectures for AI Transistors 10,000,000   R10000     Pentium                    i80386  i80286    R3000 R2000   1,000,000 100,000 i8086 10,000  i8080   i8008 Computer Architectures In Practice i4004 1,000 1970 1975 1980 1985 1990 1995 2000 • 100 million transistors on chip in early year 2000. • Transistor count grows much faster than clock rate 2.1/40 2005
  • 40. © V. De Florio KULeuven 2003 Basic Concepts Performance • Another important factor for performance is given by  Memory accesses  I/O (disk accesses) Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/43
  • 41. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Memory • Semiconductor DRAM technology  Density: increase of 60% per year (quadruplicate in 3 years)  Cycle time: much less increase than this! Computer Architectures In Practice Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM Computer Architectures for AI 4x in 3 years 1.4x in 10 years Disk 2x in 3 years 1.4x in 10 years Speed increases of memory and I/O have not kept pace with processor speed increases. 2.1/44
  • 42. © V. De Florio KULeuven 2003 Memory size Basic Concepts 1000000000 100000000 10000000 Bits Computer Design year 1980 1983 1986 1989 1992 1996 2000 Computer Architectures for AI 1000000 100000 10000 1000 Computer Architectures In Practice 2.1/45 1970 1975 1980 1985 Year 1990 1995 2000 size(Mb) cyc time 0.0625 250 ns 0.25 220 ns 1 190 ns 4 165 ns 16 145 ns 64 120 ns 256 100 ns
  • 43. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/46 Basic definitions 1. Bandwidth: the rate at which data can be transferred. Bandwidth is typically measured in bytes per second. 2. Block size: the amount of data transferred per request. Block size is typically measured in bytes. 3. Latency: the time between making a request (e.g. to read or write a block of data) and completing the request. Latency is typically measured in seconds. 4. Throughput: The number of requests that can be completed per unit time. Throughput is typically measured in requests per second.
  • 44. © V. De Florio KULeuven 2003 Basic Concepts Memory • DRAM: main memory of all computers  Commodity chip industry: no company >20% share  Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM) Computer Design Computer Architectures for AI Computer Architectures In Practice • Capacity: 4X/3 years (60%/year)  Moore’s Law • MB/$: + 25%/year • Latency: – 7%/year, Bandwidth: + 20%/year (so far) SIMM = single in-line memory chip, a small circuit board that can hold a group of memory chips. Measured in bytes vs bits 32-bit path to memory DIMM = dual in-line memory chip. 64-bit to memory source: www.pricewatch.com, 5/21/98 2.1/47
  • 45. © V. De Florio KULeuven 2003 Processor Limit: DRAM Gap Basic Concepts 1000 CPU Computer Architectures for AI Computer Architectures In Practice 2.1/48 100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Computer Design Performance “Moore’s Law” µProc 60%/yr. DRAM 7%/yr..
  • 46. © V. De Florio KULeuven 2003 Memory Summary Basic Concepts • DRAM: Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/49  rapid improvements in capacity, MB/$, bandwidth;  slow improvement in latency  Processor-memory interface is a bottleneck to delivered bandwidth
  • 47. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/50 Disk Components
  • 48. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/51 Disk Components: Platters • Platters: the recording surfaces. i. 1 to 8 inches in diameter (2.5 to 20 cm). ii. Stacked on a spindle: typical disks have 1-12 platters. iii. Data can be stored on one or both surfaces. iv. Spindle and platters rotate at 3600 - 10000 rpm (60-165 Hz). v. Recording density depends on applying a magnetic film with few defects. vi. Rotation rate limited by bearings and power consumption.
  • 49. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/52 Disk Components: Heads • i. Heads: write and read data to and from platters. Data stored as presence or absence of magnetization. ii. Head “floats” on air-film that rotates with the disk. Bernoulli effect pulls head toward disk but not into it. A dust particle can cause a “head crash” where the disk surface is scratched and any data on it is lost. iii. Disk heads are manufactured using thin film technology. Advancing technology allows smaller heads and therefore more closely spaced tracks and bits.
  • 50. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/53 Disk Components: Actuators • i. ii. iii. Actuators: move heads radially over the platters. Actuator arm needs to be light to move quickly. Actuator arm needs to stiff to prevent flexing. Smaller platters allow shorter arms: therefore lighter and stiffer. iv. Actuators limited by • • power of actuator motor and weight and strength of actuator components
  • 51. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Disks: Data Layout • Each surface consists of concentric rings called tracks • Each track is divided into sectors. Data is written to and read from the disk a whole sector at a time • The set of tracks that are a the same relative position on each surface form a cylinder Computer Architectures for AI Computer Architectures In Practice 2.1/54 cylinder
  • 52. © V. De Florio KULeuven 2003 Three Components of Disk Access Time Basic Concepts 1. Seek time: the time to move the heads to the desired cylinder  Advertised to be 8 to 12 ms. May be lower in real life Computer Design 2. Rotational latency: the time for the desired sector to arrive under the head  4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM Computer Architectures for AI 3. Transfer time: the time to read the data from the disk and send it over the I/O bus to the processor  2 to 12 MB per second Computer Architectures In Practice Queue Proc Ctrl Disk Access Time IOC Device Response time = Queue + Ctrl + Device Service time 2.1/55
  • 53. © V. De Florio KULeuven 2003 Hard Disks Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Disk Latency = Queueing Time + Controller time + Seek Time + Rotation Time + Xfer Time Order of magnitude times for 4K byte transfers: Average Seek: 8 ms or less Rotate: 4.2 ms @ 7200 rpm Xfer: 1 ms @ 7200 rpm 2.1/56
  • 54. © V. De Florio KULeuven 2003 Hard Disks • Capacity Basic Concepts  + 60%/year (2X / 1.5 yrs) • Transfer rate (BW) Latency = Queuing Time +  + 40%/year (2X / 2.0 yrs) Computer Controller time + Design • Rotation + Seek time per access Seek Time + Rotation Time  – 8%/ year (1/2 in 10 yrs) + + Size / Bandwidth per byte • MB/$ Computer { Architectures for AI  > 60%/year (2X / <1.5 yrs) Computer Architectures In Practice source: Ed Grochowski, 1996, “IBM leadership in disk drive technology”; www.storage.ibm.com/storage/technolo/grochows/grocho01.htm, 2.1/57
  • 55. © V. De Florio KULeuven 2003 Hard disks Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/58 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes
  • 56. © V. De Florio KULeuven 2003 Hard Disks Areal Density Basic Concepts 10000 1000 100 10 1 1970 Computer Design 1980 1990 2000 Year Computer Architectures for AI Computer Architectures In Practice 1989: 63 Mbit/sq. in 60,000 MBytes 2.1/59 1997: 1450 Mbit/sq. in 1600 MBytes 1997: 3090 Mbit/sq. in 8100 MBytes
  • 57. © V. De Florio KULeuven 2003 Hard Disks Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/60 • Continued advance in capacity (60%/yr) and bandwidth (40%/yr.) • Slow improvement in seek, rotation (8%/yr) • Time to read whole disk Year Sequentially Randomly 1990 4 minutes 6 hours 2000 12 minutes 1 week
  • 58. © V. De Florio KULeuven 2003 Memory/Disk Summary Basic Concepts • Memory:  DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/61 • Disk:  Continued advance in capacity, cost/bit, bandwidth; slow improvement in seek, rotation • Huge gap between CPU and external memories • How to address this problem? • Classical way: memory hierarchies
  • 59. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/62 Memory hierarchies • Axiom of HW designer: smaller is faster  Larger memories => larger signal delay  More levels are required to encode addresses  In a smaller memory the designer can use more power per cell => shorter access times • Crucial features for performance  Huge bandwidth (in MB/sec.)  Short access times • Principle of locality  The data most recently used is very likely to be accessed again in the near future (temporal l.)  Memory cells close to the most recently used one are likely to be accessed in the near future (spatial) • Combining the above with the Amdhal law, the “best” enhancement is using hierarchies of memories
  • 60. © V. De Florio KULeuven 2003 Typical memory hierarchy (`95) Basic Concepts CPU Registers Cache Computer Design I/O bus Memory bus Memory I/O devices 32 MB 100 ns 2 GB 5 ms Computer Architectures for AI Computer Architectures In Practice 2.1/63 Size: 200B Speed: 5 ns 64KB 10 ns
  • 61. © V. De Florio KULeuven 2003 Basic Concepts Memory hierarchies Input/Output and Storage Disks, WORM, Tape Computer Design Computer Architectures for AI Coherence, Bandwidth, Latency L2 Cache L1 Cache Computer Architectures In Practice Emerging Technologies Interleaving Bus protocols DRAM Memory Hierarchy VLSI Instruction Set Architecture Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP 2.1/64 RAID Pipelining and Instruction Level Parallelism
  • 62. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/65 Memory hierarchies • • • • • Registers: smallest and fastest memory Size: less than 1KB Access time: 2-5 ns Bandwidth: 4000-32000 MB/sec Managed by the compiler (or the assembly programmer)  register int a; • Special purpose vs. general purpose • Monolithic or double-shaped  Rx = Rl + Rh • Backed in cache • Implemented via custom memory with multiple ports
  • 63. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/66 Memory hierarchies • Cache = small, fast memory located close to the CPU • The cache holds the most recently accessed code or data  Managed by HW  No way to tell “put these data in cache” at SW  New research: cache-conscious data structures • • • • • Size: less than 4 MB Access time: 3-10 ns Bandwidth: 800-5000 MB/sec Backed in main memory Implemented with (on- or off-chip) CMOS SRAM
  • 64. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/67 Memory hierarchies • Cache terminology: cache hit, cache miss, cache block  Cache hit: the CPU has been able to find in cache the requested data  Cache miss:  Cache hit  Cache block: the fixed-size buffer used to load a portion of memory into the cache • A cache miss blocks the CPU until the corresponding memory block gets cached
  • 65. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/68 Memory hierarchies • Virtual memory: same principles behind the use of cache, but implemented between main memory and disk storage • At any point in time, not all the data referenced by p need to be in main memory • Address space is partitioned into fixedsize blocks: pages • A page is either in memory or on disk • When CPU references an item within a page if ( Check-if-in-cache() == CACHE_MISS ) if ( Check-if-in-memory() == MEM_MISS) PageFault(); // Loads page in memory  CPU doesn’t stall – switches to other tasks
  • 66. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/69 Cache performance • Example: speedup using a cache  Cache 10 times faster than main memory  Cache is used 90% of the cases speedup   1 1  fract. enhanced   1 0.9 1  0.9  10  5.3 fract. enhanced speedupenhanced
  • 67. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/70 Cache performance CPUtime = (CPU clock cycles + memory stall cycles) x clock cycle time Memory stall cycles = #(misses)  £(miss) = IC  #(misses per instruction)  £(miss) = IC  #(memory references per instr.)  miss-frequency  £(miss)
  • 68. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/71 Cache performance • Example (P&H, p.43)  A computer has a CPI = 2 when data is in cache  Memory access is only required by load and store instructions (40% of total #)  £(miss) = 25 clock cycles  Cache misses frequency = 2% ? How faster would the machine be when no cache miss occurs? CPU"-hit = (CPU clock cycles + memory stall cycles)  clock cycle time = (IC  CPI + 0)  clock cycle time = IC  2  clock cycle time
  • 69. © V. De Florio KULeuven 2003 Basic Concepts Cache performance ? How fast would the machine be when cache misses do occur? 1. Compute the memory stall cycles (msc) Computer Design msc = IC  memory references per instruction  miss rate  miss penalty = IC  (1 + 0.4)  0.02  25 Data access Computer Architectures for AI Instruction access = IC  0.7 Computer Architectures In Practice 2.1/72 2. Compute total performance: CPUcache=(CPU clock cycle + msc)  clock cycle time = (IC  2 + IC  0.7)  clock cycle time = 2.7  IC  clock cycle time
  • 70. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/73 Computer Design • Quantitative assessments Instruction sets • Pipelining
  • 71. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/74 Computer design • Instruction-set architecture:  The architecture of the machine level  The boundary between SW and HW • Organization:  High level aspects: memory system, bus structure, internal CPU design • Hardware:  The specifics of a machine: detailed logic design, packaging technology… • Architecture = I + O + H
  • 72. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/75 Instruction Sets • IS = Instruction sets = The architecture of the machine language • IS Classification • Roles of the compilers • DLX
  • 73. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/76 Computer Design  IS IS Classification • Role of the compilers • DLX
  • 74. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/77 Computer Design  IS  IS Classification • Key: type of internal storage in the CPU • Three main classes  Stack architectures  Accumulator architectures  General-purpose register architectures
  • 75. Computer Design  IS  IS Classification  Stack A. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI • • • • Stack architecture: Operands are implicitly referred to Top two items on the system stack Example: C = A + B 3. ADD 2.1/78 2. PUSH B A Computer Architectures In Practice B 1. PUSH A ADD = PUSH (POP + POP)
  • 76. Computer Design  IS  IS Classification  Stack A. © V. De Florio KULeuven 2003 Basic Concepts Computer Design • • • • Stack architecture: Operands are implicitly referred to Top two items on the system stack Example: C = A + B 3. ADD Computer Architectures for AI 2. PUSH B A Computer Architectures In Practice 2.1/79 1. PUSH A ADD = PUSH (POP + POP) ADD = PUSH (B + POP)
  • 77. Computer Design  IS  IS Classification  Stack A. © V. De Florio KULeuven 2003 Basic Concepts Computer Design • • • • Stack architecture: Operands are implicitly referred to Top two items on the system stack Example: C = A + B 3. ADD Computer Architectures for AI 2. PUSH B B+A Computer Architectures In Practice 2.1/80 1. PUSH A ADD = PUSH (POP + POP) ADD = PUSH (B + POP) ADD = PUSH (B + A)
  • 78. Computer Design  IS  IS Classification  Stack A. © V. De Florio KULeuven 2003 Basic Concepts Computer Design • • • • Stack architecture: Operands are implicitly referred to Top two items on the system stack Example: C = A + B 4. POP C 3. ADD Computer Architectures for AI 2. PUSH B 1. PUSH A Computer Architectures In Practice C = TOP STACK = A+B An example: the ARIEL virtual machine (Part 1, Slides 91 –) 2.1/81
  • 79. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Design  IS  IS Classification  Accumulator A. • Accumulator Architectures • A special register (the accumulator) plays the role of an implicit argument • Example: C = A + B 1. Computer Architectures for AI Computer Architectures In Practice 2.1/82 LOAD A ; let Acml = A 2. ADD B ; let Acml = Acml + B 3. STORE C ; let C = Acml
  • 80. Computer Design  IS  IS Classification  Register A. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/83 • • • • General-purpose Register Architecture Explicit operands only Either registers or memory locations Two flavors:  Register-memory architectures (RMA)  Register-register architectures (RRA) • Example: C = A + B  RMA: Load R1, A  Add R1, B ; in C, R1 += B  Store C, R1  RRA: Load R1, A  Load R2, B  Add R3, R1, R2  Store C, R3
  • 81. © V. De Florio KULeuven 2003 Basic Concepts Computer Design  IS  IS Classification  RRA • Some old machines used stack or accumulator architectures  For instance, T800 and 6502/6510 Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/84 • Today the de facto standard is RRA  Regs are fast  Regs are easier to use (compiler writers)  Do not require to deal with associativity issues  Stacks do!  Regs can hold variables register int I; for (I=0; I<1000000;I++) { do-stgh(I); … }  Using regs you don’t need a memory address
  • 82. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Design  IS  IS Classification  Register A. • RRA: no memory operands  All instructions are similar in size -> take similar number of clocks to execute (very useful property… see later)  No side effect  Higher instruction count • RMA: one memory operand Computer Architectures for AI Computer Architectures In Practice  One load can be spared  A register operand is destroyed ( R += B )  Clocks per instruction varies by operand location • Memory-memory:  Compact  Large variation of work per instruction  Large variation in instruction size 2.1/85
  • 83. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Computer Design  IS  Memory addressing • How is memory organized? • What does it mean, e.g., read memory at address 512? • What do we read?  Bytes, half words, words, double words • How are consecutive bytes stored in a word? (Assumption: word is 4 bytes)  Little endian: &word = &LSB  Big endian: &word = &MSB  XDR routines are needed to exchange data (&word  address of word) 2.1/86
  • 84. © V. De Florio KULeuven 2003 Basic Concepts A memory model for didactics • Memory can be thought as finite, long array of cells, each of size 1 byte 0 Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/87 1 2 3 4 5 6 7 … • Each cell has a label, called address, and a content, i.e. the byte stored into it • Think of a chest of drawers, with a label on each drawer and possibly something into it
  • 85. © V. De Florio KULeuven 2003 A memory model for didactics Basic Concepts Content Computer Design Computer Architectures for AI 4 3 2 1 Computer Architectures In Practice 2.1/88 Address
  • 86. © V. De Florio KULeuven 2003 Basic Concepts A memory model for didactics • The character * has a special meaning • It refers to the contents of a cell Computer Design • For instance: Computer Architectures for AI *(1)  Computer Architectures In Practice This character means we’re inspecting the contents of a cell (we open a drawer and see what’s in it) 2.1/89
  • 87. © V. De Florio KULeuven 2003 Basic Concepts A memory model for didactics • The character * has a special meaning • It refers to the contents of a cell Computer Design • For instance: Computer Architectures for AI *(1)  Computer Architectures In Practice This character means we’re writing new contents into a cell (we open a drawer and change its contents) 2.1/90
  • 88. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/91 A memory model for didactics • Memory is (often) byte addressable, though it is organized into small groups of bytes: the machine word • A common size for the machine word is 4 bytes (32 bits) • Two possible organizations for the bytes in a word  Little endian  Big endian
  • 89. © V. De Florio KULeuven 2003 Little endian versus Big endian MSB0 LSB0 LSB0 MSB0 4 Big endian (Motorola) Little endian 3 Basic Concepts Big endian MSB1 LSB1 LSB1 MLSB1 0 MSB 0 Computer Design Computer Architectures for AI 1 2 LSB 3 1 MSB 4 5 6 LSB 7 2 2.1/92 5 LSB 3 Computer Architectures In Practice Little endian (Intel) 2 1 MSB 0 6 LSB 7 6 5 MSB 4 7
  • 90. © V. De Florio KULeuven 2003 Little endian versus Big endian Problem: communication between the two Little endian 0 MSB0 00 LSB0 00 1 00 00 2 00 00 3 LSB0 01 MSB0 01 4 MSB1 10 LSB1 10 Little endian (Intel) 5 00 00 LSB 00 Basic Concepts Big endian 00 00 MSB 01 6 00 00 LSB 10 00 00 MSB 00 7 LSB1 00 MLSB1 00 Big endian (Motorola) MSB 00 Computer Design 00 00 LSB 01 MSB 10 00 00 LSB 00 =268435456 Computer Architectures for AI Computer Architectures In Practice So they are the same; though, interpreted as if they were… =16777216 01 00 00 00 00 2.1/93 =1 00 00 10 =16
  • 91. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/94 Computer Design  IS  Memory addressing • Alignment is mandatory on some machines  Object O; int t = sizeof(O);  ALIGNED(O) means &O modulo t is 0  “access to O is aligned”  For instance if access to integers (4 bytes) is aligned, then an integer can only be stored in addresses divisible by 4  Alignment is sometimes necessary because prevents hardware complications  Alignment implies faster access
  • 92. Computer Design  IS  Memory addressing © V. De Florio KULeuven 2003 Basic Concepts Computer Design • Addressing modes: ways to specify the address of an object in memory • An addressing mode can specify  A constant  A register  A memory location Computer Architectures for AI Computer Architectures In Practice In what follows, A += B means * (x) means x++ --x Rx 2.1/95 A=A+B return the contents of memory at address x means “at the end, let x = x + 1” means “at the beginning, let x = x – 1” means register x
  • 93. Computer Design  IS  Memory addressing © V. De Florio KULeuven 2003 Meaning Add R4, R3 Add R4, #3 R4 += R3 R4 += 3 Displacement Indirect Add R4, 100(R1) Add R4, (R1) R4 += *(100+R1) R4 += *(R1) Add R4, (R1 + R2) R4 += *(R1 + R2) Absolute Computer Architectures for AI Example Indexed Computer Design Mode Register Immediate Basic Concepts Add R4, (100) R4 += *(100) Deferred Add R4, @(R3) R4 += *(*(R3)) Autoincrement Add R4, (R3)+ Indirect, R3++ Autodecrement Add R4, -(R2) Computer Architectures In Practice Scaled Add R4, 100(R2)[R3] R2--, indirect R4 += * ( 100 + R2 + R3 * d ) d = size of the addressed data (1, 2, 4, 8, or 16) 2.1/96
  • 94. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/97 Computer Design  IS  Memory addressing • Addressing mode can reduce IC • Complex addressing modes increase the complexity of the hardware  can increase CPI • Displacement, immediate and deferred represent b/w 75% and 99% of addressing modes (experiments done with TeX, spice, and gcc) • IC(p) = number of instructions that the CPU executed during the activity of program p • CPI(p) = clock cycles per instruction = #CC(p) / IC(p) average number of clock cycles needed to execute one instruction of p
  • 95. Computer Design  IS  Operations © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/98 • • • • • • • • Arithmetical and logical (add, and, sub...) Data transfer (move, store) Control (br, jmp, call, ret, iret…) System (virtual memory mngt…) Floating point (add, mul, …) Decimal (decimal add, decimal mul…) String (str move, str cmp, str search) Graphics (pixel operations) • Benchmarks show that often a small set of simple instructions account for stg like 95% of instructions executed (see Fig. 2.11, P&H p.81)
  • 96. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/99 Computer Design  IS  Operations • Control Flow Instructions  Branch (conditional change)  Jump (unconditional change)  Procedure calls  Procedure returns • Most of the comparisons in conditional branches are simple “==“, “!=“ with 0! • In some cases, the address to go to is only known at run-time  “Return” uses a stack  Switch statements  Dynamic libraries
  • 97. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/100 Computer Design  IS  Operands • When we say, e.g., “Add R1, #5” do we work with bytes? Half-words? Words? • How do we specify the type of the operand? 1. Classical method: the type of operand is part of the opcode • Add family is coded as ffff…fffvv where f are fixed bits and v are bits that specify the type
  • 98. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/101 Computer Design  IS  Operands and types • Example: Add family = 10110101000100vv • 1011010100010000 = 1011010100010001 = 1011010100010010 = 1011010100010011 = Add Add Add Add float words words half-words bytes • Old fashioned method: operand = data + tag • Tag describes a type • Tag is interpreted by HW • Operation is chosen accordingly
  • 99. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/102 Computer Design  IS  Operands and types • Which types to support? • Old fashioned solution: all (bytes, semiwords, words, f.p., double words, double precision f.p., …) • Current trend: Only operations on items greater than or equal to 32 bits • On the DEC Alpha one needs multiple instructions to access objects smaller than 32 bits
  • 100. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/103 Computer Design  IS  Operands and types • Floating point numbers: IEEE standard 754 • In the early ’80, each manufacturer had its own f.p. representation • Sometimes string operations are available (strcmp, strcpy…) • Sometimes BCD is used to code numbers  Four bits are used to code a decimal digit  A byte codes two decimal digits  Functions for “packing” and “unpacking” are required  It is unclear if this will stay in the future
  • 101. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/104 Computer Design  IS • IS Classification Role of the compilers • DLX
  • 102. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/105 Computer Design  IS  Role of the compiler • In the past, the role of Assembly language was crucial • Architectural decisions aimed at easing assembly language programming • Now, the user interface is a high level language (C, C++, Java…) • The user interfaces the machine via the HLL, though the machine actually executes some lower level code • This lower level code is produced by a compiler  The role of the compiler is fundamental  The IS architecture needs to take the compiler into strong account
  • 103. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice Computer Design  IS  Role of the compiler • Goals of the compiler writer  Correctness  Performance  …Fast compilation, debugging support, … • Strategy for writing a compiler Use a number of “passes” From high level structures down to lower levels, until machine level  This way complexity is decomposed in smaller blocks  Optimizing becomes more difficult 2.1/106
  • 104. © V. De Florio KULeuven 2003 Basic Concepts Computer Design  IS  Role of the compiler Dependencies D(language) D(machine) Function Front-end Language  common intermediate form HL Opt Loop transformations, function inlining… Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/107 Global Opt D(language) D(machine) Register allocation… Code generator Instruction selection, D(machine) opt.
  • 105. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Design  IS  Role of the compiler • HL Optimizations: source-level optimizations (code  code’) • Local optimizations: basic block optimizations • Global optimizations: loop optimization and basic blocks optimizations • Machine-dependent optimization: using low level architectural knowledge Computer Architectures In Practice 2.1/108 • Basic Block = a straight-line code fragment
  • 106. © V. De Florio KULeuven 2003 Basic Concepts Computer Design  IS  Role of the compiler • Compilers have different optimization levels  -O1 .. -On Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/109 • Optimization can have a big impact on instruction count  on performance
  • 107. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/110 Computer Design  IS  Role of the compiler
  • 108. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/111 Computer Design  IS  Role of the compiler • In some cases, though, optimization may be counterproductive! • This happens because there might be conflicts between local and global optimization tasks SAME EXPRESSION • Example: a = sqrt(x*x + y*y) + f()… ; b = sqrt(x*x + y*y) + g()…; • Idea: tmp = sqrt(x*x + y*y); a = tmp + f() …; b = tmp + g() …;
  • 109. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/112 Computer Design  IS  Role of the compiler • Effective, but only if tmp can be stored in a register • No register  in memory  cache misses  … bad performance • Problem is  When the compiler performs, e.g., code transformations like in the example, it does not know whether a register will actually be available  This will only become clear later (at global optimization level) • (Phase ordering problem)
  • 110. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/113 Computer Design  IS  Role of the compiler • Key resource is the register file • “Intelligent” register allocation techniques are a must • Current solution: graph coloring (graph with possible candidates for allocation to a register) • NP-complete, though effective heuristic algorithms exist
  • 111. © V. De Florio KULeuven 2003 Basic Concepts Computer Design  IS  Role of the compiler • A special class of compilers – Algorithmdriven software generation  FFTW approach: Software generation system based on symbolic computation Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/114  Objective CamL  Sort of FFT compiler that generates optimal C code via symbolic computing  Possible future steps (project works, theses…): Extending the approach going down to code generation for, e.g., the TI ‘C67 DSP and other VLIW CPUs
  • 112. © V. De Florio KULeuven 2003 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.1/115 Exam of 16 Jan 2002 • A program is composed of three classes of instructions: i1 (integer instructions), i2 (loadstore instructions), and i3 (floating point instructions) • The three classes are responsible of r1 = 60%, r2 = 30% and r3 = 10% of the overall execution time, respectively • You can choose between three levels of optimisation on your computer: O1, O2, and O3: O1 optimises i1, O2 optimises i2, and O3 optimises i3 • The corresponding enhancements would be e1 = 2, e2 = 3, e3 = 10 • Suppose you can only choose one of the three levels of optimisation. Which one would you choose? Justify your choice
  • 113. © V. De Florio KULeuven 2003 Basic Concepts Solution • r1 = 60% r2 = 30% r3 = 10% Computer Design S= Exec-timeNEW = Exec-timeOLD Computer Architectures for AI Computer Architectures In Practice 2.1/116 • s1 = 1.42857 s2 = 1.25 s3 = 1.0989 e1 = 2 e2 = 3 e3 = 10 1 (1 - r) + r / e