3. Course Overview [contd…]
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
Performance
IFetchDcd Exec Mem WB
Pipelining
Memory
Memory Systems
4. What You Will Learn
• How programs are translated into the
machine language
– And how the hardware executes them
• The hardware/software interface
• What determines program performance
– And how it can be improved
• How hardware designers improve
performance
• What is parallel processing
5. What’s In It For Me ?
• In-depth understanding of the inner-workings of
modern computers, their evolution, and trade-
offs present at the hardware/software boundary.
– Insight into fast/slow operations that are easy/hard to
implementation hardware
• Experience with the design process in the
context of a large complex (hardware) design.
– Functional Spec --> Control & Datapath --> Physical
implementation
– Modern CAD tools
6. Computer Architecture - Definition
• Computer Architecture = ISA + MO
• Instruction Set Architecture
– What the executable can “see” as underlying hardware
– Logical View
• Machine Organization
– How the hardware implements ISA ?
– Physical View
7. Computer Architecture – Changing Definition
• 1950s to 1960s: Computer Architecture Course:
–Computer Arithmetic
• 1970s to mid 1980s: Computer Architecture Course:
–Instruction Set Design, especially ISA appropriate for compilers
• 1990s: Computer Architecture Course:
Design of CPU, memory system, I/O system, Multiprocessors,
Networks
• 2000s: Computer Architecture Course:
–Non Von-Neumann architectures, Reconfiguration
• DNA Computing, Quantum Computing ????
8. Some Examples …
° Digital Alpha (v1, v3) 1992-97
° HP PA-RISC (v1.1, v2.0) 1986-96
° Sun SPARC (v8, v9) 1987-95
° SGI MIPS (MIPS I, II, III, IV, V) 1986-96
° IA-16/32 (8086,286,386, 486, 1978-1999
Pentium, MMX, SSE, …)
° IA-64 (Itanium) 1996-now
° AMD64/EMT64 2002-now
° IBM POWER (PowerPC,…) 1990-now
° Many dead processor architectures live on in
microcontrollers
9. Generations of Computer
• Vacuum tube - 1946-1957
• Transistor - 1958-1964
• Small scale integration - 1965 on
– Up to 100 devices on a chip
• Medium scale integration - to 1971
– 100-3,000 devices on a chip
• Large scale integration - 1971-1977
– 3,000 - 100,000 devices on a chip
• Very large scale integration - 1978 to date
– 100,000 - 100,000,000 devices on a chip
• Ultra large scale integration
– Over 100,000,000 devices on a chip
10. The MIPS R3000 ISA (Summary)
• Instruction Categories
– Load/Store R0 - R31
– Computational
– Jump and Branch
– Floating Point
PC
• coprocessor HI
– Memory Management LO
– Special
3 Instruction Formats: all 32 bits wide
OP rs rt rd sa funct
OP rs rt immediate
OP jump target
11. “What” is Computer Architecture ?
Application
Operating
System
Compiler Firmware
Instruction Set
Architecture
ECE 321 Instr. Set Proc. I/O system
Datapath & Control
Digital Design
Circuit Design
Layout
• Coordination of many levels of abstraction
• Under a rapidly changing set of forces
• Design, Measurement, and Evaluation
12. Impact of Changing ISA
• Early 1990’s Apple switched instruction set
architecture of the Macintosh
– From Motorola 68000-based machines
– To PowerPC architecture
• Intel 80x86 Family: many implementations
of same architecture
– program written in 1978 for 8086 can be run
on latest Pentium chip
13. Factors Affecting ISA ???
Technology Programming
Languages
Applications
Computer Cleverness
Architecture
Operating
Systems
History
15. The Big Picture
Processor
Input
Control
Memory
Datapath
Output
Since 1946 all computers have had 5 components!!!
16. Example Organization
• TI SuperSPARCtm TMS390Z50 in Sun SPARCstation20
MBus Module
SuperSPARC
Floating-point Unit
L2 CC DRAM
Integer Unit $ MBus Controller
Inst Ref Data L64852 MBus control
M-S Adapter STDIO
Cache MMU Cache
SBus serial
Store SCSI kbd
SBus mouse
Buffer DMA Ethernet audio
RTC
Bus Interface SBus
Cards Floppy
17. Moore’s Law
• Increased density of components on chip
• Gordon Moore - cofounder of Intel
• Number of transistors on a chip will double 18-24
months
• Since 1970’s development has slowed a little
– Number of transistors doubles every 24 months
• Cost of a chip has remained almost unchanged
• Higher packing density means shorter electrical paths,
giving higher performance
• Smaller size gives increased flexibility
• Reduced power and cooling requirements
• Fewer interconnections increases reliability
18. Technology Trends
• Processor
– logic capacity: about 30% per year
– clock rate: about 20% per year
• Memory
– DRAM capacity: about 60% per year (4x every 3 years)
– Memory speed: about 10% per year
– Cost per bit: improves about 25% per year
• Disk
– capacity: about 60% per year
– Total use of data: 100% per 9 months!
• Network Bandwidth
– Bandwidth increasing more than 100% per year!
19. Technology Trends
Microprocessor Logic Density
DRAM chip capacity 100000000
DRAM
10000000
Year Size uP -Nam e
R10000
Pentium
1980 64 Kb R4400
i80486
1983 256 Kb 1000000
Transistors
1986 1 Mb i80386
i80286
1989 4 Mb 100000
R3010
1992 16 Mb i8086
SU MIPS i80x86
1996 64 Mb 10000
M68K
MIPS
1999 256 Mb Alpha
i4004
2002 1 Gb 1000
1965 1970 1975 1980 1985 1990 1995 2000 2005
° In ~1985 the single-chip processor (32-bit) and the single-board computer emerged
° In ~2002 started having multiple processor cores on a chip (IBM POWER4)
22. Levels of Representation
temp = v[k];
High Level Language v[k] = v[k+1];
Program
v[k+1] = temp;
Compiler
• lw $15, 0($2)
Assembly Language
Program • lw $16, 4($2)
• sw $16, 0($2)
Assembler
• sw $15, 4($2)
0000 1001 1100 0110 1010 1111 0101 1000
Machine Language 1010 1111 0101 1000 0000 1001 1100 0110
Program 1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
Machine Interpretation
Control Signal ALUOP[0:3] <= InstReg[9:11] & MASK
Specification
23. Execution Cycle
Instruction Obtain instruction from program storage
Fetch
Instruction Determine required actions and instruction size
Decode
Operand Locate and obtain operand data
Fetch
Execute Compute result value or status
Result Deposit results in storage for later use
Store
Next
Determine successor instruction
Instruction
25. Understanding Performance
• Algorithm
– Determines number of operations executed
• Programming language, compiler, architecture
– Determine number of machine instructions executed
per operation
• Processor and memory system
– Determine how fast instructions are executed
• I/O system (including OS)
– Determines how fast I/O operations are executed
27. Performance Metrics
• Response Time
– Delay between start and end time of a task
• Throughput
– Numbers of tasks per given time
• New: Power/Energy
– Energy per task, power
28. CPU Clocking
• Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 250 10–12s
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0 109Hz
29. Examples
(Throughput/Performance)
• Replace the processor with a faster
version?
– 3.8 GHz instead of 3.2 GHz
• Add an additional processor to a system?
– Core Duo instead of P4
30. Measuring Performance
• Wall clock time vs. Total execution time
• CPU Time
– User Time
– System Time
Try using time command on UNIX system
31. Relating the Metrics
• Performance = 1/Execution Time
• CPU Execution Time = CPU clock cycles
for program x Clock cycle time
• CPU clock cycles = Instructions for a
program x Average clock cycles per
Instruction
32. Performance Summary
The BIG Picture
Instructio ns Clock cycles Seconds
CPU Time
Program Instructio n Clock cycle
• Performance depends on
– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
33. SPEC CPU Benchmark
• Programs used to measure performance
– Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
– Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
– Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance
– Normalize relative to reference machine
– Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
n
n Execution time ratio i
i 1
34. CINT2006 for Opteron X4 2356
Name Description IC 109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
High cache miss rates
35. Amdahl’s Law
• Pitfall: Expecting the improvement of one aspect
of a machine to increase overall performance by
an amount proportional to the size of
improvement
36. Amhdahl’s Law [contd…]
• A program runs in 100 seconds on a machine
• Multiply operations responsible for 80 seconds of this time.
• How much do I have to improve the speed of multiplication if I want
my program to run 5 times faster ?
• Execution Time After improvement =
(exec time affected by improvement/amount of improvement) + exec
time unaffected
exec time after improvement = (80 seconds / n) + (100 – 80 seconds)
We want performance to be 5 times faster =>
20 seconds = 80/n seconds + 20 seconds
0 = 80 / n !!!!
37. Amdahl’s Law [contd…]
• Opportunity for improvement is affected by
how much time the event consumes
• Make the common case fast
• Very high speedup requires making nearly
every case fast
• Focus on overall performance, not one
aspect
38. Summary
• Computer Architecture = Instruction Set Architure + Machine
Organization
• All computers consist of five components
– Processor: (1) datapath and (2) control
– (3) Memory
– (4) Input devices and (5) Output devices
• Not all “memory” are created equally
– Cache: fast (expensive) memory are placed closer to the
processor
– Main memory: less expensive memory--we can have more
• Interfaces are where the problems are - between functional units
and between the computer and the outside world
• Need to design against constraints of performance, power, area and
cost
39. Summary
• Performance “eye of the beholder”
Seconds/program =
(Instructions/Pgm)x(Clk Cycles/Instructions)x(Seconds/Clk cycles)
• Amdahl’s Law “Make the Common Case
Fast”
40. Homework
• Chapter 1
• 1.3, 1.4, 1.10, 1.15, 1.16 (first 4 parts of
each question)
• Due Next Tuesday