This document discusses various attributes that influence computer system performance. It covers topics like instruction count, cycles per instruction, processor cycle time, memory access latency, and how factors like instruction set architecture, compiler technology, processor implementation, and memory hierarchy can affect these performance attributes and metrics like instructions per second. It also summarizes different types of parallel computer architectures like shared-memory multiprocessors, distributed-memory multicomputers, vector supercomputers and SIMD machines.
2. CPU/Processor driven by-
A clock with a constant cycle time (τ) in nSecond
Clock Rate: f = 1/ τ in megahertz
Ic- Instruction Count: Size of program/number of
machine instructions to be executed in the program.
Different machine instructions needed- different no. of
clock cycles to execute
CPI (Cycles per Instruction): Time needed to execute
each Instruction.
Average CPI: For a given Instruction Set.
3. Performance Factors:
CPU Time (T): -Time needed to execute a Program.
- in seconds/program
T = CPU Time = Ic * CPI * τ
Execution of an instruction going through a cycle of
events :
Instruction fetch
Decode
Operand(s) fetch
Execution
Store results
4. Events Carried out in the CPU:
Instruction decodes
Execution phases
Remaining three required to Access the memory.
Memory Cycle :
Time needed to complete one memory reference.
Note- Memory cycle is k times processor cycle τ.
k depends upon speed of memory technology.
5.
6. System Attributes Influence on Performance Factor (Ic,
p, m, k, t):
1.Instruction-set architecture-
Affects the program length (Ic) and processor
cycle needed (p)
2.Compiler Technology-
Affect value of Ic, p, m
3.CPU Implementation & Control-
Determine total processor time (p * τ)
4.Cache & Memory Hierarchy-
Affect the memory access latency (k*τ)
7. System
Attributes
Performance Factors
Instr.
Count
(Ic)
Avg. Cycles per Instruction, CPI Processor
Cycle Time
τ
Processor
Cycles per
instruction
(p)
Memory
Reference/
Instruction,
(m)
Memory
Access
Latency,
(k)
Instruction-Set
Architecture X X
Compiler
Technology X X X
Processor
Implementation
& Control
X X
Cache &
Memory
Hierarchy
X X
8. MIPS Rate: Million Instructions per Second
C = Total no. clock cycle needed to execute a Program
T = C * τ = C/f
CPI = C/Ic
T = Ic * CPI * τ = (Ic * CPI)/f
9. Throughput Rate (Ws):
No. of programs a system can execute per unit Time.
Ws = Program/Second
Note:- In Multiprogrammed system, System throughput
(Ws) is often lower than CPU throughput Wp.
Wp = f/ (Ic * CPI)
= 1/ Ic * CPI * τ
= 1 Program/T
Ws = Wp
If the CPU is kept busy in a perfect program-interleaving
fashion
10. Two approaches to parallel programming :
Sequential Coded Source Program
Detect Parallelism & Assign target
Machine Resources
Note:- Compiler Approach applied in programming
Shared-Memory Multiprocessors
11. •Parallel Dialects of C……
•Parallelism specified in user Program
Note:- Approach applied in Multicomputer
12. Parallel Computers Architectural Model/
Physical Model
Distinguished by having-
1. Shared Common Memory:
Three Shared-Memory Multiprocessor Models are:
i. UMA (Uniform-Memory Access)
ii. NUMA (Non-Uniform-Memory Access)
iii. COMA (Cache-Only Memory Architecture)
2. Unshared Distributed Memory
i. CC-NUMA (Cache-Coherent -NUMA)
14. Physical memory is uniformly shared by all the
processors.
All Processors have equal access time to all memory
words, so it is called Uniform Memory Access.
Peripherals are also shared in some fashion.
Also called Tightly Coupled Systems -due to the high
degree of resource sharing.
15. Symmetric Vs Asymmetric Multiprocessor
Symmetric Multiprocessor: All processors have equal
access to all peripheral devices.
Asymmetric Multiprocessor:
Only one or a subset of processors are executive capable.
i. MP (Executive or Master Processor)-
Can execute the O.S. and handle I/O
ii. AP (Attached Processor)-
No I/O capability
AP execute user codes under Supervision of MP
16. NUMA Multiprocessor Model
Shared-Memory System
Access Time varies with the location of the Memory
Word
Local Memories (LM): Shared Memory is physically
distributed to all processors
Global Address Space: Forms by collection of all
Local Memories (LM) that is accessible by all
processors.
Faster Access to a local memory with a local processor
Slow Access to remote memory attached to other
processors due to the added delay through the
interconnection network
18. P – Processor
CSM – Cluster Shared Memory
CIN – Cluster Interconnection Network
GSM – Global Shared Memory
UMA or
NUMA
(Access of Remote Memory)
19. Three Memory-Access Patterns when Globally Shared
Memory (GSM) added to a multiprocessor system:
i. The fastest is Local Memory(LM) access
ii. The next is global memory (GSM)access
iii. The slowest is access of Remote Memory
Remote Memory- LM attach to other processor
Note:
All cluster have equal access to GSM
Access right among Intercluster memories can be specified.
20. COMA Multiprocessor Model
• Distributed Main Memory converted to Cache
•Cache form Global Address Space
•Remote Cache access by – Distributed cache Directories
C – Cache
P – Processor
D - Directories
21. Multiprocessor System Suitable for-
General purpose Multiuser Applications
Programmability is major concern
Shortcoming of Multiprocessor System-
Lack of Scalability
Limitation in Latency Tolerance for Remote Memory
Access
29. Step 1-2 Program & data are first loaded into the Main
Memory through a Host computer.
Step 3 All instructions are first decoded by the Scalar
Control Unit.
Step 4 If the decoded instruction is a scalar operation or
a program control operation, it will be directly
executed by the scalar processor using the Scalar
Functional Pipelines.
Step 5 If the instructions are decoded as a Vector
operation, it will be sent to the vector control
unit.
Step 6 Vector control unit will supervise the flow of
vector data between the main memory and vector
functional pipelines.
Note: A number of vector functional pipelines may be built into a
30. SIMD Supercomputers
CU- Control Unit
PE- Processing Element
LM- Local Memory
IS- Instruction Stream
DS- Data Stream
(Abstract Model of a SIMD computer)
32. SIMD Machine Model:
An operational model of an SIMD computer is specified
by a 5-tuple:
M = <N , C , I , M , R>
(1) N = No. of Processing Elements (PE) in the machine.
(2) C =Set of instructions directly executed by the
control unit (CU). Scalar & Program Flow Control
Instructions.
(3) I = Set of instructions broadcast by the CU to all
PEs for parallel execution.
Include: Arithmetic, logic, data routing, masking, and
other local operations executed by each active PE
over data within that PE.
33. (4) M = Set of Masking Schemes
Each mask partitions the set of PEs into enabled and
disabled subsets.
(5) R = Set of data-routing functions
Specifying various patterns to be set up in the
interconnection network for inter-PE communications.