SlideShare une entreprise Scribd logo
1  sur  51
Computer Architecture
Parallel Processors (SIMD)
Contents
• Parallel Processors
• Flynn's taxonomy
• What is SIMD?
• Types of Processing
– Scalar Processing
– Vector Processing
• Architecture for Vector Processing
• Vector processors
– Vector Processor Architectures
– Components of Vector Processors
– Advantages of Vector Processing
• Array processors
– Array Processor Classification
– Array Processor Architecture
• Dedicated Memory Organization
• Global Memory Organization
• ILLIAC IV
– ILLIAC IV Architecture
• Super Computers
– Cray X1
• Multimedia Extension
Parallel Processors
• In computers, parallel processing is the processing
of program instructions by dividing them among
multiple processors with the objective of running a program in less
time.
• In the earliest computers, only one program ran at a time. A
computation-intensive program that took one hour to run and a
tape copying program that took one hour to run would take a total
of two hours to run. An early form of parallel processing allowed
the interleaved execution of both programs together.
• The computer would start an I/O operation, and while it was waiting
for the operation to complete, it would execute the processor-
intensive program. The total execution time for the two jobs would
be a little over one hour.
Flynn's taxonomy
• Flynn's taxonomy is a classification of computer architectures,
proposed by Michael J. Flynn in 1966.The classification system
has stuck, and has been used as a tool in design of modern
processors and their functionalities.
• The four classifications defined by Flynn are based upon the
number of concurrent instruction (or control) streams and
data streams available in the architecture.
• Single instruction stream single data stream (SISD)
• Single instruction stream, multiple data streams (SIMD)
• Single instruction, multiple threads (SIMT)
• Multiple instruction streams, single data stream (MISD)
Classification
What is SIMD?
• Single instruction, multiple data (SIMD), is a class
of parallel computers in Flynn's taxonomy.
• It describes computers with multiple processing
elements that perform the same operation on
multiple data points simultaneously. Thus, such
machines exploit data level parallelism.
• There are simultaneous (parallel) computations,
but only a single process (instruction) at a given
moment.
Types of Processing
• Scalar Processing
• A CPU that performs computations on one number or set of
data at a time. A scalar processor is known as a "single
instruction stream single data stream" (SISD) CPU.
• Vector Processing
• A vector processor or array processor is a
central processing unit (CPU) that implements an instruction
set containing instructions that operate on 1-D arrays of data
called vectors.
Architecture for Vector Processing
• Two architectures suitable for vector processing are:
• Pipelined vector processors
• Parallel Array processors
Description of Vector Processors
• CPU that implements an instruction set that operates on 1-D
arrays, called vectors
• Vectors contain multiple data elements
• Number of data elements per vector is typically referred to
as the vector length
• Both instructions and data are pipelined to reduce
decoding time
+
r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
+
v3
vector
length
add.vv v3, v1, v2
VECTOR
(N operations)
Vector Processor Architectures
• Memory-to-Memory Architecture (Traditional)
o For all vector operation, operands are fetched
directly from main memory, then routed to the
functional unit
o Results are written back to main memory
o Includes early vector machines through mid
1980s:
▪ Advanced Scientific Computer (TI), Cyber 200 & ETA-10
o Major reason for demise was due to large startup
time
Memory-to-Memory Architecture
Vector Processor Architectures (cont)
• Register-to-Register Architecture (Modern)
o All vector operations occur between vector
registers
o If necessary, operands are fetched from main
memory into a set of vector registers (load-store
unit)
o Includes all vector machines since the late 1980s:
▪ Convex, Cray, Fujitsu, Hitachi, NEC
o SIMD processors are based on this architecture
Register-to-Register Architecture
Components of Vector Processors
• Vector Registers
o Typically 8-32 vector registers with 64 - 128 64-bit elements
o Each contains a vector of double-precision numbers
o Register size determines the maximum vector length
o Each includes at least 2 read and 1 write ports
• Vector Functional Units (FUs)
o Fully pipelined, new operation every cycle
o Performs arithmetic and logic operations
o Typically 4-8 different units
• Vector Load-Store Units (LSUs)
o Moves vectors between memory and registers
• Scalar Registers
o Single elements for interconnecting FUs, LSUs, and registers
Components of Vector Processors
The Vector Unit
• A vector unit consists of a pipelined functional unit, which perform
ALU operation of vectors in a pipeline
• It has also vector registers, including:
• A set of general purpose vector registers, each of length s(e.g.,
128);
• A vector length register VL,which stores the length l (0 l s) of the
• currently processed vector(s)
Advantages of Vector Processing
Advantages:
 Quick fetch and decode of a single instruction for multiple
operations.
 The instruction provides a regular source of data, which arrive at
each cycle, and can be processed in a pipelined fashion efficiently.
• Easier Addressing of Main Memory
• Elimination of Memory Wastage
• Simplification of Control Hazards
• Reduced Code Size
Array Processors
• ARRAY processor is a processor that performs
computations on a large array of data.
• Array processor is a synchronous parallel
computer with multiple ALU called processing
elements ( PE) that can operate in parallel in
lockstep fashion.
• It is composed of N identical PE under the control of a single
control unit and a number of memory modules
Array Processors
• Array processors also frequently use a form of parallel
computation called pipelining where an operation is
divided into smaller steps and the steps are performed
simultaneously.
• It can greatly improve performance on certain
workloads mainly in numerical simulation.
How Array Processor can help?
An Example• Consider the simple task of adding two groups of
10 numbers together. In a normal programming language you
might have done something as:
• execute this loop 10 times
• read the next instruction and decode it
• fetch this number fetch that number
• add them
• put the result here
But to an array processor this tasks looks as
• read instruction and decode it
• fetch these 10 numbers
• fetch those 10 numbers
• add them
• put the results here
Array Processor Classification
Processing element complexity
Single-bit processors
• Connection Machine (CM-2)  65536 PEs
connected by a hypercube network (by
Thinking Machine Corporation).
Multi-bit processors
• ILLIAC IV (64-bit), MasPar MP-1 (32-bit)
Array Processor Classification
• SIMD ( Single Instruction Multiple Data )
• is an array processor that has a single instruction multiple data
organization.
• It manipulates vector instructions by means of multiple
functional unit responding to a common instruction.
• Attached array processor
• is an auxiliary processor attached to a general purpose
computer.
Its intent is to improve the performance of the host
computer in specific numeric calculation tasks.
SIMD-Array Processor Architecture
• SIMD has two basic configuration
– a. Array processors using RAM also known as
( Dedicated memory organization )
• ILLIAC-IV, CM-2,MP-1
– b. Associative processor using content accessible
memory also known as
( Global Memory Organization)
• BSP
Control Unit
• A simple CPU
• Can execute instructions w/o PE intervention
• Coordinates all PE’s
• 64 64b registers, D0-D63
• 4 64b Accumulators A0-A3
• Ops:
– Integer ops
– Shifts– Boolean
– Loop control
– Index Memory
Processing Element
• A PE consists of an ALU with working registers
and a local memory PMEMi which is used to store
distributed data.
• All PE do the same function synchronously under
the super vision of CU in a lock-step fashion.
• Before execution in a PE the vector instructions
should be loaded into its PMEM .
• Data can be added into the PMEM from an external
source or by the CU.
Processing Element
A PE consists of the following:
• 64 bit register
• A: Accumulator
• B: 2nd operand for binary ops
• R: Routing
– Inter
-PE Communication
• S: Status Register
• X: Index for PMEM 16bits
• D: mode 8bits
• Communication:
– PMEM only from local PE
– Amongst PE with R
Interconnection Network and Host Computer
• Interconnection Network :
All communication between PE’s are done by the
inter connection network. It does all the routing
and manipulation function . This interconnection
network is under the control of CU.
• Host Computer:
The array processor is interfaced to the host
controller using host computer. The host computer
does the resource management and peripheral
and I/O supervisions.
Dedicated Memory Organization
(Array processors using RAM )
• Here we have a Control Unit and multiple synchronized PE.
•The control unit controls all the PE below it .
•Control unit decode all the instructions given to it and
decides where the decoded instruction should be executed.
•The vector instructions are broad casted to all the PE.
•This broad casting is to get spatial parallelism through
duplicate PE.
•The scalar instructions are executed directly inside the CU.
Dedicated Memory Organization
Global Memory Organization
• In this configuration PE does not have private memory.
• Memories attached to PE are replaced by parallel memory
modules shared to all PE via an alignment network.
• Alignment network does path switching between PE and
parallel memory.
• The PE to PE communication is also via alignment network.
• The alignment network is controlled by the CU.
• The number of PE (N) and the number of memory modules
(K)may not be equal .
• An alignment network should allow conflict free access of
shared memories by as many PEs as possible .
Global Memory Organization
Attached Array Processor
• In this configuration the attached array processor has an
input output interface to common processor and another
interface with a local memory.
• The local memory connects to the main memory with the
help of a high speed memory bus.
Performance and Scalability of Array Processor
• To compute N
• Y =  A(i) * B(i)• i=1
Assuming:
 A dedicated memory organization.
 Elements of A and B are properly and perfectly distributed
among processors (the compiler can help here).
We have:
 The product terms are generated in parallel.
Additions can be performed in log2N iterations.
 Speed up factor (S) is (assuming that addition and multiplication
take the same time):
• S = 2N-1
1+ log 2 N
ILLIAC IV
• The ILLIAC IV system was the first real attempt to contract a
large-scale parallel machine, and in its time it was the most
powerful computing machine in the world. It was designed and
constructed by academics and scientists from the University of
Illinois and the Burroughs Corporation. A significant amount of
software, including sophisticated compilers, was developed for
ILLIAC IV, and many researchers were able to develop parallel
application software.
• ILLIAC IV grew from a series of ILLIAC machines. Work on
ILLIAC IV began in the 1960s, and the machine became
operational in 1972. The original aim was to produce a 1
GFLOP machine using an SIMD array architecture comprising
256 processors partitioned into four quadrants, each controlled
by an independent control unit.
ILLIAC IV Features
ILLIAC IV (started in the late 60’s; fully
operational in 1975) is a typical example of Array
Processors.
SIMD computer for array processing.
Control Unit + 64 Processing Elements.
 2K words memory per PE.
CU can access all memory.
PEs can access local memory and communicate
with neighbors.
CU reads program and broadcasts instructions to
Pes.
ILLIAC IV Architecture
Super Computers
• Cray Inc. is an American supercomputer manufacturer
headquartered in Seattle, Washington. The company's
predecessor, Cray Research, Inc. (CRI), was founded in
1972 by computer designer Seymour Cray.
• Cray-1
• The Cray-1 was a supercomputer designed, manufactured and
marketed by Cray Research. The first Cray-1 system was installed at
Los Alamos National Laboratory in 1976 and it went on to become
one of the best known and most successful supercomputers in
history.
Cray X1
• The Cray X1 is a non-uniform memory access, vector
processor supercomputer manufactured and sold by Cray
Inc. since 2003. The X1 is often described as the unification
of the Cray T90, Cray SV1, and Cray T3E architectures into
a single machine. The X1 shares the multistreaming
processors, vector caches, and CMOS design of the SV1,
the highly scalable distributed memory design of the T3E,
and the high memory bandwidth.
• The X1 uses a 1.2 ns (800 MHz) clock cycle, and 8-wide
vector pipes in MSP mode, offering a peak speed of
12.8 gigaflops per processor. maximum of 4096 processors,
comprising 1024 shared-memory nodes connected in a two-
dimensional network, in 32 frames. Such a system would
supply a peak speed of 50 teraflops.
Cray X1
Cray combines several technologies in the X1 machine (2003):
Multi-streaming vector processing.
Multiple node architecture.
Cray X1 System Functional Diagram
•Mainframe
•Node interconnection network
•System Port Channel (SPC)
•Communicate within Nodes
•I/O drawers (IODs)
•Cray Programming Environment Server (CPES)
•Cray Network Subsystem (CNS)
•Storage area network (SAN)
•RAID
Cray X1 System Functional Diagram
Nodes
Nodes are housed in hardware modules called Node module
Four multichip modules (MCMs)
One multi streaming processor (MSP)
Four SPC I/O ports
Routing switches controls all memory access
Node Processors
• Each node consists of four MCMs
• Each MCM includes one multi streaming processor (MSP)
• Each MSP includes a 2-MB cache
• A single MSP provides 12.8 GF (gigaflops)
• Each MSP has four internal single-streaming processors (SSPs)
• Each SSP contains both a superscalar processing unit and a
two-pipe vector processing unit
• The four SSPs in an MSP share the 2-MB cache of
the MSP
Cray Computers
• Cray-1
• Cray-2
• Cray-3
• Cray-3/SSS
• Cray-4
• Cray C90
• Cray Urika-GD
• Cray X1
• Cray X2
• Cray XC30
• Cray XC40
Continue ………
Multimedia extensions
• A multimedia extension is essentially a
supplementary processing capability that is
supported on recent products. MMX provides
integer operations, and defines eight different
registers, names MM0 through MM7, and the
operations that operate on them.
MMX (instruction set)
• MMX is a single instruction, multiple
data (SIMD) instruction set designed by Intel,
introduced in 1997 with its P5-based Pentium line
of microprocessors, designated as "Pentium with
MMX Technology".
• MMX is a single instruction, multiple data
instruction set designed by the large manufacturer
of computer products, Intel. A multimedia
extension is essentially a supplementary
processing capability that is supported on recent
products.
Technical details
• MMX defines eight registers, called MM0 through
MM7, and operations that operate on them. Each
register is 64 bits wide and can be used to hold
either 64-bit integers, or multiple smaller integers
in a "packed" format: a single instruction can then
be applied to two 32-bit integers, four 16-bit
integers, or eight 8-bit integers at once.
Pentium II processor with MMX
technology
Parallel Processors (SIMD)

Contenu connexe

Tendances

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
Haris456
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
koolkampus
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Mr SMAK
 
Connection Machine
Connection MachineConnection Machine
Connection Machine
butest
 

Tendances (20)

Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple data
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
 
Pipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture pptPipeline hazards in computer Architecture ppt
Pipeline hazards in computer Architecture ppt
 
Cache memory
Cache memoryCache memory
Cache memory
 
Memory management
Memory managementMemory management
Memory management
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
Connection Machine
Connection MachineConnection Machine
Connection Machine
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
 
Parallelism
ParallelismParallelism
Parallelism
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Scope of parallelism
Scope of parallelismScope of parallelism
Scope of parallelism
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
chapter 2 architecture
chapter 2 architecturechapter 2 architecture
chapter 2 architecture
 

En vedette

Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled Interrupts
Anshuman Biswal
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
Ashik Iqbal
 
কফি সমাচার
কফি সমাচারকফি সমাচার
কফি সমাচার
Sajid Rahat
 

En vedette (20)

Array Processor
Array ProcessorArray Processor
Array Processor
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled Interrupts
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Observer Pattern
Observer PatternObserver Pattern
Observer Pattern
 
Pipelining In computer
Pipelining In computer Pipelining In computer
Pipelining In computer
 
Arithmatic pipline
Arithmatic piplineArithmatic pipline
Arithmatic pipline
 
Pervasive Computing
Pervasive ComputingPervasive Computing
Pervasive Computing
 
Concept of Pipelining
Concept of PipeliningConcept of Pipelining
Concept of Pipelining
 
Pipeline
PipelinePipeline
Pipeline
 
Fibonacci Heap
Fibonacci HeapFibonacci Heap
Fibonacci Heap
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
কফি সমাচার
কফি সমাচারকফি সমাচার
কফি সমাচার
 
What animal
What animalWhat animal
What animal
 
Compression Type Connector - dongya electronic
Compression Type Connector - dongya electronicCompression Type Connector - dongya electronic
Compression Type Connector - dongya electronic
 
howti
howtihowti
howti
 
What animal
What animalWhat animal
What animal
 
Operating system
Operating systemOperating system
Operating system
 
Question 2 review
Question 2 reviewQuestion 2 review
Question 2 review
 

Similaire à Parallel Processors (SIMD)

BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
Kadri20
 
Coa swetappt copy
Coa swetappt   copyCoa swetappt   copy
Coa swetappt copy
sweta_pari
 

Similaire à Parallel Processors (SIMD) (20)

CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and cisc
 
Multiprocessor.pptx
 Multiprocessor.pptx Multiprocessor.pptx
Multiprocessor.pptx
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
PILOT Session for Embedded Systems
PILOT Session for Embedded Systems PILOT Session for Embedded Systems
PILOT Session for Embedded Systems
 
Mces MOD 1.pptx
Mces MOD 1.pptxMces MOD 1.pptx
Mces MOD 1.pptx
 
Chap4.ppt
Chap4.pptChap4.ppt
Chap4.ppt
 
Chap4.ppt
Chap4.pptChap4.ppt
Chap4.ppt
 
Chap4.ppt
Chap4.pptChap4.ppt
Chap4.ppt
 
The Central Processing Unit(CPU) for Chapter 4
The Central Processing Unit(CPU) for Chapter 4The Central Processing Unit(CPU) for Chapter 4
The Central Processing Unit(CPU) for Chapter 4
 
Introduction to embedded System.pptx
Introduction to embedded System.pptxIntroduction to embedded System.pptx
Introduction to embedded System.pptx
 
Coa swetappt copy
Coa swetappt   copyCoa swetappt   copy
Coa swetappt copy
 

Plus de Ali Raza (15)

Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Hasil Review
Hasil Review Hasil Review
Hasil Review
 
Difference
DifferenceDifference
Difference
 
The mughal empire
The mughal empireThe mughal empire
The mughal empire
 
Psychology
PsychologyPsychology
Psychology
 
Assignment of ict robotics
Assignment of ict roboticsAssignment of ict robotics
Assignment of ict robotics
 
Software programming and development
Software programming and developmentSoftware programming and development
Software programming and development
 
artificial intelligence
artificial intelligence artificial intelligence
artificial intelligence
 
E commrece
E commreceE commrece
E commrece
 
Computer networks7
Computer networks7Computer networks7
Computer networks7
 
Presentation DBMS (1)
Presentation DBMS (1)Presentation DBMS (1)
Presentation DBMS (1)
 
Personal computer
Personal computer Personal computer
Personal computer
 
Assignment of ict robotics
Assignment of ict roboticsAssignment of ict robotics
Assignment of ict robotics
 
Presentation of verb
Presentation of verb Presentation of verb
Presentation of verb
 
Verb
VerbVerb
Verb
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Parallel Processors (SIMD)

  • 2. Contents • Parallel Processors • Flynn's taxonomy • What is SIMD? • Types of Processing – Scalar Processing – Vector Processing • Architecture for Vector Processing • Vector processors – Vector Processor Architectures – Components of Vector Processors – Advantages of Vector Processing • Array processors – Array Processor Classification – Array Processor Architecture • Dedicated Memory Organization • Global Memory Organization • ILLIAC IV – ILLIAC IV Architecture • Super Computers – Cray X1 • Multimedia Extension
  • 3. Parallel Processors • In computers, parallel processing is the processing of program instructions by dividing them among multiple processors with the objective of running a program in less time. • In the earliest computers, only one program ran at a time. A computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run. An early form of parallel processing allowed the interleaved execution of both programs together. • The computer would start an I/O operation, and while it was waiting for the operation to complete, it would execute the processor- intensive program. The total execution time for the two jobs would be a little over one hour.
  • 4. Flynn's taxonomy • Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966.The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities. • The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) streams and data streams available in the architecture. • Single instruction stream single data stream (SISD) • Single instruction stream, multiple data streams (SIMD) • Single instruction, multiple threads (SIMT) • Multiple instruction streams, single data stream (MISD) Classification
  • 5. What is SIMD? • Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. • It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism. • There are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.
  • 6. Types of Processing • Scalar Processing • A CPU that performs computations on one number or set of data at a time. A scalar processor is known as a "single instruction stream single data stream" (SISD) CPU. • Vector Processing • A vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on 1-D arrays of data called vectors.
  • 7. Architecture for Vector Processing • Two architectures suitable for vector processing are: • Pipelined vector processors • Parallel Array processors
  • 8.
  • 9. Description of Vector Processors • CPU that implements an instruction set that operates on 1-D arrays, called vectors • Vectors contain multiple data elements • Number of data elements per vector is typically referred to as the vector length • Both instructions and data are pipelined to reduce decoding time + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 + v3 vector length add.vv v3, v1, v2 VECTOR (N operations)
  • 10. Vector Processor Architectures • Memory-to-Memory Architecture (Traditional) o For all vector operation, operands are fetched directly from main memory, then routed to the functional unit o Results are written back to main memory o Includes early vector machines through mid 1980s: ▪ Advanced Scientific Computer (TI), Cyber 200 & ETA-10 o Major reason for demise was due to large startup time
  • 12. Vector Processor Architectures (cont) • Register-to-Register Architecture (Modern) o All vector operations occur between vector registers o If necessary, operands are fetched from main memory into a set of vector registers (load-store unit) o Includes all vector machines since the late 1980s: ▪ Convex, Cray, Fujitsu, Hitachi, NEC o SIMD processors are based on this architecture
  • 14. Components of Vector Processors • Vector Registers o Typically 8-32 vector registers with 64 - 128 64-bit elements o Each contains a vector of double-precision numbers o Register size determines the maximum vector length o Each includes at least 2 read and 1 write ports • Vector Functional Units (FUs) o Fully pipelined, new operation every cycle o Performs arithmetic and logic operations o Typically 4-8 different units • Vector Load-Store Units (LSUs) o Moves vectors between memory and registers • Scalar Registers o Single elements for interconnecting FUs, LSUs, and registers
  • 15. Components of Vector Processors
  • 16. The Vector Unit • A vector unit consists of a pipelined functional unit, which perform ALU operation of vectors in a pipeline • It has also vector registers, including: • A set of general purpose vector registers, each of length s(e.g., 128); • A vector length register VL,which stores the length l (0 l s) of the • currently processed vector(s)
  • 17. Advantages of Vector Processing Advantages:  Quick fetch and decode of a single instruction for multiple operations.  The instruction provides a regular source of data, which arrive at each cycle, and can be processed in a pipelined fashion efficiently. • Easier Addressing of Main Memory • Elimination of Memory Wastage • Simplification of Control Hazards • Reduced Code Size
  • 18.
  • 19. Array Processors • ARRAY processor is a processor that performs computations on a large array of data. • Array processor is a synchronous parallel computer with multiple ALU called processing elements ( PE) that can operate in parallel in lockstep fashion. • It is composed of N identical PE under the control of a single control unit and a number of memory modules
  • 20. Array Processors • Array processors also frequently use a form of parallel computation called pipelining where an operation is divided into smaller steps and the steps are performed simultaneously. • It can greatly improve performance on certain workloads mainly in numerical simulation.
  • 21. How Array Processor can help? An Example• Consider the simple task of adding two groups of 10 numbers together. In a normal programming language you might have done something as: • execute this loop 10 times • read the next instruction and decode it • fetch this number fetch that number • add them • put the result here But to an array processor this tasks looks as • read instruction and decode it • fetch these 10 numbers • fetch those 10 numbers • add them • put the results here
  • 22. Array Processor Classification Processing element complexity Single-bit processors • Connection Machine (CM-2)  65536 PEs connected by a hypercube network (by Thinking Machine Corporation). Multi-bit processors • ILLIAC IV (64-bit), MasPar MP-1 (32-bit)
  • 23. Array Processor Classification • SIMD ( Single Instruction Multiple Data ) • is an array processor that has a single instruction multiple data organization. • It manipulates vector instructions by means of multiple functional unit responding to a common instruction. • Attached array processor • is an auxiliary processor attached to a general purpose computer. Its intent is to improve the performance of the host computer in specific numeric calculation tasks.
  • 24. SIMD-Array Processor Architecture • SIMD has two basic configuration – a. Array processors using RAM also known as ( Dedicated memory organization ) • ILLIAC-IV, CM-2,MP-1 – b. Associative processor using content accessible memory also known as ( Global Memory Organization) • BSP
  • 25. Control Unit • A simple CPU • Can execute instructions w/o PE intervention • Coordinates all PE’s • 64 64b registers, D0-D63 • 4 64b Accumulators A0-A3 • Ops: – Integer ops – Shifts– Boolean – Loop control – Index Memory
  • 26. Processing Element • A PE consists of an ALU with working registers and a local memory PMEMi which is used to store distributed data. • All PE do the same function synchronously under the super vision of CU in a lock-step fashion. • Before execution in a PE the vector instructions should be loaded into its PMEM . • Data can be added into the PMEM from an external source or by the CU.
  • 27. Processing Element A PE consists of the following: • 64 bit register • A: Accumulator • B: 2nd operand for binary ops • R: Routing – Inter -PE Communication • S: Status Register • X: Index for PMEM 16bits • D: mode 8bits • Communication: – PMEM only from local PE – Amongst PE with R
  • 28. Interconnection Network and Host Computer • Interconnection Network : All communication between PE’s are done by the inter connection network. It does all the routing and manipulation function . This interconnection network is under the control of CU. • Host Computer: The array processor is interfaced to the host controller using host computer. The host computer does the resource management and peripheral and I/O supervisions.
  • 29. Dedicated Memory Organization (Array processors using RAM ) • Here we have a Control Unit and multiple synchronized PE. •The control unit controls all the PE below it . •Control unit decode all the instructions given to it and decides where the decoded instruction should be executed. •The vector instructions are broad casted to all the PE. •This broad casting is to get spatial parallelism through duplicate PE. •The scalar instructions are executed directly inside the CU.
  • 31. Global Memory Organization • In this configuration PE does not have private memory. • Memories attached to PE are replaced by parallel memory modules shared to all PE via an alignment network. • Alignment network does path switching between PE and parallel memory. • The PE to PE communication is also via alignment network. • The alignment network is controlled by the CU. • The number of PE (N) and the number of memory modules (K)may not be equal . • An alignment network should allow conflict free access of shared memories by as many PEs as possible .
  • 33. Attached Array Processor • In this configuration the attached array processor has an input output interface to common processor and another interface with a local memory. • The local memory connects to the main memory with the help of a high speed memory bus.
  • 34. Performance and Scalability of Array Processor • To compute N • Y =  A(i) * B(i)• i=1 Assuming:  A dedicated memory organization.  Elements of A and B are properly and perfectly distributed among processors (the compiler can help here). We have:  The product terms are generated in parallel. Additions can be performed in log2N iterations.  Speed up factor (S) is (assuming that addition and multiplication take the same time): • S = 2N-1 1+ log 2 N
  • 35.
  • 36. ILLIAC IV • The ILLIAC IV system was the first real attempt to contract a large-scale parallel machine, and in its time it was the most powerful computing machine in the world. It was designed and constructed by academics and scientists from the University of Illinois and the Burroughs Corporation. A significant amount of software, including sophisticated compilers, was developed for ILLIAC IV, and many researchers were able to develop parallel application software. • ILLIAC IV grew from a series of ILLIAC machines. Work on ILLIAC IV began in the 1960s, and the machine became operational in 1972. The original aim was to produce a 1 GFLOP machine using an SIMD array architecture comprising 256 processors partitioned into four quadrants, each controlled by an independent control unit.
  • 37. ILLIAC IV Features ILLIAC IV (started in the late 60’s; fully operational in 1975) is a typical example of Array Processors. SIMD computer for array processing. Control Unit + 64 Processing Elements.  2K words memory per PE. CU can access all memory. PEs can access local memory and communicate with neighbors. CU reads program and broadcasts instructions to Pes.
  • 39. Super Computers • Cray Inc. is an American supercomputer manufacturer headquartered in Seattle, Washington. The company's predecessor, Cray Research, Inc. (CRI), was founded in 1972 by computer designer Seymour Cray. • Cray-1 • The Cray-1 was a supercomputer designed, manufactured and marketed by Cray Research. The first Cray-1 system was installed at Los Alamos National Laboratory in 1976 and it went on to become one of the best known and most successful supercomputers in history.
  • 40. Cray X1 • The Cray X1 is a non-uniform memory access, vector processor supercomputer manufactured and sold by Cray Inc. since 2003. The X1 is often described as the unification of the Cray T90, Cray SV1, and Cray T3E architectures into a single machine. The X1 shares the multistreaming processors, vector caches, and CMOS design of the SV1, the highly scalable distributed memory design of the T3E, and the high memory bandwidth. • The X1 uses a 1.2 ns (800 MHz) clock cycle, and 8-wide vector pipes in MSP mode, offering a peak speed of 12.8 gigaflops per processor. maximum of 4096 processors, comprising 1024 shared-memory nodes connected in a two- dimensional network, in 32 frames. Such a system would supply a peak speed of 50 teraflops.
  • 41. Cray X1 Cray combines several technologies in the X1 machine (2003): Multi-streaming vector processing. Multiple node architecture.
  • 42. Cray X1 System Functional Diagram •Mainframe •Node interconnection network •System Port Channel (SPC) •Communicate within Nodes •I/O drawers (IODs) •Cray Programming Environment Server (CPES) •Cray Network Subsystem (CNS) •Storage area network (SAN) •RAID
  • 43. Cray X1 System Functional Diagram
  • 44. Nodes Nodes are housed in hardware modules called Node module Four multichip modules (MCMs) One multi streaming processor (MSP) Four SPC I/O ports Routing switches controls all memory access
  • 45. Node Processors • Each node consists of four MCMs • Each MCM includes one multi streaming processor (MSP) • Each MSP includes a 2-MB cache • A single MSP provides 12.8 GF (gigaflops) • Each MSP has four internal single-streaming processors (SSPs) • Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit • The four SSPs in an MSP share the 2-MB cache of the MSP
  • 46. Cray Computers • Cray-1 • Cray-2 • Cray-3 • Cray-3/SSS • Cray-4 • Cray C90 • Cray Urika-GD • Cray X1 • Cray X2 • Cray XC30 • Cray XC40 Continue ………
  • 47.
  • 48. Multimedia extensions • A multimedia extension is essentially a supplementary processing capability that is supported on recent products. MMX provides integer operations, and defines eight different registers, names MM0 through MM7, and the operations that operate on them.
  • 49. MMX (instruction set) • MMX is a single instruction, multiple data (SIMD) instruction set designed by Intel, introduced in 1997 with its P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology". • MMX is a single instruction, multiple data instruction set designed by the large manufacturer of computer products, Intel. A multimedia extension is essentially a supplementary processing capability that is supported on recent products.
  • 50. Technical details • MMX defines eight registers, called MM0 through MM7, and operations that operate on them. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: a single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. Pentium II processor with MMX technology