SlideShare une entreprise Scribd logo
1  sur  114
한국해양과학기술진흥원
Introduction to Parallel Computing
2013.10.6
Sayed Chhattan Shah, PhD
Senior Researcher
Electronics and Telecommunications Research Institute, Korea
한국해양과학기술진흥원
2
Acknowledgements
Blaise Barney, Lawrence Livermore National Laboratory,
“Introduction to Parallel Computing”
Jim Demmel, Parallel Computing Lab at UC Berkeley, “
Parallel Computing”
한국해양과학기술진흥원
Outline
 Introduction
 Parallel Computer Memory Architectures
 Shared Memory
 Distributed Memory
 Hybrid Distributed-Shared Memory
 Parallel Programming Models
 Shared Memory Model
 Threads Model
 Message Passing Model
 Data Parallel Model
 Design Parallel Programs
Overview
한국해양과학기술진흥원
What is Parallel Computing?
 Traditionally, software has been written
for serial computation
 To be run on a single computer having a single Central
Processing Unit
 A problem is broken into a discrete series of instructions
 Instructions are executed one after another
 Only one instruction may execute at any moment in time
한국해양과학기술진흥원
What is Parallel Computing?
한국해양과학기술진흥원
What is Parallel Computing?
Parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem
 Accomplished by breaking the problem into independent parts
so that each processing element can execute its part of the
algorithm simultaneously with the others
한국해양과학기술진흥원
What is Parallel Computing?
한국해양과학기술진흥원
What is Parallel Computing?
The computational problem should be:
 Solved in less time with multiple compute resources than with a
single compute resource
The compute resources might be
 A single computer with multiple processors
 Several networked computers
 A combination of both
한국해양과학기술진흥원
What is Parallel Computing?
LLN Lab Parallel Computer:
 Each compute node is a multi-processor parallel computer
 Multiple compute nodes are networked together with an
InfiniBand network
한국해양과학기술진흥원
What is Parallel Computing?
The Real World is Massively Parallel
 In the natural world, many complex, interrelated events are
happening at the same time, yet within a temporal sequence
 Compared to serial computing, parallel computing is much
better suited for modeling, simulating and understanding
complex, real world phenomena
한국해양과학기술진흥원
What is Parallel Computing?
한국해양과학기술진흥원
Uses for Parallel Computing
 Science and Engineering
 Historically, parallel computing has been used to model difficult
problems in many areas of science and engineering
• Atmosphere, Earth, Environment, Physics, Bioscience, Chemistry, Mechanical
Engineering, Electrical Engineering, Circuit Design, Microelectronics, Defense,
Weapons
한국해양과학기술진흥원
Uses for Parallel Computing
 Industrial and Commercial
 Today, commercial applications provide an equal or greater
driving force in the development of faster computers
• Data mining, Oil exploration, Web search engines, Medical imaging and
diagnosis, Pharmaceutical design, Financial and economic modeling, Advanced
graphics and virtual reality, Collaborative work environments
한국해양과학기술진흥원
Why Use Parallel Computing?
 Save time and money
 In theory, throwing more resources at a task will shorten its
time to completion, with potential cost savings
 Parallel computers can be built from cheap, commodity
components
한국해양과학기술진흥원
Why Use Parallel Computing?
Solve larger problems
 Many problems are so large and complex that it is impractical
or impossible to solve them on a single computer, especially
given limited computer memory
• en.wikipedia.org/wiki/Grand_Challenge problems requiring
PetaFLOPS and PetaBytes of computing resources.
 Web search engines and databases processing millions of
transactions per second
한국해양과학기술진흥원
Why Use Parallel Computing?
Provide concurrency
 A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things
simultaneously
• For example, the Access Grid (www.accessgrid.org) provides a
global collaboration network where people from around the world
can meet and conduct work "virtually”
한국해양과학기술진흥원
Why Use Parallel Computing?
Use of non-local resources
 Using compute resources on a wide area network, or even the
Internet when local compute resources are insufficient
• SETI@home (setiathome.berkeley.edu) over 1.3 million users, 3.4
million computers in nearly every country in the world.
Source: www.boincsynergy.com/stats/ (June, 2013).
• Folding@home (folding.stanford.edu) uses over 320,000 computers
globally (June, 2013)
한국해양과학기술진흥원
Why Use Parallel Computing?
Limits to serial computing
 Transmission speeds
• The speed of a serial computer is directly dependent upon how fast
data can move through hardware.
• Absolute limits are the speed of light and the transmission limit of
copper wire
• Increasing speeds necessitate increasing proximity of processing
elements
 Limits to miniaturization
• Processor technology is allowing an increasing number of
transistors to be placed on a chip
• However, even with molecular or atomic-level components, a limit
will be reached on how small components can be
한국해양과학기술진흥원
Why Use Parallel Computing?
Limits to serial computing
 Economic limitations
• It is increasingly expensive to make a single processor faster
• Using a larger number of moderately fast commodity processors to
achieve the same or better performance is less expensive
 Current computer architectures are increasingly relying upon
hardware level parallelism to improve performance
• Multiple execution units
• Pipelined instructions
• Multi-core
한국해양과학기술진흥원
The Future
 Trends indicated by ever faster networks, distributed
systems, and multi-processor computer architectures
clearly show that parallelism is the future of computing
한국해양과학기술진흥원
The Future
 There has been a greater than 1000x increase in
supercomputer performance, with no end currently in sight
 The race is already on for Exascale Computing!
 1 exaFlops = 1018, floating point operations per second
한국해양과학기술진흥원
Who is Using Parallel Computing?
한국해양과학기술진흥원
Who is Using Parallel Computing?
Concepts and Terminology
한국해양과학기술진흥원
von Neumann Architecture
Named after the Hungarian mathematician John von
Neumann who first authored the general requirements
for an electronic computer in his 1945 papers
Since then, virtually all computers have followed this
basic design
Comprises of four main components:
 RAM is used to store both program instructions and data
 Control unit fetches instructions or data from memory,
Decodes instructions and then sequentially coordinates operations
to accomplish the programmed task.
 Arithmetic Unit performs basic arithmetic operations
 Input-Output is the interface to the human operator
한국해양과학기술진흥원
Flynn's Classical Taxonomy
Different ways to classify parallel computers
 Examples available HERE
Widely used classifications is called Flynn's Taxonomy
 Based upon the number of concurrent instruction and data
streams available in the architecture
한국해양과학기술진흥원
Single Instruction, Single Data
 A sequential computer which exploits no parallelism in either the
instruction or data streams
• Single Instruction: Only one instruction stream is being acted on by the
CPU during any one clock cycle
• Single Data: Only one data stream is being used as input during any
one clock cycle
• Can have concurrent processing characteristics:
Pipelined execution
Flynn's Classical Taxonomy
한국해양과학기술진흥원
Flynn's Classical Taxonomy
Older generation mainframes, minicomputers, workstations, modern
day PCs
한국해양과학기술진흥원
Flynn's Classical Taxonomy
Single Instruction, Multiple Data
 A computer which exploits multiple data streams against a single
instruction stream to perform operations
• Single Instruction: All processing units execute the same instruction at
any given clock cycle
• Multiple Data: Each processing unit can operate on a different data
element
한국해양과학기술진흥원
Flynn's Classical Taxonomy
Single Instruction, Multiple Data
 Vector Processor implements a instruction set containing
instructions that operates on 1D arrays of data called vectors
한국해양과학기술진흥원
Multiple Instruction, Single Data
 Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams
 Single Data: A single data stream is fed into multiple processing
units
 Some conceivable uses might be:
• Multiple cryptography algorithms attempting to crack a single coded
message
Flynn's Classical Taxonomy
한국해양과학기술진흥원
Multiple Instruction, Multiple Data
 Multiple autonomous processors simultaneously executing
different instructions on different data
• Multiple Instruction: Every processor may be executing a different
instruction stream
• Multiple Data: Every processor may be working with a different data
stream
Flynn's Classical Taxonomy
한국해양과학기술진흥원
Flynn's Classical Taxonomy
Most current supercomputers, networked parallel computer
clusters, grids, clouds, multi-core PCs
Many MIMD architectures also include SIMD execution sub-
components
한국해양과학기술진흥원
Some General Parallel Terminology
Parallel computing
 Using parallel computer to solve single problems faster
Parallel computer
 Multiple-processor or multi-core system supporting parallel
programming
Parallel programming
 Programming in a language that supports concurrency
explicitly
Supercomputing or High Performance Computing
 Using the world's fastest and largest computers to solve large
problems
한국해양과학기술진흥원
Some General Parallel Terminology
Task
 A logically discrete section of computational work.
 A task is typically a program or program-like set of instructions that is
executed by a processor
 A parallel program consists of multiple tasks running on multiple
processors
Shared Memory
 Hardware point of view, describes a computer architecture where all
processors have direct access to common physical memory
 In a programming sense, it describes a model where all parallel tasks
have the same "picture" of memory and can directly address and
access the same logical memory locations regardless of where the
physical memory actually exists
한국해양과학기술진흥원
Some General Parallel Terminology
Symmetric Multi-Processor (SMP)
 Hardware architecture where multiple processors share a
single address space and access to all resources; shared
memory computing
Distributed Memory
 In hardware, refers to network based memory access for
physical memory that is not common
 As a programming model, tasks can only logically "see" local
machine memory and must use communications to access
memory on other machines where other tasks are executing
한국해양과학기술진흥원
Some General Parallel Terminology
Communications
 Parallel tasks typically need to exchange data. There are several ways this can
be accomplished, such as through a shared memory bus or over a network,
however the actual event of data exchange is commonly referred to as
communications regardless of the method employed
Synchronization
 Coordination of parallel tasks in real time, very often associated with
communications
 Often implemented by establishing a synchronization point within an
application where a task may not proceed further until another task(s) reaches
the same or logically equivalent point
 Synchronization usually involves waiting by at least one task, and can therefore
cause a parallel application's wall clock execution time to increase
한국해양과학기술진흥원
Some General Parallel Terminology
Parallel Overhead
 The amount of time required to coordinate parallel tasks, as
opposed to doing useful work.
 Parallel overhead can include factors such as:
• Task start-up time
• Synchronizations
• Data communications
• Software overhead imposed by parallel languages, libraries,
operating system, etc.
• Task termination time
Massively Parallel
 Refers to the hardware that comprises a given parallel system - having many
processors.
 The meaning of "many" keeps increasing, but currently, the largest parallel
computers can be comprised of processors numbering in the hundreds of
thousands
한국해양과학기술진흥원
Some General Parallel Terminology
Embarrassingly Parallel
 Solving many similar, but independent tasks simultaneously;
little to no need for coordination between the tasks
Scalability
 A parallel system's ability to demonstrate a proportionate
increase in parallel speedup with the addition of more
resources
한국해양과학기술진흥원
Some General Parallel Terminology
Multi-core processor
 A single computing component with two or more independent
actual central processing units
한국해양과학기술진흥원
Potential program speedup is defined by the fraction of
code (P) that can be parallelized
 If none of the code can be parallelized, P = 0 and the speedup =
1 (no speedup).
 If 50% of the code can be parallelized,
maximum speedup = 2, meaning the code will run twice as fast.
Limits and Costs of Parallel Programming
한국해양과학기술진흥원
 Introducing the number of processors N performing the parallel
fraction of work P, the relationship can be modeled by:
 It soon becomes obvious that there are limits to the scalability of
parallelism. For example
Limits and Costs of Parallel Programming
한국해양과학기술진흥원
Limits and Costs of Parallel Programming
한국해양과학기술진흥원
Some General Parallel Terminology
Complexity
 Parallel applications are much more complex than
corresponding serial applications
• Not only do you have multiple instruction streams executing at the
same time, but you also have data flowing between them
 The costs of complexity are measured in programmer time in
virtually every aspect of the software development cycle:
• Design
• Coding
• Debugging
• Tuning
• Maintenance
한국해양과학기술진흥원
Some General Parallel Terminology
Portability
 Portability issues with parallel programs are not as serious as
in years past due to standardization in several APIs, such as
MPI, POSIX threads, and OpenMP
 All of the usual portability issues associated with serial
programs apply to parallel programs
• Hardware architectures are characteristically highly variable and
can affect portability
한국해양과학기술진흥원
Some General Parallel Terminology
Resource Requirements
 Amount of memory required can be greater for parallel codes
than serial codes, due to the need to replicate data and for
overheads associated with parallel support libraries and
subsystems
 For short running parallel programs, there can actually be a
decrease in performance compared to a similar serial
implementation
• The overhead costs associated with setting up the parallel
environment, task creation, communications and task termination
can comprise a significant portion of the total execution time for
short runs
한국해양과학기술진흥원
Some General Parallel Terminology
Scalability
 Ability of a parallel program's performance to scale is a result of
a number of interrelated factors. Simply adding more
processors is rarely the answer
 Algorithm may have inherent limits to scalability
 Hardware factors play a significant role in scalability.
Examples:
• Memory-CPU bus bandwidth on an SMP machine
• Communications network bandwidth
• Amount of memory available on any given machine or set of
machines
• Processor clock speed
Parallel Computer Memory Architectures
한국해양과학기술진흥원
All processors access all memory as global address
space
 Multiple processors can operate independently but share the
same memory resources
 Changes in a memory location effected by one processor are
visible to all other processors
Shared Memory
한국해양과학기술진흥원
Shared memory machines are classified
as UMA and NUMA, based upon memory access times
 Uniform Memory Access (UMA)
• Most commonly represented today by Symmetric Multiprocessor
(SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA
 Cache coherent means if one processor updates a location in shared
memory, all the other processors know about the update. Cache
coherency is accomplished at the hardware level
Shared Memory
한국해양과학기술진흥원
 Non-Uniform Memory Access (NUMA)
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-
NUMA - Cache Coherent NUMA
Shared Memory
한국해양과학기술진흥원
Advantages
 Global address space provides a user-friendly programming
perspective to memory
 Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs
Disadvantages
 Primary disadvantage is the lack of scalability between memory
and CPUs
• Adding more CPUs can geometrically increases traffic on the
shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory
management
 Programmer responsibility for synchronization constructs that
ensure "correct" access of global memory
Shared Memory
한국해양과학기술진흥원
Processors have their own local memory
 Changes to processor’s local memory have no effect on the
memory of other processors
 When a processor needs access to data in another processor,
it is usually the task of the programmer to explicitly define how
and when data is communicated
 Synchronization between tasks is likewise the programmer's
responsibility
 The network "fabric" used for data transfer varies widely,
though it can be as simple as Ethernet
Distributed Memory
한국해양과학기술진흥원
Advantages
 Memory is scalable with the number of processors.
• Increase the number of processors and the size of memory
increases proportionately
 Each processor can rapidly access its own memory without
interference and without the overhead incurred with trying to
maintain global cache coherency.
 Cost effectiveness: can use commodity, off-the-shelf
processors and networking
Disadvantages
 The programmer is responsible for many of the details
associated with data communication between processors.
 Non-uniform memory access times - data residing on a remote
node takes longer to access than node local data.
Distributed Memory
한국해양과학기술진흥원
Hybrid Distributed-Shared Memory
The largest and fastest computers in the world today
employ both shared and distributed memory
architectures
 The shared memory component can be a shared memory
machine or graphics processing units
 The distributed memory component is the networking of
multiple shared memory or GPU machines
한국해양과학기술진흥원
Advantages and Disadvantages
 Whatever is common to both shared and distributed memory
architectures
 Increased scalability is an important advantage
 Increased programmer complexity is an important
disadvantage
Hybrid Distributed-Shared Memory
Parallel Programming Models
한국해양과학기술진흥원
Parallel Programming Model
Programming model provides an abstract view of
computing system
 Abstraction above hardware and memory architectures
 Value of a programming model is usually judged on its
generality
• how well a range of different problems can be expressed and
• how well they execute on a range of different architectures
 The implementation of a programming model can take several
forms such as libraries invoked from traditional sequential
languages, language extensions, or complete new execution
models
한국해양과학기술진흥원
Parallel Programming Model
Parallel programming models in common use:
 Shared Memory (without threads)
 Threads
 Distributed Memory / Message Passing
 Data Parallel
 Hybrid
 Single Program Multiple Data (SPMD)
 Multiple Program Multiple Data (MPMD)
These models are NOT specific to a particular type of
machine or memory architecture
 Any of these models can be implemented on any underlying
hardware
한국해양과학기술진흥원
Parallel Programming Model
SHARED memory model on a DISTRIBUTED memory
machine
 Machine memory was physically distributed across networked
machines, but appeared to the user as a single shared memory
(global address space).
 This approach is referred to as virtual shared memory
DISTRIBUTED memory model on a SHARED memory
machine
 The SGI Origin 2000 employed the CC-NUMA type of shared
memory architecture, where every task has direct access to
global address space spread across all machines.
 However, the ability to send and receive messages using MPI, as
is commonly done over a network of distributed memory
machines, was implemented and commonly used
한국해양과학기술진흥원
Shared Memory Model - Without Threads
Tasks share a common address space
 Efficient means of passing data between programs
 Various mechanisms such as locks / semaphores may be used
to control access to the shared memory
 Programmer's point of view
• The notion of data "ownership" is lacking, so there is no need to
specify explicitly the communication of data between tasks
• Program development can often be simplified
Disadvantage in terms of performance
 It becomes more difficult to understand and manage data locality:
• Keeping data local to the processor that works on it conserves
memory accesses, cache refreshes and bus traffic that occurs when
multiple processors use the same data
한국해양과학기술진흥원
Shared Memory Model - Without Threads
Implementations
 Native compilers or hardware translate user program variables
into actual memory addresses, which are global
• On stand-alone shared memory machines, this is straightforward.
• On distributed shared memory machines, memory is physically
distributed across a network of machines, but made global through
specialized hardware and software
한국해양과학기술진흥원
Threads Model
Type of shared memory programming model
 A single "heavy weight" process can have multiple "light weight",
concurrent execution paths
 Main program a.out is scheduled by native OS
 a.out loads and acquires all of the necessary system and user
resources to run. This is the "heavy weight" process
 a.out performs some serial work, and then creates a number of
tasks (threads) that can be scheduled and run by the operating
system concurrently
한국해양과학기술진흥원
Threads Model
 Each thread has local data, but also, shares the entire resources
of a.out
 This saves the overhead associated with replicating a program's
resources for each thread ("light weight"). Each thread also
benefits from a global memory view because it shares the
memory space of a.out
 Threads communicate with each other through global memory
(updating address locations). This requires synchronization
constructs to ensure that more than one thread is not updating
the same global address at any time
 Threads can come and go, but a.out remains present to provide
the necessary shared resources until the application has
completed
한국해양과학기술진흥원
Threads Model
Implementations
• POSIX Threads
 Library based; requires parallel coding
 C Language only
 Commonly referred to as Pthreads.
 Most hardware vendors now offer Pthreads in addition to their proprietary
threads implementations.
 Very explicit parallelism; requires significant programmer attention to detail.
• OpenMP
 Compiler directive based; can use serial code
 Portable / multi-platform, including Unix and Windows platforms
 Available in C/C++ and Fortran implementations
 Can be very easy and simple to use
한국해양과학기술진흥원
Distributed Memory / Message Passing Model
A set of tasks that use their own local memory during
computation
 Multiple tasks can reside on the same physical machine and/or
across an arbitrary number of machines.
 Tasks exchange data through communications by sending and
receiving messages.
 Data transfer usually requires cooperative operations to be
performed by each process. For example, a send operation must
have a matching receive operation.
한국해양과학기술진흥원
Distributed Memory / Message Passing Model
Implementations
 From a programming perspective, message passing
implementations usually comprise a library of subroutines
 Calls to these subroutines are imbedded in source code
 MPI specifications are available on the web at http://www.mpi-
forum.org/docs/
 MPI implementations exist for virtually all popular parallel
computing platforms
한국해양과학기술진흥원
Address space is treated globally
A set of tasks work collectively on the same data
structure, however, each task works on a different partition
of the same data structure
 On shared memory architectures, all tasks may have access to
the data structure through global memory
 On distributed memory architectures the data structure is split up
and resides as "chunks" in the local memory of each task
Data Parallel Model
한국해양과학기술진흥원
Implementations
 Unified Parallel C (UPC): an extension to the C programming
language for SPMD parallel programming. Compiler dependent.
More information: http://upc.lbl.gov/
 Global Arrays: provides a shared memory style programming
environment in the context of distributed array data structures.
Public domain library with C and Fortran77 bindings. More
information: http://www.emsl.pnl.gov/docs/global/
 X10: a PGAS based parallel programming language being
developed by IBM at the Thomas J. Watson Research Center.
More information: http://x10-lang.org/
Data Parallel Model
한국해양과학기술진흥원
Hybrid Model
A hybrid model combines more than one of the previously
described programming models
한국해양과학기술진흥원
Single Program Multiple Data
High level programming model that can be built upon any
combination of the previously mentioned parallel
programming models
SINGLE PROGRAM: All tasks execute their copy of the
same program simultaneously.
 This program can be threads, message passing, data parallel or
hybrid.
MULTIPLE DATA: All tasks may use different data
한국해양과학기술진흥원
Multiple Program Multiple Data
High level programming model that can be built upon any
combination of the previously mentioned parallel
programming models
MULTIPLE PROGRAM: Tasks may execute different
programs simultaneously. The programs can be threads,
message passing, data parallel or hybrid.
MULTIPLE DATA: All tasks may use different data
Designing Parallel Programs
한국해양과학기술진흥원
Determine whether the problem can be parallelized
Calculate the potential energy for each of several
thousand independent conformations of a molecule. When
done, find the minimum energy conformation
 This problem can be solved in parallel.
• Each of the molecular conformations is independently determinable
• The calculation of the minimum energy conformation is also a
parallelizable problem.
Understand the Problem and the Program
한국해양과학기술진흥원
Calculation of the series (0,1,1,2,3,5,8,13,21,...) by use of
the formula: F(n) = F(n-1) + F(n-2)
 This problem can't be solved in parallel
• Calculation of the Fibonacci sequence as shown would entail
dependent calculations rather than independent ones.
• The calculation of the F(n) value uses those of both F(n-1) and F(n-2).
These three terms cannot be calculated independently and therefore,
not in parallel
Understand the Problem and the Program
한국해양과학기술진흥원
Understand the Problem and the Program
Identify the program's hotspots
 Know where most of the real work is being done
• The majority of scientific and technical programs usually accomplish
most of their work in a few places
• Profilers and performance analysis tools can help here
 Focus on parallelizing the hotspots and ignore those sections of
the program that account for little CPU usage
한국해양과학기술진흥원
Understand the Problem and the Program
Identify bottlenecks in the program
 Are there areas that are disproportionately slow, or cause
parallelizable work to halt or be deferred?
• For example, I/O is usually something that slows a program down
 May be possible to restructure the program or use a different
algorithm to reduce or eliminate unnecessary slow areas
Identify inhibitors to parallelism
 One common class of inhibitor is data dependence
한국해양과학기술진흥원
Break the problem into discrete "chunks" of work that can
be distributed to multiple tasks
Two basic ways to partition computational work among
parallel tasks:
 Domain decomposition
• Data associated with a problem is decomposed
• Each parallel task then works on a portion of the data
Partitioning
한국해양과학기술진흥원
Partitioning
 Functional decomposition
• Problem is decomposed according to the work
• Each task then performs a portion of the overall work
• Functional decomposition lends itself well to problems that can be
split into different tasks
한국해양과학기술진흥원
Partitioning
Ecosystem Modeling
 Each program calculates the population of a given group
• Each group's growth depends on that of its neighbors
• As time progresses, each process calculates its current state, then
exchanges information with the neighbor populations
• All tasks then progress to calculate the state at the next time step
한국해양과학기술진흥원
Partitioning
Climate Modeling
 Each model component can be thought of as a separate task
 Arrows represent exchanges of data between components during
computation:
• the atmosphere model generates wind velocity data that are used by
the ocean model, the ocean model generates sea surface
temperature data that are used by the atmosphere model, and so on
한국해양과학기술진흥원
Communications
Some types of problems can be decomposed and
executed in parallel with virtually no need for tasks to
share data
 For example, imagine an image processing operation where
every pixel in a black and white image needs to have its color
reversed
 The image data can easily be distributed to multiple tasks that
then act independently of each other to do their portion of the
work
Most parallel applications require tasks to share data with
each other
 For example, a 3-D heat diffusion problem requires a task to
know the temperatures calculated by the tasks that have
neighboring data
 Changes to neighboring data has a direct effect on that task's
data
한국해양과학기술진흥원
Communications
Factors to Consider when designing your program's inter-
task communications
 Cost of communications
• Inter-task communication virtually always implies overhead
• Communications frequently require some type of synchronization
between tasks
• Competing communication traffic can saturate the available network
bandwidth, further aggravating performance problems
 Latency vs. Bandwidth
• latency is the time it takes to send a message from point A to B
• bandwidth is the amount of data that can be communicated per unit of
time
• Sending many small messages can cause latency to dominate
communication overheads
• Often it is more efficient to package small messages into a larger
message, thus increasing the effective communications bandwidth
한국해양과학기술진흥원
Communications
 Visibility of communications
• With the Message Passing Model, communications are explicit and
generally quite visible and under the control of the programmer
• With the Data Parallel Model, communications often occur
transparently to the programmer, particularly on distributed memory
architectures
• The programmer may not even be able to know exactly how inter-task
communications are being accomplished
 Synchronous vs. asynchronous communications
• Synchronous communications are often referred to
as blocking communications since other work must wait until the
communications have completed
• Asynchronous communications allow tasks to transfer data
independently from one another
• Interleaving computation with communication is the single greatest
benefit for using asynchronous communications
한국해양과학기술진흥원
Communications
 Scope of communications
• Knowing which tasks must communicate with each other is critical
during the design stage of a parallel code
• Point-to-point - involves two tasks with one task acting as the
sender/producer of data, and the other acting as the
receiver/consumer
• Collective - involves data sharing between more than two tasks, which
are often specified as being members in a common group
• Some common variations
한국해양과학기술진흥원
Communications
Efficiency of communications
 Which implementation for a given model should be used?
• Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform than
another
• What type of communication operations should be used?
• asynchronous communication operations can improve overall
program performance
한국해양과학기술진흥원
Communications
Overhead and Complexity
한국해양과학기술진흥원
Synchronization
Types of Synchronization
 Barrier
• Each task performs its work until it reaches the barrier
• When the last task reaches the barrier, all tasks are synchronized
 Lock
• Typically used to protect access to global data or code section
• Only one task at a time may use the lock
 Synchronous communication operations
• When a task performs a communication operation, some form of
coordination is required with the other task participating in the
communication
• For example, before a task can perform a send operation, it must first
receive an acknowledgment from the receiving task that it is OK to send
한국해양과학기술진흥원
 A dependence exists between program statements when the order of
statement execution affects the results of the program
 Dependencies are important to parallel programming because they
are one of the primary inhibitors to parallelism
 Loop carried data dependence: The value of A(J-1) must be computed before the value
of A(J), therefore A(J) exhibits a data dependency on A(J-1)
 How to Handle Data Dependencies
 Distributed memory architectures - communicate required data at synchronization
points
 Shared memory architectures -synchronize read-write operations between tasks
Data Dependencies
한국해양과학기술진흥원
 Practice of distributing approximately equal amounts of work among
tasks
 Important to parallel programs for performance reasons
 For example, if all tasks are subject to a barrier synchronization point, the slowest
task will determine the overall performance.
Load Balancing
한국해양과학기술진흥원
How to Achieve Load Balance
 Equally partition the work each task receives
• For array operations where each task performs similar work, evenly
distribute the data set among the tasks
• For loop iterations where the work done in each iteration is similar,
evenly distribute the iterations across the tasks
• If a heterogeneous mix of machines, use performance analysis tool to
detect any load imbalances and adjust work accordingly
 Use dynamic work assignment
• When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use
a scheduler - task pool approach
 As each task finishes its work, it queues to get a new piece of work.
• It may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code
Load Balancing
한국해양과학기술진흥원
 The number of tasks into which a problem is decomposed
determines granularity
 Fine-grain Parallelism – Decomposition into large no of tasks
• Individual tasks are relatively small in terms of code size and
execution time
• The data is transferred among processors frequently
• Facilitates load balancing
• Implies high communication overhead
 Coarse-grain Parallelism – Decomposition into small no of tasks
• Individual tasks are relatively small in terms of code size and
execution time
• Implies more opportunity for performance increase
• Harder to load balance efficiently
Granularity
한국해양과학기술진흥원
 Array Processing
 Example demonstrates calculations on 2-dimensional array
elements, with the computation on each array element being
independent from other array elements
 Serial program calculates one element at a time in sequential
order
Parallel Examples : Array Processing
한국해양과학기술진흥원
 Array Processing - Parallel Solution 1
 Independent calculation of array elements ensures there is no
need for communication between tasks
 Each task executes the portion of the loop corresponding to the
data it owns
Parallel Examples : Array Processing
한국해양과학기술진흥원
 One Possible Solution
 Implement as a Single Program Multiple Data model
• Master process initializes array, sends info to worker processes and
receives results
• Worker process receives info, performs its share of computation and
sends results to master
• Static load balancing
• Each task has a fixed amount of work to do
• May be significant idle time for faster or more lightly loaded processors -
slowest tasks determines overall performance
Parallel Examples : Array Processing
한국해양과학기술진흥원
Pool of Tasks Scheme
 Two processes are employed
• Master Process
• Holds pool of tasks for worker processes to do
• Sends worker a task when requested
• Collects results from workers
• Worker Process
• Gets task from master process
• Performs computation
• Sends results to master
• Worker processes do not know before runtime which portion of array they will
handle or how many tasks they will perform
• Dynamic load balancing occurs at run time
• the faster tasks will get more work to do
Parallel Examples : Array Processing
한국해양과학기술진흥원
Multicore computing
 A Multicore processor is a processor that includes
multiple execution units ("cores") on the same chip
Symmetric multiprocessing
 A computer system with multiple identical processors that
share memory and connect via a bus than node local data.
Distributed System
 A software system in which components located on networked
computers communicate and coordinate their actions by
passing messages
 An important goal and challenge of distributed systems is
location transparency
Classes of Parallel Computers
한국해양과학기술진흥원
Cluster computing
 A cluster is a group of loosely coupled computers that work
together closely, so that in some respects they can be
regarded as a single computer
Massive parallel processing
 A massively parallel processor (MPP) is a single computer with
many networked processors.
 MPPs have many of the same characteristics as clusters, but
MPPs have specialized interconnect networks
Grid computing
 Compared to clusters, grids tend to be more loosely coupled,
heterogeneous, and geographically dispersed
Classes of Parallel Computers
한국해양과학기술진흥원
Exascale computing
 Computer system capable of at least one exaFlops
• 10^18 floating point operations per second
Volunteer Computing
 Computer owners donate their computing resources to one or
more projects
Peep-to-peer Network
 A type of decentralized and distributed network architecture in
which individual nodes in the network act as both suppliers and
consumers of resources
Classes of Parallel Computers
Distributed Systems
한국해양과학기술진흥원
A collection of independent computers that appear to the
users of the system as a single computer
A collection of autonomous computers, connected
through a network and distribution middleware which
enables computers to coordinate their activities and to
share the resources of the system, so that users
perceive the system as a single, integrated computing
facility
Distributed Systems
한국해양과학기술진흥원
Example - Internet
한국해양과학기술진흥원
Example - ATM
한국해양과학기술진흥원
Example - Mobile devices in a Distributed System
한국해양과학기술진흥원
Why Distributed Systems?
한국해양과학기술진흥원
Transparency
Scalability
Fault tolerance
Concurrency
Openness
These challenges can also be seen as the goals or
desired properties of a distributed system
Basic problems and challenges
한국해양과학기술진흥원
 Concealment from the user and the application programmer
of the separation of the components of a distributed system
 Access Transparency - Local and remote resources are accessed in
same way
 Location Transparency - Users are unaware of the location of
resources
 Migration Transparency - Resources can migrate without name
change
 Replication Transparency - Users are unaware of the existence of
multiple copies of resources
 Failure Transparency - Users are unaware of the failure of individual
components
 Concurrency Transparency - Users are unaware of sharing resources
with others
Transparency
한국해양과학기술진흥원
 Addition of users and resources without suffering a noticeable
loss of performance or increase in administrative complexity
 Adding users and resources causes a system to grow:
 Size - growth with regards to the number of users or resources
• System may become overloaded
• May increase administration cost
 Geography - growth with regards to geography or the distance
between nodes
• Greater communication delays
 Administration – increase in administrative cost
Scalability
한국해양과학기술진흥원
Whether the system can be extended in various ways
without disrupting existing system and services
 Hardware extensions
• adding peripherals, memory, communication interfaces
 Software extensions
• Operating System features
• Communication protocols
Openness is supported by:
 Public interfaces
 Standardized communication protocols
Openness
한국해양과학기술진흥원
In a single system several processes are interleaved
In distributed systems - there are many systems with one
or more processors
Many users simultaneously invoke commands or
applications, access and update shared data
 Mutual exclusion
 Synchronization
 No global clock
Concurrency
한국해양과학기술진흥원
Hardware, software and networks fail
Distributed systems must maintain availability even at
low levels of hardware, software, network reliability
Fault tolerance is achieved by
 Recovery
 Redundancy
Issues
 Detecting failures
 Masking failures
 Recovery from failures
 Redundancy
Fault tolerance
한국해양과학기술진흥원
Fault tolerance
Omission and Arbitrary Failures
한국해양과학기술진흥원
Performance issues
 Responsiveness
 Throughput
 Load sharing, load balancing
 Quality of service
 Correctness
 Reliability, availability, fault tolerance
 Security
 Performance
 Adaptability
Design Requirements

Contenu connexe

Tendances

Parallel Programing Model
Parallel Programing ModelParallel Programing Model
Parallel Programing ModelAdlin Jeena
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization Hafiz faiz
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architectureMaulik Togadiya
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classificationNarayan Kandel
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitivesStudent
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computingSVijaylakshmi
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelismprashantdahake
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network securitybabak danyal
 

Tendances (20)

Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel Programing Model
Parallel Programing ModelParallel Programing Model
Parallel Programing Model
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
Query Decomposition and data localization
Query Decomposition and data localization Query Decomposition and data localization
Query Decomposition and data localization
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Distributed shred memory architecture
Distributed shred memory architectureDistributed shred memory architecture
Distributed shred memory architecture
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Feng’s classification
Feng’s classificationFeng’s classification
Feng’s classification
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Synchronization in distributed computing
Synchronization in distributed computingSynchronization in distributed computing
Synchronization in distributed computing
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelism
 
operating system structure
operating system structureoperating system structure
operating system structure
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
key distribution in network security
key distribution in network securitykey distribution in network security
key distribution in network security
 

Similaire à Introduction to Parallel and Distributed Computing

parallelprogramming-130823023925-phpapp01.pptx
parallelprogramming-130823023925-phpapp01.pptxparallelprogramming-130823023925-phpapp01.pptx
parallelprogramming-130823023925-phpapp01.pptxMarlonMagtibay3
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
Week # 1.pdf
Week # 1.pdfWeek # 1.pdf
Week # 1.pdfgiddy5
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processingSyed Zaid Irshad
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptxNithishaYadavv
 
00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdfaminnezarat
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxssuser413a98
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.pptssuser413a98
 

Similaire à Introduction to Parallel and Distributed Computing (20)

parallelprogramming-130823023925-phpapp01.pptx
parallelprogramming-130823023925-phpapp01.pptxparallelprogramming-130823023925-phpapp01.pptx
parallelprogramming-130823023925-phpapp01.pptx
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 
Parallel Computing
Parallel Computing Parallel Computing
Parallel Computing
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
Week # 1.pdf
Week # 1.pdfWeek # 1.pdf
Week # 1.pdf
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx(19-23)CC Unit-1 ppt.pptx
(19-23)CC Unit-1 ppt.pptx
 
00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf00 - BigData-Chapter_01-PDC.pdf
00 - BigData-Chapter_01-PDC.pdf
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptx
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
parallel computing.ppt
parallel computing.pptparallel computing.ppt
parallel computing.ppt
 
Isometric Making Essay
Isometric Making EssayIsometric Making Essay
Isometric Making Essay
 
parallel processing
parallel processingparallel processing
parallel processing
 

Plus de Sayed Chhattan Shah

Introduction to System Programming
Introduction to System ProgrammingIntroduction to System Programming
Introduction to System ProgrammingSayed Chhattan Shah
 
Introduction to Differential Equations
Introduction to Differential EquationsIntroduction to Differential Equations
Introduction to Differential EquationsSayed Chhattan Shah
 
Cloud and Edge Computing Systems
Cloud and Edge Computing SystemsCloud and Edge Computing Systems
Cloud and Edge Computing SystemsSayed Chhattan Shah
 
Introduction to Internet of Things
Introduction to Internet of ThingsIntroduction to Internet of Things
Introduction to Internet of ThingsSayed Chhattan Shah
 
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...Sayed Chhattan Shah
 
IEEE 802.11 Architecture and Services
IEEE 802.11 Architecture and ServicesIEEE 802.11 Architecture and Services
IEEE 802.11 Architecture and ServicesSayed Chhattan Shah
 
Routing in Mobile Ad hoc Networks
Routing in Mobile Ad hoc NetworksRouting in Mobile Ad hoc Networks
Routing in Mobile Ad hoc NetworksSayed Chhattan Shah
 
Keynote Talk on Recent Advances in Mobile Grid and Cloud Computing
Keynote Talk on Recent Advances in Mobile Grid and Cloud ComputingKeynote Talk on Recent Advances in Mobile Grid and Cloud Computing
Keynote Talk on Recent Advances in Mobile Grid and Cloud ComputingSayed Chhattan Shah
 
Keynote on Mobile Grid and Cloud Computing
Keynote on Mobile Grid and Cloud ComputingKeynote on Mobile Grid and Cloud Computing
Keynote on Mobile Grid and Cloud ComputingSayed Chhattan Shah
 
Introduction to Mobile Ad hoc Networks
Introduction to Mobile Ad hoc NetworksIntroduction to Mobile Ad hoc Networks
Introduction to Mobile Ad hoc NetworksSayed Chhattan Shah
 
Tips on Applying for a Scholarship
Tips on Applying for a ScholarshipTips on Applying for a Scholarship
Tips on Applying for a ScholarshipSayed Chhattan Shah
 

Plus de Sayed Chhattan Shah (17)

Introduction to System Programming
Introduction to System ProgrammingIntroduction to System Programming
Introduction to System Programming
 
Introduction to Differential Equations
Introduction to Differential EquationsIntroduction to Differential Equations
Introduction to Differential Equations
 
Algorithm Design and Analysis
Algorithm Design and AnalysisAlgorithm Design and Analysis
Algorithm Design and Analysis
 
Cloud and Edge Computing Systems
Cloud and Edge Computing SystemsCloud and Edge Computing Systems
Cloud and Edge Computing Systems
 
Introduction to Internet of Things
Introduction to Internet of ThingsIntroduction to Internet of Things
Introduction to Internet of Things
 
IoT Network Technologies
IoT Network TechnologiesIoT Network Technologies
IoT Network Technologies
 
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...
5G Network: Requirements, Design Principles, Architectures, and Enabling Tech...
 
Data Center Networks
Data Center NetworksData Center Networks
Data Center Networks
 
IEEE 802.11 Architecture and Services
IEEE 802.11 Architecture and ServicesIEEE 802.11 Architecture and Services
IEEE 802.11 Architecture and Services
 
Routing in Mobile Ad hoc Networks
Routing in Mobile Ad hoc NetworksRouting in Mobile Ad hoc Networks
Routing in Mobile Ad hoc Networks
 
Keynote Talk on Recent Advances in Mobile Grid and Cloud Computing
Keynote Talk on Recent Advances in Mobile Grid and Cloud ComputingKeynote Talk on Recent Advances in Mobile Grid and Cloud Computing
Keynote Talk on Recent Advances in Mobile Grid and Cloud Computing
 
Keynote on Mobile Grid and Cloud Computing
Keynote on Mobile Grid and Cloud ComputingKeynote on Mobile Grid and Cloud Computing
Keynote on Mobile Grid and Cloud Computing
 
Introduction to Mobile Ad hoc Networks
Introduction to Mobile Ad hoc NetworksIntroduction to Mobile Ad hoc Networks
Introduction to Mobile Ad hoc Networks
 
Cloud Robotics
Cloud RoboticsCloud Robotics
Cloud Robotics
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Tips on Applying for a Scholarship
Tips on Applying for a ScholarshipTips on Applying for a Scholarship
Tips on Applying for a Scholarship
 
Cluster and Grid Computing
Cluster and Grid ComputingCluster and Grid Computing
Cluster and Grid Computing
 

Dernier

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 

Dernier (20)

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 

Introduction to Parallel and Distributed Computing

  • 1. 한국해양과학기술진흥원 Introduction to Parallel Computing 2013.10.6 Sayed Chhattan Shah, PhD Senior Researcher Electronics and Telecommunications Research Institute, Korea
  • 2. 한국해양과학기술진흥원 2 Acknowledgements Blaise Barney, Lawrence Livermore National Laboratory, “Introduction to Parallel Computing” Jim Demmel, Parallel Computing Lab at UC Berkeley, “ Parallel Computing”
  • 3. 한국해양과학기술진흥원 Outline  Introduction  Parallel Computer Memory Architectures  Shared Memory  Distributed Memory  Hybrid Distributed-Shared Memory  Parallel Programming Models  Shared Memory Model  Threads Model  Message Passing Model  Data Parallel Model  Design Parallel Programs
  • 5. 한국해양과학기술진흥원 What is Parallel Computing?  Traditionally, software has been written for serial computation  To be run on a single computer having a single Central Processing Unit  A problem is broken into a discrete series of instructions  Instructions are executed one after another  Only one instruction may execute at any moment in time
  • 7. 한국해양과학기술진흥원 What is Parallel Computing? Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem  Accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others
  • 9. 한국해양과학기술진흥원 What is Parallel Computing? The computational problem should be:  Solved in less time with multiple compute resources than with a single compute resource The compute resources might be  A single computer with multiple processors  Several networked computers  A combination of both
  • 10. 한국해양과학기술진흥원 What is Parallel Computing? LLN Lab Parallel Computer:  Each compute node is a multi-processor parallel computer  Multiple compute nodes are networked together with an InfiniBand network
  • 11. 한국해양과학기술진흥원 What is Parallel Computing? The Real World is Massively Parallel  In the natural world, many complex, interrelated events are happening at the same time, yet within a temporal sequence  Compared to serial computing, parallel computing is much better suited for modeling, simulating and understanding complex, real world phenomena
  • 13. 한국해양과학기술진흥원 Uses for Parallel Computing  Science and Engineering  Historically, parallel computing has been used to model difficult problems in many areas of science and engineering • Atmosphere, Earth, Environment, Physics, Bioscience, Chemistry, Mechanical Engineering, Electrical Engineering, Circuit Design, Microelectronics, Defense, Weapons
  • 14. 한국해양과학기술진흥원 Uses for Parallel Computing  Industrial and Commercial  Today, commercial applications provide an equal or greater driving force in the development of faster computers • Data mining, Oil exploration, Web search engines, Medical imaging and diagnosis, Pharmaceutical design, Financial and economic modeling, Advanced graphics and virtual reality, Collaborative work environments
  • 15. 한국해양과학기술진흥원 Why Use Parallel Computing?  Save time and money  In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings  Parallel computers can be built from cheap, commodity components
  • 16. 한국해양과학기술진흥원 Why Use Parallel Computing? Solve larger problems  Many problems are so large and complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory • en.wikipedia.org/wiki/Grand_Challenge problems requiring PetaFLOPS and PetaBytes of computing resources.  Web search engines and databases processing millions of transactions per second
  • 17. 한국해양과학기술진흥원 Why Use Parallel Computing? Provide concurrency  A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously • For example, the Access Grid (www.accessgrid.org) provides a global collaboration network where people from around the world can meet and conduct work "virtually”
  • 18. 한국해양과학기술진흥원 Why Use Parallel Computing? Use of non-local resources  Using compute resources on a wide area network, or even the Internet when local compute resources are insufficient • SETI@home (setiathome.berkeley.edu) over 1.3 million users, 3.4 million computers in nearly every country in the world. Source: www.boincsynergy.com/stats/ (June, 2013). • Folding@home (folding.stanford.edu) uses over 320,000 computers globally (June, 2013)
  • 19. 한국해양과학기술진흥원 Why Use Parallel Computing? Limits to serial computing  Transmission speeds • The speed of a serial computer is directly dependent upon how fast data can move through hardware. • Absolute limits are the speed of light and the transmission limit of copper wire • Increasing speeds necessitate increasing proximity of processing elements  Limits to miniaturization • Processor technology is allowing an increasing number of transistors to be placed on a chip • However, even with molecular or atomic-level components, a limit will be reached on how small components can be
  • 20. 한국해양과학기술진흥원 Why Use Parallel Computing? Limits to serial computing  Economic limitations • It is increasingly expensive to make a single processor faster • Using a larger number of moderately fast commodity processors to achieve the same or better performance is less expensive  Current computer architectures are increasingly relying upon hardware level parallelism to improve performance • Multiple execution units • Pipelined instructions • Multi-core
  • 21. 한국해양과학기술진흥원 The Future  Trends indicated by ever faster networks, distributed systems, and multi-processor computer architectures clearly show that parallelism is the future of computing
  • 22. 한국해양과학기술진흥원 The Future  There has been a greater than 1000x increase in supercomputer performance, with no end currently in sight  The race is already on for Exascale Computing!  1 exaFlops = 1018, floating point operations per second
  • 26. 한국해양과학기술진흥원 von Neumann Architecture Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers Since then, virtually all computers have followed this basic design Comprises of four main components:  RAM is used to store both program instructions and data  Control unit fetches instructions or data from memory, Decodes instructions and then sequentially coordinates operations to accomplish the programmed task.  Arithmetic Unit performs basic arithmetic operations  Input-Output is the interface to the human operator
  • 27. 한국해양과학기술진흥원 Flynn's Classical Taxonomy Different ways to classify parallel computers  Examples available HERE Widely used classifications is called Flynn's Taxonomy  Based upon the number of concurrent instruction and data streams available in the architecture
  • 28. 한국해양과학기술진흥원 Single Instruction, Single Data  A sequential computer which exploits no parallelism in either the instruction or data streams • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle • Single Data: Only one data stream is being used as input during any one clock cycle • Can have concurrent processing characteristics: Pipelined execution Flynn's Classical Taxonomy
  • 29. 한국해양과학기술진흥원 Flynn's Classical Taxonomy Older generation mainframes, minicomputers, workstations, modern day PCs
  • 30. 한국해양과학기술진흥원 Flynn's Classical Taxonomy Single Instruction, Multiple Data  A computer which exploits multiple data streams against a single instruction stream to perform operations • Single Instruction: All processing units execute the same instruction at any given clock cycle • Multiple Data: Each processing unit can operate on a different data element
  • 31. 한국해양과학기술진흥원 Flynn's Classical Taxonomy Single Instruction, Multiple Data  Vector Processor implements a instruction set containing instructions that operates on 1D arrays of data called vectors
  • 32. 한국해양과학기술진흥원 Multiple Instruction, Single Data  Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams  Single Data: A single data stream is fed into multiple processing units  Some conceivable uses might be: • Multiple cryptography algorithms attempting to crack a single coded message Flynn's Classical Taxonomy
  • 33. 한국해양과학기술진흥원 Multiple Instruction, Multiple Data  Multiple autonomous processors simultaneously executing different instructions on different data • Multiple Instruction: Every processor may be executing a different instruction stream • Multiple Data: Every processor may be working with a different data stream Flynn's Classical Taxonomy
  • 34. 한국해양과학기술진흥원 Flynn's Classical Taxonomy Most current supercomputers, networked parallel computer clusters, grids, clouds, multi-core PCs Many MIMD architectures also include SIMD execution sub- components
  • 35. 한국해양과학기술진흥원 Some General Parallel Terminology Parallel computing  Using parallel computer to solve single problems faster Parallel computer  Multiple-processor or multi-core system supporting parallel programming Parallel programming  Programming in a language that supports concurrency explicitly Supercomputing or High Performance Computing  Using the world's fastest and largest computers to solve large problems
  • 36. 한국해양과학기술진흥원 Some General Parallel Terminology Task  A logically discrete section of computational work.  A task is typically a program or program-like set of instructions that is executed by a processor  A parallel program consists of multiple tasks running on multiple processors Shared Memory  Hardware point of view, describes a computer architecture where all processors have direct access to common physical memory  In a programming sense, it describes a model where all parallel tasks have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists
  • 37. 한국해양과학기술진흥원 Some General Parallel Terminology Symmetric Multi-Processor (SMP)  Hardware architecture where multiple processors share a single address space and access to all resources; shared memory computing Distributed Memory  In hardware, refers to network based memory access for physical memory that is not common  As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing
  • 38. 한국해양과학기술진흥원 Some General Parallel Terminology Communications  Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method employed Synchronization  Coordination of parallel tasks in real time, very often associated with communications  Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point  Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase
  • 39. 한국해양과학기술진흥원 Some General Parallel Terminology Parallel Overhead  The amount of time required to coordinate parallel tasks, as opposed to doing useful work.  Parallel overhead can include factors such as: • Task start-up time • Synchronizations • Data communications • Software overhead imposed by parallel languages, libraries, operating system, etc. • Task termination time Massively Parallel  Refers to the hardware that comprises a given parallel system - having many processors.  The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands
  • 40. 한국해양과학기술진흥원 Some General Parallel Terminology Embarrassingly Parallel  Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks Scalability  A parallel system's ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources
  • 41. 한국해양과학기술진흥원 Some General Parallel Terminology Multi-core processor  A single computing component with two or more independent actual central processing units
  • 42. 한국해양과학기술진흥원 Potential program speedup is defined by the fraction of code (P) that can be parallelized  If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup).  If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast. Limits and Costs of Parallel Programming
  • 43. 한국해양과학기술진흥원  Introducing the number of processors N performing the parallel fraction of work P, the relationship can be modeled by:  It soon becomes obvious that there are limits to the scalability of parallelism. For example Limits and Costs of Parallel Programming
  • 45. 한국해양과학기술진흥원 Some General Parallel Terminology Complexity  Parallel applications are much more complex than corresponding serial applications • Not only do you have multiple instruction streams executing at the same time, but you also have data flowing between them  The costs of complexity are measured in programmer time in virtually every aspect of the software development cycle: • Design • Coding • Debugging • Tuning • Maintenance
  • 46. 한국해양과학기술진흥원 Some General Parallel Terminology Portability  Portability issues with parallel programs are not as serious as in years past due to standardization in several APIs, such as MPI, POSIX threads, and OpenMP  All of the usual portability issues associated with serial programs apply to parallel programs • Hardware architectures are characteristically highly variable and can affect portability
  • 47. 한국해양과학기술진흥원 Some General Parallel Terminology Resource Requirements  Amount of memory required can be greater for parallel codes than serial codes, due to the need to replicate data and for overheads associated with parallel support libraries and subsystems  For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation • The overhead costs associated with setting up the parallel environment, task creation, communications and task termination can comprise a significant portion of the total execution time for short runs
  • 48. 한국해양과학기술진흥원 Some General Parallel Terminology Scalability  Ability of a parallel program's performance to scale is a result of a number of interrelated factors. Simply adding more processors is rarely the answer  Algorithm may have inherent limits to scalability  Hardware factors play a significant role in scalability. Examples: • Memory-CPU bus bandwidth on an SMP machine • Communications network bandwidth • Amount of memory available on any given machine or set of machines • Processor clock speed
  • 49. Parallel Computer Memory Architectures
  • 50. 한국해양과학기술진흥원 All processors access all memory as global address space  Multiple processors can operate independently but share the same memory resources  Changes in a memory location effected by one processor are visible to all other processors Shared Memory
  • 51. 한국해양과학기술진흥원 Shared memory machines are classified as UMA and NUMA, based upon memory access times  Uniform Memory Access (UMA) • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors • Equal access and access times to memory • Sometimes called CC-UMA - Cache Coherent UMA  Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level Shared Memory
  • 52. 한국해양과학기술진흥원  Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs • One SMP can directly access memory of another SMP • Not all processors have equal access time to all memories • Memory access across link is slower • If cache coherency is maintained, then may also be called CC- NUMA - Cache Coherent NUMA Shared Memory
  • 53. 한국해양과학기술진흥원 Advantages  Global address space provides a user-friendly programming perspective to memory  Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages  Primary disadvantage is the lack of scalability between memory and CPUs • Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management  Programmer responsibility for synchronization constructs that ensure "correct" access of global memory Shared Memory
  • 54. 한국해양과학기술진흥원 Processors have their own local memory  Changes to processor’s local memory have no effect on the memory of other processors  When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated  Synchronization between tasks is likewise the programmer's responsibility  The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet Distributed Memory
  • 55. 한국해양과학기술진흥원 Advantages  Memory is scalable with the number of processors. • Increase the number of processors and the size of memory increases proportionately  Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency.  Cost effectiveness: can use commodity, off-the-shelf processors and networking Disadvantages  The programmer is responsible for many of the details associated with data communication between processors.  Non-uniform memory access times - data residing on a remote node takes longer to access than node local data. Distributed Memory
  • 56. 한국해양과학기술진흥원 Hybrid Distributed-Shared Memory The largest and fastest computers in the world today employ both shared and distributed memory architectures  The shared memory component can be a shared memory machine or graphics processing units  The distributed memory component is the networking of multiple shared memory or GPU machines
  • 57. 한국해양과학기술진흥원 Advantages and Disadvantages  Whatever is common to both shared and distributed memory architectures  Increased scalability is an important advantage  Increased programmer complexity is an important disadvantage Hybrid Distributed-Shared Memory
  • 59. 한국해양과학기술진흥원 Parallel Programming Model Programming model provides an abstract view of computing system  Abstraction above hardware and memory architectures  Value of a programming model is usually judged on its generality • how well a range of different problems can be expressed and • how well they execute on a range of different architectures  The implementation of a programming model can take several forms such as libraries invoked from traditional sequential languages, language extensions, or complete new execution models
  • 60. 한국해양과학기술진흥원 Parallel Programming Model Parallel programming models in common use:  Shared Memory (without threads)  Threads  Distributed Memory / Message Passing  Data Parallel  Hybrid  Single Program Multiple Data (SPMD)  Multiple Program Multiple Data (MPMD) These models are NOT specific to a particular type of machine or memory architecture  Any of these models can be implemented on any underlying hardware
  • 61. 한국해양과학기술진흥원 Parallel Programming Model SHARED memory model on a DISTRIBUTED memory machine  Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory (global address space).  This approach is referred to as virtual shared memory DISTRIBUTED memory model on a SHARED memory machine  The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines.  However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used
  • 62. 한국해양과학기술진흥원 Shared Memory Model - Without Threads Tasks share a common address space  Efficient means of passing data between programs  Various mechanisms such as locks / semaphores may be used to control access to the shared memory  Programmer's point of view • The notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks • Program development can often be simplified Disadvantage in terms of performance  It becomes more difficult to understand and manage data locality: • Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data
  • 63. 한국해양과학기술진흥원 Shared Memory Model - Without Threads Implementations  Native compilers or hardware translate user program variables into actual memory addresses, which are global • On stand-alone shared memory machines, this is straightforward. • On distributed shared memory machines, memory is physically distributed across a network of machines, but made global through specialized hardware and software
  • 64. 한국해양과학기술진흥원 Threads Model Type of shared memory programming model  A single "heavy weight" process can have multiple "light weight", concurrent execution paths  Main program a.out is scheduled by native OS  a.out loads and acquires all of the necessary system and user resources to run. This is the "heavy weight" process  a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently
  • 65. 한국해양과학기술진흥원 Threads Model  Each thread has local data, but also, shares the entire resources of a.out  This saves the overhead associated with replicating a program's resources for each thread ("light weight"). Each thread also benefits from a global memory view because it shares the memory space of a.out  Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time  Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed
  • 66. 한국해양과학기술진흥원 Threads Model Implementations • POSIX Threads  Library based; requires parallel coding  C Language only  Commonly referred to as Pthreads.  Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.  Very explicit parallelism; requires significant programmer attention to detail. • OpenMP  Compiler directive based; can use serial code  Portable / multi-platform, including Unix and Windows platforms  Available in C/C++ and Fortran implementations  Can be very easy and simple to use
  • 67. 한국해양과학기술진흥원 Distributed Memory / Message Passing Model A set of tasks that use their own local memory during computation  Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines.  Tasks exchange data through communications by sending and receiving messages.  Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.
  • 68. 한국해양과학기술진흥원 Distributed Memory / Message Passing Model Implementations  From a programming perspective, message passing implementations usually comprise a library of subroutines  Calls to these subroutines are imbedded in source code  MPI specifications are available on the web at http://www.mpi- forum.org/docs/  MPI implementations exist for virtually all popular parallel computing platforms
  • 69. 한국해양과학기술진흥원 Address space is treated globally A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure  On shared memory architectures, all tasks may have access to the data structure through global memory  On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task Data Parallel Model
  • 70. 한국해양과학기술진흥원 Implementations  Unified Parallel C (UPC): an extension to the C programming language for SPMD parallel programming. Compiler dependent. More information: http://upc.lbl.gov/  Global Arrays: provides a shared memory style programming environment in the context of distributed array data structures. Public domain library with C and Fortran77 bindings. More information: http://www.emsl.pnl.gov/docs/global/  X10: a PGAS based parallel programming language being developed by IBM at the Thomas J. Watson Research Center. More information: http://x10-lang.org/ Data Parallel Model
  • 71. 한국해양과학기술진흥원 Hybrid Model A hybrid model combines more than one of the previously described programming models
  • 72. 한국해양과학기술진흥원 Single Program Multiple Data High level programming model that can be built upon any combination of the previously mentioned parallel programming models SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously.  This program can be threads, message passing, data parallel or hybrid. MULTIPLE DATA: All tasks may use different data
  • 73. 한국해양과학기술진흥원 Multiple Program Multiple Data High level programming model that can be built upon any combination of the previously mentioned parallel programming models MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. MULTIPLE DATA: All tasks may use different data
  • 75. 한국해양과학기술진흥원 Determine whether the problem can be parallelized Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation  This problem can be solved in parallel. • Each of the molecular conformations is independently determinable • The calculation of the minimum energy conformation is also a parallelizable problem. Understand the Problem and the Program
  • 76. 한국해양과학기술진흥원 Calculation of the series (0,1,1,2,3,5,8,13,21,...) by use of the formula: F(n) = F(n-1) + F(n-2)  This problem can't be solved in parallel • Calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. • The calculation of the F(n) value uses those of both F(n-1) and F(n-2). These three terms cannot be calculated independently and therefore, not in parallel Understand the Problem and the Program
  • 77. 한국해양과학기술진흥원 Understand the Problem and the Program Identify the program's hotspots  Know where most of the real work is being done • The majority of scientific and technical programs usually accomplish most of their work in a few places • Profilers and performance analysis tools can help here  Focus on parallelizing the hotspots and ignore those sections of the program that account for little CPU usage
  • 78. 한국해양과학기술진흥원 Understand the Problem and the Program Identify bottlenecks in the program  Are there areas that are disproportionately slow, or cause parallelizable work to halt or be deferred? • For example, I/O is usually something that slows a program down  May be possible to restructure the program or use a different algorithm to reduce or eliminate unnecessary slow areas Identify inhibitors to parallelism  One common class of inhibitor is data dependence
  • 79. 한국해양과학기술진흥원 Break the problem into discrete "chunks" of work that can be distributed to multiple tasks Two basic ways to partition computational work among parallel tasks:  Domain decomposition • Data associated with a problem is decomposed • Each parallel task then works on a portion of the data Partitioning
  • 80. 한국해양과학기술진흥원 Partitioning  Functional decomposition • Problem is decomposed according to the work • Each task then performs a portion of the overall work • Functional decomposition lends itself well to problems that can be split into different tasks
  • 81. 한국해양과학기술진흥원 Partitioning Ecosystem Modeling  Each program calculates the population of a given group • Each group's growth depends on that of its neighbors • As time progresses, each process calculates its current state, then exchanges information with the neighbor populations • All tasks then progress to calculate the state at the next time step
  • 82. 한국해양과학기술진흥원 Partitioning Climate Modeling  Each model component can be thought of as a separate task  Arrows represent exchanges of data between components during computation: • the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on
  • 83. 한국해양과학기술진흥원 Communications Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data  For example, imagine an image processing operation where every pixel in a black and white image needs to have its color reversed  The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work Most parallel applications require tasks to share data with each other  For example, a 3-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data  Changes to neighboring data has a direct effect on that task's data
  • 84. 한국해양과학기술진흥원 Communications Factors to Consider when designing your program's inter- task communications  Cost of communications • Inter-task communication virtually always implies overhead • Communications frequently require some type of synchronization between tasks • Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems  Latency vs. Bandwidth • latency is the time it takes to send a message from point A to B • bandwidth is the amount of data that can be communicated per unit of time • Sending many small messages can cause latency to dominate communication overheads • Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth
  • 85. 한국해양과학기술진흥원 Communications  Visibility of communications • With the Message Passing Model, communications are explicit and generally quite visible and under the control of the programmer • With the Data Parallel Model, communications often occur transparently to the programmer, particularly on distributed memory architectures • The programmer may not even be able to know exactly how inter-task communications are being accomplished  Synchronous vs. asynchronous communications • Synchronous communications are often referred to as blocking communications since other work must wait until the communications have completed • Asynchronous communications allow tasks to transfer data independently from one another • Interleaving computation with communication is the single greatest benefit for using asynchronous communications
  • 86. 한국해양과학기술진흥원 Communications  Scope of communications • Knowing which tasks must communicate with each other is critical during the design stage of a parallel code • Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer • Collective - involves data sharing between more than two tasks, which are often specified as being members in a common group • Some common variations
  • 87. 한국해양과학기술진흥원 Communications Efficiency of communications  Which implementation for a given model should be used? • Using the Message Passing Model as an example, one MPI implementation may be faster on a given hardware platform than another • What type of communication operations should be used? • asynchronous communication operations can improve overall program performance
  • 89. 한국해양과학기술진흥원 Synchronization Types of Synchronization  Barrier • Each task performs its work until it reaches the barrier • When the last task reaches the barrier, all tasks are synchronized  Lock • Typically used to protect access to global data or code section • Only one task at a time may use the lock  Synchronous communication operations • When a task performs a communication operation, some form of coordination is required with the other task participating in the communication • For example, before a task can perform a send operation, it must first receive an acknowledgment from the receiving task that it is OK to send
  • 90. 한국해양과학기술진흥원  A dependence exists between program statements when the order of statement execution affects the results of the program  Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism  Loop carried data dependence: The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1)  How to Handle Data Dependencies  Distributed memory architectures - communicate required data at synchronization points  Shared memory architectures -synchronize read-write operations between tasks Data Dependencies
  • 91. 한국해양과학기술진흥원  Practice of distributing approximately equal amounts of work among tasks  Important to parallel programs for performance reasons  For example, if all tasks are subject to a barrier synchronization point, the slowest task will determine the overall performance. Load Balancing
  • 92. 한국해양과학기술진흥원 How to Achieve Load Balance  Equally partition the work each task receives • For array operations where each task performs similar work, evenly distribute the data set among the tasks • For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks • If a heterogeneous mix of machines, use performance analysis tool to detect any load imbalances and adjust work accordingly  Use dynamic work assignment • When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler - task pool approach  As each task finishes its work, it queues to get a new piece of work. • It may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code Load Balancing
  • 93. 한국해양과학기술진흥원  The number of tasks into which a problem is decomposed determines granularity  Fine-grain Parallelism – Decomposition into large no of tasks • Individual tasks are relatively small in terms of code size and execution time • The data is transferred among processors frequently • Facilitates load balancing • Implies high communication overhead  Coarse-grain Parallelism – Decomposition into small no of tasks • Individual tasks are relatively small in terms of code size and execution time • Implies more opportunity for performance increase • Harder to load balance efficiently Granularity
  • 94. 한국해양과학기술진흥원  Array Processing  Example demonstrates calculations on 2-dimensional array elements, with the computation on each array element being independent from other array elements  Serial program calculates one element at a time in sequential order Parallel Examples : Array Processing
  • 95. 한국해양과학기술진흥원  Array Processing - Parallel Solution 1  Independent calculation of array elements ensures there is no need for communication between tasks  Each task executes the portion of the loop corresponding to the data it owns Parallel Examples : Array Processing
  • 96. 한국해양과학기술진흥원  One Possible Solution  Implement as a Single Program Multiple Data model • Master process initializes array, sends info to worker processes and receives results • Worker process receives info, performs its share of computation and sends results to master • Static load balancing • Each task has a fixed amount of work to do • May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance Parallel Examples : Array Processing
  • 97. 한국해양과학기술진흥원 Pool of Tasks Scheme  Two processes are employed • Master Process • Holds pool of tasks for worker processes to do • Sends worker a task when requested • Collects results from workers • Worker Process • Gets task from master process • Performs computation • Sends results to master • Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform • Dynamic load balancing occurs at run time • the faster tasks will get more work to do Parallel Examples : Array Processing
  • 98. 한국해양과학기술진흥원 Multicore computing  A Multicore processor is a processor that includes multiple execution units ("cores") on the same chip Symmetric multiprocessing  A computer system with multiple identical processors that share memory and connect via a bus than node local data. Distributed System  A software system in which components located on networked computers communicate and coordinate their actions by passing messages  An important goal and challenge of distributed systems is location transparency Classes of Parallel Computers
  • 99. 한국해양과학기술진흥원 Cluster computing  A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer Massive parallel processing  A massively parallel processor (MPP) is a single computer with many networked processors.  MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks Grid computing  Compared to clusters, grids tend to be more loosely coupled, heterogeneous, and geographically dispersed Classes of Parallel Computers
  • 100. 한국해양과학기술진흥원 Exascale computing  Computer system capable of at least one exaFlops • 10^18 floating point operations per second Volunteer Computing  Computer owners donate their computing resources to one or more projects Peep-to-peer Network  A type of decentralized and distributed network architecture in which individual nodes in the network act as both suppliers and consumers of resources Classes of Parallel Computers
  • 102. 한국해양과학기술진흥원 A collection of independent computers that appear to the users of the system as a single computer A collection of autonomous computers, connected through a network and distribution middleware which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility Distributed Systems
  • 105. 한국해양과학기술진흥원 Example - Mobile devices in a Distributed System
  • 107. 한국해양과학기술진흥원 Transparency Scalability Fault tolerance Concurrency Openness These challenges can also be seen as the goals or desired properties of a distributed system Basic problems and challenges
  • 108. 한국해양과학기술진흥원  Concealment from the user and the application programmer of the separation of the components of a distributed system  Access Transparency - Local and remote resources are accessed in same way  Location Transparency - Users are unaware of the location of resources  Migration Transparency - Resources can migrate without name change  Replication Transparency - Users are unaware of the existence of multiple copies of resources  Failure Transparency - Users are unaware of the failure of individual components  Concurrency Transparency - Users are unaware of sharing resources with others Transparency
  • 109. 한국해양과학기술진흥원  Addition of users and resources without suffering a noticeable loss of performance or increase in administrative complexity  Adding users and resources causes a system to grow:  Size - growth with regards to the number of users or resources • System may become overloaded • May increase administration cost  Geography - growth with regards to geography or the distance between nodes • Greater communication delays  Administration – increase in administrative cost Scalability
  • 110. 한국해양과학기술진흥원 Whether the system can be extended in various ways without disrupting existing system and services  Hardware extensions • adding peripherals, memory, communication interfaces  Software extensions • Operating System features • Communication protocols Openness is supported by:  Public interfaces  Standardized communication protocols Openness
  • 111. 한국해양과학기술진흥원 In a single system several processes are interleaved In distributed systems - there are many systems with one or more processors Many users simultaneously invoke commands or applications, access and update shared data  Mutual exclusion  Synchronization  No global clock Concurrency
  • 112. 한국해양과학기술진흥원 Hardware, software and networks fail Distributed systems must maintain availability even at low levels of hardware, software, network reliability Fault tolerance is achieved by  Recovery  Redundancy Issues  Detecting failures  Masking failures  Recovery from failures  Redundancy Fault tolerance
  • 114. 한국해양과학기술진흥원 Performance issues  Responsiveness  Throughput  Load sharing, load balancing  Quality of service  Correctness  Reliability, availability, fault tolerance  Security  Performance  Adaptability Design Requirements