Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.
2. Overview
Traditionally, computer software has been written for serial
computation. To solve a problem, an algorithm is constructed
and implemented as a serial stream of instructions. These
instructions are executed on a central processing unit on one
computer. Only one instruction may execute at a time—after
that instruction is finished, the next is executed.
Nowadays one single machine (PC) can have multi-core
and/or multi-processor computer architecture.
A multiprocessor computer architecture where two or more
identical processors can connect to a single shared main
memory. Most common multiprocessor systems today use an
SMP (symmetric multiprocessing) architecture. In the case of
multi-core processors, the SMP architecture applies to the
cores, treating them as separate processors.
3. Speedup
The amount of performance gained by
the use of a multi-core processor is
strongly dependent on the software
algorithms and implementation. In
particular, the possible gains are limited
by the fraction of the software that can
be "parallelized" to run on multiple cores
simultaneously; this effect is described
by Amdahl's law. In the best case, so-
called embarrassingly parallel problems
may realize speedup factors near the
number of cores. Many typical
applications, however, do not realize
such large speedup factors and thus,
the parallelization of software is a
significant on-going topic of research.
4. Intel Atom
Nokia Booklet 3G - Intel® Atom™ Z530, 1.6 GHz
Intel Atom is the brand name for a line of ultra-low-voltage x86
and x86-64 CPUs (or microprocessors) from Intel, designed in 45
nm CMOS and used mainly in Netbooks, Nettops and MIDs.
Intel Atom can execute up to two instructions per cycle. The
performance of a single core Atom is equal to around half that
offered by an equivalent Celeron.
Hyper-threading (officially termed Hyper-Threading Technology
or HTT) is an Intel-proprietary technology used to improve
parallelization of computations (doing multiple tasks at once)
performed on PC microprocessors.
A processor with hyper-threading enabled is treated by the
operating system as two processors instead of one. This means
that only one processor is physically present but the operating
system sees two virtual processors, and shares the workload
between them.
The advantages of hyper-threading are listed as: improved
support for multi-threaded code, allowing multiple threads to run
simultaneously, improved reaction and response time.
5. Instruction level parallelism
Instruction-level parallelism (ILP) is a measure of how
many of the operations in a computer program can be
performed simultaneously. Consider the following
program:
1. e = a + b
2. f = c + d
3. g = e * f
Operation 3 depends on the results of operations 1 and
2, so it cannot be calculated until both of them are
completed. However, operations 1 and 2 do not depend
on any other operation, so they can be calculated
simultaneously. (See also: Data dependency) If we
assume that each operation can be completed in one unit
of time then these three instructions can be completed in
a total of two units of time, giving an ILP of 3/2.
6. Qt 4's Multithreading
Qt provides thread support in the form of platform-independent threading
classes, a thread-safe way of posting events, and signal-slot connections
across threads. This makes it easy to develop portable multithreaded Qt
applications and take advantage of multiprocessor machines.
QThread provides the means to start a new thread.
QThreadStorage provides per-thread data storage.
QThreadPool manages a pool of threads that run QRunnable objects.
QRunnable is an abstract class representing a runnable object.
QMutex provides a mutual exclusion lock, or mutex.
QMutexLocker is a convenience class that automatically locks and unlocks a
QMutex.
QReadWriteLock provides a lock that allows simultaneous read access.
QReadLocker and QWriteLocker are convenience classes that automatically lock
and unlock a QReadWriteLock.
QSemaphore provides an integer semaphore (a generalization of a mutex).
QWaitCondition provides a way for threads to go to sleep until woken up by
another thread.
QAtomicInt provides atomic operations on integers.
QAtomicPointer provides atomic operations on pointers.
7. OpenMP
The OpenMP Application Program Interface (API) supports multi-platform
shared-memory parallel programming in C/C++ and Fortran on all
architectures, including Unix platforms and Windows NT platforms.
OpenMP is a portable, scalable model that gives shared-memory parallel
programmers a simple and flexible interface for developing parallel
applications for platforms ranging from the desktop to the supercomputer.
The designers of OpenMP wanted to provide an easy method to thread
applications without requiring that the programmer know how to create,
synchronize, and destroy threads or even requiring him or her to determine
how many threads to create. To achieve these ends, the OpenMP designers
developed a platform-independent set of compiler pragmas, directives,
function calls, and environment variables that explicitly instruct the compiler
how and where to insert threads into the application.
Most loops can be threaded by inserting only one pragma right before the
loop. Further, by leaving the nitty-gritty details to the compiler and OpenMP,
you can spend more time determining which loops should be threaded and
how to best restructure the algorithms for maximum performance.
8. OpenMP Example • OpenMP places the following five restrictions on
#include <omp.h> which loops can be threaded:
#include <stdio.h> • The loop variable must be of type signed
int main() { integer. Unsigned integers, such as
#pragma omp parallel DWORD's, will not work.
printf("Hello from thread %d, nthreads %dn", • The comparison operation must be in the
omp_get_thread_num(), omp_get_num_threads()); form loop_variable <, <=, >, or >=
} loop_invariant_integer
• The third expression or increment portion of
//------------------------------------------- the for loop must be either integer addition
#pragma omp parallel shared(n,a,b) or integer subtraction and by a loop
{ invariant value.
#pragma omp for • If the comparison operation is < or <=, the
for (int i=0; i<n; i++) loop variable must increment on every
{ iteration, and conversely, if the comparison
a[i] = i + 1; operation is > or >=, the loop variable must
#pragma omp parallel for decrement on every iteration.
/*-- Okay - This is a parallel region --*/ • The loop must be a basic block, meaning
for (int j=0; j<n; j++) no jumps from the inside of the loop to the
b[i][j] = a[i]; outside are permitted with the exception of
}
the exit statement, which terminates the
} /*-- End of parallel region --*/
whole application. If the statements goto or
//-------------------------------------------
break are used, they must jump within the
#pragma omp parallel for
loop, not outside it. The same goes for
for (i=0; i < numPixels; i++)
exception handling; exceptions must be
{ caught within the loop.
pGrayScaleBitmap[i] = (unsigned BYTE)
(pRGBBitmap[i].red * 0.299 +
pRGBBitmap[i].green * 0.587 +
pRGBBitmap[i].blue * 0.114);
}
10. Intel Threading Building Blocks (TBB)
Intel® Threading Building Blocks (Intel® TBB) is an award-winning C++ template
library that abstracts threads to tasks to create reliable, portable, and scalable
parallel applications. Just as the C++ Standard Template Library (STL) extends the
core language, Intel TBB offers C++ users a higher level abstraction for parallelism.
To implement Intel TBB, developers use familiar C++ templates and coding style,
leaving low-level threading details to the library. It is also portable between
architectures and operating systems.
Intel® TBB for Windows (Linux, Mac OS) costs $299 per sit.
#include <iostream>
#include <string>
#include “tbb/parallel_for.h”
#include “tbb/blocked_range.h”
using namespace tbb;
using namespace std;
int main() {
//...
parallel_for(blocked_range<size_t>(0, to_scan.size() ),
SubStringFinder( to_scan, max, pos ));
//...
return 0;
}
11. Parallel Pattern Library (PPL)
The Concurrency Runtime is a concurrent programming framework for C++.
The Concurrency Runtime simplifies parallel programming and helps you
write robust, scalable, and responsive parallel applications.
The features that the Concurrency Runtime provides are unified by a
common work scheduler. This work scheduler implements a work-stealing
algorithm that enables your application to scale as the number of available
processors increases.
The Concurrency Runtime enables the following programming patterns and
concepts:
Imperative data parallelism: Parallel algorithms distribute computations on
collections or on sets of data across multiple processors.
Task parallelism: Task objects distribute multiple independent operations across
processors.
Declarative data parallelism: Asynchronous agents and message passing enable
you to declare what computation has to be performed, but not how it is performed.
Asynchrony: Asynchronous agents make productive use of latency by doing work
while waiting for data.
The Concurrency Runtime is provided as part of the C Runtime Library
(CRT).
Only Visual Studio 2010 supports PPL
12. Concurrency Runtime Architecture
The Concurrency Runtime is divided into four components: the
Parallel Patterns Library (PPL), the Asynchronous Agents Library,
the work scheduler, and the resource manager. These components
reside between the operating system and applications. The
following illustration shows how the Concurrency Runtime
components interact among the operating system and applications:
struct LongRunningOperationMsg{
LongRunningOperationMsg (int x, int y)
: m_x(x),m_y(y){}
int m_x;
int m_y;
}
call<LongRunningOperationMsg>*
LongRunningOperationCall = new
call<LongRunningOperationMsg>([](
LongRunningOperationMsg msg)
{
LongRunningOperation(msg.x, msg.y);
})
void SomeFunction(int x, int y){
asend(LongRunningOperationCall,
LongRunningOperationMsg(x,y));
}
13. References
Parallel computing
Superscalar
Simultaneous multithreading
Hyper-threading
Thread Support in Qt
OpenMP
Intel: Getting Started with OpenMP
Intel® Threading Building Blocks (Intel® TBB)
Intel® Threading Building Blocks 2.2 for Open Source
Concurrency Runtime Library
Four Ways to Use the Concurrency Runtime in Your C++
Projects
Parallel Programming in Native Code blog