Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию и векторизации приложений с использованием Parallels Composer

Software & Services Group, Developer Products Division
Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Parallel Studio 2011
Essential Performance
Design l Build & Debug l Verify l Tune
Kirill Mavrodiev
kirill.mavrodiev@intel.com
EMEA Compiler TCE (Software Engineer)
SSG DPD ICL

AGENDA
•Intel® Parallel Studio 2011
•Intel® Parallel Composer
–Intel® Cilk Plus Key words
–Array Notation
–Guided Auto-Parallelization (GAP)
–Intel® Parallel Debugger Extension
2
Intel® Parallel Studio 2011 Introduction

Three Product Lines for Diverse Needs
Essential
Performance
Advanced
Performance
Distributed
Performance
C/C++ developers
Microsoft Visual Studio*
Take advantage of multicore
C++ and Fortran developers
Windows* and Linux*
High performance, cross platform apps
C++ and Fortran developers
on Windows* and Linux*
High performance MPI clusters
intel.com/software/products
3

What’s New in
4
Intel® Parallel Building Blocks – A
comprehensive, portable, reliable, future proof
parallel models for both data and task parallelism
• Intel® Threading Building Blocks
• Intel® Cilk Plus
• Intel® Array Building Blocks (Beta)
Intel® Parallel Advisor –Parallelism design guide
• Demystifies and speeds parallel application design
• Gives parallelism design insight and analysis
through Explorer and Modeler analysis tools
Now includes Intel® Premier Support
• Unlimited technical support and upgrades for one
year
Enhancements
• Intel® Threading Building Blocks 3.0
• Compiler improvements
• Microsoft Visual Studio* 2010 integration

5
• All-in-one toolset for
the software
development
lifecycle
• Microsoft Visual
Studio* plug-in
– 2005, 2008 and 2010

Intel® Parallel Advisor
Step by Step Guidance
6
Focus on the hot call trees and loops as locations to
experiment with parallelism.
Advisor annotations into source code to describe
their parallel experiment.
Evaluate the performance of parallel experiment by
displaying the performance projection for each
parallel site and how each site‟s performance
impacts the entire program.
Identifies data issues (races) of each parallel
experiment.

Intel® Parallel Composer
BUILD & DEBUG PHASE
7
Develop high performance applications with a optimized
C/C++ compiler and comprehensive threaded libraries
Intel® Integrated Performance
Primitives
Extensive library highly optimized software functions for
digital media and data-processing applications
Improve Performance
Easier, faster performance for Windows* apps
Intel® Parallel Building Blocks
Comprehensive set of parallel development models that
support multiple approaches to parallelism.

Intel’s Family of Parallel Programming Models
8
MPI
Intel®
Concurrent
Collections
OpenMP*
OpenCL*
Intel®
Cilk
Plus
Intel® Math
Kernel
Library (MKL)
Intel®
Integrated
Performance
Primitives
(IPP)
Intel®
Threading
Building
Blocks (TBB)
Intel® Array
Building
Blocks (ArBB)
Fixed
Function
Libraries
Established
Standards
Research and
Exploration
Intel® Parallel
Building Blocks (PBB)
Intel® Cilk Plus, Intel® TBB: Part of Intel® Parallel Studio 2011
Intel® Array Building Blocks: Known by code names „Intel Ct‟ or
„Intel Firetown”; public beta started around mid September 2010
8

Intel® Parallel Building Blocks
Details
9

Intel® Parallel Inspector
VERIFY PHASE
10
Identify memory issues in serial and parallel
applications in addition to threading errors.
Find Memory Errors
Find a wide variety of memory errors
Find Threading Errors
Find data races and deadlocks
Improve Reliability
Ensure application reliability with proactive memory
and threading error checking

Intel® Parallel Amplifier
TUNE PHASE
11
Hotspot Analysis
Where does my program spend most of the time?
Concurrency Analysis
Where and Why doesn‟t my program utilize all
available cores?
Locks & Wait Analysis
Where and Why does my program wait?
Optimize serial and parallel application performance
with 3 easy to use, powerful analysis methods

Summary
• For over a year, Intel Parallel Studio has made it easier for
Windows* developers to create fast, reliable applications for
multicore
• This release is a major update
– Intel® Parallel Building Blocks adds significant new parallelism models
– Intel® Parallel Advisor empowers software architects with parallelism
design insight and analysis for building reliable, high performance
applications for multicore
– Other enhancements including support for Visual Studio* 2010
www.intel.com/go/parallel:
Try it Right Now!
12

Intel® Cilk™ Plus
Language extensions to simplify task & data parallelism
• C++ language extension that provides three simple
keywords to write parallel code
– Loop-type data parallelism using cilk_for
– General parallelism using cilk_spawn and cilk_sync
• Unambiguous semantics, strict fork–join model via
compiler support
• Easiest for the programmer to understand the parallel
control flow
• Automatic load balancing via work stealing
• Low-overhead task spawning, encourage programmers
to create many small tasks
• A program with many small tasks provide opportunity
for the task scheduler both to load balance and forward
scale to larger core counts
14

Intel® Cilk™ Plus
Cilk adds the following new keywords:
 _Cilk_spawn for spawning a function call that executes
asynchronously,
 _Cilk_sync for synchronization point to wait for children spawned
inside that function,
 _Cilk_for for parallel for-loop that executes iterations in parallel.
Cilk includes Reducers for lock-free access to global
data:
 Use built-in reducers for common types – strings, summation,
min/max, logical operations, and more.
 Write custom reducers to manage any data type.
15

Simple Divide and Conquer Example
#include "cilk/cilk.h"
int fib(int n) {
int x, y;
if (n<2) return n;
x = cilk_spawn fib(n-1);
y = fib(n-2);
cilk_sync;
return x+y;
};
int main () {
printf("Fib of 40 is %dn", fib(40));
return 0;
}
Allow fib(n-1) to run in parallel
with fib(n-2)
Ensure that all parallel work is
complete before using the result
16

Intel® Cilk Plus Tachyon Implementation
17

Array Notations for data parallelism
• New array extension to C/C++ language
• Specify parallel operations on arrays (instead of sequential
loops)
• Predictable performance based on mapping parallel constructs
to underlying multi-threading/SIMD hardware
• Works seamlessly with existing C/C++ frameworks and
runtimes: Intel® TBB, OpenMP*, MPI, Intel® Cilk™ Plus,
Pthreads, etc.
18

Array Section Notation
• Array Section Notation
<array base> [<lower_bound> : <length> : <stride>]*
– „:<stride>‟ is optional ( defaults to stride=1 )
– missing „:<length>:<stride>‟ implies length=1
– Simple „:‟ select all elements of this dimension
– Note syntax difference to Fortran section which is
lower_bound : upper_bound : [stride]
• Samples:
A[:] // All elements of vector A
B[2:6] // Elements 2 to 7 of vector B
C[:][5] // Column 5 of matrix C
D[0:3:2] // Elements 0,2,4 of vector D
E[0:3][0:4] // 12 elements from E[0][0] to E[2][3]
19

Intel® Cilk Plus implementation
cilk_for(int k = 0; k < nz; k++){
for(int j = 0; j < ny; j++)
for(int i = 0; i < nx; i+=STRIDE){
tmp[:] = rhs[ID(i,j,k) : STRIDE] + x[IDEA(i,j,k) : STRIDE] * 6.0 -
(x[ID(i-1,j,k) : STRIDE] + x[ID(i+1,j,k) : STRIDE]+
x[ID(i,j-1,k) : STRIDE] + x[ID(i,j+1,k) : STRIDE]+
x[ID(i,j,k-1) : STRIDE] + x[ID(i,j,k+1) : STRIDE]);
residueConvergeStrongCReducer = cilk::max_of (
residueConvergeStrongCReducer, __sec_reduce_max
(fabs(tmp[:])) );
residueConvergeStrongL2Reducer +=
__sec_reduce_add (tmp[:]*tmp[:]);
}
}
20

Intel® Cilk Plus implementation
21

GAP: Guided Auto-parallelization
• Targeted for Mainstream and HPC Users
• Advice to change code for more auto-vectorization, auto-parallelization and
data transformations
• Diagnostic guidance generated when invoked
• Advice may involve
– suggestions for source-change
– adding pragmas
– adding new options
• Simple source changes that assert new properties
– Add a new pragma for loop if semantics are satisfied
– Use a local-variable for the upper-bound of a loop
– Initialize scalar variable unconditionally at top of loop
– Reorder fields of a structure (or split into two)
• Desired behavior
– Each advice is specific using source-level variable names
– User does semantic analysis – apply or reject each advice
– Advice should be as localized as possible
– Following the advice should result in better optimizations
22

GAP – How it Works
Selection of most Relevant Switches
Multiple compiler switches to activate and fine-tune
guidance analysis
• Activate messages individually for vectorization,
parallelization, data transformations or all three
-guide-vec[=level]
-guide-par[=level]
-guide-data-trans[=level]
-guide[=level]
Optional argument level=1,2,3,4 controls extend of
analysis; Intel Composer only supports up to level 3
• Control the source code part for which analysis is done
-guide-opts=<arg>
Samples:
-guide-opts=“convert.c,'funca(int)'“
-guide-opts="bar.f90,'module_1::func_solve'“
• Control where the message are going
-guide-file=<file_name> -guide-file-append<=file_name>
23

GAP Case Study
extern int num_nodes;
typedef struct TEST_STRUCT {
// Coordinates of city1
float latitude1;
float longitude1;
int city_id1;
int stops[10000]; // Currently unused field
// Coordinates of city2
float latitude2;
float longitude2;
int city_id2;
} test_struct;
extern float *distances; extern test_struct** nodes;
void process_nodes(void)
{
float const R = 3964.0;
float temp, lat1, lat2, long1, long2, result;
int temp1 = num_nodes;
//#pragma loop count min(16)
//#pragma parallel
// for (int k=0; k < temp1; k++) {
for (int k=0; k < num_nodes; k++) {
lat1 = nodes[k]->latitude1;
lat2 = nodes[k]->latitude2;
long1 = nodes[k]->longitude1;
long2 = nodes[k]->longitude2;
// Compute the distance between the two cities
temp = sin(lat1) * sin(lat2) +
cos(lat1) * cos(lat2) * cos(long1-long2);
result = 2.0 * R * atan(sqrt((1.0-temp)/(1.0+temp)));
// Store the distance computed in the distances array
distances[k] = result;
}
}
[c:/test2/usability2] icl -c distance.cpp -Qguide=4 -Qparallel
GAP REPORT LOG OPENED ON Wed Mar 03 18:34:01 2010
c:test2usability2distance.h(2): remark #30755: (DTRANS) Reorderi
ng the fields of the structure 'TEST_STRUCT' will improve data locality.
Suggested field order: 'stops, latitude1, longitude1, latitude2, longitude2,
city_id1, city_id2'. [VERIFY] The suggestion is based on the field references
in current compilation. Please make sure that the restructured code satisfies
the original program semantics.
c:test2gap_examplesusability2distance.cpp(30): remark #30534:
(LOOP) Add -Qansi-alias option for better type-based disambiguation
analysis by the compiler if appropriate (option will apply for entire
compilation). This will improve optimizations for the loop at line 30 [VERIFY]
Make sure that the semantics of this option is obeyed for entire
compilation.
c:test2usability2distance.cpp(29): remark #30519: (PAR) Use "#pr
agma parallel" to parallelize the loop at line 29, if these arrays in the loop d
o not have cross-iteration dependencies: nodes, distances. [VERIFY] A cross-
iteration dependency exists if a memory location is modified in an iteration
of the loop and accessed (a read or a write) in another iteration of the loop.
Make sure that there are no such dependencies.
c:test2gap_examplesusability2distance.cpp(29): remark #30525: (PAR)
If the trip count of the loop at line 29 is greater than 16, then use "#pragma
loop count min(16)" to parallelize this loop. [VERIFY] Make sure that the
loop has a minimum of 16 iterations.
c:test2gap_examplesusability2distance.cpp(48): remark #30525: (PAR)
If the trip count of the loop at line 48 is greater than 751, then use
"#pragma loop count min(751)" to parallelize this loop. [VERIFY] Make sure
that the loop has a minimum of 751 iterations.
END OF GAP REPORT LOG
24

Intel® Debugger &
Intel® Parallel Debugger Extension
Linux* Mac*OS Windows*
Intel® Debugger (IDB) Intel® Parallel Debugger
Extension
Intel® C++ Composer XE
Intel® Fortran Composer XE
Intel® Cluster Toolkit
Compiler Edition
Intel® Fortran Composer XE
Intel® Visual Fortran Composer XE

Thread Shared Data Event Detection
Break on Thread Shared Data Access (read/write)
Re-entrant Function Detection
SIMD SSE Registers Window
Enhanced OpenMP* Support
Serialize OpenMP threaded application execution on the
fly
Insight into thread groups, barriers, locks, wait lists etc.
Key Features
26

Questions?
27

Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR
IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS
IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR
A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or
components and reflect the approximate performance of Intel products as measured
by those tests. Any difference in system hardware or software design or
configuration may affect actual performance. Buyers should consult other sources of
information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the
performance of Intel products, reference www.intel.com/software/products.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other
countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2010. Intel Corporation.
http://www.intel.com/software/products
28

Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию и векторизации приложений с использованием Parallels Composer

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию и векторизации приложений с использованием Parallels Composer

Similaire à Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию и векторизации приложений с использованием Parallels Composer (20)

Plus de Media Gorod

Plus de Media Gorod (20)

Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию и векторизации приложений с использованием Parallels Composer