Intel Technologies for High Performance Computing

Intel Technologies for High Performance
Computing
Leo Borges
Intel Software Conference 2014 Brazil
May 2014

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its
customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks
are accurate and reflect performance of systems available for purchase.
Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC
manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors
support HT Technology, visit http://www.intel.com/info/hyperthreading.
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology
performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system
delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different
processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life
sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject
to change without notice
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s
current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice
Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only
and are subject to change without notice.
*Other names and brands may be claimed as the property of others.
2

Building Blocks
Many Product Families – Today’s talk: HPC Focus
3
E5-2600 v3
(E5-2400 v3
for Comms &
Storage only)
E3-1200 v3
E7-4800 v3
E5-4600 v3
E7-2800 v3
E7-8800 v3
Haswell
E7
E5
Efficient
Performance
E3
E5-1600 v3
Boards/PDKs
Software
SSDsLAN
RAID
Note: For discussion purposes pnly
(Not intended to be interpreted as
portfolio recommendations or
guidance)
Cloud
Storagev3
Segments
Channel
Enterprise
HPC
Mission
Critical
Big Data
Public
Cloud
Co-processors
Product families and building blocks targeting an array of Segments
Storage
Networking

Recall of a few basics for HPC
What to expect from your code
What to expect from the hardware
Review Vectorization
Xeon + Xeon Phi Example
Objectives of this session
4

Review of a few HPC basics
for non-ninja programmers
5

How it works and where are the bottlenecks
CPUCPUCPUCPU
L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3
memorymemorymemorymemory
CPUCPUCPUCPU
L 2L 2L 2L 2 L 3L 3L 3L 3
memorymemorymemorymemory
I/OI/OI/OI/O
Interconnect.Interconnect.Interconnect.Interconnect.
L 1L 1L 1L 1
Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?
Cache Size, BW &Cache Size, BW &Cache Size, BW &Cache Size, BW &
latencylatencylatencylatency
CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ?
Intra / Inter socketIntra / Inter socketIntra / Inter socketIntra / Inter socket
communicationscommunicationscommunicationscommunications
InterInterInterInter nodesnodesnodesnodes
communication?communication?communication?communication?
6

Unfortunately, you need to be aware
CPU
L 1 L 2 L 3
memory
Bandwidth
Latency
Capacity
From the core ………………….. ------> ………………………… to the i/o subsystem
L1 L2 L3 L4 L5 …. Ln
caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR
7

FLOPS and memory Bandwidth impact the efficiency & scalability
Performing Flops is not an issue
Data movement is the issue (BW, Latency, Power)
Efficiency (= Peak flops / Achieved flops)
won’t be high enough if store / load are not fast enough (GB/s)
First approximation: Only a matter of
Frequency and Bandwidth
for (i=0;i<=MAX;i++)
c[i]= a[i] + b[i]* d[i];
store load load load
add mul
8

Performance expectation: upper bounds
CPU bound.
“HPL”Real world applications
Memory bound.
“Stream”
Flops/s demanding
applications
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
BW demanding
applications
• Our performance metrics are Gflop/s and % of peak (efficiency)
• Elapsed time might not tell all the information (how far of the peak
performance?)
9

Performance expectation: upper bounds
CPU bound.
“HPL”Real world applications
Memory bound.
“Stream”
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
• Our performance metrics are Gflop/s and % of peak (efficiency)
• Elapsed time might not tell all the information (how far of the peak
performance?)
10
Memory
Bound?
Compute
Bound?

Glossary, “High performance computing”
Peak =nb of floating points operations per cycle * frequency
“Flops /sec”
“Efficiency = % of the peak performance”
Same for Bandwidth (but in Gbytes / sec)
sec/sec)/(*)/( FlopscyclecycleFlopsPeak ==
By the way : What is the peak perf of your laptop ?
11

Anatomy of a Computer Platform
12

CPU: Core/Uncore - Designed For Modularity
DRAMDRAMDRAMDRAM
QPIQPIQPIQPI
Core
Uncore
IMC QPI
Power &
Clock
#QPI
Links
# mem
channels
Size of
cache# cores
Power
Manage-
ment
Type of
Memory
Integrated
graphics
Differentiation in the “Uncore”:
…
QPI…
…
…
L3 Cache
QPI: Intel®
QuickPath
Interconnect
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE

Romley EP/EN Platforms
Intel® Xeon® Processor E5-2600 v2/2400 v2 Product Families
14
Intel® Xeon® processor
E5-2400/2600 prod fam
Intel® Xeon® Processor
E5-2400/2600 prod fam
Intel® C600 series
chipset
QPI
QPI
DDR3
DDR3
DDR3
DDR3
3Gb/s
SAS,
SATA
Memory
DDR3 & DDR3L
RDIMMs & UDIMMs, LR DIMMs
Socket R: 4 channels per socket, up to 3
DPC; speeds up to DDR3 1866
Socket B2: 3 channels per socket, up to 2
DPC; speeds up to DDR3 1600
PCI Express* 3.0
Socket R: 40 lanes per socket
Socket B2: 24 lanes per socket
Extra Gen 2 x4 on 2nd CPU
DDR3
DDR3
DDR3
DDR3
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
Intel® C600 series chipset
(Patsburg PCH)
Optimized Server & WS PCH
Integrated Storage:
Up to 8 ports 3Gb/s SAS
RAID 5 optional
Ivy Bridge CPUs
Socket R: Up to 12 cores / socket
Socket B2: Up to 10 cores / socket
DMI2
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*2.0x4
QPI
Socket R: 2 QPI links
Socket B2: 1 QPI link

IvyBridge (IVB) E5-2600 v2 family
The total benefit (at node level) is given by a combination of factors
DDR3
DDR3
DDR3
DDR3
LLC
Cache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
15
C
C
C
C
C
C
C
C
C
C
Feature Xeon E5-2600 v2
Process
Technology
22 nm
Cores/Threads
Up to 12 Cores/24
Threads
Last-level Cache Up to 30 MB
Max Memory
Speed (MHz)
Up to 1866
Max DIMM
Capacity
12 Slots/Processor
PCIe* Lanes /
Controllers/Speed
40 / 10 (PCIe* 3.0 at 8
GT/s)
TDP (W)
150 (Workstation only),
130, 115, 95, 80, 70, 60
wstream.exe

Advanced
Standard
Workstation
Only SKU
Segment
Optimized
8.0 GT/s QPI
DDR3-1866
Intel® HT
Intel® Turbo Boost
Low Power
Basic
Socket compatible with
SNB-EP top to bottom on the SKU stack
All SKUs, frequencies and features
and can change without notice
6C 80W
2.1GHz 15M E5-2620 v2
4C 80W
2.5GHz 10M E5-2609 v2
10C 115W
2.5GHz 25M E5-2670 v2
8C 95W
2.0GHz 20M E5-2640 v2
4C 80W
1.8GHz 10M E5-2603 v2
6C 80W
2.6GHz 15M E5-2630 v2
10C 130W
3.0GHz 25M E5-2690 v2
10C 115W
2.8GHz 25M E5-2680 v2
8C 95W
2.6GHz 20M E5-2650 v2
10C 95W
2.2GHz 25M E5-2660 v2
12C 130W
2.7GHz 30M E5-2697 v2
12C 115W
2.4GHz 30M E5-2695 v2
8C 130W
3.3GHz 25M
6C 130W
3.5GHz 25M E5-2643 v2
4C 130W
3.5GHz 15M E5-2637 v2
10C 70W
1.7GHz 25M E5-2650L v2
6C 60W
2.4GHz 15M E5-2630L v2
10C 8.0 GT/s QPI
6C 7.2 GT/s QPI
DDR3-1600
Intel® HT
Intel® Turbo Boost
7.2 GT/s QPI
DDR3 1600
Intel® HT
Intel® Turbo
Boost
8.0 GT/s QPI
DDR3-1866 (skt R)
DDR3-1600 (skt B2)
Intel® HT
Intel® Turbo Boost
6.4 GT/s QPI
DDR3 1333
No Intel® HT
No Intel® Turbo
8C 150W
3.4GHz 20M E5-2687W v2
E5-2667 v2
E5-2600 v2 Product Family
16

Parallel Programming for Intel® Architecture
(or, in general, for normal CPUs)
Cores
Vectors
Memory,
caches
Data layout and
alignment
OpenMP TBB Cilk plus
Vector
loops
Vector
functions
Blocking
algorithms
Manual layout,
ugly code
AoS SoA
library
4 considerations when writing an efficient, unconstrained parallel program
Array
notations
Threads, locks
Intrinsics
Directives for
alignment
Performance
Analysis

“SIMDization”, so called Vectorization
Single Instruction Multiple Data (SIMD):
Processing vector with a single operation
Provides data level parallelism (DLP)
Vector:
Consists of more than one element
Elements are of same scalar data types (e.g. floats, integers, …)
Scalar
Processing
Vector
Processing
AA BB
CC
++
A B
C
+
CiCi
++
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
VLVL
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
18

Vectorization of Code
• Transform sequential code to exploit vector processing capabilities (SIMD)
– Manually by explicit syntax
– Automatically by tools like a compiler
for(i = 0; i <= MAX;i++)
c[i] = a[i] + b[i];
a
b
c
+
a
b
c
++
a[i]
b[i]
c[i]
+
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
19

Reminder about the peak flops
Scheduler (Port names as used by Intel® Architecture Code Analyzer ***)
Load
Port 0 Port 1 Port 5 Port 2 Port 3
Load
Store Address
Store DataALUALU ALU/JMP
AVX FP Shuf
AVX FP Bool
VI* ADDVI* MUL
SSE MUL
DIV**
SSE ADD
AVX FP ADD
AVX FP MUL
0 63 127 255
SSE Shuf
AVX FP Blend
Port 4
AVX FP Blend
VI* ADD Store Address
6 instructions / cycle:
• 3 memory ops
• 3 computational operations
Nehalem /Westmere: Two 128 bits SIMD per cycle
4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle
2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle
SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle
4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
Intel® SandyBridge/Ivy Bridge micro-architecture
20

Processor: Intel Core i5-3427U
ark.intel.com:
21
In the Laptop We’ll be Using for Demo…
Processor Number i5-3427U
# of Cores 2
# of Threads 4
Clock Speed 1.8 GHz
Max Turbo Frequency 2.8 GHz
Instruction Set Extensions AVX
SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle
4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision)
2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)

Haswell-EP vs IvyBridge-EP
The total benefit (at node level) is given by a combinaison of factors
• Benefit from micro-u optimization (IPC)
25 % IPC improvements
• Benefit from the nb of cores
up to 1.16x (at cst Frequency)
• Benefit from AVX2
up to 2x (FMA)
• Benefit from Memory bandwidth
up to 1.14x (1866MHz to 2133MHz)
DDR4
DDR4
DDR4
DDR4
LLC
Cache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
22
C
C
C
C
C
C
C
C
C
C
C C

Flops/s, AVX, AVX2 and AVX-512
2013 2014 2015 2016
H1 H2 H1 H2 H1 H2 H1 H2
Haswell-EP future futureIvy Bridge-EP
23
----512512512512
----512512512512

FMA
FP Multiply
Unified Reservation Station
Port1
Port2
Port3
Port4
Port5
Load &
Store Address
Store
Data
Integer
ALU & Shift
Integer
ALU & LEA
Integer
ALU & LEA
FMA FP Mult
FP Add
Divide
Port6
Integer
ALU & Shift
Port7
Store
Address
Port0
New AGU for Stores
• Leaves Port 2 & 3 open for
Loads
Branch
New Branch Unit
• Reduces Port0 Conflicts
• 2nd EU for high branch code
4th ALU
• Great for integer workloads
• Frees Port0 & 1 for vector
Vector
Shuffle
Branch
Vector Int
Multiply
Vector
Logicals
Vector
Shifts
Vector Int
ALU
Vector Int
ALU
Vector
Logicals
Vector
Logicals
Intel® Microarchitecture (Haswell)
2xFMA
• Doubles peak FLOPs
• Two FP multiplies benefits
legacy
Haswell Execution Unit Overview
24

Extends 128-bit integer vector instructions to 256-bit
Floating Point Fused Multiply Add: A*B + C
Increased FLOPS potential
Increased accuracy – Only a single round
Enhanced vectorization with Gather, Shifts and powerful permutes
Intel® AVX2 uses same 256-bit YMM registers as Intel AVX
Floating-Point Performance (Peak) per Core
2x
2x
AVX2
Haswell
FMA (*,+)
FMA (*,+)
AVX
SandyBridge/
Ivy Bridge
MUL (*)
ADD (+)
SSE4
Nehalem/
Westmere
MUL (*)
ADD (+)
8 DP (16 SP)
4 DP (8 SP)
16 DP (32 SP)
256b AVX1
16 SP / 8 DP
Flops/Cycle
256b AVX2
32 SP / 16 DP
Flops/Cycle (FMA)
25

Use math libs for best use of AVX1, AVX2 & AVX-512
1.0
2.0
0.0
Assembly
Intrinsics
Assembly
Intrinsics
MKL Dgemm
benchmark
MKL Dgemm
benchmark
MKL FFT
benchmark
MKL FFT
benchmark
1.5
Use Intel® Math Kernel
Library as much as
possible
Use of intrinsics or
assembly for specific
kernels
Use Compiler and Intel
tools to optimize your
source code
speedup
Application
Source code
Application
Source code
One core basis comparison
27

Intel® Math Kernel Library:
Optimized Mathematical Building Blocks
Linear Algebra
• BLAS
• LAPACK
• Sparse Solvers
• Iterative
• Pardiso*
• ScaLAPACK
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential, Log
• Power / Root
Vector RNGs
• Congruential
• Wichmann-Hill
• Mersenne Twister
• Sobol
• Neiderreiter
• Non-deterministic
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Intel® MKL is an integral part of Intel® Parallel Studio XE
28

Many Ways to Vectorize
Ease of useCompiler:
Auto-vectorization (no change of code)
Programmer control
Compiler:
Auto-vectorization hints (#pragma simd, …)
SIMD intrinsic class
(e.g.: F32vec, F64vec, …)
Vector intrinsic
(e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)
Assembler code
(e.g.: [v]addps, [v]addss, …)
Compiler:
Intel® Cilk™ Plus Array Notation Extensions
29

Control Vectorization !
Provides details on vectorization success & failure:
Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n>
*: First available with Intel® Parallel Studio XE
n Diagnostic Messages
0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting
in case it was enabled on command line earlier.
1 Tells the vectorizer to report on vectorized loops.
[default if n missing]
2 Tells the vectorizer to report on vectorized and non-vectorized loops.
3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven
or assumed data dependences.
4 Tells the vectorizer to report on non-vectorized loops.
5 Tells the vectorizer to report on non-vectorized loops and the reason why they were
not vectorized.
6* Tells the vectorizer to use greater detail when reporting on vectorized and non-
vectorized loops and any proven or assumed data dependences.
30

Vectorization Report II
Note:
In case inter-procedural optimization (-ipo or /Qipo) is activated and
compilation and linking are separate compiler invocations, the switch to enable
reporting needs to be added to the link step!
35: subroutine fd( y )
36: integer :: i
37: real, dimension(10), intent(inout) :: y
38: do i=2,10
39: y(i) = y(i-1) + 1
40: end do
41: end subroutine fd
novec.f90(38): (col. 3) remark: loop was not vectorized: existence
of vector dependence.
novec.f90(39): (col. 5) remark: vector dependence: proven FLOW
dependence between y line 39, and y line 39.
novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized:
existence of vector dependence
31

Reasons for Vectorization Fails & How to Succeed
● Most frequent reason is Dependence:
Minimize dependencies among iterations by design!
● Alignment: Align your arrays/data structures
● Function calls in loop body: Use aggressive in-lining (IPO)
● Complex control flow/conditional branches:
Avoid them in loops by creating multiple versions of loops
● Unsupported loop structure: Use loop invariant expressions
● Not inner loop: Manual loop interchange possible?
● Mixed data types: Avoid type conversions
● Non-unit stride between elements: Possible to change algorithm to
allow linear/consecutive access?
● Loop body too complex reports: Try splitting up the loops!
● Vectorization seems inefficient reports: Enforce vectorization,
benchmark !
32

IVDEP vs. SIMD Pragma/Directives
33
Differences between IVDEP & SIMD pragmas/directives:
#pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran)
-Ignore vector dependencies (IVDEP):
Compiler ignores assumed but not proven dependencies for a loop
-Example:
#pragma simd (C/C++) or !DIR$ SIMD (Fortran):
- Aggressive version of IVDEP: Ignores all dependencies inside a loop
- It’s an imperative that forces the compiler try everything to vectorize
- Efficiency heuristic is ignored
- Attention: This can break semantically correct code!
However, it can vectorize code legally in some cases that wouldn’t be possible
otherwise!
void foo(int *a, int k, int c, int m)
{
#pragma ivdep
for (int i = 0; i < m; i++)
a[i] = a[i + k] * c;
}

Memory Subsystem
34

Memory Bandwidth update
For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory
8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s)
For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory
8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s)
For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory
8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s)
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
2 full width QPI 1.12 full width QPI 1.1
DMI2DMI2
40LPCIe3.040LPCIe3.0
HSW
Socket-R3
LGA
HSW
Socket-R3
LGA
DDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
36

Processor: Intel Core i5-3427U
ark.intel.com:
37
In the Laptop We’ll be Using for Demo…
Memory Types DDR3/L/-RS 1333/1600
# of Memory Channels 2
Max Memory Bandwidth 25.6 GB/s
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
Platform: 2 channels , 1 sockets and 1600 MHz memory
8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)

Intel® Many Integrated Core
Architecture
39

Up to 61 IA cores/1.2 GHz/ 244
Threads
Up to 16 GB memory with up to 352
GB/s bandwidth
512-bit SIMD instructions
Open Source Linux operating system
IP addressable
Standard programming languages,
tools, clustering
22 nm process
Intel® Xeon Phi™ Product Family
Passive Card
Active Card
http://software.intel.com/en-us/mic-developer
40

3 Family
Outstanding Parallel
Computing Solution
Performance/$ leadership
5 Family
Optimized for High
Density Environments
Performance/Watt
leadership
8GB GDDR5
>300GB/s
>1TF DP
225-245W TDP
6GB GDDR5
240GB/s
>1TF DP
300W TDP
Intel® Xeon Phi™ Coprocessor Product Lineup
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
41
Optional 3-year
Warranty
Extend to 3-year warranty on any Intel® Xeon
Phi™ Coprocessor. Product Code:
XPX100WRNTY, MM# 933057
7 Family
Highest Performance
Most Memory
Performance leadership
16GB GDDR5
352GB/s
>1.2TF DP
300W TDP
3120P
MM# 927501
3120A
MM# 927500
5110P
MM# 924044
5120D (no
thermal)
MM# 927503
7120P
MM# 927499
7120X
(No Thermal
Solution)
MM# 927498
7120A
MM# 934878
7120D
(Dense Form
Factor)
MM# 932330
41

Core Architecture
Instruction
decoder
L1 Cache (I & D)
L2 Cache
Interprocessor
network
Vector
Unit
Scalar Unit
Vector
Registers
Scalar
Registers
512 KB Slice per
32 KB per core
L2 Hardware Prefetching
Fully Coherent
In Order
512-wide64-bit
4 Threads per Core
VPU: integer, SP, DP;
3-operand,
16-instruction
42

Spectrum of Execution Models
(Offload / Native / Symmetric)
Offload:
Workload is run on host, and highly
parallel phases on Coprocessor
!dir$ omp offload target(mic)
!$omp parallel do
do i=1,10
A(i) = B(i) * C(i)
enddo
!$omp end parallel
MPI Example
on Host with offload to coprocessors
43

Spectrum of Execution Models
(Offload / Native / Symmetric)
MPI example
on Coprocessor only
Native (Coprocessor-only model):
Workload is run solely on coprocessor
icc –mmic … ./bin_mic
Then
ssh mic0
./bin_mic
Or start it from host
micnaticeloadex ./bin_mic
44

Symmetric Mode
Command Line
Arslan et al. 2013. Rice HPC Conf.
Workload runs on Host AND Coprocessors
45

QPI
IOH* IOH*
rank 0 in
“mic0”
rank 1 in
“mic1”
rank 4 in
“mic2”
rank 2 in
“cpu0”
rank 3 in
“cpu1”
MPI
Process
OpenMP
Threads
244
threads
244
threads
12
threads
12
threads
244
threads
244
threads
4x 7120A
(61 Cores, 1.238 GHz, 16GB GDDR5)
2x E5-2697v2
(12C, 2.7GHz) and
64GB DDR3-1866 MHz
rank 5 in
“mic3”
Peer-to-peer
via DMA
*Integrated in the processor
Single Node Tests – HW and SW Configuration
Isotropic RTM FD Kernel
Direct DMA transfers between devices
46

Scalability study with one to four Intel® Xeon Phi™ coprocessors
1.1
4.0
9.3
14.7
20.1
24.4
0.0
5.0
10.0
15.0
20.0
25.0
30.0
0.0
0.4
0.8
1.2
1.6
TFlops
Scaling Based on Number of Coprocessors
CUDA K40c CUDA K10
High performance and scalability with Intel® Xeon Phi® coprocessor
Single Node Tests – Performance & Scalability
Isotopic RTM FD Kernel
47
Scaling analysis with each Intel® Xeon Phi™ coprocessor
solving a 14GB subdomain and pair of Intel® Xeon®
processors solving a 10GB subdomain
16th order 3D space and 2nd order time; 61 Flops per Cell
24.4 GCell/s total performance with 2 processors + 4
coprocessors
semi-OPT measurement is an OpenMP parallel version
implemented with cache-blocking and compiler directives
to improve vectorization. The remaining measurements are
on code with additional optimizations such as loop
unrolling, non-temporal stores, tiling on Y-Z, prefetch
tuning, and balance between MULs and ADDs via intrinsics
CUDA K40c and CUDA K10 are measurements on single
devices using code that extended the FDTD3d sample in
the CUDA SDK5.5 to 16th order in space and further
optimized to increase register reuse
4.2
GCell/s
5.1
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to
http://www.intel.com/performance
1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014
2x Xeon1
semi-OPT
2x Xeon1 2x Xeon1 +
1x 7120A
2x Xeon1 +
2x 7120A
2x Xeon1 +
3x 7120A
2x Xeon1 +
4x 7120A
Config. Summary
IC 14.0 U1 MPI 4.1.1.036
MPSS 6720-15
ECC off,
Turbo on (Xeon & 7120A)
CUDA 5.5
(875MHz Boost Enabled)

3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A

Energy efficiency with
multiple Intel® Xeon
Phi cards
Note: 3 and 4 Xeon Phi power values are projections
based on the data collected for 1 and 2 Xeon Phi.
Single Node Tests – Performance/Watt
High energy efficiency with Xeon Phi
This data was presented by
Petrobras at SC13 and Rice 2014
Oil & Gas HPC Workshop
Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf
50

Next Intel® Xeon Phi™ Product Family
(Codenamed Knights Landing)
51
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change
without notice.
• “Knights Landing” code name for the 2nd generation
Intel® Xeon Phi™ product
• Based on Intel’s 14 nanometer manufacturing
process
• Standalone bootable processor (running the host
OS) and a PCIe coprocessor (PCIe end-point device)
• Integrated on-package high-bandwidth memory
• Flexible memory modes for the on package memory
include: cache and flat
• Support for Intel® Advanced Vector Extensions 512
(Intel® AVX-512)
• 60+ cores, 3+ TeraFLOPS of double-precision peak
performance per single socket node
• Multiple hardware threads per core with improved
single-thread performance over the current
generation Intel® Xeon Phi™ coprocessor
51 Note that code name above is not the product name

Programming Resources
52
Intel® Xeon Phi™ Coprocessor
Developer’s Quick Start Guide
Overview of Programming for Intel®
Xeon® processors and Intel® Xeon Phi™
coprocessors
Access to webinar replays and over 50
training videos
Beginning labs for the Intel® Xeon Phi™
Coprocessor
Programming guides, tools, case studies,
labs, code samples, forums & more
http://software.intel.com/mic-developer
Using a familiar programming model and tools means
that developers don’t need to start from scratch. Many
programming resources are available to further
accelerate time to solution.
52
Click on tabs

Questions?Questions?
Are you
ready for Multicore and
ManyCore?

Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel
logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804
54

Intel Technologies for High Performance Computing

Intel Technologies for High Performance Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intel Technologies for High Performance Computing

Similar to Intel Technologies for High Performance Computing (20)

More from Intel Software Brasil

More from Intel Software Brasil (18)

Intel Technologies for High Performance Computing