SlideShare a Scribd company logo
1 of 55
Download to read offline
Intel Technologies for High Performance
Computing
Leo Borges
Intel Software Conference 2014 Brazil
May 2014
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its
customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks
are accurate and reflect performance of systems available for purchase.
Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC
manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors
support HT Technology, visit http://www.intel.com/info/hyperthreading.
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology
performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system
delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different
processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life
sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject
to change without notice
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s
current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice
Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only
and are subject to change without notice.
*Other names and brands may be claimed as the property of others.
2
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Building Blocks
Many Product Families – Today’s talk: HPC Focus
3
E5-2600 v3
(E5-2400 v3
for Comms &
Storage only)
E3-1200 v3
E7-4800 v3
E5-4600 v3
E7-2800 v3
E7-8800 v3
Haswell
E7
E5
Efficient
Performance
E3
E5-1600 v3
Boards/PDKs
Software
SSDsLAN
RAID
Note: For discussion purposes pnly
(Not intended to be interpreted as
portfolio recommendations or
guidance)
Cloud
Storagev3
Segments
Channel
Enterprise
HPC
Mission
Critical
Big Data
Public
Cloud
Co-processors
Product families and building blocks targeting an array of Segments
Storage
Networking
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Recall of a few basics for HPC
What to expect from your code
What to expect from the hardware
Review Vectorization
Xeon + Xeon Phi Example
Objectives of this session
4
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Review of a few HPC basics
for non-ninja programmers
5
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
How it works and where are the bottlenecks
CPUCPUCPUCPU
L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3
memorymemorymemorymemory
CPUCPUCPUCPU
L 2L 2L 2L 2 L 3L 3L 3L 3
memorymemorymemorymemory
I/OI/OI/OI/O
Interconnect.Interconnect.Interconnect.Interconnect.
L 1L 1L 1L 1
Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?
Cache Size, BW &Cache Size, BW &Cache Size, BW &Cache Size, BW &
latencylatencylatencylatency
CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ?
Intra / Inter socketIntra / Inter socketIntra / Inter socketIntra / Inter socket
communicationscommunicationscommunicationscommunications
InterInterInterInter nodesnodesnodesnodes
communication?communication?communication?communication?
6
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Unfortunately, you need to be aware
CPU
L 1 L 2 L 3
memory
Bandwidth
Latency
Capacity
From the core ………………….. ------> ………………………… to the i/o subsystem
L1 L2 L3 L4 L5 …. Ln
caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR
7
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
FLOPS and memory Bandwidth impact the efficiency & scalability
Performing Flops is not an issue
Data movement is the issue (BW, Latency, Power)
Efficiency (= Peak flops / Achieved flops)
won’t be high enough if store / load are not fast enough (GB/s)
First approximation: Only a matter of
Frequency and Bandwidth
for (i=0;i<=MAX;i++)
c[i]= a[i] + b[i]* d[i];
store load load load
add mul
8
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance expectation: upper bounds
CPU bound.
“HPL”Real world applications
Memory bound.
“Stream”
Flops/s demanding
applications
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
BW demanding
applications
• Our performance metrics are Gflop/s and % of peak (efficiency)
• Elapsed time might not tell all the information (how far of the peak
performance?)
9
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Performance expectation: upper bounds
CPU bound.
“HPL”Real world applications
Memory bound.
“Stream”
Analyzing this Flop/memory-access ratio will give a first guess
for performance prediction
• Our performance metrics are Gflop/s and % of peak (efficiency)
• Elapsed time might not tell all the information (how far of the peak
performance?)
10
Memory
Bound?
Compute
Bound?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Glossary, “High performance computing”
Peak =nb of floating points operations per cycle * frequency
“Flops /sec”
“Efficiency = % of the peak performance”
Same for Bandwidth (but in Gbytes / sec)
sec/sec)/(*)/( FlopscyclecycleFlopsPeak ==
By the way : What is the peak perf of your laptop ?
11
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Anatomy of a Computer Platform
12
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
CPU: Core/Uncore - Designed For Modularity
DRAMDRAMDRAMDRAM
QPIQPIQPIQPI
Core
Uncore
IMC QPI
Power &
Clock
#QPI
Links
# mem
channels
Size of
cache# cores
Power
Manage-
ment
Type of
Memory
Integrated
graphics
Differentiation in the “Uncore”:
…
QPI…
…
…
L3 Cache
QPI: Intel®
QuickPath
Interconnect
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Romley EP/EN Platforms
Intel® Xeon® Processor E5-2600 v2/2400 v2 Product Families
14
Intel® Xeon® processor
E5-2400/2600 prod fam
Intel® Xeon® Processor
E5-2400/2600 prod fam
Intel® C600 series
chipset
QPI
QPI
DDR3
DDR3
DDR3
DDR3
3Gb/s
SAS,
SATA
Memory
DDR3 & DDR3L
RDIMMs & UDIMMs, LR DIMMs
Socket R: 4 channels per socket, up to 3
DPC; speeds up to DDR3 1866
Socket B2: 3 channels per socket, up to 2
DPC; speeds up to DDR3 1600
PCI Express* 3.0
Socket R: 40 lanes per socket
Socket B2: 24 lanes per socket
Extra Gen 2 x4 on 2nd CPU
DDR3
DDR3
DDR3
DDR3
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
Intel® C600 series chipset
(Patsburg PCH)
Optimized Server & WS PCH
Integrated Storage:
Up to 8 ports 3Gb/s SAS
RAID 5 optional
Ivy Bridge CPUs
Socket R: Up to 12 cores / socket
Socket B2: Up to 10 cores / socket
DMI2
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*3.0x8
PCIe*2.0x4
QPI
Socket R: 2 QPI links
Socket B2: 1 QPI link
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
IvyBridge (IVB) E5-2600 v2 family
The total benefit (at node level) is given by a combination of factors
DDR3
DDR3
DDR3
DDR3
LLC
Cache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
15
C
C
C
C
C
C
C
C
C
C
Feature Xeon E5-2600 v2
Process
Technology
22 nm
Cores/Threads
Up to 12 Cores/24
Threads
Last-level Cache Up to 30 MB
Max Memory
Speed (MHz)
Up to 1866
Max DIMM
Capacity
12 Slots/Processor
PCIe* Lanes /
Controllers/Speed
40 / 10 (PCIe* 3.0 at 8
GT/s)
TDP (W)
150 (Workstation only),
130, 115, 95, 80, 70, 60
wstream.exe
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Advanced
Standard
Workstation
Only SKU
Segment
Optimized
8.0 GT/s QPI
DDR3-1866
Intel® HT
Intel® Turbo Boost
Low Power
Basic
Socket compatible with
SNB-EP top to bottom on the SKU stack
All SKUs, frequencies and features
and can change without notice
6C 80W
2.1GHz 15M E5-2620 v2
4C 80W
2.5GHz 10M E5-2609 v2
10C 115W
2.5GHz 25M E5-2670 v2
8C 95W
2.0GHz 20M E5-2640 v2
4C 80W
1.8GHz 10M E5-2603 v2
6C 80W
2.6GHz 15M E5-2630 v2
10C 130W
3.0GHz 25M E5-2690 v2
10C 115W
2.8GHz 25M E5-2680 v2
8C 95W
2.6GHz 20M E5-2650 v2
10C 95W
2.2GHz 25M E5-2660 v2
12C 130W
2.7GHz 30M E5-2697 v2
12C 115W
2.4GHz 30M E5-2695 v2
8C 130W
3.3GHz 25M
6C 130W
3.5GHz 25M E5-2643 v2
4C 130W
3.5GHz 15M E5-2637 v2
10C 70W
1.7GHz 25M E5-2650L v2
6C 60W
2.4GHz 15M E5-2630L v2
10C 8.0 GT/s QPI
6C 7.2 GT/s QPI
DDR3-1600
Intel® HT
Intel® Turbo Boost
7.2 GT/s QPI
DDR3 1600
Intel® HT
Intel® Turbo
Boost
8.0 GT/s QPI
DDR3-1866 (skt R)
DDR3-1600 (skt B2)
Intel® HT
Intel® Turbo Boost
6.4 GT/s QPI
DDR3 1333
No Intel® HT
No Intel® Turbo
8C 150W
3.4GHz 20M E5-2687W v2
E5-2667 v2
E5-2600 v2 Product Family
16
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture
(or, in general, for normal CPUs)
Cores
Vectors
Memory,
caches
Data layout and
alignment
OpenMP TBB Cilk plus
Vector
loops
Vector
functions
Blocking
algorithms
Manual layout,
ugly code
AoS SoA
library
4 considerations when writing an efficient, unconstrained parallel program
Array
notations
Threads, locks
Intrinsics
Directives for
alignment
Performance
Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
“SIMDization”, so called Vectorization
Single Instruction Multiple Data (SIMD):
Processing vector with a single operation
Provides data level parallelism (DLP)
Vector:
Consists of more than one element
Elements are of same scalar data types (e.g. floats, integers, …)
Scalar
Processing
Vector
Processing
AA BB
CC
++
A B
C
+
CiCi
++
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
CiCi
AiAi BiBi
VLVL
Ci
+
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
Ci
Ai Bi
VL
18
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vectorization of Code
• Transform sequential code to exploit vector processing capabilities (SIMD)
– Manually by explicit syntax
– Automatically by tools like a compiler
for(i = 0; i <= MAX;i++)
c[i] = a[i] + b[i];
a
b
c
+
a
b
c
++
a[i]
b[i]
c[i]
+
a[i]
b[i]
c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
+
a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]
b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]
c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]
19
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Reminder about the peak flops
Scheduler (Port names as used by Intel® Architecture Code Analyzer ***)
Load
Port 0 Port 1 Port 5 Port 2 Port 3
Load
Store Address
Store DataALUALU ALU/JMP
AVX FP Shuf
AVX FP Bool
VI* ADDVI* MUL
SSE MUL
DIV**
SSE ADD
AVX FP ADD
AVX FP MUL
0 63 127 255
SSE Shuf
AVX FP Blend
Port 4
AVX FP Blend
VI* ADD Store Address
6 instructions / cycle:
• 3 memory ops
• 3 computational operations
Nehalem /Westmere: Two 128 bits SIMD per cycle
4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle
2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle
SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle
8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle
4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
Intel® SandyBridge/Ivy Bridge micro-architecture
20
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Processor: Intel Core i5-3427U
ark.intel.com:
21
In the Laptop We’ll be Using for Demo…
Processor Number i5-3427U
# of Cores 2
# of Threads 4
Clock Speed 1.8 GHz
Max Turbo Frequency 2.8 GHz
Instruction Set Extensions AVX
SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle
8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle
4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle
2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision)
2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Haswell-EP vs IvyBridge-EP
The total benefit (at node level) is given by a combinaison of factors
• Benefit from micro-u optimization (IPC)
25 % IPC improvements
• Benefit from the nb of cores
up to 1.16x (at cst Frequency)
• Benefit from AVX2
up to 2x (FMA)
• Benefit from Memory bandwidth
up to 1.14x (1866MHz to 2133MHz)
DDR4
DDR4
DDR4
DDR4
LLC
Cache
MC
QPII/O
C
C
QPI
QPI
Gen3 x16
Gen3 x16
Gen3 x8
22
C
C
C
C
C
C
C
C
C
C
C C
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Flops/s, AVX, AVX2 and AVX-512
2013 2014 2015 2016
H1 H2 H1 H2 H1 H2 H1 H2
Haswell-EP future futureIvy Bridge-EP
23
----512512512512
----512512512512
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
FMA
FP Multiply
Unified Reservation Station
Port1
Port2
Port3
Port4
Port5
Load &
Store Address
Store
Data
Integer
ALU & Shift
Integer
ALU & LEA
Integer
ALU & LEA
FMA FP Mult
FP Add
Divide
Port6
Integer
ALU & Shift
Port7
Store
Address
Port0
New AGU for Stores
• Leaves Port 2 & 3 open for
Loads
Branch
New Branch Unit
• Reduces Port0 Conflicts
• 2nd EU for high branch code
4th ALU
• Great for integer workloads
• Frees Port0 & 1 for vector
Vector
Shuffle
Branch
Vector Int
Multiply
Vector
Logicals
Vector
Shifts
Vector Int
ALU
Vector Int
ALU
Vector
Logicals
Vector
Logicals
Intel® Microarchitecture (Haswell)
2xFMA
• Doubles peak FLOPs
• Two FP multiplies benefits
legacy
Haswell Execution Unit Overview
24
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Extends 128-bit integer vector instructions to 256-bit
Floating Point Fused Multiply Add: A*B + C
Increased FLOPS potential
Increased accuracy – Only a single round
Enhanced vectorization with Gather, Shifts and powerful permutes
Intel® AVX2 uses same 256-bit YMM registers as Intel AVX
Floating-Point Performance (Peak) per Core
2x
2x
AVX2
Haswell
FMA (*,+)
FMA (*,+)
AVX
SandyBridge/
Ivy Bridge
MUL (*)
ADD (+)
SSE4
Nehalem/
Westmere
MUL (*)
ADD (+)
8 DP (16 SP)
4 DP (8 SP)
16 DP (32 SP)
256b AVX1
16 SP / 8 DP
Flops/Cycle
256b AVX2
32 SP / 16 DP
Flops/Cycle (FMA)
25
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture
(or, in general, for normal CPUs)
Cores
Vectors
Memory,
caches
Data layout and
alignment
OpenMP TBB Cilk plus
Vector
loops
Vector
functions
Blocking
algorithms
Manual layout,
ugly code
AoS SoA
library
4 considerations when writing an efficient, unconstrained parallel program
Array
notations
Threads, locks
Intrinsics
Directives for
alignment
Performance
Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Use math libs for best use of AVX1, AVX2 & AVX-512
1.0
2.0
0.0
Assembly
Intrinsics
Assembly
Intrinsics
MKL Dgemm
benchmark
MKL Dgemm
benchmark
MKL FFT
benchmark
MKL FFT
benchmark
1.5
Use Intel® Math Kernel
Library as much as
possible
Use of intrinsics or
assembly for specific
kernels
Use Compiler and Intel
tools to optimize your
source code
speedup
Application
Source code
Application
Source code
One core basis comparison
27
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Math Kernel Library:
Optimized Mathematical Building Blocks
Linear Algebra
• BLAS
• LAPACK
• Sparse Solvers
• Iterative
• Pardiso*
• ScaLAPACK
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces
• Cluster FFT
Vector Math
• Trigonometric
• Hyperbolic
• Exponential, Log
• Power / Root
Vector RNGs
• Congruential
• Wichmann-Hill
• Mersenne Twister
• Sobol
• Neiderreiter
• Non-deterministic
Summary Statistics
• Kurtosis
• Variation coefficient
• Order statistics
• Min/max
• Variance-covariance
And More
• Splines
• Interpolation
• Trust Region
• Fast Poisson Solver
Intel® MKL is an integral part of Intel® Parallel Studio XE
28
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Many Ways to Vectorize
Ease of useCompiler:
Auto-vectorization (no change of code)
Programmer control
Compiler:
Auto-vectorization hints (#pragma simd, …)
SIMD intrinsic class
(e.g.: F32vec, F64vec, …)
Vector intrinsic
(e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …)
Assembler code
(e.g.: [v]addps, [v]addss, …)
Compiler:
Intel® Cilk™ Plus Array Notation Extensions
29
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Control Vectorization !
Provides details on vectorization success & failure:
Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n>
*: First available with Intel® Parallel Studio XE
n Diagnostic Messages
0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting
in case it was enabled on command line earlier.
1 Tells the vectorizer to report on vectorized loops.
[default if n missing]
2 Tells the vectorizer to report on vectorized and non-vectorized loops.
3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven
or assumed data dependences.
4 Tells the vectorizer to report on non-vectorized loops.
5 Tells the vectorizer to report on non-vectorized loops and the reason why they were
not vectorized.
6* Tells the vectorizer to use greater detail when reporting on vectorized and non-
vectorized loops and any proven or assumed data dependences.
30
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vectorization Report II
Note:
In case inter-procedural optimization (-ipo or /Qipo) is activated and
compilation and linking are separate compiler invocations, the switch to enable
reporting needs to be added to the link step!
35: subroutine fd( y )
36: integer :: i
37: real, dimension(10), intent(inout) :: y
38: do i=2,10
39: y(i) = y(i-1) + 1
40: end do
41: end subroutine fd
novec.f90(38): (col. 3) remark: loop was not vectorized: existence
of vector dependence.
novec.f90(39): (col. 5) remark: vector dependence: proven FLOW
dependence between y line 39, and y line 39.
novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized:
existence of vector dependence
31
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Reasons for Vectorization Fails & How to Succeed
● Most frequent reason is Dependence:
Minimize dependencies among iterations by design!
● Alignment: Align your arrays/data structures
● Function calls in loop body: Use aggressive in-lining (IPO)
● Complex control flow/conditional branches:
Avoid them in loops by creating multiple versions of loops
● Unsupported loop structure: Use loop invariant expressions
● Not inner loop: Manual loop interchange possible?
● Mixed data types: Avoid type conversions
● Non-unit stride between elements: Possible to change algorithm to
allow linear/consecutive access?
● Loop body too complex reports: Try splitting up the loops!
● Vectorization seems inefficient reports: Enforce vectorization,
benchmark !
32
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
IVDEP vs. SIMD Pragma/Directives
33
Differences between IVDEP & SIMD pragmas/directives:
#pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran)
-Ignore vector dependencies (IVDEP):
Compiler ignores assumed but not proven dependencies for a loop
-Example:
#pragma simd (C/C++) or !DIR$ SIMD (Fortran):
- Aggressive version of IVDEP: Ignores all dependencies inside a loop
- It’s an imperative that forces the compiler try everything to vectorize
- Efficiency heuristic is ignored
- Attention: This can break semantically correct code!
However, it can vectorize code legally in some cases that wouldn’t be possible
otherwise!
void foo(int *a, int k, int c, int m)
{
#pragma ivdep
for (int i = 0; i < m; i++)
a[i] = a[i + k] * c;
}
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Memory Subsystem
34
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
CPU: Core/Uncore - Designed For Modularity
DRAMDRAMDRAMDRAM
QPIQPIQPIQPI
Core
Uncore
IMC QPI
Power &
Clock
#QPI
Links
# mem
channels
Size of
cache# cores
Power
Manage-
ment
Type of
Memory
Integrated
graphics
Differentiation in the “Uncore”:
…
QPI…
…
…
L3 Cache
QPI: Intel®
QuickPath
Interconnect
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE
CCCC
OOOO
RRRR
EEEE
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Memory Bandwidth update
For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory
8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s)
For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory
8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s)
For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory
8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s)
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
2 full width QPI 1.12 full width QPI 1.1
DMI2DMI2
40LPCIe3.040LPCIe3.0
HSW
Socket-R3
LGA
HSW
Socket-R3
LGA
DDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
DDR3/4DDR3/4
36
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Processor: Intel Core i5-3427U
ark.intel.com:
37
In the Laptop We’ll be Using for Demo…
Memory Types DDR3/L/-RS 1333/1600
# of Memory Channels 2
Max Memory Bandwidth 25.6 GB/s
Basical rules for theoretical memory BW [Bytes / second ] :
8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets
Platform: 2 channels , 1 sockets and 1600 MHz memory
8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture
(or, in general, for normal CPUs)
Cores
Vectors
Memory,
caches
Data layout and
alignment
OpenMP TBB Cilk plus
Vector
loops
Vector
functions
Blocking
algorithms
Manual layout,
ugly code
AoS SoA
library
4 considerations when writing an efficient, unconstrained parallel program
Array
notations
Threads, locks
Intrinsics
Directives for
alignment
Performance
Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel® Many Integrated Core
Architecture
39
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Up to 61 IA cores/1.2 GHz/ 244
Threads
Up to 16 GB memory with up to 352
GB/s bandwidth
512-bit SIMD instructions
Open Source Linux operating system
IP addressable
Standard programming languages,
tools, clustering
22 nm process
Intel® Xeon Phi™ Product Family
Passive Card
Active Card
http://software.intel.com/en-us/mic-developer
40
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3 Family
Outstanding Parallel
Computing Solution
Performance/$ leadership
5 Family
Optimized for High
Density Environments
Performance/Watt
leadership
8GB GDDR5
>300GB/s
>1TF DP
225-245W TDP
6GB GDDR5
240GB/s
>1TF DP
300W TDP
Intel® Xeon Phi™ Coprocessor Product Lineup
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
41
Optional 3-year
Warranty
Extend to 3-year warranty on any Intel® Xeon
Phi™ Coprocessor. Product Code:
XPX100WRNTY, MM# 933057
7 Family
Highest Performance
Most Memory
Performance leadership
16GB GDDR5
352GB/s
>1.2TF DP
300W TDP
3120P
MM# 927501
3120A
MM# 927500
5110P
MM# 924044
5120D (no
thermal)
MM# 927503
7120P
MM# 927499
7120X
(No Thermal
Solution)
MM# 927498
7120A
MM# 934878
7120D
(Dense Form
Factor)
MM# 932330
41
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Core Architecture
Instruction
decoder
L1 Cache (I & D)
L2 Cache
Interprocessor
network
Vector
Unit
Scalar Unit
Vector
Registers
Scalar
Registers
512 KB Slice per
32 KB per core
L2 Hardware Prefetching
Fully Coherent
In Order
512-wide64-bit
4 Threads per Core
VPU: integer, SP, DP;
3-operand,
16-instruction
42
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Spectrum of Execution Models
(Offload / Native / Symmetric)
Offload:
Workload is run on host, and highly
parallel phases on Coprocessor
!dir$ omp offload target(mic)
!$omp parallel do
do i=1,10
A(i) = B(i) * C(i)
enddo
!$omp end parallel
MPI Example
on Host with offload to coprocessors
43
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Spectrum of Execution Models
(Offload / Native / Symmetric)
MPI example
on Coprocessor only
Native (Coprocessor-only model):
Workload is run solely on coprocessor
icc –mmic … ./bin_mic
Then
ssh mic0
./bin_mic
Or start it from host
micnaticeloadex ./bin_mic
44
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Symmetric Mode
Command Line
Arslan et al. 2013. Rice HPC Conf.
Workload runs on Host AND Coprocessors
45
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
QPI
IOH* IOH*
rank 0 in
“mic0”
rank 1 in
“mic1”
rank 4 in
“mic2”
rank 2 in
“cpu0”
rank 3 in
“cpu1”
MPI
Process
OpenMP
Threads
244
threads
244
threads
12
threads
12
threads
244
threads
244
threads
4x 7120A
(61 Cores, 1.238 GHz, 16GB GDDR5)
2x E5-2697v2
(12C, 2.7GHz) and
64GB DDR3-1866 MHz
rank 5 in
“mic3”
Peer-to-peer
via DMA
*Integrated in the processor
Single Node Tests – HW and SW Configuration
Isotropic RTM FD Kernel
Direct DMA transfers between devices
46
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Scalability study with one to four Intel® Xeon Phi™ coprocessors
1.1
4.0
9.3
14.7
20.1
24.4
0.0
5.0
10.0
15.0
20.0
25.0
30.0
0.0
0.4
0.8
1.2
1.6
TFlops
Scaling Based on Number of Coprocessors
CUDA K40c CUDA K10
High performance and scalability with Intel® Xeon Phi® coprocessor
Single Node Tests – Performance & Scalability
Isotopic RTM FD Kernel
47
Scaling analysis with each Intel® Xeon Phi™ coprocessor
solving a 14GB subdomain and pair of Intel® Xeon®
processors solving a 10GB subdomain
16th order 3D space and 2nd order time; 61 Flops per Cell
24.4 GCell/s total performance with 2 processors + 4
coprocessors
semi-OPT measurement is an OpenMP parallel version
implemented with cache-blocking and compiler directives
to improve vectorization. The remaining measurements are
on code with additional optimizations such as loop
unrolling, non-temporal stores, tiling on Y-Z, prefetch
tuning, and balance between MULs and ADDs via intrinsics
CUDA K40c and CUDA K10 are measurements on single
devices using code that extended the FDTD3d sample in
the CUDA SDK5.5 to 16th order in space and further
optimized to increase register reuse
4.2
GCell/s
5.1
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to
http://www.intel.com/performance
1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014
2x Xeon1
semi-OPT
2x Xeon1 2x Xeon1 +
1x 7120A
2x Xeon1 +
2x 7120A
2x Xeon1 +
3x 7120A
2x Xeon1 +
4x 7120A
Config. Summary
IC 14.0 U1 MPI 4.1.1.036
MPSS 6720-15
ECC off,
Turbo on (Xeon & 7120A)
CUDA 5.5
(875MHz Boost Enabled)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Parallel Programming for Intel® Architecture
(or, in general, for normal CPUs)
Cores
Vectors
Memory,
caches
Data layout and
alignment
OpenMP TBB Cilk plus
Vector
loops
Vector
functions
Blocking
algorithms
Manual layout,
ugly code
AoS SoA
library
4 considerations when writing an efficient, unconstrained parallel program
Array
notations
Threads, locks
Intrinsics
Directives for
alignment
Performance
Analysis
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Energy efficiency with
multiple Intel® Xeon
Phi cards
Note: 3 and 4 Xeon Phi power values are projections
based on the data collected for 1 and 2 Xeon Phi.
Single Node Tests – Performance/Watt
High energy efficiency with Xeon Phi
This data was presented by
Petrobras at SC13 and Rice 2014
Oil & Gas HPC Workshop
Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf
50
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Next Intel® Xeon Phi™ Product Family
(Codenamed Knights Landing)
51
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change
without notice.
• “Knights Landing” code name for the 2nd generation
Intel® Xeon Phi™ product
• Based on Intel’s 14 nanometer manufacturing
process
• Standalone bootable processor (running the host
OS) and a PCIe coprocessor (PCIe end-point device)
• Integrated on-package high-bandwidth memory
• Flexible memory modes for the on package memory
include: cache and flat
• Support for Intel® Advanced Vector Extensions 512
(Intel® AVX-512)
• 60+ cores, 3+ TeraFLOPS of double-precision peak
performance per single socket node
• Multiple hardware threads per core with improved
single-thread performance over the current
generation Intel® Xeon Phi™ coprocessor
51 Note that code name above is not the product name
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Programming Resources
52
Intel® Xeon Phi™ Coprocessor
Developer’s Quick Start Guide
Overview of Programming for Intel®
Xeon® processors and Intel® Xeon Phi™
coprocessors
Access to webinar replays and over 50
training videos
Beginning labs for the Intel® Xeon Phi™
Coprocessor
Programming guides, tools, case studies,
labs, code samples, forums & more
http://software.intel.com/mic-developer
Using a familiar programming model and tools means
that developers don’t need to start from scratch. Many
programming resources are available to further
accelerate time to solution.
52
Click on tabs
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Questions?Questions?
Are you
ready for Multicore and
ManyCore?
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR
OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO
LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS
INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel
logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804
54
Intel Technologies for High Performance Computing

More Related Content

What's hot

Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОДКак выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Nick Turunov
 
Intel Itanium Hotchips 2011 Overview
Intel Itanium Hotchips 2011 OverviewIntel Itanium Hotchips 2011 Overview
Intel Itanium Hotchips 2011 Overview
Pauline Nist
 

What's hot (20)

Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОДКак выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
 
Real-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPAReal-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPA
 
More explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff upMore explosions, more chaos, and definitely more blowing stuff up
More explosions, more chaos, and definitely more blowing stuff up
 
Real-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPAReal-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPA
 
3 additional dpdk_theory(1)
3 additional dpdk_theory(1)3 additional dpdk_theory(1)
3 additional dpdk_theory(1)
 
Intel Itanium Hotchips 2011 Overview
Intel Itanium Hotchips 2011 OverviewIntel Itanium Hotchips 2011 Overview
Intel Itanium Hotchips 2011 Overview
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and future
 
Intel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent MemoryIntel Optane Data Center Persistent Memory
Intel Optane Data Center Persistent Memory
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
 
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
Unleashing Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Inside the ...
 
Overcoming Scaling Challenges in MongoDB Deployments with SSD
Overcoming Scaling Challenges in MongoDB Deployments with SSDOvercoming Scaling Challenges in MongoDB Deployments with SSD
Overcoming Scaling Challenges in MongoDB Deployments with SSD
 
Overview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path ArchitectureOverview of Intel® Omni-Path Architecture
Overview of Intel® Omni-Path Architecture
 
N(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdownN(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdown
 
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
 
Make your unity game faster, faster
Make your unity game faster, fasterMake your unity game faster, faster
Make your unity game faster, faster
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
IT@Intel: Creating Smart Spaces with All-in-Ones
IT@Intel:  Creating Smart Spaces with All-in-OnesIT@Intel:  Creating Smart Spaces with All-in-Ones
IT@Intel: Creating Smart Spaces with All-in-Ones
 
Clear Linux OS - Introduction
Clear Linux OS - IntroductionClear Linux OS - Introduction
Clear Linux OS - Introduction
 
Clear Linux OS - Architecture Overview
Clear Linux OS - Architecture OverviewClear Linux OS - Architecture Overview
Clear Linux OS - Architecture Overview
 

Viewers also liked

Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Intel Software Brasil
 

Viewers also liked (20)

2015 09 30_keynote_dy_span_v1.0
2015 09 30_keynote_dy_span_v1.02015 09 30_keynote_dy_span_v1.0
2015 09 30_keynote_dy_span_v1.0
 
Intel
IntelIntel
Intel
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 
Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013Vetorização e Otimização de Código - Intel Software Conference 2013
Vetorização e Otimização de Código - Intel Software Conference 2013
 
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software ConferenceIdentificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
Identificando Hotspots e Intel® VTune™ Amplifier - Intel Software Conference
 
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
 
Intel
IntelIntel
Intel
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
 

Similar to Intel Technologies for High Performance Computing

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Alluxio, Inc.
 
Relative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph TempleRelative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph Temple
Joao Galdino Mello de Souza
 
Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing Transformation
Intel IT Center
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Cloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process PhaseCloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process Phase
finteligent
 

Similar to Intel Technologies for High Performance Computing (20)

Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developers
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
Relative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph TempleRelative Capacity por Eduardo Oliveira e Joseph Temple
Relative Capacity por Eduardo Oliveira e Joseph Temple
 
NFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function FrameworkNFF-GO (YANFF) - Yet Another Network Function Framework
NFF-GO (YANFF) - Yet Another Network Function Framework
 
Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing Transformation
 
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
Intel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/LabIntel NFVi Enabling Kit Demo/Lab
Intel NFVi Enabling Kit Demo/Lab
 
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRaySoftware-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
 
Cloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process PhaseCloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process Phase
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
Introduction to container networking in K8s - SDN/NFV London meetup
Introduction to container networking in K8s - SDN/NFV  London meetupIntroduction to container networking in K8s - SDN/NFV  London meetup
Introduction to container networking in K8s - SDN/NFV London meetup
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
 
Intel Core X-seires processors
Intel Core X-seires processorsIntel Core X-seires processors
Intel Core X-seires processors
 
Embree Ray Tracing Kernels
Embree Ray Tracing KernelsEmbree Ray Tracing Kernels
Embree Ray Tracing Kernels
 

More from Intel Software Brasil

Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
Intel Software Brasil
 

More from Intel Software Brasil (18)

Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw Day
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
 
Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3
 
Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013
 
Hackeando a Sala de Aula
Hackeando a Sala de AulaHackeando a Sala de Aula
Hackeando a Sala de Aula
 
Android Native Apps Hands On
Android Native Apps Hands OnAndroid Native Apps Hands On
Android Native Apps Hands On
 
Android Fat Binaries
Android Fat BinariesAndroid Fat Binaries
Android Fat Binaries
 
Android Native Apps Development
Android Native Apps DevelopmentAndroid Native Apps Development
Android Native Apps Development
 

Intel Technologies for High Performance Computing

  • 1. Intel Technologies for High Performance Computing Leo Borges Intel Software Conference 2014 Brazil May 2014
  • 2. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel® Hyper-Threading Technology Available on select Intel® Xeon® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading. Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo , Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice. *Other names and brands may be claimed as the property of others. 2
  • 3. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Building Blocks Many Product Families – Today’s talk: HPC Focus 3 E5-2600 v3 (E5-2400 v3 for Comms & Storage only) E3-1200 v3 E7-4800 v3 E5-4600 v3 E7-2800 v3 E7-8800 v3 Haswell E7 E5 Efficient Performance E3 E5-1600 v3 Boards/PDKs Software SSDsLAN RAID Note: For discussion purposes pnly (Not intended to be interpreted as portfolio recommendations or guidance) Cloud Storagev3 Segments Channel Enterprise HPC Mission Critical Big Data Public Cloud Co-processors Product families and building blocks targeting an array of Segments Storage Networking
  • 4. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Recall of a few basics for HPC What to expect from your code What to expect from the hardware Review Vectorization Xeon + Xeon Phi Example Objectives of this session 4
  • 5. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Review of a few HPC basics for non-ninja programmers 5
  • 6. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice How it works and where are the bottlenecks CPUCPUCPUCPU L 1L 1L 1L 1 L 2L 2L 2L 2 L 3L 3L 3L 3 memorymemorymemorymemory CPUCPUCPUCPU L 2L 2L 2L 2 L 3L 3L 3L 3 memorymemorymemorymemory I/OI/OI/OI/O Interconnect.Interconnect.Interconnect.Interconnect. L 1L 1L 1L 1 Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ?Memory size, BW & latency ? Cache Size, BW &Cache Size, BW &Cache Size, BW &Cache Size, BW & latencylatencylatencylatency CoreCoreCoreCore count, size & perf ?count, size & perf ?count, size & perf ?count, size & perf ? Intra / Inter socketIntra / Inter socketIntra / Inter socketIntra / Inter socket communicationscommunicationscommunicationscommunications InterInterInterInter nodesnodesnodesnodes communication?communication?communication?communication? 6
  • 7. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Unfortunately, you need to be aware CPU L 1 L 2 L 3 memory Bandwidth Latency Capacity From the core ………………….. ------> ………………………… to the i/o subsystem L1 L2 L3 L4 L5 …. Ln caches eDram MCDram NVM SSD PCIe SSD HDD TapesDDR 7
  • 8. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice FLOPS and memory Bandwidth impact the efficiency & scalability Performing Flops is not an issue Data movement is the issue (BW, Latency, Power) Efficiency (= Peak flops / Achieved flops) won’t be high enough if store / load are not fast enough (GB/s) First approximation: Only a matter of Frequency and Bandwidth for (i=0;i<=MAX;i++) c[i]= a[i] + b[i]* d[i]; store load load load add mul 8
  • 9. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance expectation: upper bounds CPU bound. “HPL”Real world applications Memory bound. “Stream” Flops/s demanding applications Analyzing this Flop/memory-access ratio will give a first guess for performance prediction BW demanding applications • Our performance metrics are Gflop/s and % of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak performance?) 9
  • 10. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Performance expectation: upper bounds CPU bound. “HPL”Real world applications Memory bound. “Stream” Analyzing this Flop/memory-access ratio will give a first guess for performance prediction • Our performance metrics are Gflop/s and % of peak (efficiency) • Elapsed time might not tell all the information (how far of the peak performance?) 10 Memory Bound? Compute Bound?
  • 11. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Glossary, “High performance computing” Peak =nb of floating points operations per cycle * frequency “Flops /sec” “Efficiency = % of the peak performance” Same for Bandwidth (but in Gbytes / sec) sec/sec)/(*)/( FlopscyclecycleFlopsPeak == By the way : What is the peak perf of your laptop ? 11
  • 12. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Anatomy of a Computer Platform 12
  • 13. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice CPU: Core/Uncore - Designed For Modularity DRAMDRAMDRAMDRAM QPIQPIQPIQPI Core Uncore IMC QPI Power & Clock #QPI Links # mem channels Size of cache# cores Power Manage- ment Type of Memory Integrated graphics Differentiation in the “Uncore”: … QPI… … … L3 Cache QPI: Intel® QuickPath Interconnect CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE
  • 14. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Romley EP/EN Platforms Intel® Xeon® Processor E5-2600 v2/2400 v2 Product Families 14 Intel® Xeon® processor E5-2400/2600 prod fam Intel® Xeon® Processor E5-2400/2600 prod fam Intel® C600 series chipset QPI QPI DDR3 DDR3 DDR3 DDR3 3Gb/s SAS, SATA Memory DDR3 & DDR3L RDIMMs & UDIMMs, LR DIMMs Socket R: 4 channels per socket, up to 3 DPC; speeds up to DDR3 1866 Socket B2: 3 channels per socket, up to 2 DPC; speeds up to DDR3 1600 PCI Express* 3.0 Socket R: 40 lanes per socket Socket B2: 24 lanes per socket Extra Gen 2 x4 on 2nd CPU DDR3 DDR3 DDR3 DDR3 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 Intel® C600 series chipset (Patsburg PCH) Optimized Server & WS PCH Integrated Storage: Up to 8 ports 3Gb/s SAS RAID 5 optional Ivy Bridge CPUs Socket R: Up to 12 cores / socket Socket B2: Up to 10 cores / socket DMI2 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*3.0x8 PCIe*2.0x4 QPI Socket R: 2 QPI links Socket B2: 1 QPI link
  • 15. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice IvyBridge (IVB) E5-2600 v2 family The total benefit (at node level) is given by a combination of factors DDR3 DDR3 DDR3 DDR3 LLC Cache MC QPII/O C C QPI QPI Gen3 x16 Gen3 x16 Gen3 x8 15 C C C C C C C C C C Feature Xeon E5-2600 v2 Process Technology 22 nm Cores/Threads Up to 12 Cores/24 Threads Last-level Cache Up to 30 MB Max Memory Speed (MHz) Up to 1866 Max DIMM Capacity 12 Slots/Processor PCIe* Lanes / Controllers/Speed 40 / 10 (PCIe* 3.0 at 8 GT/s) TDP (W) 150 (Workstation only), 130, 115, 95, 80, 70, 60 wstream.exe
  • 16. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Advanced Standard Workstation Only SKU Segment Optimized 8.0 GT/s QPI DDR3-1866 Intel® HT Intel® Turbo Boost Low Power Basic Socket compatible with SNB-EP top to bottom on the SKU stack All SKUs, frequencies and features and can change without notice 6C 80W 2.1GHz 15M E5-2620 v2 4C 80W 2.5GHz 10M E5-2609 v2 10C 115W 2.5GHz 25M E5-2670 v2 8C 95W 2.0GHz 20M E5-2640 v2 4C 80W 1.8GHz 10M E5-2603 v2 6C 80W 2.6GHz 15M E5-2630 v2 10C 130W 3.0GHz 25M E5-2690 v2 10C 115W 2.8GHz 25M E5-2680 v2 8C 95W 2.6GHz 20M E5-2650 v2 10C 95W 2.2GHz 25M E5-2660 v2 12C 130W 2.7GHz 30M E5-2697 v2 12C 115W 2.4GHz 30M E5-2695 v2 8C 130W 3.3GHz 25M 6C 130W 3.5GHz 25M E5-2643 v2 4C 130W 3.5GHz 15M E5-2637 v2 10C 70W 1.7GHz 25M E5-2650L v2 6C 60W 2.4GHz 15M E5-2630L v2 10C 8.0 GT/s QPI 6C 7.2 GT/s QPI DDR3-1600 Intel® HT Intel® Turbo Boost 7.2 GT/s QPI DDR3 1600 Intel® HT Intel® Turbo Boost 8.0 GT/s QPI DDR3-1866 (skt R) DDR3-1600 (skt B2) Intel® HT Intel® Turbo Boost 6.4 GT/s QPI DDR3 1333 No Intel® HT No Intel® Turbo 8C 150W 3.4GHz 20M E5-2687W v2 E5-2667 v2 E5-2600 v2 Product Family 16
  • 17. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • 18. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice “SIMDization”, so called Vectorization Single Instruction Multiple Data (SIMD): Processing vector with a single operation Provides data level parallelism (DLP) Vector: Consists of more than one element Elements are of same scalar data types (e.g. floats, integers, …) Scalar Processing Vector Processing AA BB CC ++ A B C + CiCi ++ AiAi BiBi CiCi AiAi BiBi CiCi AiAi BiBi CiCi AiAi BiBi VLVL Ci + Ai Bi Ci Ai Bi Ci Ai Bi Ci Ai Bi VL 18
  • 19. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Vectorization of Code • Transform sequential code to exploit vector processing capabilities (SIMD) – Manually by explicit syntax – Automatically by tools like a compiler for(i = 0; i <= MAX;i++) c[i] = a[i] + b[i]; a b c + a b c ++ a[i] b[i] c[i] + a[i] b[i] c[i] + a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] + a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] 19
  • 20. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Reminder about the peak flops Scheduler (Port names as used by Intel® Architecture Code Analyzer ***) Load Port 0 Port 1 Port 5 Port 2 Port 3 Load Store Address Store DataALUALU ALU/JMP AVX FP Shuf AVX FP Bool VI* ADDVI* MUL SSE MUL DIV** SSE ADD AVX FP ADD AVX FP MUL 0 63 127 255 SSE Shuf AVX FP Blend Port 4 AVX FP Blend VI* ADD Store Address 6 instructions / cycle: • 3 memory ops • 3 computational operations Nehalem /Westmere: Two 128 bits SIMD per cycle 4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle 2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle Intel® SandyBridge/Ivy Bridge micro-architecture 20
  • 21. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Processor: Intel Core i5-3427U ark.intel.com: 21 In the Laptop We’ll be Using for Demo… Processor Number i5-3427U # of Cores 2 # of Threads 4 Clock Speed 1.8 GHz Max Turbo Frequency 2.8 GHz Instruction Set Extensions AVX SandyBridge/ Ivy Bridge: Two 256 bits SIMD per cycle 8 MUL (32b) and 8 ADD (32b): 16 Single Precision Flops / cycle 4 MUL (64b) and 4ADD (64b): 8 Double Precision Flops / cycle 2 (cores) * 1.8GHz * 16 Flop/cycle = 57.6 Gflop/s (single precision) 2 (cores) * 1.8GHz * 8 Flop/cycle = 28.8 Gflop/s (double precision)
  • 22. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Haswell-EP vs IvyBridge-EP The total benefit (at node level) is given by a combinaison of factors • Benefit from micro-u optimization (IPC) 25 % IPC improvements • Benefit from the nb of cores up to 1.16x (at cst Frequency) • Benefit from AVX2 up to 2x (FMA) • Benefit from Memory bandwidth up to 1.14x (1866MHz to 2133MHz) DDR4 DDR4 DDR4 DDR4 LLC Cache MC QPII/O C C QPI QPI Gen3 x16 Gen3 x16 Gen3 x8 22 C C C C C C C C C C C C
  • 23. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Flops/s, AVX, AVX2 and AVX-512 2013 2014 2015 2016 H1 H2 H1 H2 H1 H2 H1 H2 Haswell-EP future futureIvy Bridge-EP 23 ----512512512512 ----512512512512
  • 24. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice FMA FP Multiply Unified Reservation Station Port1 Port2 Port3 Port4 Port5 Load & Store Address Store Data Integer ALU & Shift Integer ALU & LEA Integer ALU & LEA FMA FP Mult FP Add Divide Port6 Integer ALU & Shift Port7 Store Address Port0 New AGU for Stores • Leaves Port 2 & 3 open for Loads Branch New Branch Unit • Reduces Port0 Conflicts • 2nd EU for high branch code 4th ALU • Great for integer workloads • Frees Port0 & 1 for vector Vector Shuffle Branch Vector Int Multiply Vector Logicals Vector Shifts Vector Int ALU Vector Int ALU Vector Logicals Vector Logicals Intel® Microarchitecture (Haswell) 2xFMA • Doubles peak FLOPs • Two FP multiplies benefits legacy Haswell Execution Unit Overview 24
  • 25. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Extends 128-bit integer vector instructions to 256-bit Floating Point Fused Multiply Add: A*B + C Increased FLOPS potential Increased accuracy – Only a single round Enhanced vectorization with Gather, Shifts and powerful permutes Intel® AVX2 uses same 256-bit YMM registers as Intel AVX Floating-Point Performance (Peak) per Core 2x 2x AVX2 Haswell FMA (*,+) FMA (*,+) AVX SandyBridge/ Ivy Bridge MUL (*) ADD (+) SSE4 Nehalem/ Westmere MUL (*) ADD (+) 8 DP (16 SP) 4 DP (8 SP) 16 DP (32 SP) 256b AVX1 16 SP / 8 DP Flops/Cycle 256b AVX2 32 SP / 16 DP Flops/Cycle (FMA) 25
  • 26. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • 27. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Use math libs for best use of AVX1, AVX2 & AVX-512 1.0 2.0 0.0 Assembly Intrinsics Assembly Intrinsics MKL Dgemm benchmark MKL Dgemm benchmark MKL FFT benchmark MKL FFT benchmark 1.5 Use Intel® Math Kernel Library as much as possible Use of intrinsics or assembly for specific kernels Use Compiler and Intel tools to optimize your source code speedup Application Source code Application Source code One core basis comparison 27
  • 28. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel® Math Kernel Library: Optimized Mathematical Building Blocks Linear Algebra • BLAS • LAPACK • Sparse Solvers • Iterative • Pardiso* • ScaLAPACK Fast Fourier Transforms • Multidimensional • FFTW interfaces • Cluster FFT Vector Math • Trigonometric • Hyperbolic • Exponential, Log • Power / Root Vector RNGs • Congruential • Wichmann-Hill • Mersenne Twister • Sobol • Neiderreiter • Non-deterministic Summary Statistics • Kurtosis • Variation coefficient • Order statistics • Min/max • Variance-covariance And More • Splines • Interpolation • Trust Region • Fast Poisson Solver Intel® MKL is an integral part of Intel® Parallel Studio XE 28
  • 29. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Many Ways to Vectorize Ease of useCompiler: Auto-vectorization (no change of code) Programmer control Compiler: Auto-vectorization hints (#pragma simd, …) SIMD intrinsic class (e.g.: F32vec, F64vec, …) Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …) Assembler code (e.g.: [v]addps, [v]addss, …) Compiler: Intel® Cilk™ Plus Array Notation Extensions 29
  • 30. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Control Vectorization ! Provides details on vectorization success & failure: Linux*, Mac OS* X: -vec-report<n>, Windows*: /Qvec-report<n> *: First available with Intel® Parallel Studio XE n Diagnostic Messages 0 Tells the vectorizer to report no diagnostic information. Useful for turning off reporting in case it was enabled on command line earlier. 1 Tells the vectorizer to report on vectorized loops. [default if n missing] 2 Tells the vectorizer to report on vectorized and non-vectorized loops. 3 Tells the vectorizer to report on vectorized and non-vectorized loops and any proven or assumed data dependences. 4 Tells the vectorizer to report on non-vectorized loops. 5 Tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized. 6* Tells the vectorizer to use greater detail when reporting on vectorized and non- vectorized loops and any proven or assumed data dependences. 30
  • 31. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Vectorization Report II Note: In case inter-procedural optimization (-ipo or /Qipo) is activated and compilation and linking are separate compiler invocations, the switch to enable reporting needs to be added to the link step! 35: subroutine fd( y ) 36: integer :: i 37: real, dimension(10), intent(inout) :: y 38: do i=2,10 39: y(i) = y(i-1) + 1 40: end do 41: end subroutine fd novec.f90(38): (col. 3) remark: loop was not vectorized: existence of vector dependence. novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39. novec.f90(38:3-38:3):VEC:MAIN_: loop was not vectorized: existence of vector dependence 31
  • 32. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Reasons for Vectorization Fails & How to Succeed ● Most frequent reason is Dependence: Minimize dependencies among iterations by design! ● Alignment: Align your arrays/data structures ● Function calls in loop body: Use aggressive in-lining (IPO) ● Complex control flow/conditional branches: Avoid them in loops by creating multiple versions of loops ● Unsupported loop structure: Use loop invariant expressions ● Not inner loop: Manual loop interchange possible? ● Mixed data types: Avoid type conversions ● Non-unit stride between elements: Possible to change algorithm to allow linear/consecutive access? ● Loop body too complex reports: Try splitting up the loops! ● Vectorization seems inefficient reports: Enforce vectorization, benchmark ! 32
  • 33. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice IVDEP vs. SIMD Pragma/Directives 33 Differences between IVDEP & SIMD pragmas/directives: #pragma ivdep (C/C++) or !DIR$ IVDEP (Fortran) -Ignore vector dependencies (IVDEP): Compiler ignores assumed but not proven dependencies for a loop -Example: #pragma simd (C/C++) or !DIR$ SIMD (Fortran): - Aggressive version of IVDEP: Ignores all dependencies inside a loop - It’s an imperative that forces the compiler try everything to vectorize - Efficiency heuristic is ignored - Attention: This can break semantically correct code! However, it can vectorize code legally in some cases that wouldn’t be possible otherwise! void foo(int *a, int k, int c, int m) { #pragma ivdep for (int i = 0; i < m; i++) a[i] = a[i + k] * c; }
  • 34. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Memory Subsystem 34
  • 35. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice CPU: Core/Uncore - Designed For Modularity DRAMDRAMDRAMDRAM QPIQPIQPIQPI Core Uncore IMC QPI Power & Clock #QPI Links # mem channels Size of cache# cores Power Manage- ment Type of Memory Integrated graphics Differentiation in the “Uncore”: … QPI… … … L3 Cache QPI: Intel® QuickPath Interconnect CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE CCCC OOOO RRRR EEEE
  • 36. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Memory Bandwidth update For Sandy Bridge EP platform: 4 channels , 2 sockets and 1600 MHz memory 8*1.600* 4*2 = 102.4 GB/s peak (ST : 80 GB/s) For Ivy Bridge EP platform: 4 channels , 2 sockets and 1866 MHz memory 8*1.866* 4*2 = 119.42 GB/s peak (ST : ~98 GB/s) For Haswell EP platform: 4 channels , 2 sockets and 2133 MHz memory 8*2.133* 4*2 = 136.5 GB/s peak (ST : ~114 GB/s) Basical rules for theoretical memory BW [Bytes / second ] : 8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets 2 full width QPI 1.12 full width QPI 1.1 DMI2DMI2 40LPCIe3.040LPCIe3.0 HSW Socket-R3 LGA HSW Socket-R3 LGA DDR3/4DDR3/4 DDR3/4DDR3/4 DDR3/4DDR3/4 DDR3/4DDR3/4 36
  • 37. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Processor: Intel Core i5-3427U ark.intel.com: 37 In the Laptop We’ll be Using for Demo… Memory Types DDR3/L/-RS 1333/1600 # of Memory Channels 2 Max Memory Bandwidth 25.6 GB/s Basical rules for theoretical memory BW [Bytes / second ] : 8 [Bytes / channel] * Mem freq [Gcycles/sec] * nb of channels * nb of sockets Platform: 2 channels , 1 sockets and 1600 MHz memory 8*1.6* 2*1 = 25.6 GB/s peak (ST : 20 GB/s)
  • 38. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • 39. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel® Many Integrated Core Architecture 39
  • 40. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Up to 61 IA cores/1.2 GHz/ 244 Threads Up to 16 GB memory with up to 352 GB/s bandwidth 512-bit SIMD instructions Open Source Linux operating system IP addressable Standard programming languages, tools, clustering 22 nm process Intel® Xeon Phi™ Product Family Passive Card Active Card http://software.intel.com/en-us/mic-developer 40
  • 41. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3 Family Outstanding Parallel Computing Solution Performance/$ leadership 5 Family Optimized for High Density Environments Performance/Watt leadership 8GB GDDR5 >300GB/s >1TF DP 225-245W TDP 6GB GDDR5 240GB/s >1TF DP 300W TDP Intel® Xeon Phi™ Coprocessor Product Lineup Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 41 Optional 3-year Warranty Extend to 3-year warranty on any Intel® Xeon Phi™ Coprocessor. Product Code: XPX100WRNTY, MM# 933057 7 Family Highest Performance Most Memory Performance leadership 16GB GDDR5 352GB/s >1.2TF DP 300W TDP 3120P MM# 927501 3120A MM# 927500 5110P MM# 924044 5120D (no thermal) MM# 927503 7120P MM# 927499 7120X (No Thermal Solution) MM# 927498 7120A MM# 934878 7120D (Dense Form Factor) MM# 932330 41
  • 42. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Core Architecture Instruction decoder L1 Cache (I & D) L2 Cache Interprocessor network Vector Unit Scalar Unit Vector Registers Scalar Registers 512 KB Slice per 32 KB per core L2 Hardware Prefetching Fully Coherent In Order 512-wide64-bit 4 Threads per Core VPU: integer, SP, DP; 3-operand, 16-instruction 42
  • 43. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Spectrum of Execution Models (Offload / Native / Symmetric) Offload: Workload is run on host, and highly parallel phases on Coprocessor !dir$ omp offload target(mic) !$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo !$omp end parallel MPI Example on Host with offload to coprocessors 43
  • 44. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Spectrum of Execution Models (Offload / Native / Symmetric) MPI example on Coprocessor only Native (Coprocessor-only model): Workload is run solely on coprocessor icc –mmic … ./bin_mic Then ssh mic0 ./bin_mic Or start it from host micnaticeloadex ./bin_mic 44
  • 45. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Symmetric Mode Command Line Arslan et al. 2013. Rice HPC Conf. Workload runs on Host AND Coprocessors 45
  • 46. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice QPI IOH* IOH* rank 0 in “mic0” rank 1 in “mic1” rank 4 in “mic2” rank 2 in “cpu0” rank 3 in “cpu1” MPI Process OpenMP Threads 244 threads 244 threads 12 threads 12 threads 244 threads 244 threads 4x 7120A (61 Cores, 1.238 GHz, 16GB GDDR5) 2x E5-2697v2 (12C, 2.7GHz) and 64GB DDR3-1866 MHz rank 5 in “mic3” Peer-to-peer via DMA *Integrated in the processor Single Node Tests – HW and SW Configuration Isotropic RTM FD Kernel Direct DMA transfers between devices 46
  • 47. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Scalability study with one to four Intel® Xeon Phi™ coprocessors 1.1 4.0 9.3 14.7 20.1 24.4 0.0 5.0 10.0 15.0 20.0 25.0 30.0 0.0 0.4 0.8 1.2 1.6 TFlops Scaling Based on Number of Coprocessors CUDA K40c CUDA K10 High performance and scalability with Intel® Xeon Phi® coprocessor Single Node Tests – Performance & Scalability Isotopic RTM FD Kernel 47 Scaling analysis with each Intel® Xeon Phi™ coprocessor solving a 14GB subdomain and pair of Intel® Xeon® processors solving a 10GB subdomain 16th order 3D space and 2nd order time; 61 Flops per Cell 24.4 GCell/s total performance with 2 processors + 4 coprocessors semi-OPT measurement is an OpenMP parallel version implemented with cache-blocking and compiler directives to improve vectorization. The remaining measurements are on code with additional optimizations such as loop unrolling, non-temporal stores, tiling on Y-Z, prefetch tuning, and balance between MULs and ADDs via intrinsics CUDA K40c and CUDA K10 are measurements on single devices using code that extended the FDTD3d sample in the CUDA SDK5.5 to 16th order in space and further optimized to increase register reuse 4.2 GCell/s 5.1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to http://www.intel.com/performance 1. Xeon = Intel® Xeon® processor E5-2697v2 Source: Intel Measured Results as of April 2014 2x Xeon1 semi-OPT 2x Xeon1 2x Xeon1 + 1x 7120A 2x Xeon1 + 2x 7120A 2x Xeon1 + 3x 7120A 2x Xeon1 + 4x 7120A Config. Summary IC 14.0 U1 MPI 4.1.1.036 MPSS 6720-15 ECC off, Turbo on (Xeon & 7120A) CUDA 5.5 (875MHz Boost Enabled)
  • 48. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Parallel Programming for Intel® Architecture (or, in general, for normal CPUs) Cores Vectors Memory, caches Data layout and alignment OpenMP TBB Cilk plus Vector loops Vector functions Blocking algorithms Manual layout, ugly code AoS SoA library 4 considerations when writing an efficient, unconstrained parallel program Array notations Threads, locks Intrinsics Directives for alignment Performance Analysis
  • 49. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3DFD comparison : E5-2697v2 (Ivy Bridge) and Xeon Phi 7120A
  • 50. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Energy efficiency with multiple Intel® Xeon Phi cards Note: 3 and 4 Xeon Phi power values are projections based on the data collected for 1 and 2 Xeon Phi. Single Node Tests – Performance/Watt High energy efficiency with Xeon Phi This data was presented by Petrobras at SC13 and Rice 2014 Oil & Gas HPC Workshop Source: Petrobras presentation at 2014 RICE Oil & Gas HPC: http://rice2014oghpc.blogs.rice.edu/files/2014/03/Intel-Rice2014-RTM-XeonPhi-V3.pdf 50
  • 51. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Next Intel® Xeon Phi™ Product Family (Codenamed Knights Landing) 51 All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. • “Knights Landing” code name for the 2nd generation Intel® Xeon Phi™ product • Based on Intel’s 14 nanometer manufacturing process • Standalone bootable processor (running the host OS) and a PCIe coprocessor (PCIe end-point device) • Integrated on-package high-bandwidth memory • Flexible memory modes for the on package memory include: cache and flat • Support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) • 60+ cores, 3+ TeraFLOPS of double-precision peak performance per single socket node • Multiple hardware threads per core with improved single-thread performance over the current generation Intel® Xeon Phi™ coprocessor 51 Note that code name above is not the product name
  • 52. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Programming Resources 52 Intel® Xeon Phi™ Coprocessor Developer’s Quick Start Guide Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors Access to webinar replays and over 50 training videos Beginning labs for the Intel® Xeon Phi™ Coprocessor Programming guides, tools, case studies, labs, code samples, forums & more http://software.intel.com/mic-developer Using a familiar programming model and tools means that developers don’t need to start from scratch. Many programming resources are available to further accelerate time to solution. 52 Click on tabs
  • 53. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Questions?Questions? Are you ready for Multicore and ManyCore?
  • 54. Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 54