SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Performance Tuning
and
Intel® Vtune™ Amplifier
Leo Borges (leonardo.borges@intel.com)
Intel - Software and Services Group
iStep-Brazil, August 2013
1
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
Start tuning on host
Overview of Intel® VTune™ Amplifier XE
Efficiency metrics
Problem areas
2
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Start with host-based profiling to
identify vectorization/ parallelism/
offload candidates
Start with representative/reasonable workloads!
Use Intel® VTune™ Amplifier XE to gather hot spot data
• Tells what functions account for most of the run time
• Often, this is enough
– But it does not tell you much about program structure
Alternately, profile functions & loops using Intel® Composer XE
• Build with options -profile-functions -profile-loops=all -profile-
loops-report=2
• Run the code (which may run slower) to collect profile data
• Look at the resulting dump files, or open the xml file with the data viewer
loopprofileviewer.sh located in the compiler ./bin directory
• Tells you
– which loops and functions account for the most run time
– how many times each loop executes (min, max and average)
3
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Correctness/Performance Analysis of
Parallel code
Intel® Inspector XE and thread-reports in Intel® VTune™
Amplifier XE are not available for the Intel® Xeon Phi™
coprocessor
So…
• Use Intel Inspector XE on your code with offload disabled (on host)
to identify correctness errors (e.g., deadlocks, races)
– Once fixed, then enable offload and continue debugging on the coprocessor
• Use Intel VTune Amplifier XE’s parallel performance analysis tools to
find issues on the host by running your program with offload
disabled
– Fix everything you can
– Then study scaling on the coprocessor using lessons from host tuning to
further optimize parallel performance
– Be wary of synchronization when the number of threads becomes more than a
handful
– Also pay attention to load balance.
4
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
Start tuning on host
Overview of Intel® VTune™ Amplifier XE
Efficiency metrics
Problem areas
5
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE offers a rich GUI
Menu and
Tool bars
Analysis
Type
Viewpoint
currently being
used
Tabs within
each result
Grid area
Stack Pane
Timeline area
Filter area
Current
grouping
6
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Double Click Function
to View Source
Intel® VTune™ Amplifier XE on Intel® Xeon
Phi™ coprocessors
Adjust Data Grouping
… (Partial list shown)
Filter by Module &
Other Controls
Filter
by Timeline Selection
(or by Grid Selection)
No Call Stacks Yet
7
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE displays event data
at function, source & assembly levels
Time on Source / Asm
Quickly scroll to hot spots.
Scroll Bar “Heat Map” is an
overview of hot spots
Click jump to scroll Asm
Quick Asm navigation:
Select source to highlight Asm
Right click for instruction
reference manual
Intel® VTune™ Amplifier XE 2013
8
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Collect events in Intel® VTune™ Amplifier XE on
offload and native program executions
Application settings:
• Application: ssh
• Parameters: mic0 “<app startup>”
• Working directory: Usually does not matter
• Don’t forget to set search directories under “All files” to map source
9
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Coprocessor collections have their
own analysis types
10
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Performance Analysis
Intel® VTune™ Amplifier XE only collects “Lightweight Hotspots” for
Intel® Xeon Phi™ coprocessors
• Uses Event-Based Sampling -
• Uses the Performance Monitoring Unit
• No instrumentation (“Locks & Waits”, “Concurrency”, etc.)
• More Available!
Other analysis types need to be configured
• Use “Lightweight Hotspots” and create a copy of it
• Add the desired counters
• Add some useful name and documentation
• Multi-event collections (over 2) can multiplex or use multiple
runs
11
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Configuring a User-defined Analysis
12
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Collecting Hardware Performance
Data
Hardware counters and events
• 2 counters in core, most are thread specific
• 4 outside the core (uncore) that get no thread or core details
• See PMU documentation for a full list of events
Collection
• Invoke from Intel® VTune™ Amplifier XE
• If collecting more than 2 core events, select multi-run for more
precise results or the default multiplexed collection, all in one run
• Uncore events are limited to 4 at a time in a single run
• Uncore event sampling needs a source of PMU interrupts, e.g.
programming cores to CPU_CLK_UNHALTED
Output files
• Intel VTune Amplifier XE performance database
13
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
Start tuning on host
Overview of Intel® VTune™ Amplifier XE
Efficiency metrics
Problem areas
14
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Efficiency Metric: CPI
• Cycles per Instruction
• Can be calculated per hardware thread or per core
• Is a ratio! So varies widely across applications
and should be used carefully.
Number of
Hardware
Threads / Core
Minimum (Best)
Theoretical CPI per
Core
Minimum (Best)
Theoretical CPI
per Thread
1 1.0 1.0
2 0.5 1.0
3 0.5 1.5
4 0.5 2.0
15
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Efficiency Metric: CPI
• Measures how latency affects your application’s
execution
• Look at how optimizations applied to your
application affect CPI
• Address high CPIs through any optimizations that
aim to reduce latency
Metric Formula Investigate if
CPI Per
Thread
CPU_CLK_UNHALTED/
INSTRUCTIONS_EXECUTED
> 4.0, or increasing
CPI Per
Core
(CPI per Thread) / Number of
hardware threads used
> 1.0, or increasing
16
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Some CPI Data
• Scaling data from a lab workload:
• Scatter plot of observed CPIs from lab workloads:
Metric 1
hardware
thread /
core
2
hardware
threads /
core
3
hardware
threads /
core
4
hardware
threads /
core
CPI per
Thread
5.24 8.80 11.18 13.74
CPI per
Core
5.24 4.40 3.73 3.43
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
17
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Efficiency Metric: Compute to Data
Access Ratio
• Measures an application’s computational density,
and its suitability for Intel® MIC Architecture
• Increase computational density through
vectorization and reducing data access (see cache
issues, also, DATA ALIGNMENT!)
Metric Formula Investigate if
L1 Compute to Data
Access Ratio
VPU_ELEMENTS_ACTIVE /
DATA_READ_OR_WRITE
< Vectorization
Intensity
L2 Compute to Data
Access Ratio
VPU_ELEMENTS_ACTIVE /
DATA_READ_MISS_OR_
WRITE_MISS
< 100x L1 Compute
to Data Access Ratio
18
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Agenda
Start tuning on host
Overview of Intel® VTune™ Amplifier XE
Efficiency metrics
Problem areas*
*tuning suggestions requiring deeper understanding of architectural tradeoffs
and application data handling details are highlighted with this “ninja” notation
19
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Problem Area: L1 Cache Usage
• Significantly affects data access latency and
therefore application performance
• Tuning Suggestions:
– Software prefetching
– Tile/block data access for cache size
– Use streaming stores
– If using 4K access stride, may be experiencing conflict
misses
– Examine Compiler prefetching (Compiler-generated L1
prefetches should not miss)
Metric Formula Investigate if
L1
Misses
DATA_READ_MISS_OR_WRITE_MISS +
L1_DATA_HIT_INFLIGHT_PF1
L1 Hit
Rate
(DATA_READ_OR_WRITE – L1 Misses) /
DATA_READ_OR_WRITE
< 95%
20
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Problem Area: Data Access Latency
• Significantly affects application performance
• Tuning Suggestions:
– Software prefetching
– Tile/block data access for cache size
– Use streaming stores
– Check cache locality – turn off prefetching and use
CACHE_FILL events - reduce sharing if needed/possible
– If using 64K access stride, may be experiencing conflict
misses
Metric Formula Investigate if
Estimated
Latency
Impact
(CPU_CLK_UNHALTED
– EXEC_STAGE_CYCLES
– DATA_READ_OR_WRITE)
/ DATA_READ_OR_WRITE_MISS
>145
21
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Problem Area: TLB Usage
• Also affects data access latency and therefore
application performance
• Tuning Suggestions:
– Improve cache usage & data access latency
– If L1 TLB miss/L2 TLB miss is high, try using large pages
– For loops with multiple streams, try splitting into multiple
loops
– If data access stride is a large power of 2, consider
padding between arrays by one 4 KB page
Metric Formula Invest-
igate if
L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1%
L2 TLB miss ratio LONG_DATA_PAGE_WALK
/ DATA_READ_OR_WRITE
> .1%
L1 TLB misses per
L2 TLB miss
DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x
22
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Problem Area: VPU Usage
• Indicates whether an application is vectorized
successfully and efficiently
• Tuning Suggestions:
– Use the Compiler vectorization report!
– For data dependencies preventing vectorization, try using
Intel® Cilk™ Plus #pragma SIMD (if safe!)
– Align data and tell the Compiler!
– Re-structure code if possible: Array notations, AOS->SOA
Metric Formula Investigate if
Vectorization
Intensity
VPU_ELEMENTS_ACTIVE /
VPU_INSTRUCTIONS_EXECUTED
<8 (DP),
<16(SP)
23
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Problem Area: Memory Bandwidth
• Can increase data latency in the system or become
a performance bottleneck
• Tuning Suggestions:
– Improve locality in caches
– Use streaming stores
– Improve software prefetching
Metric Formula Investigate if
Memory
Bandwidth
(UNC_F_CH0_NORMAL_READ +
UNC_F_CH0_NORMAL_WRITE+
UNC_F_CH1_NORMAL_READ +
UNC_F_CH1_NORMAL_WRITE) X
64/time
< 80GB/sec
(practical peak
140GB/sec)
(with 8 memory
controllers)
24
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Event collections on the coprocessor can
generate volumes of data
dgemm: on 60+ cores
Tip: Use cpu-mask to reduce data set, while maintaining
the same accuracy.
25
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Summary
• Vectorization, Parallelism, and Data locality are
critical to good performance for the Intel® Xeon
Phi™ Coprocessor
• Event names can be misleading – we recommend
using the metrics given in this presentation or our
tuning guide at http://software.intel.com/en-
us/articles/optimization-and-performance-tuning-
for-intel-xeon-phi-coprocessors-part-2-
understanding
• Intel® VTune™ Amplifier XE supports collecting all
of the above metrics, as well as providing special
analysis types like Memory Bandwidth
26
Obrigado.
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core,
VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
28

Contenu connexe

Tendances

Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОДКак выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Nick Turunov
 
Servidor IBM zEnterprise BC12
Servidor IBM zEnterprise BC12Servidor IBM zEnterprise BC12
Servidor IBM zEnterprise BC12
Anderson Bassani
 
Процессор Intel Xeon
Процессор Intel Xeon Процессор Intel Xeon
Процессор Intel Xeon
Nick Turunov
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data Center
Dr. Wilfred Lin (Ph.D.)
 

Tendances (20)

Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanov
 
Real-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPAReal-Time Game Optimization with Intel® GPA
Real-Time Game Optimization with Intel® GPA
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
 
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
【視覺進化論】AI智慧視覺運算技術論壇_2_ChungYeh
 
Lakefield: Hybrid Cores in 3D Package
Lakefield: Hybrid Cores in 3D PackageLakefield: Hybrid Cores in 3D Package
Lakefield: Hybrid Cores in 3D Package
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
6 profiling tools
6 profiling tools6 profiling tools
6 profiling tools
 
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОДКак выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
Как выбрать оптимальную серверную архитектуру для создания высокоэффективных ЦОД
 
Технологии Intel для виртуализации сетей операторов связи
Технологии Intel для виртуализации сетей операторов связиТехнологии Intel для виртуализации сетей операторов связи
Технологии Intel для виртуализации сетей операторов связи
 
Intel Roadmap
Intel RoadmapIntel Roadmap
Intel Roadmap
 
Servidor IBM zEnterprise BC12
Servidor IBM zEnterprise BC12Servidor IBM zEnterprise BC12
Servidor IBM zEnterprise BC12
 
Процессор Intel Xeon
Процессор Intel Xeon Процессор Intel Xeon
Процессор Intel Xeon
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Clear Linux OS - Architecture Overview
Clear Linux OS - Architecture OverviewClear Linux OS - Architecture Overview
Clear Linux OS - Architecture Overview
 
Sys cat i181e-en-07+sysmac studio
Sys cat i181e-en-07+sysmac studioSys cat i181e-en-07+sysmac studio
Sys cat i181e-en-07+sysmac studio
 
Intel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data CenterIntel Public Roadmap for Desktop, Mobile, Data Center
Intel Public Roadmap for Desktop, Mobile, Data Center
 
Clear Linux OS - Introduction
Clear Linux OS - IntroductionClear Linux OS - Introduction
Clear Linux OS - Introduction
 
Clear Linux Overview and Engagement
Clear Linux Overview and EngagementClear Linux Overview and Engagement
Clear Linux Overview and Engagement
 
Intel Roadmap
Intel RoadmapIntel Roadmap
Intel Roadmap
 
Public roadmap-article
Public roadmap-articlePublic roadmap-article
Public roadmap-article
 

Similaire à Intel® VTune™ Amplifier - Intel Software Conference 2013

Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing Transformation
Intel IT Center
 
Cloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process PhaseCloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process Phase
finteligent
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Media Gorod
 

Similaire à Intel® VTune™ Amplifier - Intel Software Conference 2013 (20)

Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
OPENMP ANALYSIS IN VTUNE AMPLIFIER XE
OPENMP ANALYSIS IN VTUNE AMPLIFIER XEOPENMP ANALYSIS IN VTUNE AMPLIFIER XE
OPENMP ANALYSIS IN VTUNE AMPLIFIER XE
 
E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Software Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT PlatformsSoftware Development Tools for Intel® IoT Platforms
Software Development Tools for Intel® IoT Platforms
 
Embree Ray Tracing Kernels
Embree Ray Tracing KernelsEmbree Ray Tracing Kernels
Embree Ray Tracing Kernels
 
Accelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing TransformationAccelerating Insights in the Technical Computing Transformation
Accelerating Insights in the Technical Computing Transformation
 
Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018Ready access to high performance Python with Intel Distribution for Python 2018
Ready access to high performance Python with Intel Distribution for Python 2018
 
Develop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster ReadyDevelop, Deploy, and Innovate with Intel® Cluster Ready
Develop, Deploy, and Innovate with Intel® Cluster Ready
 
Cloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process PhaseCloud Technology: Now Entering the Business Process Phase
Cloud Technology: Now Entering the Business Process Phase
 
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
Кирилл Мавродиев, Intel – Обзор современных возможностей по распараллеливанию...
 
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi:  Optimizing HPC for Breakthrough PerformanceAltair on Intel Xeon Phi:  Optimizing HPC for Breakthrough Performance
Altair on Intel Xeon Phi: Optimizing HPC for Breakthrough Performance
 
Explore, design and implement threading parallelism with Intel® Advisor XE
Explore, design and implement threading parallelism with Intel® Advisor XEExplore, design and implement threading parallelism with Intel® Advisor XE
Explore, design and implement threading parallelism with Intel® Advisor XE
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
 
Introduction to container networking in K8s - SDN/NFV London meetup
Introduction to container networking in K8s - SDN/NFV  London meetupIntroduction to container networking in K8s - SDN/NFV  London meetup
Introduction to container networking in K8s - SDN/NFV London meetup
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
 

Plus de Intel Software Brasil

Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Intel Software Brasil
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
Intel Software Brasil
 

Plus de Intel Software Brasil (20)

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw Day
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Intel® VTune™ Amplifier - Intel Software Conference 2013

  • 1. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Performance Tuning and Intel® Vtune™ Amplifier Leo Borges (leonardo.borges@intel.com) Intel - Software and Services Group iStep-Brazil, August 2013 1
  • 2. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda Start tuning on host Overview of Intel® VTune™ Amplifier XE Efficiency metrics Problem areas 2
  • 3. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Start with host-based profiling to identify vectorization/ parallelism/ offload candidates Start with representative/reasonable workloads! Use Intel® VTune™ Amplifier XE to gather hot spot data • Tells what functions account for most of the run time • Often, this is enough – But it does not tell you much about program structure Alternately, profile functions & loops using Intel® Composer XE • Build with options -profile-functions -profile-loops=all -profile- loops-report=2 • Run the code (which may run slower) to collect profile data • Look at the resulting dump files, or open the xml file with the data viewer loopprofileviewer.sh located in the compiler ./bin directory • Tells you – which loops and functions account for the most run time – how many times each loop executes (min, max and average) 3
  • 4. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Correctness/Performance Analysis of Parallel code Intel® Inspector XE and thread-reports in Intel® VTune™ Amplifier XE are not available for the Intel® Xeon Phi™ coprocessor So… • Use Intel Inspector XE on your code with offload disabled (on host) to identify correctness errors (e.g., deadlocks, races) – Once fixed, then enable offload and continue debugging on the coprocessor • Use Intel VTune Amplifier XE’s parallel performance analysis tools to find issues on the host by running your program with offload disabled – Fix everything you can – Then study scaling on the coprocessor using lessons from host tuning to further optimize parallel performance – Be wary of synchronization when the number of threads becomes more than a handful – Also pay attention to load balance. 4
  • 5. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda Start tuning on host Overview of Intel® VTune™ Amplifier XE Efficiency metrics Problem areas 5
  • 6. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE offers a rich GUI Menu and Tool bars Analysis Type Viewpoint currently being used Tabs within each result Grid area Stack Pane Timeline area Filter area Current grouping 6
  • 7. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Double Click Function to View Source Intel® VTune™ Amplifier XE on Intel® Xeon Phi™ coprocessors Adjust Data Grouping … (Partial list shown) Filter by Module & Other Controls Filter by Timeline Selection (or by Grid Selection) No Call Stacks Yet 7
  • 8. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE displays event data at function, source & assembly levels Time on Source / Asm Quickly scroll to hot spots. Scroll Bar “Heat Map” is an overview of hot spots Click jump to scroll Asm Quick Asm navigation: Select source to highlight Asm Right click for instruction reference manual Intel® VTune™ Amplifier XE 2013 8
  • 9. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Collect events in Intel® VTune™ Amplifier XE on offload and native program executions Application settings: • Application: ssh • Parameters: mic0 “<app startup>” • Working directory: Usually does not matter • Don’t forget to set search directories under “All files” to map source 9
  • 10. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Coprocessor collections have their own analysis types 10
  • 11. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance Analysis Intel® VTune™ Amplifier XE only collects “Lightweight Hotspots” for Intel® Xeon Phi™ coprocessors • Uses Event-Based Sampling - • Uses the Performance Monitoring Unit • No instrumentation (“Locks & Waits”, “Concurrency”, etc.) • More Available! Other analysis types need to be configured • Use “Lightweight Hotspots” and create a copy of it • Add the desired counters • Add some useful name and documentation • Multi-event collections (over 2) can multiplex or use multiple runs 11
  • 12. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Configuring a User-defined Analysis 12
  • 13. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Collecting Hardware Performance Data Hardware counters and events • 2 counters in core, most are thread specific • 4 outside the core (uncore) that get no thread or core details • See PMU documentation for a full list of events Collection • Invoke from Intel® VTune™ Amplifier XE • If collecting more than 2 core events, select multi-run for more precise results or the default multiplexed collection, all in one run • Uncore events are limited to 4 at a time in a single run • Uncore event sampling needs a source of PMU interrupts, e.g. programming cores to CPU_CLK_UNHALTED Output files • Intel VTune Amplifier XE performance database 13
  • 14. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda Start tuning on host Overview of Intel® VTune™ Amplifier XE Efficiency metrics Problem areas 14
  • 15. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI • Cycles per Instruction • Can be calculated per hardware thread or per core • Is a ratio! So varies widely across applications and should be used carefully. Number of Hardware Threads / Core Minimum (Best) Theoretical CPI per Core Minimum (Best) Theoretical CPI per Thread 1 1.0 1.0 2 0.5 1.0 3 0.5 1.5 4 0.5 2.0 15
  • 16. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: CPI • Measures how latency affects your application’s execution • Look at how optimizations applied to your application affect CPI • Address high CPIs through any optimizations that aim to reduce latency Metric Formula Investigate if CPI Per Thread CPU_CLK_UNHALTED/ INSTRUCTIONS_EXECUTED > 4.0, or increasing CPI Per Core (CPI per Thread) / Number of hardware threads used > 1.0, or increasing 16
  • 17. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Some CPI Data • Scaling data from a lab workload: • Scatter plot of observed CPIs from lab workloads: Metric 1 hardware thread / core 2 hardware threads / core 3 hardware threads / core 4 hardware threads / core CPI per Thread 5.24 8.80 11.18 13.74 CPI per Core 5.24 4.40 3.73 3.43 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 17
  • 18. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Efficiency Metric: Compute to Data Access Ratio • Measures an application’s computational density, and its suitability for Intel® MIC Architecture • Increase computational density through vectorization and reducing data access (see cache issues, also, DATA ALIGNMENT!) Metric Formula Investigate if L1 Compute to Data Access Ratio VPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE < Vectorization Intensity L2 Compute to Data Access Ratio VPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_ WRITE_MISS < 100x L1 Compute to Data Access Ratio 18
  • 19. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda Start tuning on host Overview of Intel® VTune™ Amplifier XE Efficiency metrics Problem areas* *tuning suggestions requiring deeper understanding of architectural tradeoffs and application data handling details are highlighted with this “ninja” notation 19
  • 20. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: L1 Cache Usage • Significantly affects data access latency and therefore application performance • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – If using 4K access stride, may be experiencing conflict misses – Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss) Metric Formula Investigate if L1 Misses DATA_READ_MISS_OR_WRITE_MISS + L1_DATA_HIT_INFLIGHT_PF1 L1 Hit Rate (DATA_READ_OR_WRITE – L1 Misses) / DATA_READ_OR_WRITE < 95% 20
  • 21. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Data Access Latency • Significantly affects application performance • Tuning Suggestions: – Software prefetching – Tile/block data access for cache size – Use streaming stores – Check cache locality – turn off prefetching and use CACHE_FILL events - reduce sharing if needed/possible – If using 64K access stride, may be experiencing conflict misses Metric Formula Investigate if Estimated Latency Impact (CPU_CLK_UNHALTED – EXEC_STAGE_CYCLES – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS >145 21
  • 22. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: TLB Usage • Also affects data access latency and therefore application performance • Tuning Suggestions: – Improve cache usage & data access latency – If L1 TLB miss/L2 TLB miss is high, try using large pages – For loops with multiple streams, try splitting into multiple loops – If data access stride is a large power of 2, consider padding between arrays by one 4 KB page Metric Formula Invest- igate if L1 TLB miss ratio DATA_PAGE_WALK/DATA_READ_OR_WRITE > 1% L2 TLB miss ratio LONG_DATA_PAGE_WALK / DATA_READ_OR_WRITE > .1% L1 TLB misses per L2 TLB miss DATA_PAGE_WALK / LONG_DATA_PAGE_WALK > 100x 22
  • 23. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: VPU Usage • Indicates whether an application is vectorized successfully and efficiently • Tuning Suggestions: – Use the Compiler vectorization report! – For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!) – Align data and tell the Compiler! – Re-structure code if possible: Array notations, AOS->SOA Metric Formula Investigate if Vectorization Intensity VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED <8 (DP), <16(SP) 23
  • 24. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Problem Area: Memory Bandwidth • Can increase data latency in the system or become a performance bottleneck • Tuning Suggestions: – Improve locality in caches – Use streaming stores – Improve software prefetching Metric Formula Investigate if Memory Bandwidth (UNC_F_CH0_NORMAL_READ + UNC_F_CH0_NORMAL_WRITE+ UNC_F_CH1_NORMAL_READ + UNC_F_CH1_NORMAL_WRITE) X 64/time < 80GB/sec (practical peak 140GB/sec) (with 8 memory controllers) 24
  • 25. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Event collections on the coprocessor can generate volumes of data dgemm: on 60+ cores Tip: Use cpu-mask to reduce data set, while maintaining the same accuracy. 25
  • 26. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary • Vectorization, Parallelism, and Data locality are critical to good performance for the Intel® Xeon Phi™ Coprocessor • Event names can be misleading – we recommend using the metrics given in this presentation or our tuning guide at http://software.intel.com/en- us/articles/optimization-and-performance-tuning- for-intel-xeon-phi-coprocessors-part-2- understanding • Intel® VTune™ Amplifier XE supports collecting all of the above metrics, as well as providing special analysis types like Memory Bandwidth 26
  • 28. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 28