SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Amrita Mathuriya
HPC Application Engineer
DCG/Intel Corporation
November 2016
Optimizations of B-spline based SPO evaluations in QMC for
Multi/many-core Shared Memory Processors
In collaboration with
Jeongnim Kim (Intel), Victor Lee(Intel)
Ye Luo, Anouar Benali (Argonne
National Laboratory)
Luke Shulenburger (Sandia National
Laboratories )
11/12/2016
Presenter: Amrita Mathuriya
3
§  HPC application Engineer at HPC Ecosystem Application Engineering Team; working for code
modernization and optimization on Xeon and Xeon Phi™
§  Working at Intel for past 8 Years.
–  Expert at algorithms and optimizations for IA architectures.
–  Worked on HPC applications in areas of Computational Geometry, Optical Proximity Correction
(OPC), Electromagnetics, Computational Biology, Quantum Monte Carlo.
–  Working on code modernization for Intel® Xeon and Xeon Phi™ architectures.
§  MS in Computer Science with the specialization in Computational Science and Engineering from
Georgia Tech, USA under the guidance of Professor David Bader.
§  Obtained B. Tech degree in Computer Science from Indian Institute of Technology (IIT) Roorkee, India.
11/12/2016
Systems
§  KNC: Intel® Xeon Phi™ coprocessor 7120P
•  61 cores @ 1.238 GHz, 4-way Intel® Hyper-Threading Technology, Memory: 15872 MB
•  Intel® Many-core Platform Software Stack Version 3.6.1
•  OS Version : 3.10.0-229.el7.x86_64
§  Intel® Xeon Phi™ 7250P (code-named Knights Landing, KNL), 68 cores, 1.4GHz with
16GB MCDRAM (used in flat mode), cluster boot mode=Quad, Turbo=enable. KNL used in
Quad/Flat mode.
§  Intel® Xeon® E5-2697v4(BDW) node single socket, 18 cores HT Enabled @2.3GHz 145W
(E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMS.
§  Bluegene/Q (BG/Q) processor from Mira Supercomputer, at Argonne National lab facility.
§  Compilers and MPI and math library.
•  icc version 16.0.2 (gcc version 4.8.3 compatibility)
•  Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)
4
11/12/2016
Agenda
§  KNL overview and motivation
§  Intro to quantum Monte Carlo and QMCPACK
§  Current status of QMCPACK
§  Analysis of CORAL graphite benchmark
§  Optimizations to B-spline based SPO evaluations for QMC
§  Summary
5
11/12/2016 6
11/12/2016
Important Characteristics of KNL
7
§  Increasing core count per node on both Intel® Xeon® and Xeon Phi™
processors.
§  Large SIMD units – AVX512 supporting 16 single precision floating point
simultaneously.
§  Two level Cache system L1/L2 and high memory bandwidth.
11/12/2016
How to gain performance?
8
§  Scalability
–  Enable data sharing with hybrid parallelism using MPI + threading.
–  Design and implement scalable algorithms
§  SIMD Parallelism – adapt Data layouts to enable efficient vectorization.
§  Efficiently utilize caches and memory bandwidth with Tiling (cache-blocking).
11/12/2016 9
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
7% of peak GFLOPS achieved with the current AoS version
Roofline Performance Analysis on KNL
Peak GFLOPS at
(0.22 Flops/Byte
VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI.
0.22
GFLOPS NOW, 7%
Scalar add peak
11/12/2016
Performance Portable on Intel® Xeon®, Intel Xeon Phi™
and BG/Q Processors
10
§  Optimizations for efficiently utilizing SIMD units and caches.
–  SoA data layout transformation.
–  Tiling or AoSoA data layout transformation.
–  Nested thread parallelization to reduce time-to-solution and memory usage.
§  Optimization work done on KNC.
§  Later ported on KNL – works out of the box.
§  Optimizations result in significant performance improvement on BG/Q.
11/12/2016
Parallel efficiency of QMCPACK on US-DOE facilities. The legend
shows the MPI tasks and OpenMP threads of the reference computing
unit (CU) and the maximum number of nodes on each platform.
QMCPACK
An open-source US-DOE flagship many-body ab initio quantum Monte Carlo (QMC) code for
computing the electronic structure of atoms, molecules, and solids. http://qmcpack.org/
(a) DMC charge-density of AB stacked graphite and (b) the ball-and-
stick rendering and the 4-Carbon unit cell in blue.
11
J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley,
“Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no.
1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008
11/12/2016
Diffusion Monte Carlo Schematics
	
Ensemble
evolves
according to
•  Diffusion
•  Drift
•  Branching
Possible new
configurations
Old configurations Random
Walking
New configurations
ensemble
w=0.8
w=1.6
w=2.4
w=0.3
11/12/2016
How is QMCPACK parallelized	
QMCPACK utilizes OpenMP to optimize memory usage and to take
advantage of the growing number of cores per SMP node.	
13
§  Walkers within a MPI task is distributed among
the cores in CPU.
§  Big common data is shared by all the walkers
like wave function coefficients.
§  Frequency stops increasing.
§  Node count stops growing.
§  Nodes are getting more powerful but require
applications to expose more concurrency.	
Free lunch is over. On node performance is challenging.
11/12/2016
QMCPACK status	
§  Excellent MPI & OpenMP parallel efficiency at the walker level
§  All in double precision except 3D cubic B-Spline.
–  Work done recently to implement mixed precision. Speeds up by 1.2-1.5x.
§  SIMD efficiency low
§  Basically scalar performance with few exceptions
–  B-Spline – SSE/SSE2/QPX
–  Distance tables with QPX
§  Array of Structure (AoS) for D-dim N-particle attributes, e.g., R (N,3),
Gradients (N,3), Hessian matrices (N,9)
Pretty good and we can even do better!	
14
11/12/2016
CORAL Benchmark – KNL Profiling
29%
34%
18%
18%
Coral Benchmark Profile on KNL
Einspline
Distance Table
Jastraw
Others
4x4x1 AB-stacked graphite
64 carbon
256 electrons
15
The three compute kernels account for 80% run time in QMCPACK on KNL.
11/12/2016
QMC: Single particle orbital (SPO) representation with B-
spline basis set
16
One Dimensional cubic B-spline function
Precomputed coefficients
4D Read only array.
Stored in SOA format, P[nx][ny][nz][N]
Provided by DFT or HF computations using
Quantum Espresso
Tensor product in each Cartesian direction,
Representation for 3D orbital,
11/12/2016
Simplified miniQMC
17
§  Only contains B-spline
evaluation routines.
§  Mimics the computational and
data access patterns of B-
spline SPO evaluations in
QMC.
B-spline SPO evaluation kernels
Random position generation
11/12/2016
Array-of-Structs (AoS)
§  Pros:
Logical for expression of
physical abstractions in
3D or higher dimensions.
Struct-of-Arrays (SoA)
§  Pros:
Contiguous loads/stores
for efficient vectorization.
Hybrid (AoSoA)
§  Pros:
Potentially useful for
increasing cache locality.
Also supports efficient
vectorization.
x x xx x x
y y yy y y
z z zz z z
x x
x x
x x
y y
y y
y y
z z
z z
z z
Data Layout – Performance Considerations
18
x x
x x
x x
…
…
…
yy
yy
yy
…
…
…
z z
z z
z z
…
…
…
11/12/2016
Pseudocode - VGH
Computes value, gradient, Hessian at random (x,y,z)
19
Random
Data access pattern of read-only B-spline
coefficients P at a random position (x; y; z)
and j0=floor(y/dy) etc. The outermost x
dimension is not shown.
Strided access
for output arrays.
11/12/2016
SoA transformation for output arrays
20
Output arrays in SoA
(Structure of arrays)
format
x x xx x x
y y yy y y
z z zz z z
…
…
…
11/12/2016
How to evaluate performance of QMC
§  Rate of Monte Carlo sample generations (throughputs) per resource
§  For the miniapp,
Throughput = (number of evaluations)/(T)
Evaluations = (Number of walkers) X (Number of iterations) X (Number of splines)
T = Time per call of a function ( such as VGH )
§  Throughput represents work done on a node.
§  Ideally, it should stay constant across problem sizes.
21
11/12/2016
VGH throughput by AoS-to-SoA transformation
Higher the better
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
2x-4x Performance improvement for small to medium problem sizes.
22
11/12/2016
Pseudocode - VGH
Computes value, gradient, Hessian at random (x,y,z)
23
Random
Data access pattern of read-only B-spline
coefficients P at a random position (x; y; z)
and j0=floor(y/dy) etc. The outermost x
dimension is not shown.
Strided access
for output arrays.
11/12/2016
Why low performance for large N?
•  AoS-to-SoA improves SIMD efficiency
•  But, caches can be utilized better
•  Reduction on the arrays G& H of
size N
•  Streaming access at 4x4x4 block
•  Pressure on resources with large N,
e.g., TLB
•  How to keep the write data in L1/L2
•  How to maximize LLC sharing
Core Core
HUB
24
Reduction of
output arrays
over 64N values
11/12/2016
AoSoA Data Layout
Transformation
25
Tiled Input array
Tiled output arrays
Data access pattern of read-only
B-spline table
a) Current b) Tiled
x x
x x
x x
…
…
…
yy
yy
yy
…
…
…
z z
z z
z z
…
…
…
Efficient cache utilization, by tiling both input and output arrays along the innermost dimension.
11/12/2016
Performance gain with tiling/AoSoA - Higher the better
AoSoA helps achieve sustained throughput across problem sizes for all architectures.
VGH Performance with SoA to AoSoA transformation (tiling)
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
26
11/12/2016
VGH throughput with tiling, higher the better
Tiling improves performance for all three processors.
Performance of VGH at N = 2048 with respect to tile size.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
27
§  BDW – peak at 64
§  The tiled input array fits in L3
cache.
§  KNC, KNL – peak at 512
§  For tile size > 512, output arrays
fall out of caches.
11/12/2016
Hybrid OpenMP/MPI Parallelism in QMCPACK
28
§  Current parallelism over walkers (Nw).
§  Working set size in QMCPACK grows
with number of walkers.
§  Parallelizing each walker update
§  Specifically, for Intel Xeon Phi, with large
number of cores/threads, next level of
parallelism becomes essential for strong
scaling. Parallel efficiency of QMCPACK on US-DOE facilities. The legend
shows the MPI tasks and OpenMP threads of the reference computing
unit (CU) and the maximum number of nodes on each platform.
J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley,
“Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no.
1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008
11/12/2016
Parallelism within a walker – nested threading
29
#pragma omp parallel
Strong Scaling:- Independent execution of tiles in different threads.
•  Reduces memory requirement
and time to solution on a node,
by reducing the number of
walkers on a node.
•  miniQMC replaces OpenMP
nested threading with manual
assignment of work.
11/12/2016
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
Strong Scaling Results on
KNL
30
Reduces time to solution by ~14x with 16 threads per walker
Speedup on KNL w.r.t. number of walkers per thread.
11/12/2016
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
Strong Scaling Results on
KNL
31
Reduces time to solution by ~14x with 16 threads per walker
Speedup on KNL w.r.t. number of walkers per thread.
Performance of VGH at N = 2048 with
respect to tile size.
11/12/2016
Roofline Performance Analysis on KNL
32
§  SoA data layout conversion
–  Increases cache aware AI
from 0.22 to 0.32
–  ~7% of the achievable peak.
–  1.5x speedup wrt. AoS
version.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware AI and X (b) the best performance (AoSoA) on DDR.
0.22 à 0.32
SoA,
~7% of peak
GFlops
11/12/2016
Roofline Performance Analysis on KNL
33
§  AoSoA version increases
cache reuse with the same
AI.
–  Better cache utilization.
–  ~2.25x gain in performance.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware AI and X (b) the best performance (AoSoA) on DDR.
0.22 à 0.32
AoSoA,
11% of peak
GFlops
11/12/2016
Roofline Performance Analysis on KNL
34
§  AoSoA version with
MCDRAM ~3.3x faster than
DDR.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware AI and X (b) the best performance (AoSoA) on DDR.
AoSoA,
3.3x speedup
With
MCDRAM
11/12/2016
Roofline performance analysis on BDW
35
Performance improved to ~50% of peak GFLOPS with the AoSoA version.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
660 GFLOPS SP Vector FMA Peak
VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI
AoSoA,
~50% of
achievable
GFlops
11/12/2016
Performance Summary
36
§  The improvements are portable to 4 types of CPUs, even from different vendors.
§  Significant speedups even on BG/Q.
On VGH routine	
 BGQ	
 BDW	
 KNC	
 KNL	
SOA and basic	
 1.9x	
 1.7x	
 2.6x	
 1.7x	
AoSoA/Tiling	
 2.7x	
 3.7x	
 5.2x	
 2.3x	
Strong scaling	
 5.2x	
 6.4x	
 35.2x	
 33.1x	
Number of threads per
walker
(The optimal tile size)	
2(32)	
 2(32)	
 8(256)	
 16(128)
11/12/2016
Symmetric Distance table computation
AoS to SoA transformation of particle positions.
37
0.8 0.6 0.6
7.5
13.1 13.0
18
30 30
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
256 512 800
Speedupwrt.BDWBaseline
Number of electrons
Speedup vs. Problem Size
Higher the better
KNL Baseline(256 TH) BDW Opt(2MPI/36TH) KNL Opt(256 TH)
KNL 50x
Faster with
SoA data
layout
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.
•  KNL used in Quad/Cache
mode for these
experiments
•  Here, TH = threads.
•  BDW has 2 sockets for
these experiments.
11/12/2016
Results
§  Array of structures (AOS) to structure of arrays (SOA) transform helps
achieve efficient vectorization.
§  Tiling for better memory access helps achieve approximately constant
throughput across problem sizes.
§  Nested parallelism over the AoSoA objects on KNL helps reduce the time-to-
solution by ~14x speedup with 16 threads.
§  Optimizations result in significant performance gain on all three distinct
cache-coherent architectures.
38
11/12/2016
Ways we increased the performance!
39
§  SIMD Parallelism
–  SoA data layout adaption.
§  Efficient cache utilization
–  Tiling/Cache-Blocking.
§  Scalability
–  Next level of threading to reduce time to solution.
–  Takes advantage of reduced working set size.
11/12/2016
Reference
40
Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim
“Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/
many-core shared memory processors”
arXiv:1611.02665
11/12/2016
Legal Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.  NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.  EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death.  SHOULD YOU PURCHASE OR
USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND
AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS'
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL
APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS
PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice.  Designers must not rely on the absence or characteristics of any features or
instructions marked "reserved" or "undefined".  Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them.  The information here is subject to change without notice.  Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.  Current
characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: 
http://www.intel.com/design/literature.htm
Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release.
Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of
Intel's internal code names is at the sole risk of the user
Intel, Look Inside, Xeon, Intel Xeon Phi, Pentium, Cilk, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2016 Intel Corporation
41
11/12/2016
Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel
does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved
for Intel microprocessors. Please refer to the applicable product User and Reference Guides
for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimers
Optimization Notice
42
11/12/2016
§  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark*
and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. For more information go to http://www.intel.com/performance.
§  Estimated Results Benchmark Disclaimer:
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or
configuration may affect actual performance.
§  Software Source Code Disclaimer:
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
§  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the
Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to the following conditions:
§  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Legal Disclaimers
43
Thank you for your time
Amrita Mathuriya
amrita.mathuriya@intel.com
www.intel.com/hpcdevcon
Backup
11/12/2016
SymmetricDTD::moveonsphere – Code Sinippet
§  For efficient auto-vectorization with the compiler
§  Three separate arrays for X, Y and Z instead of a single
array with (x, y, z) as a data member.
§  Similar SOA (structure of arrays) data layout for the output
array.
AoS
SoA
46

Contenu connexe

Tendances

AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhereinside-BigData.com
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiMaho Nakata
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCinside-BigData.com
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500Maho Nakata
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientistsinside-BigData.com
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learninginside-BigData.com
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 

Tendances (20)

AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon Phi
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
QuantumChemistry500
QuantumChemistry500QuantumChemistry500
QuantumChemistry500
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
The Education of Computational Scientists
The Education of Computational ScientistsThe Education of Computational Scientists
The Education of Computational Scientists
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 

Similaire à Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdfRioCarthiis
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscapeugur candan
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platforma3labdsp
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPCRCCSRENKEI
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkLiang Men
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCinside-BigData.com
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resumehet shah
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
 

Similaire à Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
The Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing LandscapeThe Berkeley View on the Parallel Computing Landscape
The Berkeley View on the Parallel Computing Landscape
 
Low Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard PlatformLow Power High-Performance Computing on the BeagleBoard Platform
Low Power High-Performance Computing on the BeagleBoard Platform
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC05 Preparing for Extreme Geterogeneity in HPC
05 Preparing for Extreme Geterogeneity in HPC
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 

Plus de Intel® Software

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology Intel® Software
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciIntel® Software
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.Intel® Software
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Intel® Software
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Intel® Software
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchIntel® Software
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel® Software
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019Intel® Software
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019Intel® Software
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Intel® Software
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...Intel® Software
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesIntel® Software
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision SlidesIntel® Software
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Intel® Software
 

Plus de Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Dernier

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines

  • 1.
  • 2. Amrita Mathuriya HPC Application Engineer DCG/Intel Corporation November 2016 Optimizations of B-spline based SPO evaluations in QMC for Multi/many-core Shared Memory Processors In collaboration with Jeongnim Kim (Intel), Victor Lee(Intel) Ye Luo, Anouar Benali (Argonne National Laboratory) Luke Shulenburger (Sandia National Laboratories )
  • 3. 11/12/2016 Presenter: Amrita Mathuriya 3 §  HPC application Engineer at HPC Ecosystem Application Engineering Team; working for code modernization and optimization on Xeon and Xeon Phi™ §  Working at Intel for past 8 Years. –  Expert at algorithms and optimizations for IA architectures. –  Worked on HPC applications in areas of Computational Geometry, Optical Proximity Correction (OPC), Electromagnetics, Computational Biology, Quantum Monte Carlo. –  Working on code modernization for Intel® Xeon and Xeon Phi™ architectures. §  MS in Computer Science with the specialization in Computational Science and Engineering from Georgia Tech, USA under the guidance of Professor David Bader. §  Obtained B. Tech degree in Computer Science from Indian Institute of Technology (IIT) Roorkee, India.
  • 4. 11/12/2016 Systems §  KNC: Intel® Xeon Phi™ coprocessor 7120P •  61 cores @ 1.238 GHz, 4-way Intel® Hyper-Threading Technology, Memory: 15872 MB •  Intel® Many-core Platform Software Stack Version 3.6.1 •  OS Version : 3.10.0-229.el7.x86_64 §  Intel® Xeon Phi™ 7250P (code-named Knights Landing, KNL), 68 cores, 1.4GHz with 16GB MCDRAM (used in flat mode), cluster boot mode=Quad, Turbo=enable. KNL used in Quad/Flat mode. §  Intel® Xeon® E5-2697v4(BDW) node single socket, 18 cores HT Enabled @2.3GHz 145W (E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMS. §  Bluegene/Q (BG/Q) processor from Mira Supercomputer, at Argonne National lab facility. §  Compilers and MPI and math library. •  icc version 16.0.2 (gcc version 4.8.3 compatibility) •  Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053) 4
  • 5. 11/12/2016 Agenda §  KNL overview and motivation §  Intro to quantum Monte Carlo and QMCPACK §  Current status of QMCPACK §  Analysis of CORAL graphite benchmark §  Optimizations to B-spline based SPO evaluations for QMC §  Summary 5
  • 7. 11/12/2016 Important Characteristics of KNL 7 §  Increasing core count per node on both Intel® Xeon® and Xeon Phi™ processors. §  Large SIMD units – AVX512 supporting 16 single precision floating point simultaneously. §  Two level Cache system L1/L2 and high memory bandwidth.
  • 8. 11/12/2016 How to gain performance? 8 §  Scalability –  Enable data sharing with hybrid parallelism using MPI + threading. –  Design and implement scalable algorithms §  SIMD Parallelism – adapt Data layouts to enable efficient vectorization. §  Efficiently utilize caches and memory bandwidth with Tiling (cache-blocking).
  • 9. 11/12/2016 9 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. 7% of peak GFLOPS achieved with the current AoS version Roofline Performance Analysis on KNL Peak GFLOPS at (0.22 Flops/Byte VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI. 0.22 GFLOPS NOW, 7% Scalar add peak
  • 10. 11/12/2016 Performance Portable on Intel® Xeon®, Intel Xeon Phi™ and BG/Q Processors 10 §  Optimizations for efficiently utilizing SIMD units and caches. –  SoA data layout transformation. –  Tiling or AoSoA data layout transformation. –  Nested thread parallelization to reduce time-to-solution and memory usage. §  Optimization work done on KNC. §  Later ported on KNL – works out of the box. §  Optimizations result in significant performance improvement on BG/Q.
  • 11. 11/12/2016 Parallel efficiency of QMCPACK on US-DOE facilities. The legend shows the MPI tasks and OpenMP threads of the reference computing unit (CU) and the maximum number of nodes on each platform. QMCPACK An open-source US-DOE flagship many-body ab initio quantum Monte Carlo (QMC) code for computing the electronic structure of atoms, molecules, and solids. http://qmcpack.org/ (a) DMC charge-density of AB stacked graphite and (b) the ball-and- stick rendering and the 4-Carbon unit cell in blue. 11 J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008
  • 12. 11/12/2016 Diffusion Monte Carlo Schematics Ensemble evolves according to •  Diffusion •  Drift •  Branching Possible new configurations Old configurations Random Walking New configurations ensemble w=0.8 w=1.6 w=2.4 w=0.3
  • 13. 11/12/2016 How is QMCPACK parallelized QMCPACK utilizes OpenMP to optimize memory usage and to take advantage of the growing number of cores per SMP node. 13 §  Walkers within a MPI task is distributed among the cores in CPU. §  Big common data is shared by all the walkers like wave function coefficients. §  Frequency stops increasing. §  Node count stops growing. §  Nodes are getting more powerful but require applications to expose more concurrency. Free lunch is over. On node performance is challenging.
  • 14. 11/12/2016 QMCPACK status §  Excellent MPI & OpenMP parallel efficiency at the walker level §  All in double precision except 3D cubic B-Spline. –  Work done recently to implement mixed precision. Speeds up by 1.2-1.5x. §  SIMD efficiency low §  Basically scalar performance with few exceptions –  B-Spline – SSE/SSE2/QPX –  Distance tables with QPX §  Array of Structure (AoS) for D-dim N-particle attributes, e.g., R (N,3), Gradients (N,3), Hessian matrices (N,9) Pretty good and we can even do better! 14
  • 15. 11/12/2016 CORAL Benchmark – KNL Profiling 29% 34% 18% 18% Coral Benchmark Profile on KNL Einspline Distance Table Jastraw Others 4x4x1 AB-stacked graphite 64 carbon 256 electrons 15 The three compute kernels account for 80% run time in QMCPACK on KNL.
  • 16. 11/12/2016 QMC: Single particle orbital (SPO) representation with B- spline basis set 16 One Dimensional cubic B-spline function Precomputed coefficients 4D Read only array. Stored in SOA format, P[nx][ny][nz][N] Provided by DFT or HF computations using Quantum Espresso Tensor product in each Cartesian direction, Representation for 3D orbital,
  • 17. 11/12/2016 Simplified miniQMC 17 §  Only contains B-spline evaluation routines. §  Mimics the computational and data access patterns of B- spline SPO evaluations in QMC. B-spline SPO evaluation kernels Random position generation
  • 18. 11/12/2016 Array-of-Structs (AoS) §  Pros: Logical for expression of physical abstractions in 3D or higher dimensions. Struct-of-Arrays (SoA) §  Pros: Contiguous loads/stores for efficient vectorization. Hybrid (AoSoA) §  Pros: Potentially useful for increasing cache locality. Also supports efficient vectorization. x x xx x x y y yy y y z z zz z z x x x x x x y y y y y y z z z z z z Data Layout – Performance Considerations 18 x x x x x x … … … yy yy yy … … … z z z z z z … … …
  • 19. 11/12/2016 Pseudocode - VGH Computes value, gradient, Hessian at random (x,y,z) 19 Random Data access pattern of read-only B-spline coefficients P at a random position (x; y; z) and j0=floor(y/dy) etc. The outermost x dimension is not shown. Strided access for output arrays.
  • 20. 11/12/2016 SoA transformation for output arrays 20 Output arrays in SoA (Structure of arrays) format x x xx x x y y yy y y z z zz z z … … …
  • 21. 11/12/2016 How to evaluate performance of QMC §  Rate of Monte Carlo sample generations (throughputs) per resource §  For the miniapp, Throughput = (number of evaluations)/(T) Evaluations = (Number of walkers) X (Number of iterations) X (Number of splines) T = Time per call of a function ( such as VGH ) §  Throughput represents work done on a node. §  Ideally, it should stay constant across problem sizes. 21
  • 22. 11/12/2016 VGH throughput by AoS-to-SoA transformation Higher the better Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. 2x-4x Performance improvement for small to medium problem sizes. 22
  • 23. 11/12/2016 Pseudocode - VGH Computes value, gradient, Hessian at random (x,y,z) 23 Random Data access pattern of read-only B-spline coefficients P at a random position (x; y; z) and j0=floor(y/dy) etc. The outermost x dimension is not shown. Strided access for output arrays.
  • 24. 11/12/2016 Why low performance for large N? •  AoS-to-SoA improves SIMD efficiency •  But, caches can be utilized better •  Reduction on the arrays G& H of size N •  Streaming access at 4x4x4 block •  Pressure on resources with large N, e.g., TLB •  How to keep the write data in L1/L2 •  How to maximize LLC sharing Core Core HUB 24 Reduction of output arrays over 64N values
  • 25. 11/12/2016 AoSoA Data Layout Transformation 25 Tiled Input array Tiled output arrays Data access pattern of read-only B-spline table a) Current b) Tiled x x x x x x … … … yy yy yy … … … z z z z z z … … … Efficient cache utilization, by tiling both input and output arrays along the innermost dimension.
  • 26. 11/12/2016 Performance gain with tiling/AoSoA - Higher the better AoSoA helps achieve sustained throughput across problem sizes for all architectures. VGH Performance with SoA to AoSoA transformation (tiling) Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. 26
  • 27. 11/12/2016 VGH throughput with tiling, higher the better Tiling improves performance for all three processors. Performance of VGH at N = 2048 with respect to tile size. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. 27 §  BDW – peak at 64 §  The tiled input array fits in L3 cache. §  KNC, KNL – peak at 512 §  For tile size > 512, output arrays fall out of caches.
  • 28. 11/12/2016 Hybrid OpenMP/MPI Parallelism in QMCPACK 28 §  Current parallelism over walkers (Nw). §  Working set size in QMCPACK grows with number of walkers. §  Parallelizing each walker update §  Specifically, for Intel Xeon Phi, with large number of cores/threads, next level of parallelism becomes essential for strong scaling. Parallel efficiency of QMCPACK on US-DOE facilities. The legend shows the MPI tasks and OpenMP threads of the reference computing unit (CU) and the maximum number of nodes on each platform. J. Kim, K. P. Esler, J. McMinis, M. A. Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley, “Hybrid algorithms in quantum monte carlo,” Journal of Physics: Conference Series, vol. 402, no. 1, p. 012008, 2012. [Online]. Available: http://stacks.iop.org/1742-6596/402/i=1/a=012008
  • 29. 11/12/2016 Parallelism within a walker – nested threading 29 #pragma omp parallel Strong Scaling:- Independent execution of tiles in different threads. •  Reduces memory requirement and time to solution on a node, by reducing the number of walkers on a node. •  miniQMC replaces OpenMP nested threading with manual assignment of work.
  • 30. 11/12/2016 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. Strong Scaling Results on KNL 30 Reduces time to solution by ~14x with 16 threads per walker Speedup on KNL w.r.t. number of walkers per thread.
  • 31. 11/12/2016 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. Strong Scaling Results on KNL 31 Reduces time to solution by ~14x with 16 threads per walker Speedup on KNL w.r.t. number of walkers per thread. Performance of VGH at N = 2048 with respect to tile size.
  • 32. 11/12/2016 Roofline Performance Analysis on KNL 32 §  SoA data layout conversion –  Increases cache aware AI from 0.22 to 0.32 –  ~7% of the achievable peak. –  1.5x speedup wrt. AoS version. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI and X (b) the best performance (AoSoA) on DDR. 0.22 à 0.32 SoA, ~7% of peak GFlops
  • 33. 11/12/2016 Roofline Performance Analysis on KNL 33 §  AoSoA version increases cache reuse with the same AI. –  Better cache utilization. –  ~2.25x gain in performance. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI and X (b) the best performance (AoSoA) on DDR. 0.22 à 0.32 AoSoA, 11% of peak GFlops
  • 34. 11/12/2016 Roofline Performance Analysis on KNL 34 §  AoSoA version with MCDRAM ~3.3x faster than DDR. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI and X (b) the best performance (AoSoA) on DDR. AoSoA, 3.3x speedup With MCDRAM
  • 35. 11/12/2016 Roofline performance analysis on BDW 35 Performance improved to ~50% of peak GFLOPS with the AoSoA version. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. 660 GFLOPS SP Vector FMA Peak VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware AI AoSoA, ~50% of achievable GFlops
  • 36. 11/12/2016 Performance Summary 36 §  The improvements are portable to 4 types of CPUs, even from different vendors. §  Significant speedups even on BG/Q. On VGH routine BGQ BDW KNC KNL SOA and basic 1.9x 1.7x 2.6x 1.7x AoSoA/Tiling 2.7x 3.7x 5.2x 2.3x Strong scaling 5.2x 6.4x 35.2x 33.1x Number of threads per walker (The optimal tile size) 2(32) 2(32) 8(256) 16(128)
  • 37. 11/12/2016 Symmetric Distance table computation AoS to SoA transformation of particle positions. 37 0.8 0.6 0.6 7.5 13.1 13.0 18 30 30 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 256 512 800 Speedupwrt.BDWBaseline Number of electrons Speedup vs. Problem Size Higher the better KNL Baseline(256 TH) BDW Opt(2MPI/36TH) KNL Opt(256 TH) KNL 50x Faster with SoA data layout Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. •  KNL used in Quad/Cache mode for these experiments •  Here, TH = threads. •  BDW has 2 sockets for these experiments.
  • 38. 11/12/2016 Results §  Array of structures (AOS) to structure of arrays (SOA) transform helps achieve efficient vectorization. §  Tiling for better memory access helps achieve approximately constant throughput across problem sizes. §  Nested parallelism over the AoSoA objects on KNL helps reduce the time-to- solution by ~14x speedup with 16 threads. §  Optimizations result in significant performance gain on all three distinct cache-coherent architectures. 38
  • 39. 11/12/2016 Ways we increased the performance! 39 §  SIMD Parallelism –  SoA data layout adaption. §  Efficient cache utilization –  Tiling/Cache-Blocking. §  Scalability –  Next level of threading to reduce time to solution. –  Takes advantage of reduced working set size.
  • 40. 11/12/2016 Reference 40 Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim “Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/ many-core shared memory processors” arXiv:1611.02665
  • 41. 11/12/2016 Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.  NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.  EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death.  SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice.  Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined".  Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them.  The information here is subject to change without notice.  Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.  Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:  http://www.intel.com/design/literature.htm Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user Intel, Look Inside, Xeon, Intel Xeon Phi, Pentium, Cilk, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2016 Intel Corporation 41
  • 42. 11/12/2016 Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimers Optimization Notice 42
  • 43. 11/12/2016 §  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. §  Estimated Results Benchmark Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. §  Software Source Code Disclaimer: Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. §  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: §  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Legal Disclaimers 43
  • 44. Thank you for your time Amrita Mathuriya amrita.mathuriya@intel.com www.intel.com/hpcdevcon
  • 46. 11/12/2016 SymmetricDTD::moveonsphere – Code Sinippet §  For efficient auto-vectorization with the compiler §  Three separate arrays for X, Y and Z instead of a single array with (x, y, z) as a data member. §  Similar SOA (structure of arrays) data layout for the output array. AoS SoA 46