SlideShare a Scribd company logo
1 of 18
Experts in numerical algorithms
and HPC services
Accelerators: the good, the bad and the ugly!
Dr Ian Reid
Ian.Reid@nag.co.uk
2
 NAG Introduction
 Accelerators – NAG experience
 NAG on Intel Xeon Phi
 Summary
Agenda
3
 Founded 1970
 Not-for-profit organisation
 Surpluses fund on-going R&D
 Mathematical and Statistical Expertise
 Libraries of components
 Consulting
 HPC Services
 Computational Science and Engineering (CSE) support
 Procurement advice, market watch, benchmarking
NAG Background
4
 Escalator?:
Want more performance? Buy the next processor!
 To get performance/efficiency we have to go
(massively) parallel
 Disruption causing serious look at ‘other’
technologies (and algorithms!)
 Even CPUs with tens of cores
 Hybrid, shared-memory and distributed-memory
parallelism
 Painful whichever way we turn!
Where has my Escalator gone?
5
 Loose definition: hardware on which to run your
software better than on your (general purpose) CPU
 Generally NOT an easy win
 Significant learning curve and effort
 Offload disadvantages…
 The good: put some effort in; get a great result!
 The bad: put effort in, get an OK result, but learn
lessons which can be re-used (often good!)
 The ugly: put significant effort in, get a poor result
and don’t learn anything substantive
Accelerators
6
 The Intel Xeon Phi is a co-processor attached to a
host system via the PCI express bus
 Highly parallel architecture
 Compiler support for OpenMP parallelism
 It has a distinct memory system from the host
 Several use cases to consider:
 Automatic Offloading
 Explicit Offloading
 Native Applications
Intel Xeon Phi
7
 Relatively easy to take existing OpenMP based code
and port to Phi
 Tuning for Phi takes some learning and expertise
 … but feedback into Xeon code is often very strong
 NAG Library for Intel Xeon Phi supports all models
 Offload (supports automatic and explicit) and Native libs
 Windows version from Intel Xeon Phi now in beta
NAG Experience with Intel Xeon Phi
8
 Offload OpenMP regions to Phi when problem sizes
are above some threshold
 Estimating problem size can be complex
 Required data is transferred to/from the host
prior/post executing OpenMP region
 Data transfer takes time, eats into the benefit of running
the OpenMP on the Phi
 Transparent to the user of the Library
 Just recompile code containing NAG Library function calls
to benefit.
Automatic Offload
9
 All NAG functions can be explicitly offloaded by user
 user code modified to include relevant offload statements
 allows control of which functions offloaded
 Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
 user responsible for data movement
 reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
 Effort required by the user to re-code application
Explicit Offload
10
 Users may choose to port their entire application
 user code modified to include relevant offload statements
 allows complete control of which functions are offloaded
 Data transfers to Phi can be dissociated with function
offloading allowing data to remain on the Phi
 user responsible for data movement
 reduces penalty of offloading data by allowing its use by
multiple offloaded function calls before returning to host
 Effort required by the user to re-code application
Native Applications
11
 Sandybridge CPUs (typically using 32 threads)
 Knights Corner Phi processor (typically using 240
threads)
Performance Examples and Lessons
12
0
200
400
600
800
1,000
1,200
1,400
1,600
0 5000 10000 15000 20000 25000 30000
Time(s)
Problem Size (n)
Hierarchical Cluster Analysis (go3ec)
32 threads original Phi offload original Phi offload opt 32 threads opt
 n=30k; m=3k
 Xeon 32t: 1,412s
 Phi 240t*: 1,259s
 Xeon 32t*: 1,073s
 For this size problem
best to stay on CPU
but take the 25%!
13
0
50
100
150
200
250
300
350
400
450
0 5000 10000 15000 20000 25000 30000
Time(s)
Problem Size (n)
Distance Matrix (g03ea)
32 threads original Phi offload original Phi offload opt 32 threads opt
 n=30k; m=3k
 Xeon 32t: 192s
 Phi 240t*: 40.6s
 Xeon 32t*: 75.7s
 Phi gain ~2x (~5x
over original)
14
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
100 10,000 1,000,000 100,000,000
Time(s)
Size of problem (n, log scale)
Uniform RNG - Mersenne Twister (g05sa)
8 threads original Native Phi original Native Phi opt 8 threads opt
 n=500m
 Xeon 8t: 0.25s
 Phi 240t*: 0.08s
 Xeon 8t*: 0.22s
 Phi gain ~3x
15
0
50
100
150
200
250
300
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Time(s)
Problem Size (weighted)
Maximum Likelihood Estimates (g03ca)
32 threads original Phi offload original Phi offload opt 32 threads opt
 n=2500; m=2500;
nfac=30; nvar=200
 Xeon 32t: 256s
 Phi 240t*: 53.6s
 Xeon 32t*: 54.7s
 Phi gain 4x, but also
Xeon speed-up (green
line under red)
16
0
20
40
60
80
100
120
140
160
180
200
0 1000 2000 3000 4000 5000 6000 7000
Time(s)
Problem Size (n)
Solve real symmetric positive definite simultaneous linear
equations using iterative refinement (f04af)
32 threads original Phi offload original Phi offload opt 32 threads opt
 n=6,000; nrhs;1,000
 Xeon32t: 171s
 Phi 240t*: 66s
 Xeon 32t*: 86s
 Phi gain ~1.3x (~3x
original)
17
 Parallelism is a real issue we all face
 Exciting for some. Challenging for others!
 Accelerators are interesting and can offer spectacular wins
 Intel Phi claiming less spectacular performance gains
 Less effort than on other Accelerators
 … and often repays on CPU as well!
 Acid test is always solving your (complete) problem!
 NAG can help you try out this technology
 NAG Library for Phi
 NAG expertise
Summary
18
Thank You
Questions?

More Related Content

What's hot

Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 

What's hot (7)

Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster  with TerasortUsing Derivation-Free Optimization in the Hadoop Cluster  with Terasort
Using Derivation-Free Optimization in the Hadoop Cluster with Terasort
 

Viewers also liked

Accelerating the Pace of Discovery Technical Computing at Intel
Accelerating the Pace of Discovery Technical Computing at IntelAccelerating the Pace of Discovery Technical Computing at Intel
Accelerating the Pace of Discovery Technical Computing at Intel
Intel IT Center
 
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Intel IT Center
 
Transforming Business with Advanced Analytics
Transforming Business with Advanced AnalyticsTransforming Business with Advanced Analytics
Transforming Business with Advanced Analytics
Intel IT Center
 

Viewers also liked (12)

Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
 
AIC Intel Based HPC
AIC Intel Based HPCAIC Intel Based HPC
AIC Intel Based HPC
 
Accelerating the Pace of Discovery Technical Computing at Intel
Accelerating the Pace of Discovery Technical Computing at IntelAccelerating the Pace of Discovery Technical Computing at Intel
Accelerating the Pace of Discovery Technical Computing at Intel
 
Enter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingEnter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputing
 
New Memory Solutions for Enterprise Computing
New Memory Solutions for Enterprise ComputingNew Memory Solutions for Enterprise Computing
New Memory Solutions for Enterprise Computing
 
Driving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to ExascaleDriving Industrial InnovationOn the Path to Exascale
Driving Industrial InnovationOn the Path to Exascale
 
High Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge EconomyHigh Performance Computing: The Essential tool for a Knowledge Economy
High Performance Computing: The Essential tool for a Knowledge Economy
 
Migrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel ArchitectureMigrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel Architecture
 
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
Are you ready to work in the Parallel Universe? Rise to the challenge at SC13
 
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
 
Transforming Business with Advanced Analytics
Transforming Business with Advanced AnalyticsTransforming Business with Advanced Analytics
Transforming Business with Advanced Analytics
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital Age
 

Similar to Accelerators: the good, the bad, and the ugly

Threading Successes 01 Intro
Threading Successes 01   IntroThreading Successes 01   Intro
Threading Successes 01 Intro
guest40fc7cd
 

Similar to Accelerators: the good, the bad, and the ugly (20)

Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
Threading Successes 01 Intro
Threading Successes 01   IntroThreading Successes 01   Intro
Threading Successes 01 Intro
 
The deep learning tour - Q1 2017
The deep learning tour - Q1 2017 The deep learning tour - Q1 2017
The deep learning tour - Q1 2017
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
Computação Paralela: Benefícios e Desafios - Intel Software Conference 2013
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. AvailabilityHPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Performance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android Devices
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Webinaron muticoreprocessors
Webinaron muticoreprocessorsWebinaron muticoreprocessors
Webinaron muticoreprocessors
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 

More from Intel IT Center

More from Intel IT Center (20)

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User Authentication
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace Today
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a Reality
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Big Data Analytics Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Big Data Analytics Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Big Data Analytics Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Big Data Analytics Applications Showcase
 

Recently uploaded

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Accelerators: the good, the bad, and the ugly

  • 1. Experts in numerical algorithms and HPC services Accelerators: the good, the bad and the ugly! Dr Ian Reid Ian.Reid@nag.co.uk
  • 2. 2  NAG Introduction  Accelerators – NAG experience  NAG on Intel Xeon Phi  Summary Agenda
  • 3. 3  Founded 1970  Not-for-profit organisation  Surpluses fund on-going R&D  Mathematical and Statistical Expertise  Libraries of components  Consulting  HPC Services  Computational Science and Engineering (CSE) support  Procurement advice, market watch, benchmarking NAG Background
  • 4. 4  Escalator?: Want more performance? Buy the next processor!  To get performance/efficiency we have to go (massively) parallel  Disruption causing serious look at ‘other’ technologies (and algorithms!)  Even CPUs with tens of cores  Hybrid, shared-memory and distributed-memory parallelism  Painful whichever way we turn! Where has my Escalator gone?
  • 5. 5  Loose definition: hardware on which to run your software better than on your (general purpose) CPU  Generally NOT an easy win  Significant learning curve and effort  Offload disadvantages…  The good: put some effort in; get a great result!  The bad: put effort in, get an OK result, but learn lessons which can be re-used (often good!)  The ugly: put significant effort in, get a poor result and don’t learn anything substantive Accelerators
  • 6. 6  The Intel Xeon Phi is a co-processor attached to a host system via the PCI express bus  Highly parallel architecture  Compiler support for OpenMP parallelism  It has a distinct memory system from the host  Several use cases to consider:  Automatic Offloading  Explicit Offloading  Native Applications Intel Xeon Phi
  • 7. 7  Relatively easy to take existing OpenMP based code and port to Phi  Tuning for Phi takes some learning and expertise  … but feedback into Xeon code is often very strong  NAG Library for Intel Xeon Phi supports all models  Offload (supports automatic and explicit) and Native libs  Windows version from Intel Xeon Phi now in beta NAG Experience with Intel Xeon Phi
  • 8. 8  Offload OpenMP regions to Phi when problem sizes are above some threshold  Estimating problem size can be complex  Required data is transferred to/from the host prior/post executing OpenMP region  Data transfer takes time, eats into the benefit of running the OpenMP on the Phi  Transparent to the user of the Library  Just recompile code containing NAG Library function calls to benefit. Automatic Offload
  • 9. 9  All NAG functions can be explicitly offloaded by user  user code modified to include relevant offload statements  allows control of which functions offloaded  Data transfers to Phi can be dissociated with function offloading allowing data to remain on the Phi  user responsible for data movement  reduces penalty of offloading data by allowing its use by multiple offloaded function calls before returning to host  Effort required by the user to re-code application Explicit Offload
  • 10. 10  Users may choose to port their entire application  user code modified to include relevant offload statements  allows complete control of which functions are offloaded  Data transfers to Phi can be dissociated with function offloading allowing data to remain on the Phi  user responsible for data movement  reduces penalty of offloading data by allowing its use by multiple offloaded function calls before returning to host  Effort required by the user to re-code application Native Applications
  • 11. 11  Sandybridge CPUs (typically using 32 threads)  Knights Corner Phi processor (typically using 240 threads) Performance Examples and Lessons
  • 12. 12 0 200 400 600 800 1,000 1,200 1,400 1,600 0 5000 10000 15000 20000 25000 30000 Time(s) Problem Size (n) Hierarchical Cluster Analysis (go3ec) 32 threads original Phi offload original Phi offload opt 32 threads opt  n=30k; m=3k  Xeon 32t: 1,412s  Phi 240t*: 1,259s  Xeon 32t*: 1,073s  For this size problem best to stay on CPU but take the 25%!
  • 13. 13 0 50 100 150 200 250 300 350 400 450 0 5000 10000 15000 20000 25000 30000 Time(s) Problem Size (n) Distance Matrix (g03ea) 32 threads original Phi offload original Phi offload opt 32 threads opt  n=30k; m=3k  Xeon 32t: 192s  Phi 240t*: 40.6s  Xeon 32t*: 75.7s  Phi gain ~2x (~5x over original)
  • 14. 14 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 100 10,000 1,000,000 100,000,000 Time(s) Size of problem (n, log scale) Uniform RNG - Mersenne Twister (g05sa) 8 threads original Native Phi original Native Phi opt 8 threads opt  n=500m  Xeon 8t: 0.25s  Phi 240t*: 0.08s  Xeon 8t*: 0.22s  Phi gain ~3x
  • 15. 15 0 50 100 150 200 250 300 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time(s) Problem Size (weighted) Maximum Likelihood Estimates (g03ca) 32 threads original Phi offload original Phi offload opt 32 threads opt  n=2500; m=2500; nfac=30; nvar=200  Xeon 32t: 256s  Phi 240t*: 53.6s  Xeon 32t*: 54.7s  Phi gain 4x, but also Xeon speed-up (green line under red)
  • 16. 16 0 20 40 60 80 100 120 140 160 180 200 0 1000 2000 3000 4000 5000 6000 7000 Time(s) Problem Size (n) Solve real symmetric positive definite simultaneous linear equations using iterative refinement (f04af) 32 threads original Phi offload original Phi offload opt 32 threads opt  n=6,000; nrhs;1,000  Xeon32t: 171s  Phi 240t*: 66s  Xeon 32t*: 86s  Phi gain ~1.3x (~3x original)
  • 17. 17  Parallelism is a real issue we all face  Exciting for some. Challenging for others!  Accelerators are interesting and can offer spectacular wins  Intel Phi claiming less spectacular performance gains  Less effort than on other Accelerators  … and often repays on CPU as well!  Acid test is always solving your (complete) problem!  NAG can help you try out this technology  NAG Library for Phi  NAG expertise Summary