SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
NICS, Adaptive
Computing, and Intel:
 Leadership in HPC


     Troy Baer
Senior HPC System
  Administrator
        NICS
Overview
• Introduction to NICS
• NICS and Adaptive Computing
• NICS and Intel
  – SC12 Green 500 effort
• Going Forward
National Institute for Computational Sciences:
   A University of Tennessee / ORNL Partnership

• NICS is an NSF-funded HPC center
 – Founded in 2007
 – Operated by the University of Tennessee, located
   at ORNL
 – XSEDE Partner and Service Provider
• XSEDE Systems
 – Kraken (Cray XT5, 112,984 Opteron cores)
 – Nautilus (SGI UV, 1,152 Nehalem cores + 16 M2070
   GPUs)
 – Keeneland final system (HP GPU cluster, 4,224
   Sandy Bridge cores + 792 M2090 GPUs) in
   conjunction with Georgia Tech
Other Systems and Projects at NICS

• Non-XSEDE Systems
 – Keeneland initial delivery system (HP GPU cluster)
   in conjunction with Georgia Tech
 – Ares (Cray XE/XK6)
 – Beacon (Appro/Cray cluster; more on this later...)
 – Darter (Cray XC30; more on this later...)
• Associated Centers and Projects
 – Application Acceleration Center of Excellence
   (AACE)
   • Parent project for Beacon
 – Remote Data Analysis and Visualization (RDAV)
   project
   • Parent project for Nautilus
NICS and Adaptive Computing
• NICS and Adaptive have been working together
  literally since the founding of the center
• Achievements
 – Kraken: 90-95% utilization on a petaflop-class system
   for 3 years and counting!
   • Over 3 billion core-hours delivered in total, 965 million
     delivered in CY2012
   • Delivering ~65% of all XSEDE computing cycles until very
     recently
   • Bi-modal scheduling for capability vs. capacity
 – Athena (Cray XT4): Dedicated access for COLA
   climate modeling group for ~6 months
 – Kraken/Athena: Annual OU CAPS Spring Experiment
   (storm forecasting)
 – Nautilus: NUMA+GPU scheduling
 – KIDS and KFS: GPU scheduling test bed
NICS and Intel

• AACE was born of conversations between NICS,
  ORNL, and Intel in early 2011
• Beacon project
 – Application readiness for Intel Xeon Phi
 – NSF STCI award provided people funding and initial
   hardware
   • 8 funded science teams
   • Open call for more science teams just ended
 – Second phase of hardware funded by the University of
   Tennessee system and the state of Tennessee
   • Data-intensive computing
   • Power efficiency research
BEACON                    Phase 1                  Phase 2
Compute Nodes                  16 Appro Grizzly Pass    48 Appro GreenBlade
                                                                  GBN814N
Node Processor                2x 8-core Sandy Bridge   2x 8-core Sandy Bridge

Memory/Node (GB)                                 64                       256

SSD/Node (GB)                                   160                       960
Xeon Phis/Node                                    2                         4
Interconnect                         QDR Infiniband           FDR Infiniband
Bandwidth to Storage (GB/s)                    ~2.5                       ~15
OS                                      CentOS 6.2               CentOS 6.2
Installation                               NFS-root                    Diskful
Batch Environment                    TORQUE/Moab              TORQUE/Moab
SC'12 Green 500 Effort

• In the run-up to the Supercomputing 2012
  conference, NICS, Intel, and Appro (now Cray)
  decided to take a shot at #1 on the Green 500
  list
• People worked on the system literally around
  the clock in Tennessee, California, India, and
  Germany for a month to make this happen!
• Result: New record of 112.2 TF/s @ 44.89 kW
  (i.e. 2.499 GF/W)
Stupid Phi Tricks
• Xeon Phis have a number of programming models
  –   Offload (like GPUs)
  –   Reverse offload (i.e. Phis offloading to the host)
  –   Native mode (i.e. running MPI ranks on Phis)
  –   Various hybrids thereof

• Xeon Phis are basically embedded x86_64 Linux
  boxes, complete with SSH, NFS, etc... which allows
  you to do all sorts of clever and/or hilarious things in
  job prologues and epilogues
  – NFS-export Lustre and/or local scratch from host to Phis
       • The Phis' BusyBox NFS client currently doesn't support NFS v3
         locking – Intel is working on this
  – Provision the job owner's uid (and only the job owner's uid)
    on MICs at job start
  – Reboot Phis between jobs
       • A bit slower than one might like – Intel is working on this as well
Going Forward
• New systems
     – Beacon Phase 2 (just accepted)
     – Darter (Cray XC30, just received and accepted)
     – Hopefully more in the future...

• New architectures make for interesting
    challenges WRT allocations and accounting
     – With GPUs and MICs becoming more
         commonplace, the notion of a “CPU-hour” or
         “core-hour” is even less meaningful than it was
         before.
     – Should the new accounting unit be the “node-
         hour”?
• Growing gap between capability/hero users and
    capacity/canned-code users needs to be
    addressed somehow

Contenu connexe

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com
 

Plus de inside-BigData.com (20)

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

NICS, Adaptive Computing, and Intel: Leadership in HPC

  • 1. NICS, Adaptive Computing, and Intel: Leadership in HPC Troy Baer Senior HPC System Administrator NICS
  • 2. Overview • Introduction to NICS • NICS and Adaptive Computing • NICS and Intel – SC12 Green 500 effort • Going Forward
  • 3. National Institute for Computational Sciences: A University of Tennessee / ORNL Partnership • NICS is an NSF-funded HPC center – Founded in 2007 – Operated by the University of Tennessee, located at ORNL – XSEDE Partner and Service Provider • XSEDE Systems – Kraken (Cray XT5, 112,984 Opteron cores) – Nautilus (SGI UV, 1,152 Nehalem cores + 16 M2070 GPUs) – Keeneland final system (HP GPU cluster, 4,224 Sandy Bridge cores + 792 M2090 GPUs) in conjunction with Georgia Tech
  • 4. Other Systems and Projects at NICS • Non-XSEDE Systems – Keeneland initial delivery system (HP GPU cluster) in conjunction with Georgia Tech – Ares (Cray XE/XK6) – Beacon (Appro/Cray cluster; more on this later...) – Darter (Cray XC30; more on this later...) • Associated Centers and Projects – Application Acceleration Center of Excellence (AACE) • Parent project for Beacon – Remote Data Analysis and Visualization (RDAV) project • Parent project for Nautilus
  • 5. NICS and Adaptive Computing • NICS and Adaptive have been working together literally since the founding of the center • Achievements – Kraken: 90-95% utilization on a petaflop-class system for 3 years and counting! • Over 3 billion core-hours delivered in total, 965 million delivered in CY2012 • Delivering ~65% of all XSEDE computing cycles until very recently • Bi-modal scheduling for capability vs. capacity – Athena (Cray XT4): Dedicated access for COLA climate modeling group for ~6 months – Kraken/Athena: Annual OU CAPS Spring Experiment (storm forecasting) – Nautilus: NUMA+GPU scheduling – KIDS and KFS: GPU scheduling test bed
  • 6. NICS and Intel • AACE was born of conversations between NICS, ORNL, and Intel in early 2011 • Beacon project – Application readiness for Intel Xeon Phi – NSF STCI award provided people funding and initial hardware • 8 funded science teams • Open call for more science teams just ended – Second phase of hardware funded by the University of Tennessee system and the state of Tennessee • Data-intensive computing • Power efficiency research
  • 7. BEACON Phase 1 Phase 2 Compute Nodes 16 Appro Grizzly Pass 48 Appro GreenBlade GBN814N Node Processor 2x 8-core Sandy Bridge 2x 8-core Sandy Bridge Memory/Node (GB) 64 256 SSD/Node (GB) 160 960 Xeon Phis/Node 2 4 Interconnect QDR Infiniband FDR Infiniband Bandwidth to Storage (GB/s) ~2.5 ~15 OS CentOS 6.2 CentOS 6.2 Installation NFS-root Diskful Batch Environment TORQUE/Moab TORQUE/Moab
  • 8. SC'12 Green 500 Effort • In the run-up to the Supercomputing 2012 conference, NICS, Intel, and Appro (now Cray) decided to take a shot at #1 on the Green 500 list • People worked on the system literally around the clock in Tennessee, California, India, and Germany for a month to make this happen! • Result: New record of 112.2 TF/s @ 44.89 kW (i.e. 2.499 GF/W)
  • 9. Stupid Phi Tricks • Xeon Phis have a number of programming models – Offload (like GPUs) – Reverse offload (i.e. Phis offloading to the host) – Native mode (i.e. running MPI ranks on Phis) – Various hybrids thereof • Xeon Phis are basically embedded x86_64 Linux boxes, complete with SSH, NFS, etc... which allows you to do all sorts of clever and/or hilarious things in job prologues and epilogues – NFS-export Lustre and/or local scratch from host to Phis • The Phis' BusyBox NFS client currently doesn't support NFS v3 locking – Intel is working on this – Provision the job owner's uid (and only the job owner's uid) on MICs at job start – Reboot Phis between jobs • A bit slower than one might like – Intel is working on this as well
  • 10. Going Forward • New systems – Beacon Phase 2 (just accepted) – Darter (Cray XC30, just received and accepted) – Hopefully more in the future... • New architectures make for interesting challenges WRT allocations and accounting – With GPUs and MICs becoming more commonplace, the notion of a “CPU-hour” or “core-hour” is even less meaningful than it was before. – Should the new accounting unit be the “node- hour”? • Growing gap between capability/hero users and capacity/canned-code users needs to be addressed somehow