Design installation-commissioning-red raider-cluster-ttu

Alan Sill
Alan SillManaging Director, High Performance Computing Center at TTU; Co-Director, NSF Cloud and Autonomic Computing Center à Texas Tech University
Alan F. Sill, PhD
Managing Director, High Performance Computing Center, Texas Tech University
On behalf of the TTU IT Division and HPCC Staff
AMD HPC User Forum — September 15-17, 2020
Design Considerations, Installation, and
Commissioning of the RedRaider
Cluster at the Texas Tech University
High Performance Computing Center
High Performance Computing Center
AMD HPC User Forum — September 15-17, 2020
Outline of this talk
HPCC Staff and Students
Previous clusters
• History, Performance, usage
Patterns, and Experience
Motivation for Upgrades
• Compute Capacity Goals
• Related Considerations
Installation and Benchmarks
Conclusions and Q&A
AMD HPC User Forum — September 15-17, 2020
Staff members:
Alan Sill, PhD — Managing Director
Eric Rees, PhD — Assistant Managing Director
Chris Turner, PhD — Research Associate
Tom Brown, PhD — Research Associate
Amanda McConnell, BA — Administrative 

Business Assistant
Graduate students:
Misha Ahmadian, Graduate Research Assistant
These people provide and support the TTU HPCC resources
HPCC Staff and Students (Fall 2020)
Amy Wang, MSc — Programmer Analyst IV

(Lead System Administrator)
Nandini Ramanathan, MSc — Programmer

Analyst III
Sowmith Lakki-Reddy, MSc — System

Administrator III
Undergraduate students:
Arthur Jones, Student Assistant
Nhi Nguyen, Student Assistant
Travis Turner, Student Assistant
AMD HPC User Forum — September 15-17, 2020
Quanah I cluster (Commissioned March 2017)
4 racks, 243 nodes, 36 cores/node, 8748 total cores Intel Broadwell
100 Gbps non-blocking Omni-Path fabric. Benchmarked at 253 TF.
AMD HPC User Forum — September 15-17, 2020
Quanah II Cluster (As Upgraded Nov. 2017)
• 10 racks, 467 Dell™ C6320 nodes
- 36 CPU cores/node Intel Xeon E5-2695
v4 (two 18-core sockets per node)
- 192 GB of RAM per node
- 16,812 worker node cores total
- Compute power: ~1 Tflop/s per node
- Benchmarked at 485 TF
• Operating System:
- CentOS 7.4.1708, 64-bit, Kernel 3.10
• High Speed Fabric:
- Intel ™ OmniPath, 100Gbps/connection
- Non-blocking fat tree topology
- 12 core switches, 48 ports/switch
- 57.6 Tbit/s core throughput capacity
• Management/Control Network:
- Ethernet, 10 Gbps, sequential chained
switches, 36 ports per switch
AMD HPC User Forum — September 15-17, 2020
Uptime and Utilization - Previous Cluster (Quanah)
Quanah I Quanah II —>
AMD HPC User Forum — September 15-17, 2020
Job Sizes Patterns - Previous Cluster (Quanah)
1
10
100
1000
10000
100000
1 2-10 11-100 101-1000 1001+
Jobs in Range
0
100000
200000
300000
400000
500000
600000
1 2-10 11-100 101-1000 1001+
Slots Taken in Range
Typical usage pattern for jobs:
• Charts above show most recent month of
job activity
• Large number of small jobs (note log scale)
• Most jobs below 1000 cores
• Not unusual to see requests for several
thousand cores
Typical usage pattern for queue slots:
• Most cores consumed by jobs in the middle
(11-1000 cores/job) range
• Scheduling a job of more than 2000 cores
allocates ~1/8 of the cluster
• Some evidence of users self-limiting job
sizes to avoid long scheduling queue waits
AMD HPC User Forum — September 15-17, 2020
RedRaider Design Goals
1. Add at least 1 petaflops total computing capacity beyond existing Quanah cluster.
2. Fit within existing cooling capacity and recently expanded power limits, which
means that approximately 2/3rds of new power used by cluster must be removed
through direct liquid cooling to stay within room air cooling limits.
3. Coalesce operation of the existing Quanah and older Ivy Bridge and Community
Cluster nodes with the addition above, to be operated as a single cluster.
4. Streamline storage network utilizing LNet routers to connect all components to
expanded central storage based on 200 Gbps Mellanox HDR fabric.
5. Chosen path: Addition of 240 nodes (30,720 cores) of dual 64-core AMD
Rome processors + 20 GPU nodes with 2 Nvidia GPU accelerators per node.
This new cluster will eventually include the new AMD Rome CPU, NVidia GPU, previous 

Intel Broadwell cluster and other previous specialty queues operated as Slurm partitions.
AMD HPC User Forum — September 15-17, 2020
RedRaider cluster (Delivered July 2020)
CPUs: 256 physical cores per rack unit; 200 Gbps non-blocking HDR200 Infiniband fabric
GPUs: 2 NVidia V100s per two rack units; 100 Gbps HDR100 Infiniband to HDR200 core
AMD HPC User Forum — September 15-17, 2020
Why non-blocking HDR200?
• Previous experience with Quanah and earlier clusters shows simplicity of scheduling jobs
without having to schedule into islands in the fabric produces simpler scheduling and
allows a high degree of utilization of the cluster.
• Increase in density produces high demands on fabric in terms of bandwidth per core. 

This figure of merit is actually lower for the RedRaider Nocona CPUs than for the
previous Quanah cluster.
• Simple fat-tree non-blocking arrangements with multiple core-to-leaf links per switch
provide redundancy and resilience in the event of cable or connector failures.
• User community has sometimes asked for jobs of many thousands of cores.
• Compare and contrast to Expanse design: RedRaider uses full non-blocking fabric;
Expanse is non-blocking within racks with modest oversubscription between rack groups.
Overall cost at our scale is not that different.
• Given the density of our racks and relatively small size of the overall Infiniband fabric,
reaching full non-blocking did not add very much to the cost.
AMD HPC User Forum — September 15-17, 2020
RedRaider cluster initial installation
Front view Back view - CPU racks
40 GPU nodes: Dell R740, 2 GPUs/node, 256 GB main memory/node, air cooled
240 CPU nodes: Dell C6525, 2 Rome 7702’s w/ 512 GB memory/node, liquid cooled
AMD HPC User Forum — September 15-17, 2020
RedRaider cluster initial installation - close-ups
Secondary cooling line
installation under floor
Back view close-up of
cooling lines in CPU rack
Interior of example 1/2-U
C6525 cpu worker node
AMD HPC User Forum — September 15-17, 2020
RedRaider final installation w/ cold-aisle enclosure
Benchmark
(GPUs):
• 226 Tflops
• 20 nodes
• Efficiency:
80.6%
Benchmark
(CPUs):
• 804 Tflops
• 240 nodes
• Efficiency:
81.4%
Total Cluster
Performance
• 1030 Tflops
AMD HPC User Forum — September 15-17, 2020
Software Environment (in progress, starting deployment)
Operating System
• CentOS 8.1 *
Cluster Management Software
• OpenHPC 2.0
Infiniband Drivers
• Mellanox OFED 5.0-2.1.8.0 *
Cluster-Wide File System
• Lustre 2.12.5 *
BMC Firmware
• Dell iDRAC 4.10.10.10 *
Image and Node Provisioning
• Warewulf 3.9.0
• rpmbuild 4.14.2
Job Resource Manager (batch scheduler)
• Slurm 20.02.3-1
Package Build Environment
• Spack 0.15.4
Software Deployment Environment
• LMod 8.2.10
Other Conditions and Tools
• Single-instance Slurm DBD and job status area
(investigating shared-mount NVMEoF for job status)
• Dual NFS 2.3.3 (HA mode) for applications
• Gnu compilers made available through Spack and
LMod. Others (NVidia HPC-X, AOCC) also loadable
as alternatives through LMod.
• Cluster will also have Open OnDemand access.
* Had to fall back to previous version in each starred case above to get consistent deployable conditions
AMD HPC User Forum — September 15-17, 2020
Total Compute Capacity versus Fiscal Year: 2011 - 2021 (All Clusters)
0
500
1000
1500
2000
FY2011 2013 2015 2017 2019 2021
HPCC Total Theoretical Capacity By Cluster (Teraflops):
TTU-Owned and Researcher-Owned Systems
Campus Grid Janus, Weland Hrothgar Westmere
Hrothgar Ivy Bridge Quanah (public) Lonestar 4/5
RedRaider NoconaCPU (public) RedRaider Matador GPU RedRaider NoconaCPU (researchers)
Hrothgar CC Quanah HEP Realtime/Realtime2
AntaeusHEP Nepag Discfarm
105.0 105.1 109.9 132.2 150.6 152.4
417.5
641.0 629.0
581.0
1536.0
0
500
1000
1500
FY2011 2013 2015 2017 2019 2021
HPCC Total Usable Capacity (Teraflops, 80% of Theoretical Peak)
Hrothgar
+
Ivy
+
Ivy+
CC
+
Quanah
I
+
Quanah
II
+
RedRaider
(Gradual retirement of
Hrothgar Westmere)
Design goals for the RedRaider cluster:
• Add at least 1 PF of overall compute capacity
• Allow retirement of older Hrothgar cluster
• Merge operation of primary clusters
• Support GPU computing
Practical restrictions:
• < 300 kVA power usage
• Fit within 8 racks of available floor space
• Stay within existing cooling capacity
• Commission by early FY 2021
(Gradual retirement of
Hrothgar Westmere)
AMD HPC User Forum — September 15-17, 2020
Questions for this group
We look to this forum to help with the following community topics and issues:
• Application building
• Spack, EasyBuild recipes
• Compatible compilers and libraries
• Benchmarking and Workload Optimization
• Processor Settings
• NUMA and Memory Settings
• I/O and PCIe Settings
• On-the-fly versus fixed permanent choice settings
• Safe conditions for liquid vs. air-cooled operation
AMD HPC User Forum — September 15-17, 2020
Conclusions
We have designed, installed, and commissioned a cluster that delivers the desired
performance level of >1 PF in less than 7 racks, with space left over for future expansion.
The cluster was designed to allow an increase in the average job size for large jobs while
still delivering good performance for small and medium-size jobs, with simple scheduling
due to non-blocking fat tree Infiniband.
Overall, adding new cluster based on AMD Rome CPUs and NVidia GPUs allowed us to
put more than twice as much computing capacity in 2/3rds of the space compared to our
existing cluster and stay within our desired power and cooling profile. Since we also retain
and still run the previous cluster, the overall capacity has nearly tripled as a result of this
addition.
Future expansions are possible.
Design installation-commissioning-red raider-cluster-ttu
AMD HPC User Forum — September 15-17, 2020
Backup slides
AMD HPC User Forum — September 15-17, 2020
Primary Cluster Utilization
Hrothgar cluster averaged about 80% utilization before addition of Quanah phase I,
commissioned in April 2017.
Addition of Quanah I roughly doubled the total number of utilized cores to ~16,000, with a
total number of available cores of ~18,000 across all of HPCC.
Addition of Quanah II in November 2017 required decreasing size of Hrothgar to make
everything fit, but still led to over 20,000 cores in regular use. Power limits and campus
research network bandwidth restrictions prevented running all of former Hrothgar.
Power limits in ESB solved with new UPS and generator upgrade in FY 2019. (Campus
research network limits will be addressed in upcoming upgrades to be discussed here.)
Quanah utilization has been extremely good, beginning with Quanah I in early 2017 and
extending on to current state with nearly 95% calendar uptime and near-100% usage.
Former Hrothgar systems exceeded expected end of life (oldest components 8 years old).
AMD HPC User Forum — September 15-17, 2020
Primary HPCC Clusters With RedRaider Expansion
Quanah Cluster - Omni CPU
• 16812 Cores (467 Nodes, 36 Cores/Node, 192 GB/
Node), Intel Xeon E5-2695 v4 @ 2.1GHz
• Dell C6300 4-node 2U enclosures
• Benchmarked 486 Teraflop/sec (464/467 nodes)
• Omni-Path non-blocking 100 Gbits/Second fabric for
MPI communication & storage
• 10 Gb/s Ethernet network for cluster management
• Total power drawn by Quanah cluster: ~ 200 kW
RedRaider Cluster - Nocona CPU
• 30720 Cores (240 Nodes, 128 Cores/Node, 512 GB/
Node), AMD Rome 7702 @ 2.0GHz
• Dell C6525 4-node 2U enclosures, direct liquid
cooled CPUs
• Benchmarked 804 Teraflop/sec (240 nodes)
• Mellanox non-blocking 200 Gbits/Second fabric for
MPI communication & storage
• 25 Gb/s Ethernet network for cluster management
• Total power drawn: ~ 150 kW
RedRaider Cluster - Matador GPU
• 40 NVidia V100, 640 cpu + 25,600 tensor + 204,800
CUDA cores total
• Dell R740 2U host nodes, 2 GPUs/node
• Benchmarked 226 Teraflop/sec (20 nodes)
• Mellanox non-blocking 100 Gbits/Second fabric for
MPI communication & storage
• 25 Gb/s Ethernet network for cluster management
• Total power drawn: ~ 21 kW
• Total power for RedRaider cluster including LNet
routers: ~180 kW
1 sur 21

Contenu connexe

Tendances(20)

NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com924 vues
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
Diego Alberto Tamayo837 vues
POWER9 AC922 Newell System - HPC & AI POWER9 AC922 Newell System - HPC & AI
POWER9 AC922 Newell System - HPC & AI
Anand Haridass6.1K vues
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy366 vues
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
Ganesan Narayanasamy540 vues
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
Ganesan Narayanasamy807 vues
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
Ganesan Narayanasamy347 vues

Similaire à Design installation-commissioning-red raider-cluster-ttu

NWU and HPCNWU and HPC
NWU and HPCWilhelm van Belkum
1.2K vues42 diapositives
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
5.9K vues32 diapositives

Design installation-commissioning-red raider-cluster-ttu

  • 1. Alan F. Sill, PhD Managing Director, High Performance Computing Center, Texas Tech University On behalf of the TTU IT Division and HPCC Staff AMD HPC User Forum — September 15-17, 2020 Design Considerations, Installation, and Commissioning of the RedRaider Cluster at the Texas Tech University High Performance Computing Center High Performance Computing Center
  • 2. AMD HPC User Forum — September 15-17, 2020 Outline of this talk HPCC Staff and Students Previous clusters • History, Performance, usage Patterns, and Experience Motivation for Upgrades • Compute Capacity Goals • Related Considerations Installation and Benchmarks Conclusions and Q&A
  • 3. AMD HPC User Forum — September 15-17, 2020 Staff members: Alan Sill, PhD — Managing Director Eric Rees, PhD — Assistant Managing Director Chris Turner, PhD — Research Associate Tom Brown, PhD — Research Associate Amanda McConnell, BA — Administrative 
 Business Assistant Graduate students: Misha Ahmadian, Graduate Research Assistant These people provide and support the TTU HPCC resources HPCC Staff and Students (Fall 2020) Amy Wang, MSc — Programmer Analyst IV
 (Lead System Administrator) Nandini Ramanathan, MSc — Programmer
 Analyst III Sowmith Lakki-Reddy, MSc — System
 Administrator III Undergraduate students: Arthur Jones, Student Assistant Nhi Nguyen, Student Assistant Travis Turner, Student Assistant
  • 4. AMD HPC User Forum — September 15-17, 2020 Quanah I cluster (Commissioned March 2017) 4 racks, 243 nodes, 36 cores/node, 8748 total cores Intel Broadwell 100 Gbps non-blocking Omni-Path fabric. Benchmarked at 253 TF.
  • 5. AMD HPC User Forum — September 15-17, 2020 Quanah II Cluster (As Upgraded Nov. 2017) • 10 racks, 467 Dell™ C6320 nodes - 36 CPU cores/node Intel Xeon E5-2695 v4 (two 18-core sockets per node) - 192 GB of RAM per node - 16,812 worker node cores total - Compute power: ~1 Tflop/s per node - Benchmarked at 485 TF • Operating System: - CentOS 7.4.1708, 64-bit, Kernel 3.10 • High Speed Fabric: - Intel ™ OmniPath, 100Gbps/connection - Non-blocking fat tree topology - 12 core switches, 48 ports/switch - 57.6 Tbit/s core throughput capacity • Management/Control Network: - Ethernet, 10 Gbps, sequential chained switches, 36 ports per switch
  • 6. AMD HPC User Forum — September 15-17, 2020 Uptime and Utilization - Previous Cluster (Quanah) Quanah I Quanah II —>
  • 7. AMD HPC User Forum — September 15-17, 2020 Job Sizes Patterns - Previous Cluster (Quanah) 1 10 100 1000 10000 100000 1 2-10 11-100 101-1000 1001+ Jobs in Range 0 100000 200000 300000 400000 500000 600000 1 2-10 11-100 101-1000 1001+ Slots Taken in Range Typical usage pattern for jobs: • Charts above show most recent month of job activity • Large number of small jobs (note log scale) • Most jobs below 1000 cores • Not unusual to see requests for several thousand cores Typical usage pattern for queue slots: • Most cores consumed by jobs in the middle (11-1000 cores/job) range • Scheduling a job of more than 2000 cores allocates ~1/8 of the cluster • Some evidence of users self-limiting job sizes to avoid long scheduling queue waits
  • 8. AMD HPC User Forum — September 15-17, 2020 RedRaider Design Goals 1. Add at least 1 petaflops total computing capacity beyond existing Quanah cluster. 2. Fit within existing cooling capacity and recently expanded power limits, which means that approximately 2/3rds of new power used by cluster must be removed through direct liquid cooling to stay within room air cooling limits. 3. Coalesce operation of the existing Quanah and older Ivy Bridge and Community Cluster nodes with the addition above, to be operated as a single cluster. 4. Streamline storage network utilizing LNet routers to connect all components to expanded central storage based on 200 Gbps Mellanox HDR fabric. 5. Chosen path: Addition of 240 nodes (30,720 cores) of dual 64-core AMD Rome processors + 20 GPU nodes with 2 Nvidia GPU accelerators per node. This new cluster will eventually include the new AMD Rome CPU, NVidia GPU, previous 
 Intel Broadwell cluster and other previous specialty queues operated as Slurm partitions.
  • 9. AMD HPC User Forum — September 15-17, 2020 RedRaider cluster (Delivered July 2020) CPUs: 256 physical cores per rack unit; 200 Gbps non-blocking HDR200 Infiniband fabric GPUs: 2 NVidia V100s per two rack units; 100 Gbps HDR100 Infiniband to HDR200 core
  • 10. AMD HPC User Forum — September 15-17, 2020 Why non-blocking HDR200? • Previous experience with Quanah and earlier clusters shows simplicity of scheduling jobs without having to schedule into islands in the fabric produces simpler scheduling and allows a high degree of utilization of the cluster. • Increase in density produces high demands on fabric in terms of bandwidth per core. 
 This figure of merit is actually lower for the RedRaider Nocona CPUs than for the previous Quanah cluster. • Simple fat-tree non-blocking arrangements with multiple core-to-leaf links per switch provide redundancy and resilience in the event of cable or connector failures. • User community has sometimes asked for jobs of many thousands of cores. • Compare and contrast to Expanse design: RedRaider uses full non-blocking fabric; Expanse is non-blocking within racks with modest oversubscription between rack groups. Overall cost at our scale is not that different. • Given the density of our racks and relatively small size of the overall Infiniband fabric, reaching full non-blocking did not add very much to the cost.
  • 11. AMD HPC User Forum — September 15-17, 2020 RedRaider cluster initial installation Front view Back view - CPU racks 40 GPU nodes: Dell R740, 2 GPUs/node, 256 GB main memory/node, air cooled 240 CPU nodes: Dell C6525, 2 Rome 7702’s w/ 512 GB memory/node, liquid cooled
  • 12. AMD HPC User Forum — September 15-17, 2020 RedRaider cluster initial installation - close-ups Secondary cooling line installation under floor Back view close-up of cooling lines in CPU rack Interior of example 1/2-U C6525 cpu worker node
  • 13. AMD HPC User Forum — September 15-17, 2020 RedRaider final installation w/ cold-aisle enclosure Benchmark (GPUs): • 226 Tflops • 20 nodes • Efficiency: 80.6% Benchmark (CPUs): • 804 Tflops • 240 nodes • Efficiency: 81.4% Total Cluster Performance • 1030 Tflops
  • 14. AMD HPC User Forum — September 15-17, 2020 Software Environment (in progress, starting deployment) Operating System • CentOS 8.1 * Cluster Management Software • OpenHPC 2.0 Infiniband Drivers • Mellanox OFED 5.0-2.1.8.0 * Cluster-Wide File System • Lustre 2.12.5 * BMC Firmware • Dell iDRAC 4.10.10.10 * Image and Node Provisioning • Warewulf 3.9.0 • rpmbuild 4.14.2 Job Resource Manager (batch scheduler) • Slurm 20.02.3-1 Package Build Environment • Spack 0.15.4 Software Deployment Environment • LMod 8.2.10 Other Conditions and Tools • Single-instance Slurm DBD and job status area (investigating shared-mount NVMEoF for job status) • Dual NFS 2.3.3 (HA mode) for applications • Gnu compilers made available through Spack and LMod. Others (NVidia HPC-X, AOCC) also loadable as alternatives through LMod. • Cluster will also have Open OnDemand access. * Had to fall back to previous version in each starred case above to get consistent deployable conditions
  • 15. AMD HPC User Forum — September 15-17, 2020 Total Compute Capacity versus Fiscal Year: 2011 - 2021 (All Clusters) 0 500 1000 1500 2000 FY2011 2013 2015 2017 2019 2021 HPCC Total Theoretical Capacity By Cluster (Teraflops): TTU-Owned and Researcher-Owned Systems Campus Grid Janus, Weland Hrothgar Westmere Hrothgar Ivy Bridge Quanah (public) Lonestar 4/5 RedRaider NoconaCPU (public) RedRaider Matador GPU RedRaider NoconaCPU (researchers) Hrothgar CC Quanah HEP Realtime/Realtime2 AntaeusHEP Nepag Discfarm 105.0 105.1 109.9 132.2 150.6 152.4 417.5 641.0 629.0 581.0 1536.0 0 500 1000 1500 FY2011 2013 2015 2017 2019 2021 HPCC Total Usable Capacity (Teraflops, 80% of Theoretical Peak) Hrothgar + Ivy + Ivy+ CC + Quanah I + Quanah II + RedRaider (Gradual retirement of Hrothgar Westmere) Design goals for the RedRaider cluster: • Add at least 1 PF of overall compute capacity • Allow retirement of older Hrothgar cluster • Merge operation of primary clusters • Support GPU computing Practical restrictions: • < 300 kVA power usage • Fit within 8 racks of available floor space • Stay within existing cooling capacity • Commission by early FY 2021 (Gradual retirement of Hrothgar Westmere)
  • 16. AMD HPC User Forum — September 15-17, 2020 Questions for this group We look to this forum to help with the following community topics and issues: • Application building • Spack, EasyBuild recipes • Compatible compilers and libraries • Benchmarking and Workload Optimization • Processor Settings • NUMA and Memory Settings • I/O and PCIe Settings • On-the-fly versus fixed permanent choice settings • Safe conditions for liquid vs. air-cooled operation
  • 17. AMD HPC User Forum — September 15-17, 2020 Conclusions We have designed, installed, and commissioned a cluster that delivers the desired performance level of >1 PF in less than 7 racks, with space left over for future expansion. The cluster was designed to allow an increase in the average job size for large jobs while still delivering good performance for small and medium-size jobs, with simple scheduling due to non-blocking fat tree Infiniband. Overall, adding new cluster based on AMD Rome CPUs and NVidia GPUs allowed us to put more than twice as much computing capacity in 2/3rds of the space compared to our existing cluster and stay within our desired power and cooling profile. Since we also retain and still run the previous cluster, the overall capacity has nearly tripled as a result of this addition. Future expansions are possible.
  • 19. AMD HPC User Forum — September 15-17, 2020 Backup slides
  • 20. AMD HPC User Forum — September 15-17, 2020 Primary Cluster Utilization Hrothgar cluster averaged about 80% utilization before addition of Quanah phase I, commissioned in April 2017. Addition of Quanah I roughly doubled the total number of utilized cores to ~16,000, with a total number of available cores of ~18,000 across all of HPCC. Addition of Quanah II in November 2017 required decreasing size of Hrothgar to make everything fit, but still led to over 20,000 cores in regular use. Power limits and campus research network bandwidth restrictions prevented running all of former Hrothgar. Power limits in ESB solved with new UPS and generator upgrade in FY 2019. (Campus research network limits will be addressed in upcoming upgrades to be discussed here.) Quanah utilization has been extremely good, beginning with Quanah I in early 2017 and extending on to current state with nearly 95% calendar uptime and near-100% usage. Former Hrothgar systems exceeded expected end of life (oldest components 8 years old).
  • 21. AMD HPC User Forum — September 15-17, 2020 Primary HPCC Clusters With RedRaider Expansion Quanah Cluster - Omni CPU • 16812 Cores (467 Nodes, 36 Cores/Node, 192 GB/ Node), Intel Xeon E5-2695 v4 @ 2.1GHz • Dell C6300 4-node 2U enclosures • Benchmarked 486 Teraflop/sec (464/467 nodes) • Omni-Path non-blocking 100 Gbits/Second fabric for MPI communication & storage • 10 Gb/s Ethernet network for cluster management • Total power drawn by Quanah cluster: ~ 200 kW RedRaider Cluster - Nocona CPU • 30720 Cores (240 Nodes, 128 Cores/Node, 512 GB/ Node), AMD Rome 7702 @ 2.0GHz • Dell C6525 4-node 2U enclosures, direct liquid cooled CPUs • Benchmarked 804 Teraflop/sec (240 nodes) • Mellanox non-blocking 200 Gbits/Second fabric for MPI communication & storage • 25 Gb/s Ethernet network for cluster management • Total power drawn: ~ 150 kW RedRaider Cluster - Matador GPU • 40 NVidia V100, 640 cpu + 25,600 tensor + 204,800 CUDA cores total • Dell R740 2U host nodes, 2 GPUs/node • Benchmarked 226 Teraflop/sec (20 nodes) • Mellanox non-blocking 100 Gbits/Second fabric for MPI communication & storage • 25 Gb/s Ethernet network for cluster management • Total power drawn: ~ 21 kW • Total power for RedRaider cluster including LNet routers: ~180 kW