SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
The Why and How of HPC-Cloud Hybrids
with OpenStack
OpenStack Australia Day Melbourne
June, 2017
Lev Lafayette, HPC Support and Training Officer, University of Melbourne
lev.lafayette@unimelb.edu.au
1.0 Management Layer
1.1 HPC for performance
High-performance computing (HPC) is any computer system whose architecture allows for
above average performance. Main use case refers to compute clusters with a teamster
separation between head node and workers nodes and a high-speed interconnect acts as a
single system.
1.2 Clouds for flexibility.
Precursor with virtualised hardware. Cloud VMs always have lower performance than HPC.
Question is whether the flexibility is worth the overhead.
1.3 Hybrid HPC/Clouds.
University of Melbourne model, "the chimera". Cloud VMs deployed as HPC nodes and the
Freiburg University Model, "the cyborg", HPC nodes deploying Cloud VMs.
1.4 Reviewing user preferences and usage.
Users always want more of 'x'; real issue identified was queue times. Usage indicated a high
proportion of single-node jobs.
1.5 Review and Architecture.
Review discussed whether UoM needed HPC; architecture was to use existing NeCTAR
Research cloud with an expansion of general cloud compute provisioning and use of a smaller
"true HPC" system on bare metal nodes.
2.0 Physical Layer
2.1 Physical Partitions.
"Real" HPC is a mere c276 cores, 21 GB per core. 2 socket Intel E5-2643 v3 E5-2643,
3.4GHz CPU with 6-core per socket, 192GB memory, 2x 1.2TB SAS drives, 2x 40GbE
network. “Cloud” partitions is almost 400 virtual machines with over 3,000 2.3GHz Haswell
cores with 8GB per core and . There is also a GPU partition with Dual Nvidia Tesla K80s (big
expansion this year), and departmental partitions (water and ashley). Management and login
nodes are VMs as is I/O for transferring data.
2.2 Network.
System network includes: Cloud nodes Cisco Nexus 10Gbe TCP/IP 60 usec latency (mpi-
pingpong); Bare Metal Mellanox 2100 Cumulos Linux 40Gbe TCP/IP 6.85 usec latency and
then RDMA Ethernet 1.15 usec latency
2.3 Storage.
Mountpoints to home, projects (/project /home for user data & scripts, NetApp SAS aggregate
70TB usable) and applications directories across all nodes. Additional mountpoins to VicNode
Aspera Shares. Applications directory currently on management node, needs to be decoupled.
Bare metal nodes have /scratch shared storage for MPI jobs (Dell R730 with 14 x 800GB
mixed use SSDs providing 8TB of usable storage, NFS over RDMA)., /var/local/tmp for single
node jobs, pcie SSD 1.6TB.
3.0 Operating System and
Scheduler Layer
3.1 Red Hat Linux.
Scalable FOSS operating system, high performance, very well suited for research
applications. In November 2016 of the Top 500 Supercomputers worldwide, every single
machine used a "UNIX-like" operating system; and 99.6% used Linux.
3.2 Slurm Workload Manager.
Job schedulers and resource managers allow for unattended background tasks expressed as
batch jobs among the available resources; allows multicore, multinode, arrays, dependencies,
and interactive submissions. The scheduler provides for paramterisation of computer
resources, an automatic submission of execution tasks, and a notification system for incidents.
Slurm (originally Simple Linux Utility for Resource Management), developed by Lawrence
Livermore et al., is FOSS, and used by majority of world's top systems. Scalable, offers many
optional plugins, power-saving features, accounting features, etc. Divided into logical partitions
which correlate with hardware partitions.
3.3 Git, Gerrit, and Puppet.
Version control, paired systems administration, configuration management.
3.4 OpenStack Node Deployment.
Significant use of Nova (compute) service for provisioning and decommissioning of virtual
machines on demand.
4.0 Application Layer
4.1 Source Code and EasyBuild.
Source code provides better control over security updates, integration, development, and
much better performance. Absolutely essential for reproducibility in research environment.
EasyBuild makes source software installs easier with scripts containing specified compilation
blocks (e.g., configuremake, cmake etc) and specified toolchains (GCC, Intel etc) and
environment modules (LMod). Modulefiles allow for dynamic changes to a user's environment
and ease with multiple versions of software applications on a system.
4.2 Compilers, Scripting Languages, and Applications.
Usual range of suspects; Intel and GCC, for compilers (and a little bit of PGI), Python Ruby,
and Perl for scripting languages, OpenMPI wrappers. Major applications include: MATLAB,
Gaussian, NAMD, R, OpenFOAM, Octave etc.
Almost 1,000 applications/versions installed from source, plus packages.
4.3 Containers with Singularity.
A container in a cloud virtual machine on an HPC! Wait, what?
5.0 User Layer
5.1 Karaage.
Spartan uses its own LDAP authentication that is tied to the university Security Assertion
Markup Language (SAML). Users on Spartan must belong to a project. Projects must be led
by a University of Melbourne researcher (the "Principal Investigator") and are subject to
approval by the Head of Research Compute Services. Participants in a project can be
researchers or research support staff from anywhere. Karaage is Django-based application for
user, project, and cluster reporting and management.
5.2 Freshdesk.
OMG Users!
5.3 Online Instructions and Training.
Many users (even post-doctoral researchers) require basic training in Linux command line, a
requisite skill for HPC use. Extensive training programme for researchers available using
andragogical methods, including day-long courses in “Introduction to Linux and HPC Using
Spartan”, “Linux Shell Scripting for High Performance Computing”, and “Parallel Programming
On Spartan”.
Documentation online (Github, Website, and man pages) and plenty of Slurm examples on
system.
6.0 Future Developments
6.1 Cloudbursting with Azure.
Slurm allows cloudbursting via the powersave feature; successfully experiments (and bug
discovery) within the NeCTAR research cloud.
About to add Azure through same login node. Does not mount applications directory; wrap
necessary data for transfer in script.
6.2 GPU Expansion.
Plans for a significant increase in the GPU allocation.
6.3 Test cluster (Thespian).
Everyone has a test environment, some people also have a production and a test
environment.
Test nodes already exist for Cloud and Physical partitions. Replicate management and login
nodes.
6.3 New Architectures
New architectures can be added to the system with separate build node (another VM) and with
software built for that architecture. Don't need an entirely new system.
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne

Contenu connexe

Tendances

Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_Technologies
Manish Chopra
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
euc-dm-test
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
NIKHIL NAIR
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
Haseeb Alam
 

Tendances (20)

Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_Technologies
 
The State of Linux Containers
The State of Linux ContainersThe State of Linux Containers
The State of Linux Containers
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
 
Application layer
Application layerApplication layer
Application layer
 
Cluster Computing Seminar.
Cluster Computing Seminar.Cluster Computing Seminar.
Cluster Computing Seminar.
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
EAS Data Flow lessons learnt
EAS Data Flow lessons learntEAS Data Flow lessons learnt
EAS Data Flow lessons learnt
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Beowulf cluster
Beowulf clusterBeowulf cluster
Beowulf cluster
 
Blazing Fast Lustre Storage
Blazing Fast Lustre StorageBlazing Fast Lustre Storage
Blazing Fast Lustre Storage
 
Cluster computer
Cluster  computerCluster  computer
Cluster computer
 
Enea Enabling Real-Time in Linux Whitepaper
Enea Enabling Real-Time in Linux WhitepaperEnea Enabling Real-Time in Linux Whitepaper
Enea Enabling Real-Time in Linux Whitepaper
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
tittle
tittletittle
tittle
 
DOE Magellan OpenStack user story
DOE Magellan OpenStack user storyDOE Magellan OpenStack user story
DOE Magellan OpenStack user story
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
 
Docker Application to Scientific Computing
Docker Application to Scientific ComputingDocker Application to Scientific Computing
Docker Application to Scientific Computing
 

Similaire à The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne

Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
Dan Frincu
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
Ashish Sharma
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.
Waqar Sheikh
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptx
viveknagle4
 

Similaire à The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne (20)

Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
djypllh5r1gjbaekxgwv-signature-cc6692615bbc55079760b9b0c6636bc58ec509cd0446cb...
 
Rha cluster suite wppdf
Rha cluster suite wppdfRha cluster suite wppdf
Rha cluster suite wppdf
 
Zou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 ConciseZou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 Concise
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptx
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Red Hat multi-cluster management & what's new in OpenShift
Red Hat multi-cluster management & what's new in OpenShiftRed Hat multi-cluster management & what's new in OpenShift
Red Hat multi-cluster management & what's new in OpenShift
 
Beyond static configuration
Beyond static configurationBeyond static configuration
Beyond static configuration
 
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
 

Plus de OpenStack

Federation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research CloudFederation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research Cloud
OpenStack
 
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
Enabling OpenStack for Enterprise - Tarso Dos Santos, VeritasEnabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
OpenStack
 
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
OpenStack
 
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
OpenStack
 
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash UniversityBuilding a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
OpenStack
 

Plus de OpenStack (20)

Swinburne University of Technology - Shunde Zhang & Kieran Spear, Aptira
Swinburne University of Technology - Shunde Zhang & Kieran Spear, AptiraSwinburne University of Technology - Shunde Zhang & Kieran Spear, Aptira
Swinburne University of Technology - Shunde Zhang & Kieran Spear, Aptira
 
Related OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera SoftwareRelated OSS Projects - Peter Rowe, Flexera Software
Related OSS Projects - Peter Rowe, Flexera Software
 
Supercomputing by API: Connecting Modern Web Apps to HPC
Supercomputing by API: Connecting Modern Web Apps to HPCSupercomputing by API: Connecting Modern Web Apps to HPC
Supercomputing by API: Connecting Modern Web Apps to HPC
 
Federation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research CloudFederation and Interoperability in the Nectar Research Cloud
Federation and Interoperability in the Nectar Research Cloud
 
Simplifying the Move to OpenStack
Simplifying the Move to OpenStackSimplifying the Move to OpenStack
Simplifying the Move to OpenStack
 
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red HatHyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
Hyperconverged Cloud, Not just a toy anymore - Andrew Hatfield, Red Hat
 
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, OracleMigrating your infrastructure to OpenStack - Avi Miller, Oracle
Migrating your infrastructure to OpenStack - Avi Miller, Oracle
 
A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...
A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...
A glimpse into an industry Cloud using Open Source Technologies - Adrian Koh,...
 
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
Enabling OpenStack for Enterprise - Tarso Dos Santos, VeritasEnabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
Enabling OpenStack for Enterprise - Tarso Dos Santos, Veritas
 
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
 
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
Diving in the desert: A quick overview into OpenStack Sahara capabilities - A...
 
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...
 
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...
 
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
 
Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...
Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...
Ironically, Infrastructure Doesn't Matter - Quinton Anderson, Commonwealth Ba...
 
Traditional Enterprise to OpenStack Cloud - An Unexpected Journey
Traditional Enterprise to OpenStack Cloud - An Unexpected JourneyTraditional Enterprise to OpenStack Cloud - An Unexpected Journey
Traditional Enterprise to OpenStack Cloud - An Unexpected Journey
 
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash UniversityBuilding a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
Building a GPU-enabled OpenStack Cloud for HPC - Lance Wilson, Monash University
 
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
 
Containers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStack
Containers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStackContainers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStack
Containers and OpenStack: Marc Van Hoof, Kumulus: Containers and OpenStack
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, University of Melbourne

  • 1. The Why and How of HPC-Cloud Hybrids with OpenStack OpenStack Australia Day Melbourne June, 2017 Lev Lafayette, HPC Support and Training Officer, University of Melbourne lev.lafayette@unimelb.edu.au
  • 2. 1.0 Management Layer 1.1 HPC for performance High-performance computing (HPC) is any computer system whose architecture allows for above average performance. Main use case refers to compute clusters with a teamster separation between head node and workers nodes and a high-speed interconnect acts as a single system. 1.2 Clouds for flexibility. Precursor with virtualised hardware. Cloud VMs always have lower performance than HPC. Question is whether the flexibility is worth the overhead. 1.3 Hybrid HPC/Clouds. University of Melbourne model, "the chimera". Cloud VMs deployed as HPC nodes and the Freiburg University Model, "the cyborg", HPC nodes deploying Cloud VMs. 1.4 Reviewing user preferences and usage. Users always want more of 'x'; real issue identified was queue times. Usage indicated a high proportion of single-node jobs. 1.5 Review and Architecture. Review discussed whether UoM needed HPC; architecture was to use existing NeCTAR Research cloud with an expansion of general cloud compute provisioning and use of a smaller "true HPC" system on bare metal nodes.
  • 3. 2.0 Physical Layer 2.1 Physical Partitions. "Real" HPC is a mere c276 cores, 21 GB per core. 2 socket Intel E5-2643 v3 E5-2643, 3.4GHz CPU with 6-core per socket, 192GB memory, 2x 1.2TB SAS drives, 2x 40GbE network. “Cloud” partitions is almost 400 virtual machines with over 3,000 2.3GHz Haswell cores with 8GB per core and . There is also a GPU partition with Dual Nvidia Tesla K80s (big expansion this year), and departmental partitions (water and ashley). Management and login nodes are VMs as is I/O for transferring data. 2.2 Network. System network includes: Cloud nodes Cisco Nexus 10Gbe TCP/IP 60 usec latency (mpi- pingpong); Bare Metal Mellanox 2100 Cumulos Linux 40Gbe TCP/IP 6.85 usec latency and then RDMA Ethernet 1.15 usec latency 2.3 Storage. Mountpoints to home, projects (/project /home for user data & scripts, NetApp SAS aggregate 70TB usable) and applications directories across all nodes. Additional mountpoins to VicNode Aspera Shares. Applications directory currently on management node, needs to be decoupled. Bare metal nodes have /scratch shared storage for MPI jobs (Dell R730 with 14 x 800GB mixed use SSDs providing 8TB of usable storage, NFS over RDMA)., /var/local/tmp for single node jobs, pcie SSD 1.6TB.
  • 4. 3.0 Operating System and Scheduler Layer 3.1 Red Hat Linux. Scalable FOSS operating system, high performance, very well suited for research applications. In November 2016 of the Top 500 Supercomputers worldwide, every single machine used a "UNIX-like" operating system; and 99.6% used Linux. 3.2 Slurm Workload Manager. Job schedulers and resource managers allow for unattended background tasks expressed as batch jobs among the available resources; allows multicore, multinode, arrays, dependencies, and interactive submissions. The scheduler provides for paramterisation of computer resources, an automatic submission of execution tasks, and a notification system for incidents. Slurm (originally Simple Linux Utility for Resource Management), developed by Lawrence Livermore et al., is FOSS, and used by majority of world's top systems. Scalable, offers many optional plugins, power-saving features, accounting features, etc. Divided into logical partitions which correlate with hardware partitions. 3.3 Git, Gerrit, and Puppet. Version control, paired systems administration, configuration management. 3.4 OpenStack Node Deployment. Significant use of Nova (compute) service for provisioning and decommissioning of virtual machines on demand.
  • 5. 4.0 Application Layer 4.1 Source Code and EasyBuild. Source code provides better control over security updates, integration, development, and much better performance. Absolutely essential for reproducibility in research environment. EasyBuild makes source software installs easier with scripts containing specified compilation blocks (e.g., configuremake, cmake etc) and specified toolchains (GCC, Intel etc) and environment modules (LMod). Modulefiles allow for dynamic changes to a user's environment and ease with multiple versions of software applications on a system. 4.2 Compilers, Scripting Languages, and Applications. Usual range of suspects; Intel and GCC, for compilers (and a little bit of PGI), Python Ruby, and Perl for scripting languages, OpenMPI wrappers. Major applications include: MATLAB, Gaussian, NAMD, R, OpenFOAM, Octave etc. Almost 1,000 applications/versions installed from source, plus packages. 4.3 Containers with Singularity. A container in a cloud virtual machine on an HPC! Wait, what?
  • 6. 5.0 User Layer 5.1 Karaage. Spartan uses its own LDAP authentication that is tied to the university Security Assertion Markup Language (SAML). Users on Spartan must belong to a project. Projects must be led by a University of Melbourne researcher (the "Principal Investigator") and are subject to approval by the Head of Research Compute Services. Participants in a project can be researchers or research support staff from anywhere. Karaage is Django-based application for user, project, and cluster reporting and management. 5.2 Freshdesk. OMG Users! 5.3 Online Instructions and Training. Many users (even post-doctoral researchers) require basic training in Linux command line, a requisite skill for HPC use. Extensive training programme for researchers available using andragogical methods, including day-long courses in “Introduction to Linux and HPC Using Spartan”, “Linux Shell Scripting for High Performance Computing”, and “Parallel Programming On Spartan”. Documentation online (Github, Website, and man pages) and plenty of Slurm examples on system.
  • 7. 6.0 Future Developments 6.1 Cloudbursting with Azure. Slurm allows cloudbursting via the powersave feature; successfully experiments (and bug discovery) within the NeCTAR research cloud. About to add Azure through same login node. Does not mount applications directory; wrap necessary data for transfer in script. 6.2 GPU Expansion. Plans for a significant increase in the GPU allocation. 6.3 Test cluster (Thespian). Everyone has a test environment, some people also have a production and a test environment. Test nodes already exist for Cloud and Physical partitions. Replicate management and login nodes. 6.3 New Architectures New architectures can be added to the system with separate build node (another VM) and with software built for that architecture. Don't need an entirely new system.