SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
©Bull 2012
Overlay HPC Information
1
Christian Kniep
R&D HPC Engineer
2014-06-25
©Bull 2014
About Me
2
‣ 10y+ SysAdmin
‣ 8y+ SysOps
‣ B.Sc. (2008-2011)
‣ 6y+ DevOps
‣ 1y+ R&D
- @CQnib
- http://blog.qnib.org
- https://github.com/ChristianKniep
©Bull 2014
My ISC History - Motivation
3
©Bull 2014
My ISC History - Description
4
©Bull 2014
HPC Software Stack (rough estimate)
5
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
MgmtSysOps
SysOpsMgmt
User
PowerUser/ISV
ISVMgmt
Services:! ! Storage, Job Scheduler
HW
©Bull 2014
HPC Software Stack (goal)
6
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - History
7
!
!
!
• Created my own
!
!
• No useful tools in sight
©Bull 2014
QNIB
8
‣ Cluster of n*1000+ IB nodes
• Hard to debug
!
!
!
• Created my own - Graphite-Update in late 2013
!
!
• No useful tools in sight
©Bull 2014
QNIB
9
‣ Cluster of n*1000+ IB nodes
• Hard to debug
©Bull 2014
Achieved HPC Software Stack
10
Hardware:! ! IB-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - Implementation
11
©Bull 2014
QNIBTerminal -blog.qnib.org
12
haproxy haproxy
dns
helixdns
elk
kibana
logstash
etcd
carboncarbon
graphite-webgraphite-web
graphite-apigraphite-api
grafanagrafana
slurmctldslurmctld
compute0slurmd
compute<N>slurmd
Log/Events
Services Performance
Compute
elasticsearch
©Bull 2012
DEMONSTRATION
13
©Bull 2012
Future Work
14
©Bull 2014
More Services
15
‣ Improve work-flow for log-events
‣ Nagios(-like) node is missing
‣ Cluster-FileSystem
‣ LDAP
‣ Additional dashboards
‣ Inventory
‣ using InfiniBand for communication traffic
©Bull 2014
Graph Representation
16
‣ Graph inventory needed
• Hierarchical view is not enough
©Bull 2014
Graph Representation
17
!
!
• GraphDB seems to be a good idea
comp0 comp1 comp2
ibsw0
eth1
eth10
ldap12
lustre0
ibsw2
‣ Graph inventory needed
• Hierarchical view is not enough
RETURN node=comp* WHERE 
ROUTE TO lustre_service INCLUDES ibsw2
©Bull 2012
Conclusion
18
!
!
!
!
!
!
!
!
!
!
‣ Training
• New SysOps could start on virtual cluster
• ‚Strangulate’ node to replay an error.
!
!
!
!
!
!
!
‣ Showcase
• Showing a customer his (to-be) software stack
• Convince the SysOps-Team ‚they have nothing to fear‘
!
!
!
‣ complete toolchain could be automated
• Testing
• Verification
• Q&A
!
!
‣ n*1000 containers through clustering
©Bull 2014
Conclusion
19
‣ n*100 of containers are easy (50 on my laptop)
• Running a 300 node cluster stack
©Bull 2014
Log AND Performance Management
20
‣ Metric w/o Logs are useless!
©Bull 2014
Log AND Performance Management
21
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2014
Log AND Performance Management
22
!
!
• overlapping is king
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2012 23

Contenu connexe

Tendances

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud PlatformJoseph Lust
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpTomasz Bak
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Predictionyaoch29
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scalemabl
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...COMAQA.BY
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Chris Fregly
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018Carlos Mendible
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan Misirli
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as CodeWise Engineering
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and FastVinci Rufus
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryPlatform CF
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったRyota Suenaga
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsSSP Innovations
 

Tendances (20)

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platform
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with Gulp
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Prediction
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scale
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Go
GoGo
Go
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusion
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as Code
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and Fast
 
Front end architecture patterns
Front end architecture patternsFront end architecture patterns
Front end architecture patterns
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud Foundry
 
GitLab Product update July 25
GitLab Product update July 25GitLab Product update July 25
GitLab Product update July 25
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
 
Big datalab
Big datalabBig datalab
Big datalab
 
Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014
 
Serverless
ServerlessServerless
Serverless
 

Similaire à Overlay HPC Information

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDamien Cavaillès
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...NETWAYS
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSILuke Han
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudVlad Kuusk
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fSATOSHI TAGOMORI
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP Analytics
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlababenyeung1
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsNGINX, Inc.
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Josh Ghiloni
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverlessSteve Houël
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloudChristophe Marchal
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersTobias Koprowski
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieDataWorks Summit
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computingThe BioTeam Inc.
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 

Similaire à Overlay HPC Information (20)

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPH
 
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez SquareHive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSI
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid Cloud
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11f
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlab
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company Grows
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverless
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloud
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
 
re:Invent re:cap 2020
re:Invent re:cap 2020re:Invent re:cap 2020
re:Invent re:cap 2020
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Berlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWSBerlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWS
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computing
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Overlay HPC Information

  • 1. ©Bull 2012 Overlay HPC Information 1 Christian Kniep R&D HPC Engineer 2014-06-25
  • 2. ©Bull 2014 About Me 2 ‣ 10y+ SysAdmin ‣ 8y+ SysOps ‣ B.Sc. (2008-2011) ‣ 6y+ DevOps ‣ 1y+ R&D - @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
  • 3. ©Bull 2014 My ISC History - Motivation 3
  • 4. ©Bull 2014 My ISC History - Description 4
  • 5. ©Bull 2014 HPC Software Stack (rough estimate) 5 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA MgmtSysOps SysOpsMgmt User PowerUser/ISV ISVMgmt Services:! ! Storage, Job Scheduler HW
  • 6. ©Bull 2014 HPC Software Stack (goal) 6 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 8. ! ! ! • Created my own ! ! • No useful tools in sight ©Bull 2014 QNIB 8 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 9. ! ! ! • Created my own - Graphite-Update in late 2013 ! ! • No useful tools in sight ©Bull 2014 QNIB 9 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 10. ©Bull 2014 Achieved HPC Software Stack 10 Hardware:! ! IB-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 11. ©Bull 2012 QNIBTerminal - Implementation 11
  • 12. ©Bull 2014 QNIBTerminal -blog.qnib.org 12 haproxy haproxy dns helixdns elk kibana logstash etcd carboncarbon graphite-webgraphite-web graphite-apigraphite-api grafanagrafana slurmctldslurmctld compute0slurmd compute<N>slurmd Log/Events Services Performance Compute elasticsearch
  • 15. ©Bull 2014 More Services 15 ‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
  • 16. ©Bull 2014 Graph Representation 16 ‣ Graph inventory needed • Hierarchical view is not enough
  • 17. ©Bull 2014 Graph Representation 17 ! ! • GraphDB seems to be a good idea comp0 comp1 comp2 ibsw0 eth1 eth10 ldap12 lustre0 ibsw2 ‣ Graph inventory needed • Hierarchical view is not enough RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
  • 19. ! ! ! ! ! ! ! ! ! ! ‣ Training • New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error. ! ! ! ! ! ! ! ‣ Showcase • Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘ ! ! ! ‣ complete toolchain could be automated • Testing • Verification • Q&A ! ! ‣ n*1000 containers through clustering ©Bull 2014 Conclusion 19 ‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
  • 20. ©Bull 2014 Log AND Performance Management 20 ‣ Metric w/o Logs are useless!
  • 21. ©Bull 2014 Log AND Performance Management 21 ‣ Metric w/o Logs are useless!! • and the other way around…
  • 22. ©Bull 2014 Log AND Performance Management 22 ! ! • overlapping is king ‣ Metric w/o Logs are useless!! • and the other way around…