SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
©Bull 2012
Overlay HPC Information
1
Christian Kniep
R&D HPC Engineer
2014-06-25
©Bull 2014
About Me
2
‣ 10y+ SysAdmin
‣ 8y+ SysOps
‣ B.Sc. (2008-2011)
‣ 6y+ DevOps
‣ 1y+ R&D
- @CQnib
- http://blog.qnib.org
- https://github.com/ChristianKniep
©Bull 2014
My ISC History - Motivation
3
©Bull 2014
My ISC History - Description
4
©Bull 2014
HPC Software Stack (rough estimate)
5
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
MgmtSysOps
SysOpsMgmt
User
PowerUser/ISV
ISVMgmt
Services:! ! Storage, Job Scheduler
HW
©Bull 2014
HPC Software Stack (goal)
6
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - History
7
!
!
!
• Created my own
!
!
• No useful tools in sight
©Bull 2014
QNIB
8
‣ Cluster of n*1000+ IB nodes
• Hard to debug
!
!
!
• Created my own - Graphite-Update in late 2013
!
!
• No useful tools in sight
©Bull 2014
QNIB
9
‣ Cluster of n*1000+ IB nodes
• Hard to debug
©Bull 2014
Achieved HPC Software Stack
10
Hardware:! ! IB-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - Implementation
11
©Bull 2014
QNIBTerminal -blog.qnib.org
12
haproxy haproxy
dns
helixdns
elk
kibana
logstash
etcd
carboncarbon
graphite-webgraphite-web
graphite-apigraphite-api
grafanagrafana
slurmctldslurmctld
compute0slurmd
compute<N>slurmd
Log/Events
Services Performance
Compute
elasticsearch
©Bull 2012
DEMONSTRATION
13
©Bull 2012
Future Work
14
©Bull 2014
More Services
15
‣ Improve work-flow for log-events
‣ Nagios(-like) node is missing
‣ Cluster-FileSystem
‣ LDAP
‣ Additional dashboards
‣ Inventory
‣ using InfiniBand for communication traffic
©Bull 2014
Graph Representation
16
‣ Graph inventory needed
• Hierarchical view is not enough
©Bull 2014
Graph Representation
17
!
!
• GraphDB seems to be a good idea
comp0 comp1 comp2
ibsw0
eth1
eth10
ldap12
lustre0
ibsw2
‣ Graph inventory needed
• Hierarchical view is not enough
RETURN node=comp* WHERE 
ROUTE TO lustre_service INCLUDES ibsw2
©Bull 2012
Conclusion
18
!
!
!
!
!
!
!
!
!
!
‣ Training
• New SysOps could start on virtual cluster
• ‚Strangulate’ node to replay an error.
!
!
!
!
!
!
!
‣ Showcase
• Showing a customer his (to-be) software stack
• Convince the SysOps-Team ‚they have nothing to fear‘
!
!
!
‣ complete toolchain could be automated
• Testing
• Verification
• Q&A
!
!
‣ n*1000 containers through clustering
©Bull 2014
Conclusion
19
‣ n*100 of containers are easy (50 on my laptop)
• Running a 300 node cluster stack
©Bull 2014
Log AND Performance Management
20
‣ Metric w/o Logs are useless!
©Bull 2014
Log AND Performance Management
21
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2014
Log AND Performance Management
22
!
!
• overlapping is king
‣ Metric w/o Logs are useless!!
• and the other way around…
©Bull 2012 23

Contenu connexe

Tendances

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud PlatformJoseph Lust
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpTomasz Bak
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Predictionyaoch29
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scalemabl
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...COMAQA.BY
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Chris Fregly
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018Carlos Mendible
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan Misirli
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as CodeWise Engineering
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and FastVinci Rufus
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryPlatform CF
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...Itai Yaffe
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったRyota Suenaga
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsSSP Innovations
 

Tendances (20)

mabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platformmabl's Machine Learning Implementation on Google Cloud Platform
mabl's Machine Learning Implementation on Google Cloud Platform
 
Replacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with GulpReplacing Rails asset pipeline with Gulp
Replacing Rails asset pipeline with Gulp
 
Taipei City Bike Prediction
Taipei City Bike PredictionTaipei City Bike Prediction
Taipei City Bike Prediction
 
End to-end test automation at scale
End to-end test automation at scaleEnd to-end test automation at scale
End to-end test automation at scale
 
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
Дмитрий Лемешко. Comaqa Spring 2018. Continuous mobile automation in build pi...
 
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
Advanced Spark and Tensorflow Meetup - London - Nov 15, 2016 - Deploy Spark M...
 
Go
GoGo
Go
 
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018The State of the Developer Ecosystem - .NET Conf Barcelona 2018
The State of the Developer Ecosystem - .NET Conf Barcelona 2018
 
Balkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusionBalkan - data eng meetup - data fusion
Balkan - data eng meetup - data fusion
 
Transition to Infrastructure as Code
Transition to Infrastructure as CodeTransition to Infrastructure as Code
Transition to Infrastructure as Code
 
Making Angular2 lean and Fast
Making Angular2 lean and FastMaking Angular2 lean and Fast
Making Angular2 lean and Fast
 
Front end architecture patterns
Front end architecture patternsFront end architecture patterns
Front end architecture patterns
 
Continuous Delivery with Cloud Foundry
Continuous Delivery with Cloud FoundryContinuous Delivery with Cloud Foundry
Continuous Delivery with Cloud Foundry
 
GitLab Product update July 25
GitLab Product update July 25GitLab Product update July 25
GitLab Product update July 25
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
 
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作ったAlieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
Alieaters#8 アリババクラウドで
サーバレス頑張って
モバイルアプリを作った
 
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP InnovationsArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
ArcGIS 10.1 Upgrade Cowlitz PUD & SSP Innovations
 
Big datalab
Big datalabBig datalab
Big datalab
 
Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014Released WEBridge 4 SAP R 3 on 9/9 of 2014
Released WEBridge 4 SAP R 3 on 9/9 of 2014
 
Serverless
ServerlessServerless
Serverless
 

Similaire à Overlay HPC Information

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDamien Cavaillès
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...NETWAYS
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSILuke Han
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudVlad Kuusk
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fSATOSHI TAGOMORI
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP Analytics
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlababenyeung1
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsNGINX, Inc.
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Josh Ghiloni
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverlessSteve Houël
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloudChristophe Marchal
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersTobias Koprowski
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieDataWorks Summit
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computingThe BioTeam Inc.
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 

Similaire à Overlay HPC Information (20)

Deploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPHDeploying Machine Learning in production without servers - #serverlessCPH
Deploying Machine Learning in production without servers - #serverlessCPH
 
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez SquareHive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
 
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...OSDC 2014: Christian Kniep -  Understand your data center by overlaying multi...
OSDC 2014: Christian Kniep - Understand your data center by overlaying multi...
 
Building Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSIBuilding Enterprise OLAP on Hadoop for FSI
Building Enterprise OLAP on Hadoop for FSI
 
SkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid CloudSkyBase - a Devops Platform for Hybrid Cloud
SkyBase - a Devops Platform for Hybrid Cloud
 
Hadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11fHadoop and subsystems in livedoor #Hcj11f
Hadoop and subsystems in livedoor #Hcj11f
 
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and VirtualizationSAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
SAP #BOBJ #BI 4.1 Upgrade Webcast Series 3: BI 4.1 Sizing and Virtualization
 
Why use Gitlab
Why use GitlabWhy use Gitlab
Why use Gitlab
 
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company GrowsWhy and How to Run Your Own Gitlab Runners as Your Company Grows
Why and How to Run Your Own Gitlab Runners as Your Company Grows
 
Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016Denver Cloud Foundry Meetup - February 2016
Denver Cloud Foundry Meetup - February 2016
 
To be or not to be serverless
To be or not to be serverlessTo be or not to be serverless
To be or not to be serverless
 
Aws, play! couch db scaling soa in the cloud
Aws, play! couch db  scaling soa in the cloudAws, play! couch db  scaling soa in the cloud
Aws, play! couch db scaling soa in the cloud
 
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginnersKoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
KoprowskiT_SQLRelay2014#2_Southampton_MaintenancePlansForBeginners
 
re:Invent re:cap 2020
re:Invent re:cap 2020re:Invent re:cap 2020
re:Invent re:cap 2020
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Berlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWSBerlin AWS meetup: here.com on AWS
Berlin AWS meetup: here.com on AWS
 
Scaling systems for research computing
Scaling systems for research computingScaling systems for research computing
Scaling systems for research computing
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Overlay HPC Information

  • 1. ©Bull 2012 Overlay HPC Information 1 Christian Kniep R&D HPC Engineer 2014-06-25
  • 2. ©Bull 2014 About Me 2 ‣ 10y+ SysAdmin ‣ 8y+ SysOps ‣ B.Sc. (2008-2011) ‣ 6y+ DevOps ‣ 1y+ R&D - @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
  • 3. ©Bull 2014 My ISC History - Motivation 3
  • 4. ©Bull 2014 My ISC History - Description 4
  • 5. ©Bull 2014 HPC Software Stack (rough estimate) 5 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA MgmtSysOps SysOpsMgmt User PowerUser/ISV ISVMgmt Services:! ! Storage, Job Scheduler HW
  • 6. ©Bull 2014 HPC Software Stack (goal) 6 Hardware:! ! HW-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 8. ! ! ! • Created my own ! ! • No useful tools in sight ©Bull 2014 QNIB 8 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 9. ! ! ! • Created my own - Graphite-Update in late 2013 ! ! • No useful tools in sight ©Bull 2014 QNIB 9 ‣ Cluster of n*1000+ IB nodes • Hard to debug
  • 10. ©Bull 2014 Achieved HPC Software Stack 10 Hardware:! ! IB-sensors/-errors OS:! ! ! Kernel, Userland tools MiddleWare:! MPI, ISV-libs Software:! ! End user application Excel:!! ! KPI, SLA Services:! ! Storage, Job Scheduler Log/Events Perf
  • 11. ©Bull 2012 QNIBTerminal - Implementation 11
  • 12. ©Bull 2014 QNIBTerminal -blog.qnib.org 12 haproxy haproxy dns helixdns elk kibana logstash etcd carboncarbon graphite-webgraphite-web graphite-apigraphite-api grafanagrafana slurmctldslurmctld compute0slurmd compute<N>slurmd Log/Events Services Performance Compute elasticsearch
  • 15. ©Bull 2014 More Services 15 ‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
  • 16. ©Bull 2014 Graph Representation 16 ‣ Graph inventory needed • Hierarchical view is not enough
  • 17. ©Bull 2014 Graph Representation 17 ! ! • GraphDB seems to be a good idea comp0 comp1 comp2 ibsw0 eth1 eth10 ldap12 lustre0 ibsw2 ‣ Graph inventory needed • Hierarchical view is not enough RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
  • 19. ! ! ! ! ! ! ! ! ! ! ‣ Training • New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error. ! ! ! ! ! ! ! ‣ Showcase • Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘ ! ! ! ‣ complete toolchain could be automated • Testing • Verification • Q&A ! ! ‣ n*1000 containers through clustering ©Bull 2014 Conclusion 19 ‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
  • 20. ©Bull 2014 Log AND Performance Management 20 ‣ Metric w/o Logs are useless!
  • 21. ©Bull 2014 Log AND Performance Management 21 ‣ Metric w/o Logs are useless!! • and the other way around…
  • 22. ©Bull 2014 Log AND Performance Management 22 ! ! • overlapping is king ‣ Metric w/o Logs are useless!! • and the other way around…