SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Andrea Bartolini, Prof
University of Bologna – DEI, Italy
Combining out-of-band monitoring with AI and big data
for datacenter automation in OpenPOWER
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
Performance
Analysis
Scalable Moniitoring
Framework
Machine
Learning
Data
Visualization
Resources
Management
Energy
efficiency
Job
Scheduling
Heterogeneous
Sensors
Common Interface
CRAC
PDU
CLUSTER
Reactive and Proactive
Feedbacks
ENV.
A New Trend: Datacentre Automation
Usage Scenarios
Fine Grain Power and
Performance Measurements:
- Verify and classify node performance
- In spec / out of spec behaviour
- Miss configuration
- Aging and wear out
- Detect security hazards
- Predictive maintenance
Coarse grain
Fine grain
CPU
CPU
ACC ACC
Node
DIMMDIMMDIMM
Performance Counters:
- Node components
- Microarchitectural events
AI
Datacenter Automation Design and Bottlenecks
Centralized
Monitoring
&
Analytics
Edge Monitoring
& Analytics
Low
Data-Rate
Huge
Data-Rate
Centralized
Monitoring
&
Analytics
Bottlenecks:
• Network BW
• Storage
• SW Overhead
Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
Node-1 Node-n Node-nNode-1
Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
7
D.A.V.I.D.E.
SUPERCOMPUTER
(Development of an
Added
Value
Infrastructure
Designed in
Europe)
D.A.V.I.D.E. SUPERCOMPUTER
(Development of an Added Value Infrastructure Designed in Europe)
OCP form-factor compute node
based on IBM Minsky
2xIB EDR
LIQUID COOLING
4x Tesla P100 HSMX2
University of Bologna / ETH Zurich
DiG: FINE GRAIN POWER AND
PERFORMANCE MONITORING & ANALYTICS
BusBar
2 x POWER8 with NVLink
• Power measuring block - placed between
the power sensing unit (PSU) and the DC-
DC converter provide overall node power
consumption
• Embedded system (BeagleBoneBlack)
connected through pass through
commands to node performance metrics
• Scalable interface to the data analysis point
through the MQTT protocol
• Embedded system powerful enough to
support edge analytics & inference
DiG: Out-of-band Power & Performance
Monitoring
Architecture
• Sub-Watt precision
• Power Monitoring Sampling rate
@50kS/s (T=20us)
• Performance Monitoring 242
Amester metrics every 10s
• Time synchronized (±3σ)
• NTP < 35us
• PTP < 1.3us
State-of-the art systems (HDEEM and PowerInsight)
• Max. 1 ms sampling period
• Use data only offline (no possibility for real-time computing)
Architecture
DiG: Out-of-band Power & Performance
Monitoring
Framework Fsmax [kHz]
E4 PPBB 50
HDEEM 1
PowerInsight 1
Nice, but how do I use it?
11
Application 1
Application 2
Spectral signature of an
application!
Real-time Frequency analysis on power supply and more…a live oscilloscope
• For instance, using the FFT we plot the power spectral density of the power
benchmark of two applications, and we can distinguish them by the harmonics
present in each of the signals
Low overhead, accurate monitoring
Interesting feature for node
level and system level
Intrusion Detection System!
OK, but do I automate?
Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
Scalable Data Collection, Analytics
Sens_pub
Broker1
Sens_pub Sens_pub
Cassandra
node1
MQTT
Sens_pub
BrokerM
Sens_pub Sens_pub
Cassandra
nodeM
Grafana
Back-end
• MQTT–enabled sensor
collectors
Front-end
• MQTT Brokers
• Data Visualization
• NoSQL Storage
• Big Data Analytics
Apache
Spark
Target Facility
MQTT Brokers
Applications
NoSQL
ADMIN
MQTT2Kairos MQTT2kairos
Kairosdb
Panda Matlab
MQTT: MQ Telemetry Transport
•Lightweight message queuing and transport protocol
•Developed by IBM and Eurotech
•Well suited for low resource demanding scenarios like M2M,
WSN and IoT applications
•Basic features:
•PubSub model
•Async communication protocol (messages)
•Low overhead packet (2 bytes header)
•QoS (3 levels)
•Open source implementation:
•https://mosquitto.org/
Publisher
Topic
(Broker)
Subscriber
(mosquitto_pub) (mosquitto_sub)(mosquitto)
Cassandra Column Family
MQTT Publishers
facility/sensors/B
Sens_pub_A Sens_pub_B Sens_pub_C
Metric:
A
Tags:
facility
Sensors
Metric:
B
Tags:
Facility
Sensors
Metric:
C
Tags:
facility
sensors
facility/sensors/# MQTT2KairosdbMQTT
Broker
MQTT to NoSQL Storage: MQTT2Kairosdb
= {Value;Timestamp}
12-Nov-2018 16
DiG: Power & Perf Meas on D.A.V.I.D.E.
Broker
MQTT
DiG
MQTT
Pow_pub IPMI_pub OCC_pub
DiG SW Daemons
PSU_pub Cooling_pub
D.A.V.I.D.E. Front-End
Power:
• 20μs → on-board Analysis
• 1s,1ms → to Central Unit (45 kS/s)
IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s
IPMI
Liteon
Overall Rack info
(e.g., Total Power)
Asetek
Info Liquid Cooling
Antonio Libri
Slurm
Slurm_pub
Examon Analytics: Batch training + Edge inference
examon-client
(REST)
(Batch)
Pandas
dataframe
Embedded Board:
Monitoring +
Anomaly Detection
Monitoring
Infrastructure
Historical Data
collect data
1
DL Model
train
2
Computing
Node 1
Embedded
Board
Computing
Node 2
Embedded
Board
Computing
Node N
…
3
load trained
model in
boards
DL
4 Normal
Behaviour
Anomalyonline anomaly
detection on live, new
data
AI+Big Data on D.A.V.I.D.E.:
Example of Anomaly detection
1. Collect Data
2. AI Train
3. Edge
4. AI Inference
Does it work?
Anomaly Detection
X Y
AUTOENCODER
…
Z
encoder decoder
• An autoencoder tries to copy its input (X) into its output (Y)
• An autoencoder learns to represent its input in the latent space,
extracting the important characteristics of the input set X
IDEA: train an autoencoder with the normal behavior of a HPC
system and use its reconstruction error to detect anomalies
AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
• Autoencoder: neural network with 3 layers
• Sparse layers (dimension = n_features x 10)
• We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data,
collected with Examon
• To test our approach we injected anomalies in a subset of nodes
• Misconfigurations → change of frequency governor policy
• Default policy conservative: cpu frequency depends on load
• Anomaly 1 policy changed to powersave: cpu freq always at min value
• Anomaly 2 policy changed to performance: cpu freq always at max value
Train on «normal» data
Validation
Fault!!
AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
F-Score 99th percentile
Normal Anomaly
0.99 0.97
Inference Accuracy
Edge Inference with Tensorflow on
Embedded Computers (BBB): 11 ms
Conclusion & Future Works
• We presented an approach to conbine out-of-band monitoring and
big data and AI to enable Datacenter Automation
• We proof the effectivity of our approach toward enabling
automated anomaly detection of computing node
• Future Works:
• Extending the approach toward Security and house-keeping tasks in
Datacenters
• Leverage OpenBMC and custom firmware to deploy it as part of BMC
• Looking for partnership for bringing it to Large scale P9 systems
ACKNOWLEDGE
The Datacenter Automation TEAM
• Luca Benini, Michela Milano, Andrea
Borghesi, Antonio Libri, Francesco
Beneventi, Alessandro Petrella

Contenu connexe

Tendances

Monitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUsMonitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUsLuigi Vanfretti
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...Ryousei Takano
 
Open Source Software Tools for Synchrophasor Applications
Open Source Software Tools for  Synchrophasor ApplicationsOpen Source Software Tools for  Synchrophasor Applications
Open Source Software Tools for Synchrophasor ApplicationsLuigi Vanfretti
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMNumenta
 
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class ClassifiersDetecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class ClassifiersEronides Da Silva Neto
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDaniel Lim
 
Self-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingSelf-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingKevin McGregor MSc, IFC
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringHannaneh Najdataei
 
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...OW2
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningGUANGYUAN PIAO
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurementDr.M.Prasad Naidu
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Oleg Ovcharenko
 

Tendances (13)

Monitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUsMonitoring of Transmission and Distribution Grids using PMUs
Monitoring of Transmission and Distribution Grids using PMUs
 
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
A Scalable and Distributed Electrical Power Monitoring System Utilizing Cloud...
 
Open Source Software Tools for Synchrophasor Applications
Open Source Software Tools for  Synchrophasor ApplicationsOpen Source Software Tools for  Synchrophasor Applications
Open Source Software Tools for Synchrophasor Applications
 
Real-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTMReal-Time Streaming Data Analysis with HTM
Real-Time Streaming Data Analysis with HTM
 
AI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine LearningAI Hardware for Real-Time Machine Learning
AI Hardware for Real-Time Machine Learning
 
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class ClassifiersDetecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
Detecting Anomalies in the Engine Coolant Sensor using One-Class Classifiers
 
Differential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networksDifferential data processing for energy efficiency of wireless sensor networks
Differential data processing for energy efficiency of wireless sensor networks
 
Self-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive ComputingSelf-Tuned Remote Execution for Pervasive Computing
Self-Tuned Remote Execution for Pervasive Computing
 
Continuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud ClusteringContinuous and Parallel LiDAR Point-cloud Clustering
Continuous and Parallel LiDAR Point-cloud Clustering
 
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
AI and Deep Learning for On-Board Satellite Image Analysis, OW2con'19, June 1...
 
Env2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep LearningEnv2Vec: Accelerating VNF Testing with Deep Learning
Env2Vec: Accelerating VNF Testing with Deep Learning
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
 

Similaire à Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...OPAL-RT TECHNOLOGIES
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Zabbix
 
IoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentationIoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentationVEDLIoT Project
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...Edge AI and Vision Alliance
 
Brain wave controlled robot
Brain wave controlled robotBrain wave controlled robot
Brain wave controlled robotRahul Wagh
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Mahdi Hosseini Moghaddam
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisChrisCoughlin9
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsIvano Malavolta
 
Power optimization for Android apps
Power optimization for Android appsPower optimization for Android apps
Power optimization for Android appsXavier Hallade
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable RetinaVanya Valindria
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning Dr. Swaminathan Kathirvel
 
智慧檢測技術與工業自動化
智慧檢測技術與工業自動化智慧檢測技術與工業自動化
智慧檢測技術與工業自動化CHENHuiMei
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionFEG
 

Similaire à Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER (20)

Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
OPAL-RT RT13 Conference: Rapid control prototyping solutions for power electr...
 
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016
 
IoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentationIoT Tech Expo 2023_Micha vor dem Berge presentation
IoT Tech Expo 2023_Micha vor dem Berge presentation
 
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
“Enabling Ultra-low Power Edge Inference and On-device Learning with Akida,” ...
 
Brain wave controlled robot
Brain wave controlled robotBrain wave controlled robot
Brain wave controlled robot
 
Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...Application of machine learning and cognitive computing in intrusion detectio...
Application of machine learning and cognitive computing in intrusion detectio...
 
Application of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data AnalysisApplication of the Actor Model to Large Scale NDE Data Analysis
Application of the Actor Model to Large Scale NDE Data Analysis
 
ME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNsME4AWSN - a Modeling Environment for Architecting WSNs
ME4AWSN - a Modeling Environment for Architecting WSNs
 
micro manit.pptx
micro manit.pptxmicro manit.pptx
micro manit.pptx
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
Power optimization for Android apps
Power optimization for Android appsPower optimization for Android apps
Power optimization for Android apps
 
DNA: an overview
DNA: an overviewDNA: an overview
DNA: an overview
 
System-on-Chip Programmable Retina
System-on-Chip Programmable RetinaSystem-on-Chip Programmable Retina
System-on-Chip Programmable Retina
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
 
智慧檢測技術與工業自動化
智慧檢測技術與工業自動化智慧檢測技術與工業自動化
智慧檢測技術與工業自動化
 
FYP-2
FYP-2FYP-2
FYP-2
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Alerting mechanism and algorithms introduction
Alerting mechanism and algorithms introductionAlerting mechanism and algorithms introduction
Alerting mechanism and algorithms introduction
 

Plus de Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 

Plus de Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER

  • 1. Andrea Bartolini, Prof University of Bologna – DEI, Italy Combining out-of-band monitoring with AI and big data for datacenter automation in OpenPOWER
  • 2. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 4. Usage Scenarios Fine Grain Power and Performance Measurements: - Verify and classify node performance - In spec / out of spec behaviour - Miss configuration - Aging and wear out - Detect security hazards - Predictive maintenance Coarse grain Fine grain CPU CPU ACC ACC Node DIMMDIMMDIMM Performance Counters: - Node components - Microarchitectural events AI
  • 5. Datacenter Automation Design and Bottlenecks Centralized Monitoring & Analytics Edge Monitoring & Analytics Low Data-Rate Huge Data-Rate Centralized Monitoring & Analytics Bottlenecks: • Network BW • Storage • SW Overhead Infrastructure Sensors (e.g., CRAC, PDU, Etc.) Node-1 Node-n Node-nNode-1 Infrastructure Sensors (e.g., CRAC, PDU, Etc.)
  • 6. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 8. D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OCP form-factor compute node based on IBM Minsky 2xIB EDR LIQUID COOLING 4x Tesla P100 HSMX2 University of Bologna / ETH Zurich DiG: FINE GRAIN POWER AND PERFORMANCE MONITORING & ANALYTICS BusBar 2 x POWER8 with NVLink
  • 9. • Power measuring block - placed between the power sensing unit (PSU) and the DC- DC converter provide overall node power consumption • Embedded system (BeagleBoneBlack) connected through pass through commands to node performance metrics • Scalable interface to the data analysis point through the MQTT protocol • Embedded system powerful enough to support edge analytics & inference DiG: Out-of-band Power & Performance Monitoring Architecture
  • 10. • Sub-Watt precision • Power Monitoring Sampling rate @50kS/s (T=20us) • Performance Monitoring 242 Amester metrics every 10s • Time synchronized (±3σ) • NTP < 35us • PTP < 1.3us State-of-the art systems (HDEEM and PowerInsight) • Max. 1 ms sampling period • Use data only offline (no possibility for real-time computing) Architecture DiG: Out-of-band Power & Performance Monitoring Framework Fsmax [kHz] E4 PPBB 50 HDEEM 1 PowerInsight 1 Nice, but how do I use it?
  • 11. 11 Application 1 Application 2 Spectral signature of an application! Real-time Frequency analysis on power supply and more…a live oscilloscope • For instance, using the FFT we plot the power spectral density of the power benchmark of two applications, and we can distinguish them by the harmonics present in each of the signals Low overhead, accurate monitoring Interesting feature for node level and system level Intrusion Detection System! OK, but do I automate?
  • 12. Outline • Datacenter Automation • D.A.V.I.D.E. Out-of-Band and Big Data Monitoring • Big Data/AI enabled Anomaly Detection • Future Works
  • 13. Scalable Data Collection, Analytics Sens_pub Broker1 Sens_pub Sens_pub Cassandra node1 MQTT Sens_pub BrokerM Sens_pub Sens_pub Cassandra nodeM Grafana Back-end • MQTT–enabled sensor collectors Front-end • MQTT Brokers • Data Visualization • NoSQL Storage • Big Data Analytics Apache Spark Target Facility MQTT Brokers Applications NoSQL ADMIN MQTT2Kairos MQTT2kairos Kairosdb Panda Matlab
  • 14. MQTT: MQ Telemetry Transport •Lightweight message queuing and transport protocol •Developed by IBM and Eurotech •Well suited for low resource demanding scenarios like M2M, WSN and IoT applications •Basic features: •PubSub model •Async communication protocol (messages) •Low overhead packet (2 bytes header) •QoS (3 levels) •Open source implementation: •https://mosquitto.org/ Publisher Topic (Broker) Subscriber (mosquitto_pub) (mosquitto_sub)(mosquitto)
  • 15. Cassandra Column Family MQTT Publishers facility/sensors/B Sens_pub_A Sens_pub_B Sens_pub_C Metric: A Tags: facility Sensors Metric: B Tags: Facility Sensors Metric: C Tags: facility sensors facility/sensors/# MQTT2KairosdbMQTT Broker MQTT to NoSQL Storage: MQTT2Kairosdb = {Value;Timestamp}
  • 16. 12-Nov-2018 16 DiG: Power & Perf Meas on D.A.V.I.D.E. Broker MQTT DiG MQTT Pow_pub IPMI_pub OCC_pub DiG SW Daemons PSU_pub Cooling_pub D.A.V.I.D.E. Front-End Power: • 20μs → on-board Analysis • 1s,1ms → to Central Unit (45 kS/s) IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s IPMI Liteon Overall Rack info (e.g., Total Power) Asetek Info Liquid Cooling Antonio Libri Slurm Slurm_pub
  • 17. Examon Analytics: Batch training + Edge inference examon-client (REST) (Batch) Pandas dataframe
  • 18. Embedded Board: Monitoring + Anomaly Detection Monitoring Infrastructure Historical Data collect data 1 DL Model train 2 Computing Node 1 Embedded Board Computing Node 2 Embedded Board Computing Node N … 3 load trained model in boards DL 4 Normal Behaviour Anomalyonline anomaly detection on live, new data AI+Big Data on D.A.V.I.D.E.: Example of Anomaly detection 1. Collect Data 2. AI Train 3. Edge 4. AI Inference Does it work?
  • 19. Anomaly Detection X Y AUTOENCODER … Z encoder decoder • An autoencoder tries to copy its input (X) into its output (Y) • An autoencoder learns to represent its input in the latent space, extracting the important characteristics of the input set X IDEA: train an autoencoder with the normal behavior of a HPC system and use its reconstruction error to detect anomalies
  • 20. AI+Big Data on D.A.V.I.D.E.: Anomaly detection • Autoencoder: neural network with 3 layers • Sparse layers (dimension = n_features x 10) • We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data, collected with Examon • To test our approach we injected anomalies in a subset of nodes • Misconfigurations → change of frequency governor policy • Default policy conservative: cpu frequency depends on load • Anomaly 1 policy changed to powersave: cpu freq always at min value • Anomaly 2 policy changed to performance: cpu freq always at max value Train on «normal» data Validation
  • 21. Fault!! AI+Big Data on D.A.V.I.D.E.: Anomaly detection F-Score 99th percentile Normal Anomaly 0.99 0.97 Inference Accuracy Edge Inference with Tensorflow on Embedded Computers (BBB): 11 ms
  • 22. Conclusion & Future Works • We presented an approach to conbine out-of-band monitoring and big data and AI to enable Datacenter Automation • We proof the effectivity of our approach toward enabling automated anomaly detection of computing node • Future Works: • Extending the approach toward Security and house-keeping tasks in Datacenters • Leverage OpenBMC and custom firmware to deploy it as part of BMC • Looking for partnership for bringing it to Large scale P9 systems
  • 23. ACKNOWLEDGE The Datacenter Automation TEAM • Luca Benini, Michela Milano, Andrea Borghesi, Antonio Libri, Francesco Beneventi, Alessandro Petrella