Andrea Bartolini presented a method for combining out-of-band monitoring with artificial intelligence and big data analytics to enable datacenter automation. Their system, called D.A.V.I.D.E., uses fine-grained power and performance monitoring of nodes through an embedded system. Data is collected and analyzed using MQTT, Cassandra, and Apache Spark. An autoencoder was trained on historical monitoring data to learn normal behavior and is used to detect anomalies through reconstruction error at the edge in real-time. Future work includes extending this approach for security and expanding it to larger systems.
Scaling API-first – The story of a global engineering organization
Combining out - of - band monitoring with AI and big data for datacenter automation in OpenPOWER
1. Andrea Bartolini, Prof
University of Bologna – DEI, Italy
Combining out-of-band monitoring with AI and big data
for datacenter automation in OpenPOWER
2. Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
4. Usage Scenarios
Fine Grain Power and
Performance Measurements:
- Verify and classify node performance
- In spec / out of spec behaviour
- Miss configuration
- Aging and wear out
- Detect security hazards
- Predictive maintenance
Coarse grain
Fine grain
CPU
CPU
ACC ACC
Node
DIMMDIMMDIMM
Performance Counters:
- Node components
- Microarchitectural events
AI
8. D.A.V.I.D.E. SUPERCOMPUTER
(Development of an Added Value Infrastructure Designed in Europe)
OCP form-factor compute node
based on IBM Minsky
2xIB EDR
LIQUID COOLING
4x Tesla P100 HSMX2
University of Bologna / ETH Zurich
DiG: FINE GRAIN POWER AND
PERFORMANCE MONITORING & ANALYTICS
BusBar
2 x POWER8 with NVLink
9. • Power measuring block - placed between
the power sensing unit (PSU) and the DC-
DC converter provide overall node power
consumption
• Embedded system (BeagleBoneBlack)
connected through pass through
commands to node performance metrics
• Scalable interface to the data analysis point
through the MQTT protocol
• Embedded system powerful enough to
support edge analytics & inference
DiG: Out-of-band Power & Performance
Monitoring
Architecture
10. • Sub-Watt precision
• Power Monitoring Sampling rate
@50kS/s (T=20us)
• Performance Monitoring 242
Amester metrics every 10s
• Time synchronized (±3σ)
• NTP < 35us
• PTP < 1.3us
State-of-the art systems (HDEEM and PowerInsight)
• Max. 1 ms sampling period
• Use data only offline (no possibility for real-time computing)
Architecture
DiG: Out-of-band Power & Performance
Monitoring
Framework Fsmax [kHz]
E4 PPBB 50
HDEEM 1
PowerInsight 1
Nice, but how do I use it?
11. 11
Application 1
Application 2
Spectral signature of an
application!
Real-time Frequency analysis on power supply and more…a live oscilloscope
• For instance, using the FFT we plot the power spectral density of the power
benchmark of two applications, and we can distinguish them by the harmonics
present in each of the signals
Low overhead, accurate monitoring
Interesting feature for node
level and system level
Intrusion Detection System!
OK, but do I automate?
12. Outline
• Datacenter Automation
• D.A.V.I.D.E. Out-of-Band and Big Data Monitoring
• Big Data/AI enabled Anomaly Detection
• Future Works
14. MQTT: MQ Telemetry Transport
•Lightweight message queuing and transport protocol
•Developed by IBM and Eurotech
•Well suited for low resource demanding scenarios like M2M,
WSN and IoT applications
•Basic features:
•PubSub model
•Async communication protocol (messages)
•Low overhead packet (2 bytes header)
•QoS (3 levels)
•Open source implementation:
•https://mosquitto.org/
Publisher
Topic
(Broker)
Subscriber
(mosquitto_pub) (mosquitto_sub)(mosquitto)
15. Cassandra Column Family
MQTT Publishers
facility/sensors/B
Sens_pub_A Sens_pub_B Sens_pub_C
Metric:
A
Tags:
facility
Sensors
Metric:
B
Tags:
Facility
Sensors
Metric:
C
Tags:
facility
sensors
facility/sensors/# MQTT2KairosdbMQTT
Broker
MQTT to NoSQL Storage: MQTT2Kairosdb
= {Value;Timestamp}
16. 12-Nov-2018 16
DiG: Power & Perf Meas on D.A.V.I.D.E.
Broker
MQTT
DiG
MQTT
Pow_pub IPMI_pub OCC_pub
DiG SW Daemons
PSU_pub Cooling_pub
D.A.V.I.D.E. Front-End
Power:
• 20μs → on-board Analysis
• 1s,1ms → to Central Unit (45 kS/s)
IPMI: 89 metrics per node every 5sOCC: 242 metrics per node every 10s
IPMI
Liteon
Overall Rack info
(e.g., Total Power)
Asetek
Info Liquid Cooling
Antonio Libri
Slurm
Slurm_pub
18. Embedded Board:
Monitoring +
Anomaly Detection
Monitoring
Infrastructure
Historical Data
collect data
1
DL Model
train
2
Computing
Node 1
Embedded
Board
Computing
Node 2
Embedded
Board
Computing
Node N
…
3
load trained
model in
boards
DL
4 Normal
Behaviour
Anomalyonline anomaly
detection on live, new
data
AI+Big Data on D.A.V.I.D.E.:
Example of Anomaly detection
1. Collect Data
2. AI Train
3. Edge
4. AI Inference
Does it work?
19. Anomaly Detection
X Y
AUTOENCODER
…
Z
encoder decoder
• An autoencoder tries to copy its input (X) into its output (Y)
• An autoencoder learns to represent its input in the latent space,
extracting the important characteristics of the input set X
IDEA: train an autoencoder with the normal behavior of a HPC
system and use its reconstruction error to detect anomalies
20. AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
• Autoencoder: neural network with 3 layers
• Sparse layers (dimension = n_features x 10)
• We trained on D.A.V.I.D.E. the autoencoder using ~two months of normal data,
collected with Examon
• To test our approach we injected anomalies in a subset of nodes
• Misconfigurations → change of frequency governor policy
• Default policy conservative: cpu frequency depends on load
• Anomaly 1 policy changed to powersave: cpu freq always at min value
• Anomaly 2 policy changed to performance: cpu freq always at max value
Train on «normal» data
Validation
21. Fault!!
AI+Big Data on D.A.V.I.D.E.: Anomaly
detection
F-Score 99th percentile
Normal Anomaly
0.99 0.97
Inference Accuracy
Edge Inference with Tensorflow on
Embedded Computers (BBB): 11 ms
22. Conclusion & Future Works
• We presented an approach to conbine out-of-band monitoring and
big data and AI to enable Datacenter Automation
• We proof the effectivity of our approach toward enabling
automated anomaly detection of computing node
• Future Works:
• Extending the approach toward Security and house-keeping tasks in
Datacenters
• Leverage OpenBMC and custom firmware to deploy it as part of BMC
• Looking for partnership for bringing it to Large scale P9 systems