SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
AIOps: Anomaly Detection of Distributed Traces
Prof. Dr. Jorge Cardoso
E-mail: jorge.cardoso@huawei.com
Planet-scale AIOps / Chief Architect
Ireland and Munich Research Centers
Definition (Gartner) [AIOps]
AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly
enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight.
National University of Ireland, Galway (NUI Galway)
2019.08.20
HUAWEI TECHNOLOGIES CO., LTD. 1
AIOps: Anomaly Detection of Distributed Traces
Abstract and bio
In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer
manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI
to provide tools for autonomous cloud operations. This talk will introduce the field of AIOps (artificial intelligence for IT
operations) and present what the next generation of AI-driven IT Operations tools and platforms will be. Our current research
in Munich explores how AIOps, and, namely, deep learning, machine learning, distributed traces, graph analysis, time-series
analysis, sequence analysis, and log analysis can be used to effectively detect and localize failures in cloud infrastructures to
reduce the workload of SRE operators. In particular, I will describe how distributed traces can be analyzed using deep
learning to effectively identify anomalies.
Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Ireland and Munich
Research Centers. Previously he worked for several major companies such as SAP Research
(Germany) on the Internet of Services and the Boeing Company in Seattle (USA) on Enterprise
Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology
(Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal).
He recently published his latest book titled “Fundamentals of Service Systems” with Springer. His
current research involves the development of the next generation of Cloud Operations and Analytics
using AI, Cloud Reliability and Resilience, and High Performance Business Process Management
systems. He has a Ph.D. in Computer Science from the University of Georgia (USA).
Jorge Cardoso
GitHub | Slideshare.net | GoogleScholar
Previous Positions
Interests AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
HUAWEI TECHNOLOGIES CO., LTD. 2
Operations and Maintenance
Global Trends
Core
Networks
IoT
Infrastructure
Network
Appliances
Radio
Access
Network
Content
Delivery
Networks
Virtual
CPE
Mobile
Edge
Computing
Energy Manufacturing Smart Cities/BuildingsHealthcare
Networking
East Environment Energy PHILIPS MedicalGeely
The cloud has evolved from a disruptor to an obvious solution for enterprise IT. However, the
increasing complexity of public, private, edge, and hybrid cloud environments presents new
challenges for cost-efficient operations and maintenance (O&M) tasks
HUAWEI TECHNOLOGIES CO., LTD. 3
Underlying Problem
Increasing Complexity of Systems
Every year the management of IT O&M is more complex
► Digitalization with cloud, mobile, and edge
► Large-scale microservice and serverless systems
► Increase in IT size, and event/alert volumes
► SLA guarantee for critical applications
► Nano/microservices break existing monitoring tools
Microservices/serverless add complexity to communications between
services making systems far more difficult to maintain and troubleshoot.
With a monolith application, when there is a problem, only one product
owner needs to be contacted.
HC Cloud
Data Center
1
Data Center
2
Server 1 Server 2
Socket 1 Socket 2
Core 1 Core 2
VM 1
VM 2
VM 3
Core .. Core 16
Server ..
Server
20.000
Data Center
..
Data Center
10
Huawei Cloud is an example of a planet-
scale complex system:
40 availability zones, 23 geographical
regions
(Germany, France, South/Central America, Hong
Kong and Russia to Thailand and South Africa)
20M VMs (3 VM/core), billion metrics
(for illustrative purpose only)
However, existing O&M tools still rely on old approaches which
involve manual rule configuration, averages and std, and data
wrangling to manage physical assets
HUAWEI TECHNOLOGIES CO., LTD. 4
Industry Trends
Bringing AI to O&M
Startup in the field of AIOps
The market is seeing interesting mergers and acquisitions by important vendors
► Cloud providers are also emerging as ITIM tool providers purely for their IaaS offerings
At the moment, these tools have visibility limited to the cloud providers' own environment
► Nonetheless, they are valuable assets which can also be offered to external customers
 VMware’s acquisition of Wavefront is interesting from an analytics perspective
 ScienceLogic has acquired AppFirst, enabling it to acquire granular application metrics
 CISCO has acquired AppDynamics and Perspica, enabling it to acquire rea-time monitoring
and AI
 CA Technologies’ acquisition of Automic brings in an additional automation technology to its
existing portfolio
 Spinoff of Hewlett Packard Enterprise’s (HPE's) software business to Micro Focus
VMware acquired Wavefront (2017)
Cisco acquired AppDynamics (2017)
Cisco acquired Perspica for ML (Oct 2017)
CA acquired Automic (2017)
Google acquired
StackDriver (2014)
Amazon AWS
HUAWEI TECHNOLOGIES CO., LTD. 5
Industry Trends
Bringing AI to O&M
“Virtyt Koshi, the EMEA general manager for virtualization
vendor Mavenir, reckons Google is able to run all of its data
centers in the Asia-Pacific with only about 30 people, and that a
telco would typically have about 3,000 employees to manage
infrastructure across the same area."
Reducing O&M Costs
Google: 30 people
Others: 3000 people
Automation can
reduce the number
of incidents by 50%
“We began applying machine learning two years ago (2016) to operate our data centers more efficiently… over the past few
months, DeepMind researchers began working with Google’s data center team to significantly improve the system’s utility. Using
neural networks trained on different operating scenarios and parameters, we created a more efficient and adaptive framework to
understand data center dynamics and optimize efficiency.” Eric Schmidt, December 2018
HUAWEI TECHNOLOGIES CO., LTD. 6
AIOps @ Huawei Cloud
Components and Troubleshooting Scenarios
Troubleshooting Scenarios
 Response Time Analysis
 A service started responding to requests more slowly than normal
 The change happened suddenly as a consequence of regression in
the latest deployment
 System Load
 The demand placed on the system, e.g., REST API requests per
second, increase since yesterday
 Error Analysis
 The rate of requests that fail -- either explicitly (HTTP 5xx) or
implicitly (HTTP 2xx with wrong content) -- is increasing slowly, but
steadily
 System Saturation
 The resources (e.g., memory, I/O, disk) used by key controller
services is rapidly reaching threshold levels
https://intl.huaweicloud.com/
System’s Components
► OBS. Object Storage Service is a stable, secure, efficient, cloud
storage service
► EVS. Elastic Volume Service offers scalable block storage for
servers
► VPC. Virtual Private Cloud enables to create private, isolated
virtual networks
► ECS. Elastic Cloud Server is a cloud server that provides
scalable, on-demand computing resources
HUAWEI TECHNOLOGIES CO., LTD. 7
AIOps @ Huawei Cloud
Monitoring
Logs. Service, microservices, and applications
generate logs, composed of timestamped
records with a structure and free-form text,
which are stored in system files.
Metrics. Examples of metrics include CPU
load, memory available, and the response
time of a HTTP request.
Traces. Traces records the workflow and tasks
executed in response to an HTTP request.
Events. Major milestones which occur within a
data center can be exposed as events.
Examples include alarms, service upgrades,
and software releases.
{"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332",
"duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key":
"http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value":
"HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}]
{“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506,
2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]}
{"id": "dns_address_match“, "timestamp": 1529029301238, …}
{"id": "ping_packet_loss“, "timestamp": 152902933452, …}
{"id": "tcp_connection_time“, "timestamp": 15290294516578, …}
{"id": "cpu_usage_average “, "timestamp": 1529023098976, …}
2017-01-18 15:54:00.467 32552 ERROR oslo_messaging.rpc.server [req-c0b38ace -
default default] Exception during message handling
System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data:
Logs, Metrics, Traces, Events, Topologies
HUAWEI TECHNOLOGIES CO., LTD. 8
Anomaly
Detection
Fault
Diagnosis
Fault
Prediction
Fault
Recovery
AIOps @ Huawei Cloud
O&M Troubleshooting Tasks
Data Source Connectors
Distribution Layer
Time Series Registry
Stability Analysis: Machine Learning and Analytics
Visualization, Exploration, Reporting
Real-time IngestionHistorical Data Storage
Stream ProcessingBatch Processing
Analytical Data Store
DevOpsBusiness Operators Developers
Anomaly Detection Fault Recovery
Fault Prediction Fault Diagnosis Topology Analysis
Fault Prevention
TracingMetrics Events
Reference architecture for AIOps platforms
App Logs Topology
► Anomaly detection. Determine what constitutes normal system
behavior, and then to discern departures from that normal
system behavior
► Fault diagnosis (root cause analysis). Identify links of
dependency that represent causal relationships to discover the
true source of an anomaly
► Fault Prediction. Use of historical or streaming data to predict
incidents with varying degrees of probability
► Fault recovery (Incident Remediation). Explore how decision
support systems can manage and select recovery processes to
repair failed systems
O&M Troubleshooting Tasks
iForesight is an intelligent new tool aimed at SRE cloud maintenance teams.
It enables them to quickly detect anomalies thanks to the use of artificial
intelligence when cloud services are slow or unresponsive.
iForesight3.0
Operational AI & AIOps
HUAWEI TECHNOLOGIES CO., LTD. 9
SAP HANA In-Memory Predictive Analytics
Typical toolbox of algorithms for AIOps
https://www.slideshare.net/SAPTechnology/hana-sps08-newpredictiveanalysislibrary
► Generic algorithms are building blocks for AIOps, but they rarely
work as ready to use solutions
► Their applicability is often illustrated with toy examples which do
not hold in production
► Algorithms need to be wisely combined, modified, complemented
to build new techniques (and methods) which are useful for O&M
for a particular context
‒ e.g., unstable systems, high noise-signal ratio, multi-modal
HUAWEI TECHNOLOGIES CO., LTD. 10
AIOps @ Huawei Cloud
Trace Analysis
N
1
N
2
N
3
N
4
Distributed Traces
User Commands: image delete, image list, flavor set, flavor
show, flavor unset, …
keystone -> nova -> glance -> network
Services
30%
70%
Warning Y
Warning A
Warning B
Warning C
Error Z
A B C ZTrace t1
Sequence of spans
Tracing has applicability in many fields besides microservices, e.g., end-to-end enterprise applications orchestrating mobile
apps, edges, websites, storage, and centralized systems.
HUAWEI TECHNOLOGIES CO., LTD. 11
► Monitoring and Incident Detection
 Anomaly detection in large-scale system is widely studied
 IT monitoring systems typically use application logs and
resource metrics to detect failures
⎻ Distributed tracing is becoming the third pillar of
microservices observability
⎻ Logs and metrics were previously investigated (see [1-4])
► Single and Bi-Models
 Use of single and bi-models to capture traces as
sequences of events and their response time
 Use sequential model representation by utilizing long-
short-term memory (LSTM) networks
► Results
 Detect anomalies in Huawei Cloud infrastructure
 The novel approach outperforms other deep-learning
methods based on traditional architectures
Trace Analysis
Using Multimodal Deep Learning
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.
S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019, 14-17.05, Cyprus.
J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Long Short Term Memory (LSTM)
LSTMs [1] are models which capture sequential data by
using an internal memory. They are very good for analyzing
sequences of values and predicting the next point of a
given time series. This makes LSTMs adequate for Machine
Learning problems that involve sequential data (see [3])
such speech recognition, machine translation, visual
recognition and description, and distributed traces
analysis.
[1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation
9.8 (1997): 1735-1780.
[2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
[3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-
effectiveness/
HUAWEI TECHNOLOGIES CO., LTD. 12
 A span (event) is a vector of key-value pairs
𝑘𝑖, 𝑣𝑖 describing a characteristic of a
microservice at time 𝑡𝑖
 A trace 𝑇 is an enumerated collection of events
(i.e., spans) sorted by timestamps 𝑒0, 𝑒1, … , 𝑒𝑖
[16]
 An event contains
⎻ trace id, event id, parent id
⎻ protocol, host ip, status code, url
⎻ response time, timestamp
 Trace can have different lengths
⎻ 𝑇𝑝 = 𝑒0
𝑝
, 𝑒1
𝑝
, 𝑒2
𝑝
, … , 𝑒𝑖
𝑝
and 𝑇𝑞 =
𝑒0
𝑞
, 𝑒2
𝑞
, 𝑒1
𝑞
, … , 𝑒𝑖
𝑞
are different (𝑒1 and 𝑒2 are
swapped) but originate from the same activity
⎻ Possibly caused by concurrent systems
 Label each trace with
⎻ concat(url, status code, host ip)
⎻ 𝑁𝑙 lables
 Pad trace vector up to 𝑇𝑙 or truncate traces
Trace Analysis
Approach
 Trace structure
⎻ Trace one-hot categorical encoding [17]
⎻ 𝐷1 = 𝑁𝑡, 𝑇𝑙, 𝑁𝑙
⎻ 𝑁𝑡 = # traces, 𝑇𝑙 = max length, 𝑁𝑙 = # labels
 Response time
⎻ Group response time by label
⎻ Min-max scaling [0, 1]
⎻ 𝐷2 = 𝑁𝑡, 𝑇𝑙, 1
⎻ Last dimension is the response time
HUAWEI TECHNOLOGIES CO., LTD. 13
 We model structural anomaly detection in traces
as a sequence-to-sequence, multi-classification
problem
 LSTM network architecture
⎻ SAD = Structural Anomaly Detection (𝐷1)
 Model input
⎻ 𝑇𝑘 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
 Model output
⎻ 𝑒0, 𝑒1, … , 𝑒𝑖−1
⎻ Probability distribution over the 𝑁𝑙 unique
labels from 𝐿, representing the probability for
the next label 𝑒𝑖 in the sequence
 Compare the predicted output against the
observed label value
⎻ input 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
→ output 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′
⎻ The output is shifted by one event and
padded
 Update network weights using categorical cross-
entropy loss minimization via gradient descent
Trace Analysis
Single-Modality LSTM (1)
SAD = Structural Anomaly Detection (𝐷1)
HUAWEI TECHNOLOGIES CO., LTD. 14
 Evaluate if trace 𝑇𝑡𝑒𝑠𝑡 is anomalous
⎻ 𝑇𝑡𝑒𝑠𝑡 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
 The network calculates
⎻ Probability distribution 𝑃
⎻ 𝑃 = 𝑙0: 𝑝0, 𝑙1: 𝑝1, … , 𝑙 𝑁𝑡
: 𝑝 𝑁𝑡
⎻ Probability of a label of 𝐿 to appear as the
next label value in a trace, given the previous
values
 The output layer has a softmax function to
distribute the probability over labels
⎻ 𝜎(𝑧)𝑖 =
𝑒 𝑧 𝑖
𝑗=1
𝐾
𝑒
𝑧 𝑗
 Such that
⎻ 𝑖
𝑁 𝑙
𝑝𝑖 = 1
 Detection
⎻ Accept top-𝑘 predicted labels as behaviorally
correct
⎻ Otherwise, report an anomaly along with the
events which contributed to the decision
Trace Analysis
Single-Modality LSTM (2)
SAD = Structural Anomaly Detection (𝐷1)
HUAWEI TECHNOLOGIES CO., LTD. 15
 We model response time anomaly detection in
traces as a regression task
 LSTM network architecture
⎻ RTAD = Response Time Anomaly Detection (𝐷2)
 Approach
⎻ A each timestep 𝑡𝑖𝑚𝑒 = 𝑖 with 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡𝑖−1
⎻ Predict the response time 𝑟𝑡𝑖 of the next event
 Update network weights using mean squared loss
via gradient descent
 Detection
⎻ Compute the squared error distance
⎻ 𝑒𝑟𝑟𝑜𝑟𝑖 = 𝑟𝑡𝑖 − 𝑟𝑡𝑖
𝑝 2
, where 𝑟𝑡𝑖
𝑝
is the predicted
value at timestep 𝑖
⎻ errors are fitted by a Gaussian 𝑁(0, 𝜎2
)
⎻ report trace as anomalous, If squared error
between prediction and input at time 𝑖 is out of
95% confidence interval
Trace Analysis
Single-Modality LSTM (3)
RTAD = Response Time Anomaly Detection (𝐷2)
HUAWEI TECHNOLOGIES CO., LTD. 16
 The correlation between trace
structure and response time can be
explored to improve accuracy/recall
 LSTM network architecture
⎻ Concatenation of both single-
modality architectures in the
second hidden layer
 Approach
⎻ input
⎻ 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙
, 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡 𝑇 𝑙
⎻ →
⎻ output
⎻ 𝑒1, … , 𝑒 𝑇 𝑙
, ′! 0′ , 𝑟𝑡1, 𝑟𝑡2, … , 𝑟𝑡 𝑇 𝑙
, 0
 Update network weights using sum
mean squared error and categorical
cross-entropy
Trace Analysis
Multimodal LSTM (1)
 Detection
⎻ Performed by comparing the output element-wise
with the input for both modalities using the strategy
developed in the single-modality architectures
⎻ Anomaly source: 1) response time, 2) structural, or 3)
both
SAD (𝐷1) and RTAD (𝐷2)
HUAWEI TECHNOLOGIES CO., LTD. 17
► Datasets
 System under study has 1000+ services running production Openstack-based cloud [19]
 Traces collected using Zipkin [16] over 50 days: >4.5M events; 1M traces
► Evaluation Platform
 Python using Keras, model with batch size = 512, learning rate of 0.001, and 400 epochs
 PC using GPU-NVIDIA GTX 1060
► Preprocessing
 To avoid outliers, select labels that appear more than 1000 times (105 unique labels)
 Distribution of trace lengths is imbalanced: >90% have lengths <10 events
 Select only 1000 samples of each trace length
⎻ Requires <1% of all the recorded data
⎻ Efficient and fast for training
 For robustness, we also select the traces with lengths between 4 and 20
► Performance
 Time to train multimodal LSTM on 1% traces (1M): approx. 30 min
 Prediction time per trace: <50 ms
Trace Analysis
Evaluation
HUAWEI TECHNOLOGIES CO., LTD. 18
Trace Analysis
Evaluation
Best results in terms of accuracy of structural anomaly
detection are achieved using the multimodal LSTM predictions
Multimodal LSTMSingle-model LSTM
Multimodal DenseSingle-model Dense
Single-modality LSTM achieves a comparable accuracy, while
single- and multimodal dense architectures have low
accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
Accuracy when the anomaly is injected in traces with
different sizes
Multimodal slightly outperforms the single-modality
approach in 9 out of 15 trace lengths for k = 3 and k = 5
Significantly better results are achieved for k = 1 for
almost all of the trace lengths
HUAWEI TECHNOLOGIES CO., LTD. 19
Trace Analysis
Evaluation
For SAD, significantly better performance of the multimodal
approach for k = 1
Single-modality LSTM achieves a comparable accuracy, while
single- and multimodal dense architectures have low
accuracies
Multimodal LSTM achieves a better accuracy than the single-
modality LSTM for k=1, otherwise it is similar
For RTAD, single-task models have low performance
than those of both multimodal models
Multimodal approach achieves a higher accuracy for
RTAD
Models have high accuracy even when the length of the
trace increases. This is because the LSTMs are able to
learn long-term dependencies in sequential tasks
HUAWEI TECHNOLOGIES CO., LTD. 20
 [1] F. Schmidt, A. Gulenko, M. Wallschlger, A. Acker, V. Hennig, F. Liu, and O. Kao, “Iftm - unsupervised anomaly detection for
virtualized network function services,” in 2018 IEEE International Conference on Web Services (ICWS), July 2018, pp. 187–194.
 [2] A. Gulenko, F. Schmidt, A. Acker, M. Wallschlager, O. Kao, and F. Liu, “Detecting anomalous behavior of black-box services
modeled with distance-based online clustering,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), Jul 2018,
pp. 912–915.
 [3] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security. ACM, 2017, pp. 1285–1298.
 [4] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschlger, A. Acker, and O. Kao, “Unsupervised anomaly event detection for cloud
monitoring using online arima,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC
Companion), Dec 2018, pp. 71–76.
 [16] OpenZipkin, “openzipkin/zipkin,” 2018. [Online]. Available: https://github.com/openzipkin/zipkin
 [19] A. Shrivastwa, S. Sarat, K. Jackson, C. Bunch, E. Sigler, and T. Campbell, OpenStack: Building a Cloud Environment. Packt
Publishing, 2016.
Further reading
 S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud
2019, July 3-8, 2019, Milan, Italy.
 S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019,
14-17.05, Cyprus.
 J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany.
Trace Analysis
References
Copyright©2019 Huawei Technologies Co., Ltd.
All Rights Reserved.
The information in this document may contain predictive
statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Bring digital to every person, home and
organization for a fully connected,
intelligent world.
Thank you.

Contenu connexe

Tendances

AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works Enterprise Management Associates
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Data strategy demistifying data
Data strategy demistifying dataData strategy demistifying data
Data strategy demistifying dataHans Verstraeten
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Journey data driven organization
Journey data driven organizationJourney data driven organization
Journey data driven organizationInbalraanan
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingDATAVERSITY
 
Value analysis with Value Stream and Capability modeling
Value analysis with Value Stream and Capability modelingValue analysis with Value Stream and Capability modeling
Value analysis with Value Stream and Capability modelingCOMPETENSIS
 
ArchiSurance Case Study
ArchiSurance Case StudyArchiSurance Case Study
ArchiSurance Case StudyIver Band
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
data-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxdata-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxMohamedHendawy17
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as ProductDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data FabricDesigning An Enterprise Data Fabric
Designing An Enterprise Data FabricAlan McSweeney
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceDATAVERSITY
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyShivam Dhawan
 

Tendances (20)

AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
AIOps, IT Analytics, and Business Performance: What’s Needed and What Works
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data strategy demistifying data
Data strategy demistifying dataData strategy demistifying data
Data strategy demistifying data
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Journey data driven organization
Journey data driven organizationJourney data driven organization
Journey data driven organization
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
Emerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big ThingEmerging Trends in Data Architecture – What’s the Next Big Thing
Emerging Trends in Data Architecture – What’s the Next Big Thing
 
Value analysis with Value Stream and Capability modeling
Value analysis with Value Stream and Capability modelingValue analysis with Value Stream and Capability modeling
Value analysis with Value Stream and Capability modeling
 
ArchiSurance Case Study
ArchiSurance Case StudyArchiSurance Case Study
ArchiSurance Case Study
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Big Data analytics
Big Data analyticsBig Data analytics
Big Data analytics
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
data-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxdata-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptx
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Designing An Enterprise Data Fabric
Designing An Enterprise Data FabricDesigning An Enterprise Data Fabric
Designing An Enterprise Data Fabric
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
 
BI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and StrategyBI Consultancy - Data, Analytics and Strategy
BI Consultancy - Data, Analytics and Strategy
 

Similaire à AIOps: Anomalies Detection of Distributed Traces

For linked in part 2 no template
For linked in part 2  no templateFor linked in part 2  no template
For linked in part 2 no templatePankaj Tomar
 
Dell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western OntarioDell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western OntarioBill Wong
 
Cloud Computing Nedc Wp 28 May
Cloud Computing Nedc Wp 28 MayCloud Computing Nedc Wp 28 May
Cloud Computing Nedc Wp 28 MayGovCloud Network
 
Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...William Santiago
 
Predictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environmentPredictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environmentCapgemini
 
InTTrust -IBM Artificial Intelligence Event
InTTrust -IBM Artificial Intelligence  EventInTTrust -IBM Artificial Intelligence  Event
InTTrust -IBM Artificial Intelligence EventMichail Pagiatakis
 
Activeeon technology for Big Compute and cloud migration
Activeeon technology for Big Compute and cloud migrationActiveeon technology for Big Compute and cloud migration
Activeeon technology for Big Compute and cloud migrationActiveeon
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Analytics India Magazine
 
Dell AI Telecom Webinar
Dell AI Telecom WebinarDell AI Telecom Webinar
Dell AI Telecom WebinarBill Wong
 
Dell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarDell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarBill Wong
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsData Driven Innovation
 
9 Digital Transformation Trends for 2023.pdf
9 Digital Transformation Trends for 2023.pdf9 Digital Transformation Trends for 2023.pdf
9 Digital Transformation Trends for 2023.pdfGriffonWebstudios1
 
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X Kai Wähner
 
Flexible and Scalable Integration in the Automation Industry/Industrial IoT
Flexible and Scalable Integration in the Automation Industry/Industrial IoTFlexible and Scalable Integration in the Automation Industry/Industrial IoT
Flexible and Scalable Integration in the Automation Industry/Industrial IoTconfluent
 
ELEKS Switzerland office opening, Oct 2021
ELEKS Switzerland office opening, Oct 2021ELEKS Switzerland office opening, Oct 2021
ELEKS Switzerland office opening, Oct 2021ELEKS
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edgeConference Papers
 
GARE du MIDIH MIDIH, towards a flexible, modular and open source reference ...
GARE du MIDIH   MIDIH, towards a flexible, modular and open source reference ...GARE du MIDIH   MIDIH, towards a flexible, modular and open source reference ...
GARE du MIDIH MIDIH, towards a flexible, modular and open source reference ...MIDIH_EU
 

Similaire à AIOps: Anomalies Detection of Distributed Traces (20)

For linked in part 2 no template
For linked in part 2  no templateFor linked in part 2  no template
For linked in part 2 no template
 
Dell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western OntarioDell NVIDIA AI Roadshow - South Western Ontario
Dell NVIDIA AI Roadshow - South Western Ontario
 
Cloud Computing Nedc Wp 28 May
Cloud Computing Nedc Wp 28 MayCloud Computing Nedc Wp 28 May
Cloud Computing Nedc Wp 28 May
 
Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...Cloud Computing - A collection of predictions, principles and providers - Feb...
Cloud Computing - A collection of predictions, principles and providers - Feb...
 
Predictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environmentPredictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environment
 
InTTrust -IBM Artificial Intelligence Event
InTTrust -IBM Artificial Intelligence  EventInTTrust -IBM Artificial Intelligence  Event
InTTrust -IBM Artificial Intelligence Event
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Activeeon technology for Big Compute and cloud migration
Activeeon technology for Big Compute and cloud migrationActiveeon technology for Big Compute and cloud migration
Activeeon technology for Big Compute and cloud migration
 
AE Foyer: Information Management in the Digital Enterprise
AE Foyer: Information Management in the Digital EnterpriseAE Foyer: Information Management in the Digital Enterprise
AE Foyer: Information Management in the Digital Enterprise
 
Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...Emerging engineering issues for building large scale AI systems By Srinivas P...
Emerging engineering issues for building large scale AI systems By Srinivas P...
 
Dell AI Telecom Webinar
Dell AI Telecom WebinarDell AI Telecom Webinar
Dell AI Telecom Webinar
 
Dell AI Oil and Gas Webinar
Dell AI Oil and Gas WebinarDell AI Oil and Gas Webinar
Dell AI Oil and Gas Webinar
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and Analytics
 
9 Digital Transformation Trends for 2023.pdf
9 Digital Transformation Trends for 2023.pdf9 Digital Transformation Trends for 2023.pdf
9 Digital Transformation Trends for 2023.pdf
 
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X
IIoT / Industry 4.0 with Apache Kafka, Connect, KSQL, Apache PLC4X
 
Flexible and Scalable Integration in the Automation Industry/Industrial IoT
Flexible and Scalable Integration in the Automation Industry/Industrial IoTFlexible and Scalable Integration in the Automation Industry/Industrial IoT
Flexible and Scalable Integration in the Automation Industry/Industrial IoT
 
ELEKS Switzerland office opening, Oct 2021
ELEKS Switzerland office opening, Oct 2021ELEKS Switzerland office opening, Oct 2021
ELEKS Switzerland office opening, Oct 2021
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
GARE du MIDIH MIDIH, towards a flexible, modular and open source reference ...
GARE du MIDIH   MIDIH, towards a flexible, modular and open source reference ...GARE du MIDIH   MIDIH, towards a flexible, modular and open source reference ...
GARE du MIDIH MIDIH, towards a flexible, modular and open source reference ...
 
50120140502008
5012014050200850120140502008
50120140502008
 

Plus de Jorge Cardoso

Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep LearningJorge Cardoso
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Jorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Jorge Cardoso
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresJorge Cardoso
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open StackJorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDLJorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveJorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network AnalysisJorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksJorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service NetworksJorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksJorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesJorge Cardoso
 

Plus de Jorge Cardoso (20)

Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 

Dernier

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

AIOps: Anomalies Detection of Distributed Traces

  • 1. AIOps: Anomaly Detection of Distributed Traces Prof. Dr. Jorge Cardoso E-mail: jorge.cardoso@huawei.com Planet-scale AIOps / Chief Architect Ireland and Munich Research Centers Definition (Gartner) [AIOps] AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. National University of Ireland, Galway (NUI Galway) 2019.08.20
  • 2. HUAWEI TECHNOLOGIES CO., LTD. 1 AIOps: Anomaly Detection of Distributed Traces Abstract and bio In planet-scale deployments, the Operation and Maintenance (O&M) of cloud platforms cannot be done any longer manually or simply with off-the-shelf solutions. It requires self-developed automated systems, ideally exploiting the use of AI to provide tools for autonomous cloud operations. This talk will introduce the field of AIOps (artificial intelligence for IT operations) and present what the next generation of AI-driven IT Operations tools and platforms will be. Our current research in Munich explores how AIOps, and, namely, deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, and log analysis can be used to effectively detect and localize failures in cloud infrastructures to reduce the workload of SRE operators. In particular, I will describe how distributed traces can be analyzed using deep learning to effectively identify anomalies. Dr. Jorge Cardoso is Chief Architect for Planet-scale AIOps at Huawei’s Ireland and Munich Research Centers. Previously he worked for several major companies such as SAP Research (Germany) on the Internet of Services and the Boeing Company in Seattle (USA) on Enterprise Application Integration. He previously gave lectures at the Karlsruhe Institute of Technology (Germany), University of Georgia (USA), University of Coimbra and University of Madeira (Portugal). He recently published his latest book titled “Fundamentals of Service Systems” with Springer. His current research involves the development of the next generation of Cloud Operations and Analytics using AI, Cloud Reliability and Resilience, and High Performance Business Process Management systems. He has a Ph.D. in Computer Science from the University of Georgia (USA). Jorge Cardoso GitHub | Slideshare.net | GoogleScholar Previous Positions Interests AIOps, Service Reliability Engineering, Cloud Computing, Distributed Systems, Business Process Management
  • 3. HUAWEI TECHNOLOGIES CO., LTD. 2 Operations and Maintenance Global Trends Core Networks IoT Infrastructure Network Appliances Radio Access Network Content Delivery Networks Virtual CPE Mobile Edge Computing Energy Manufacturing Smart Cities/BuildingsHealthcare Networking East Environment Energy PHILIPS MedicalGeely The cloud has evolved from a disruptor to an obvious solution for enterprise IT. However, the increasing complexity of public, private, edge, and hybrid cloud environments presents new challenges for cost-efficient operations and maintenance (O&M) tasks
  • 4. HUAWEI TECHNOLOGIES CO., LTD. 3 Underlying Problem Increasing Complexity of Systems Every year the management of IT O&M is more complex ► Digitalization with cloud, mobile, and edge ► Large-scale microservice and serverless systems ► Increase in IT size, and event/alert volumes ► SLA guarantee for critical applications ► Nano/microservices break existing monitoring tools Microservices/serverless add complexity to communications between services making systems far more difficult to maintain and troubleshoot. With a monolith application, when there is a problem, only one product owner needs to be contacted. HC Cloud Data Center 1 Data Center 2 Server 1 Server 2 Socket 1 Socket 2 Core 1 Core 2 VM 1 VM 2 VM 3 Core .. Core 16 Server .. Server 20.000 Data Center .. Data Center 10 Huawei Cloud is an example of a planet- scale complex system: 40 availability zones, 23 geographical regions (Germany, France, South/Central America, Hong Kong and Russia to Thailand and South Africa) 20M VMs (3 VM/core), billion metrics (for illustrative purpose only) However, existing O&M tools still rely on old approaches which involve manual rule configuration, averages and std, and data wrangling to manage physical assets
  • 5. HUAWEI TECHNOLOGIES CO., LTD. 4 Industry Trends Bringing AI to O&M Startup in the field of AIOps The market is seeing interesting mergers and acquisitions by important vendors ► Cloud providers are also emerging as ITIM tool providers purely for their IaaS offerings At the moment, these tools have visibility limited to the cloud providers' own environment ► Nonetheless, they are valuable assets which can also be offered to external customers  VMware’s acquisition of Wavefront is interesting from an analytics perspective  ScienceLogic has acquired AppFirst, enabling it to acquire granular application metrics  CISCO has acquired AppDynamics and Perspica, enabling it to acquire rea-time monitoring and AI  CA Technologies’ acquisition of Automic brings in an additional automation technology to its existing portfolio  Spinoff of Hewlett Packard Enterprise’s (HPE's) software business to Micro Focus VMware acquired Wavefront (2017) Cisco acquired AppDynamics (2017) Cisco acquired Perspica for ML (Oct 2017) CA acquired Automic (2017) Google acquired StackDriver (2014) Amazon AWS
  • 6. HUAWEI TECHNOLOGIES CO., LTD. 5 Industry Trends Bringing AI to O&M “Virtyt Koshi, the EMEA general manager for virtualization vendor Mavenir, reckons Google is able to run all of its data centers in the Asia-Pacific with only about 30 people, and that a telco would typically have about 3,000 employees to manage infrastructure across the same area." Reducing O&M Costs Google: 30 people Others: 3000 people Automation can reduce the number of incidents by 50% “We began applying machine learning two years ago (2016) to operate our data centers more efficiently… over the past few months, DeepMind researchers began working with Google’s data center team to significantly improve the system’s utility. Using neural networks trained on different operating scenarios and parameters, we created a more efficient and adaptive framework to understand data center dynamics and optimize efficiency.” Eric Schmidt, December 2018
  • 7. HUAWEI TECHNOLOGIES CO., LTD. 6 AIOps @ Huawei Cloud Components and Troubleshooting Scenarios Troubleshooting Scenarios  Response Time Analysis  A service started responding to requests more slowly than normal  The change happened suddenly as a consequence of regression in the latest deployment  System Load  The demand placed on the system, e.g., REST API requests per second, increase since yesterday  Error Analysis  The rate of requests that fail -- either explicitly (HTTP 5xx) or implicitly (HTTP 2xx with wrong content) -- is increasing slowly, but steadily  System Saturation  The resources (e.g., memory, I/O, disk) used by key controller services is rapidly reaching threshold levels https://intl.huaweicloud.com/ System’s Components ► OBS. Object Storage Service is a stable, secure, efficient, cloud storage service ► EVS. Elastic Volume Service offers scalable block storage for servers ► VPC. Virtual Private Cloud enables to create private, isolated virtual networks ► ECS. Elastic Cloud Server is a cloud server that provides scalable, on-demand computing resources
  • 8. HUAWEI TECHNOLOGIES CO., LTD. 7 AIOps @ Huawei Cloud Monitoring Logs. Service, microservices, and applications generate logs, composed of timestamped records with a structure and free-form text, which are stored in system files. Metrics. Examples of metrics include CPU load, memory available, and the response time of a HTTP request. Traces. Traces records the workflow and tasks executed in response to an HTTP request. Events. Major milestones which occur within a data center can be exposed as events. Examples include alarms, service upgrades, and software releases. {"traceId": "72c53", "name": "get", "timestamp": 1529029301238, "id": "df332", "duration": 124957, “annotations": [{"key": "http.status_code", "value": "200"}, {"key": "http.url", "value": "https://v2/e5/servers/detail?limit=200"}, {"key": "protocol", "value": "HTTP"}, "endpoint": {"serviceName": "hss", "ipv4": "126.75.191.253"}] {“tags": [“mem”, “192.196.0.2”, “AZ01”], “data”: [2483, 2669, 2576, 2560, 2549, 2506, 2480, 2565, 3140, …, 2542, 2636, 2638, 2538, 2521, 2614, 2514, 2574, 2519]} {"id": "dns_address_match“, "timestamp": 1529029301238, …} {"id": "ping_packet_loss“, "timestamp": 152902933452, …} {"id": "tcp_connection_time“, "timestamp": 15290294516578, …} {"id": "cpu_usage_average “, "timestamp": 1529023098976, …} 2017-01-18 15:54:00.467 32552 ERROR oslo_messaging.rpc.server [req-c0b38ace - default default] Exception during message handling System’s Components (e.g., OBS, EVS, VPC, ECS) are monitored and generate various types of data: Logs, Metrics, Traces, Events, Topologies
  • 9. HUAWEI TECHNOLOGIES CO., LTD. 8 Anomaly Detection Fault Diagnosis Fault Prediction Fault Recovery AIOps @ Huawei Cloud O&M Troubleshooting Tasks Data Source Connectors Distribution Layer Time Series Registry Stability Analysis: Machine Learning and Analytics Visualization, Exploration, Reporting Real-time IngestionHistorical Data Storage Stream ProcessingBatch Processing Analytical Data Store DevOpsBusiness Operators Developers Anomaly Detection Fault Recovery Fault Prediction Fault Diagnosis Topology Analysis Fault Prevention TracingMetrics Events Reference architecture for AIOps platforms App Logs Topology ► Anomaly detection. Determine what constitutes normal system behavior, and then to discern departures from that normal system behavior ► Fault diagnosis (root cause analysis). Identify links of dependency that represent causal relationships to discover the true source of an anomaly ► Fault Prediction. Use of historical or streaming data to predict incidents with varying degrees of probability ► Fault recovery (Incident Remediation). Explore how decision support systems can manage and select recovery processes to repair failed systems O&M Troubleshooting Tasks iForesight is an intelligent new tool aimed at SRE cloud maintenance teams. It enables them to quickly detect anomalies thanks to the use of artificial intelligence when cloud services are slow or unresponsive. iForesight3.0 Operational AI & AIOps
  • 10. HUAWEI TECHNOLOGIES CO., LTD. 9 SAP HANA In-Memory Predictive Analytics Typical toolbox of algorithms for AIOps https://www.slideshare.net/SAPTechnology/hana-sps08-newpredictiveanalysislibrary ► Generic algorithms are building blocks for AIOps, but they rarely work as ready to use solutions ► Their applicability is often illustrated with toy examples which do not hold in production ► Algorithms need to be wisely combined, modified, complemented to build new techniques (and methods) which are useful for O&M for a particular context ‒ e.g., unstable systems, high noise-signal ratio, multi-modal
  • 11. HUAWEI TECHNOLOGIES CO., LTD. 10 AIOps @ Huawei Cloud Trace Analysis N 1 N 2 N 3 N 4 Distributed Traces User Commands: image delete, image list, flavor set, flavor show, flavor unset, … keystone -> nova -> glance -> network Services 30% 70% Warning Y Warning A Warning B Warning C Error Z A B C ZTrace t1 Sequence of spans Tracing has applicability in many fields besides microservices, e.g., end-to-end enterprise applications orchestrating mobile apps, edges, websites, storage, and centralized systems.
  • 12. HUAWEI TECHNOLOGIES CO., LTD. 11 ► Monitoring and Incident Detection  Anomaly detection in large-scale system is widely studied  IT monitoring systems typically use application logs and resource metrics to detect failures ⎻ Distributed tracing is becoming the third pillar of microservices observability ⎻ Logs and metrics were previously investigated (see [1-4]) ► Single and Bi-Models  Use of single and bi-models to capture traces as sequences of events and their response time  Use sequential model representation by utilizing long- short-term memory (LSTM) networks ► Results  Detect anomalies in Huawei Cloud infrastructure  The novel approach outperforms other deep-learning methods based on traditional architectures Trace Analysis Using Multimodal Deep Learning S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud 2019, July 3-8, 2019, Milan, Italy. S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019, 14-17.05, Cyprus. J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany. Long Short Term Memory (LSTM) LSTMs [1] are models which capture sequential data by using an internal memory. They are very good for analyzing sequences of values and predicting the next point of a given time series. This makes LSTMs adequate for Machine Learning problems that involve sequential data (see [3]) such speech recognition, machine translation, visual recognition and description, and distributed traces analysis. [1] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. [2] Sequential processing in LSTM (from: http://colah.github.io/posts/2015-08-Understanding- LSTMs/ [3] LSTM model description (from Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn- effectiveness/
  • 13. HUAWEI TECHNOLOGIES CO., LTD. 12  A span (event) is a vector of key-value pairs 𝑘𝑖, 𝑣𝑖 describing a characteristic of a microservice at time 𝑡𝑖  A trace 𝑇 is an enumerated collection of events (i.e., spans) sorted by timestamps 𝑒0, 𝑒1, … , 𝑒𝑖 [16]  An event contains ⎻ trace id, event id, parent id ⎻ protocol, host ip, status code, url ⎻ response time, timestamp  Trace can have different lengths ⎻ 𝑇𝑝 = 𝑒0 𝑝 , 𝑒1 𝑝 , 𝑒2 𝑝 , … , 𝑒𝑖 𝑝 and 𝑇𝑞 = 𝑒0 𝑞 , 𝑒2 𝑞 , 𝑒1 𝑞 , … , 𝑒𝑖 𝑞 are different (𝑒1 and 𝑒2 are swapped) but originate from the same activity ⎻ Possibly caused by concurrent systems  Label each trace with ⎻ concat(url, status code, host ip) ⎻ 𝑁𝑙 lables  Pad trace vector up to 𝑇𝑙 or truncate traces Trace Analysis Approach  Trace structure ⎻ Trace one-hot categorical encoding [17] ⎻ 𝐷1 = 𝑁𝑡, 𝑇𝑙, 𝑁𝑙 ⎻ 𝑁𝑡 = # traces, 𝑇𝑙 = max length, 𝑁𝑙 = # labels  Response time ⎻ Group response time by label ⎻ Min-max scaling [0, 1] ⎻ 𝐷2 = 𝑁𝑡, 𝑇𝑙, 1 ⎻ Last dimension is the response time
  • 14. HUAWEI TECHNOLOGIES CO., LTD. 13  We model structural anomaly detection in traces as a sequence-to-sequence, multi-classification problem  LSTM network architecture ⎻ SAD = Structural Anomaly Detection (𝐷1)  Model input ⎻ 𝑇𝑘 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙  Model output ⎻ 𝑒0, 𝑒1, … , 𝑒𝑖−1 ⎻ Probability distribution over the 𝑁𝑙 unique labels from 𝐿, representing the probability for the next label 𝑒𝑖 in the sequence  Compare the predicted output against the observed label value ⎻ input 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙 → output 𝑒1, … , 𝑒 𝑇 𝑙 , ′! 0′ ⎻ The output is shifted by one event and padded  Update network weights using categorical cross- entropy loss minimization via gradient descent Trace Analysis Single-Modality LSTM (1) SAD = Structural Anomaly Detection (𝐷1)
  • 15. HUAWEI TECHNOLOGIES CO., LTD. 14  Evaluate if trace 𝑇𝑡𝑒𝑠𝑡 is anomalous ⎻ 𝑇𝑡𝑒𝑠𝑡 = 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙  The network calculates ⎻ Probability distribution 𝑃 ⎻ 𝑃 = 𝑙0: 𝑝0, 𝑙1: 𝑝1, … , 𝑙 𝑁𝑡 : 𝑝 𝑁𝑡 ⎻ Probability of a label of 𝐿 to appear as the next label value in a trace, given the previous values  The output layer has a softmax function to distribute the probability over labels ⎻ 𝜎(𝑧)𝑖 = 𝑒 𝑧 𝑖 𝑗=1 𝐾 𝑒 𝑧 𝑗  Such that ⎻ 𝑖 𝑁 𝑙 𝑝𝑖 = 1  Detection ⎻ Accept top-𝑘 predicted labels as behaviorally correct ⎻ Otherwise, report an anomaly along with the events which contributed to the decision Trace Analysis Single-Modality LSTM (2) SAD = Structural Anomaly Detection (𝐷1)
  • 16. HUAWEI TECHNOLOGIES CO., LTD. 15  We model response time anomaly detection in traces as a regression task  LSTM network architecture ⎻ RTAD = Response Time Anomaly Detection (𝐷2)  Approach ⎻ A each timestep 𝑡𝑖𝑚𝑒 = 𝑖 with 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡𝑖−1 ⎻ Predict the response time 𝑟𝑡𝑖 of the next event  Update network weights using mean squared loss via gradient descent  Detection ⎻ Compute the squared error distance ⎻ 𝑒𝑟𝑟𝑜𝑟𝑖 = 𝑟𝑡𝑖 − 𝑟𝑡𝑖 𝑝 2 , where 𝑟𝑡𝑖 𝑝 is the predicted value at timestep 𝑖 ⎻ errors are fitted by a Gaussian 𝑁(0, 𝜎2 ) ⎻ report trace as anomalous, If squared error between prediction and input at time 𝑖 is out of 95% confidence interval Trace Analysis Single-Modality LSTM (3) RTAD = Response Time Anomaly Detection (𝐷2)
  • 17. HUAWEI TECHNOLOGIES CO., LTD. 16  The correlation between trace structure and response time can be explored to improve accuracy/recall  LSTM network architecture ⎻ Concatenation of both single- modality architectures in the second hidden layer  Approach ⎻ input ⎻ 𝑒0, 𝑒1, … , 𝑒 𝑇 𝑙 , 𝑟𝑡0, 𝑟𝑡1, … , 𝑟𝑡 𝑇 𝑙 ⎻ → ⎻ output ⎻ 𝑒1, … , 𝑒 𝑇 𝑙 , ′! 0′ , 𝑟𝑡1, 𝑟𝑡2, … , 𝑟𝑡 𝑇 𝑙 , 0  Update network weights using sum mean squared error and categorical cross-entropy Trace Analysis Multimodal LSTM (1)  Detection ⎻ Performed by comparing the output element-wise with the input for both modalities using the strategy developed in the single-modality architectures ⎻ Anomaly source: 1) response time, 2) structural, or 3) both SAD (𝐷1) and RTAD (𝐷2)
  • 18. HUAWEI TECHNOLOGIES CO., LTD. 17 ► Datasets  System under study has 1000+ services running production Openstack-based cloud [19]  Traces collected using Zipkin [16] over 50 days: >4.5M events; 1M traces ► Evaluation Platform  Python using Keras, model with batch size = 512, learning rate of 0.001, and 400 epochs  PC using GPU-NVIDIA GTX 1060 ► Preprocessing  To avoid outliers, select labels that appear more than 1000 times (105 unique labels)  Distribution of trace lengths is imbalanced: >90% have lengths <10 events  Select only 1000 samples of each trace length ⎻ Requires <1% of all the recorded data ⎻ Efficient and fast for training  For robustness, we also select the traces with lengths between 4 and 20 ► Performance  Time to train multimodal LSTM on 1% traces (1M): approx. 30 min  Prediction time per trace: <50 ms Trace Analysis Evaluation
  • 19. HUAWEI TECHNOLOGIES CO., LTD. 18 Trace Analysis Evaluation Best results in terms of accuracy of structural anomaly detection are achieved using the multimodal LSTM predictions Multimodal LSTMSingle-model LSTM Multimodal DenseSingle-model Dense Single-modality LSTM achieves a comparable accuracy, while single- and multimodal dense architectures have low accuracies Multimodal LSTM achieves a better accuracy than the single- modality LSTM for k=1, otherwise it is similar Accuracy when the anomaly is injected in traces with different sizes Multimodal slightly outperforms the single-modality approach in 9 out of 15 trace lengths for k = 3 and k = 5 Significantly better results are achieved for k = 1 for almost all of the trace lengths
  • 20. HUAWEI TECHNOLOGIES CO., LTD. 19 Trace Analysis Evaluation For SAD, significantly better performance of the multimodal approach for k = 1 Single-modality LSTM achieves a comparable accuracy, while single- and multimodal dense architectures have low accuracies Multimodal LSTM achieves a better accuracy than the single- modality LSTM for k=1, otherwise it is similar For RTAD, single-task models have low performance than those of both multimodal models Multimodal approach achieves a higher accuracy for RTAD Models have high accuracy even when the length of the trace increases. This is because the LSTMs are able to learn long-term dependencies in sequential tasks
  • 21. HUAWEI TECHNOLOGIES CO., LTD. 20  [1] F. Schmidt, A. Gulenko, M. Wallschlger, A. Acker, V. Hennig, F. Liu, and O. Kao, “Iftm - unsupervised anomaly detection for virtualized network function services,” in 2018 IEEE International Conference on Web Services (ICWS), July 2018, pp. 187–194.  [2] A. Gulenko, F. Schmidt, A. Acker, M. Wallschlager, O. Kao, and F. Liu, “Detecting anomalous behavior of black-box services modeled with distance-based online clustering,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), Jul 2018, pp. 912–915.  [3] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security. ACM, 2017, pp. 1285–1298.  [4] F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschlger, A. Acker, and O. Kao, “Unsupervised anomaly event detection for cloud monitoring using online arima,” in 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion), Dec 2018, pp. 71–76.  [16] OpenZipkin, “openzipkin/zipkin,” 2018. [Online]. Available: https://github.com/openzipkin/zipkin  [19] A. Shrivastwa, S. Sarat, K. Jackson, C. Bunch, E. Sigler, and T. Campbell, OpenStack: Building a Cloud Environment. Packt Publishing, 2016. Further reading  S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection from System Tracing Data using Multimodal Deep Learning, IEEE Cloud 2019, July 3-8, 2019, Milan, Italy.  S. Nedelkoski, J. Cardoso, O. Kao, Anomaly Detection and Classification using Distributed Tracing and Deep Learning, CCGrid 2019, 14-17.05, Cyprus.  J. Cardoso, Mastering AIOps with Deep Learning, Presentation at SRECon18, 29–31 August 2018, Dusseldorf, Germany. Trace Analysis References
  • 22. Copyright©2019 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. Bring digital to every person, home and organization for a fully connected, intelligent world. Thank you.