Machine Learning for Log-Based
Anomaly Detection
Mohammed Bekkouche
m.bekkouche@esi-sba.dz
LabRI-SBA Laboratory
École Supérieure en Informatique, Sidi Bel Abbès, Algeria
July 14
th, 2025
Rome, Italy
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
1 / 14
Outline
1 Introduction
2 Log-Based Anomaly Detection
3 Machine Learning Approaches
4 Common Models
5 Case Study
6 Conclusion and Future Work
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
2 / 14
Introduction
Context and Motivation
Modern distributed systems (e.g., Hadoop) consist of thousands of
commodity machines.
These software systems generate massive volumes of logs.
Log data is essential for monitoring system behavior.
Anomalies (abnormal behaviors) may indicate failures, attacks, or
bugs.
Manual log inspection is impractical due to the scale and complexity
of modern distributed infrastructures.
Need for automated methods to detect anomalies from logs.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
3 / 14
Log-Based Anomaly Detection
Denition
Anomaly: behavior that deviates from normal patterns.
Logs: time-ordered sequences of textual messages.
Log-based anomaly detection is used to identify abnormal behaviors
(anomalies) that may signal potential system failures.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
4 / 14
Log-Based Anomaly Detection
Typical Challenges
Heterogeneous log formats.
Noise and redundancy.
Manual inspection is tedious and error-prone.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
5 / 14
Machine Learning Approaches
Supervised vs Unsupervised
Supervised: labeled data (normal / anomaly).
→ Supervised learning uses labeled data to train models, with labels
indicating whether data instances are normal or anomalous.
Unsupervised: learn patterns without labels.
→ Unsupervised learning handles data without labels and is
appropriate for production settings (real-world scenarios) lacking
annotated datasets.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
6 / 14
Machine Learning Approaches
Detection Approach
• Log Parsing • Feature Extraction • Model Training • Anomaly Detection
Log Parsing:
Identifying constant patterns
and extracting variable parameters
Input: Unstructured log messages
Output: Structured templates
(log events), extracted variable
parameters
Feature Extraction:
- Identier-based partitioning
- Converting log sequences
into feature vectors
(using TF-IDF or Word2Vec)
Input: Log events
Output: Numerical feature
vectors
Anomaly Detection:
Identifying abnormal log sequences
Input: Log feature vectors,
A ML model
Output: A prediction (i.e., an anomaly
or not) for each vector
Figure: The Approach Used for Anomaly Detection.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
7 / 14
Common Models
Typical Models
Anomaly detection is applied using the log feature vectors generated
in Feature Extraction step.
Machine learning methods usually output one prediction per log
sequence.
The goal is to identify anomalous log instances within the data.
Machine learning models that can be used for Anomaly Detection
SVM, Random Forest, Decision Tree, logistic regression
K-Means, DBSCAN, Isolation Forest, PCA
Autoencoders, LSTM, Transformers
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
8 / 14
Common Models
Example: Autoencoder
Learns to reconstruct normal sequences.
Low reconstruction error  normal
High reconstruction error  anomaly
Trained only on normal data to model expected behavior.
Eective for unsupervised anomaly detection in logs.
Figure: Autoencoder architecture with an input layer, output layer, and one
hidden layer.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
9 / 14
Case Study
HDFS Dataset (Hadoop)
The dataset contains 11,175,629 log messages.
These messages are grouped into 575,062 log sequences using Block
ID.
Anomaly labels are based on manual annotations by Hadoop experts.
Around 2.9% of the sequences are labeled as anomalous.
Popular benchmark for comparison.
Datasets
# Log Grouping Train # Sequences # Anomalies (SS) # Anomalies (US)
# Events
Messages Strategy Ratio train test train test train test
Reduced HDFS 104,815
Session Window
0.5 3,970 3,970 118 195 156 157 14
(block_id)
Complete HDFS 11,175,629
Session Window
0.9 517,554 57,507 16,044 794 15,154 1,684 48
(block_id)
Table: HDFS dataset.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
10 / 14
Case Study
Example Results (LogAnomaly)
Precision, Recall, and F1-Score are used to evaluate anomaly detection
models.
Precision: 96% (Percentage of true anomalies among all predicted
anomalies)
Recall: 94% (Percentage of actual anomalies correctly identied)
F1-Score: 95% (Harmonic mean of Precision and Recall)
Precision =
TP
TP + FP
(1)
Recall =
TP
TP + FN
(2)
F1-Score =
2 × Precision × Recall
Precision + Recall
(3)
Results reported in: LogAnomaly: Unsupervised Detection of Sequential and
Quantitative Anomalies in Unstructured Logs, IJCAI, 2019.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
11 / 14
Conclusion and Future Work
Conclusion
Logs are one of the most valuable data sources for system operation.
ML improves log anomaly detection signicantly.
Preprocessing and parsing are crucial.
Need for robust and interpretable models.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
12 / 14
Conclusion and Future Work
Future Directions
Leverage large language models (e.g., Transformers).
Hybrid approaches (deep learning + clustering).
Real-time anomaly detection.
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
13 / 14
Conclusion and Future Work
Thank You!
Questions?
M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection
July 14
th, 2025 [0.1cm] Rom
14 / 14

Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning

  • 1.
    Machine Learning forLog-Based Anomaly Detection Mohammed Bekkouche m.bekkouche@esi-sba.dz LabRI-SBA Laboratory École Supérieure en Informatique, Sidi Bel Abbès, Algeria July 14 th, 2025 Rome, Italy M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 1 / 14
  • 2.
    Outline 1 Introduction 2 Log-BasedAnomaly Detection 3 Machine Learning Approaches 4 Common Models 5 Case Study 6 Conclusion and Future Work M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 2 / 14
  • 3.
    Introduction Context and Motivation Moderndistributed systems (e.g., Hadoop) consist of thousands of commodity machines. These software systems generate massive volumes of logs. Log data is essential for monitoring system behavior. Anomalies (abnormal behaviors) may indicate failures, attacks, or bugs. Manual log inspection is impractical due to the scale and complexity of modern distributed infrastructures. Need for automated methods to detect anomalies from logs. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 3 / 14
  • 4.
    Log-Based Anomaly Detection Denition Anomaly:behavior that deviates from normal patterns. Logs: time-ordered sequences of textual messages. Log-based anomaly detection is used to identify abnormal behaviors (anomalies) that may signal potential system failures. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 4 / 14
  • 5.
    Log-Based Anomaly Detection TypicalChallenges Heterogeneous log formats. Noise and redundancy. Manual inspection is tedious and error-prone. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 5 / 14
  • 6.
    Machine Learning Approaches Supervisedvs Unsupervised Supervised: labeled data (normal / anomaly). → Supervised learning uses labeled data to train models, with labels indicating whether data instances are normal or anomalous. Unsupervised: learn patterns without labels. → Unsupervised learning handles data without labels and is appropriate for production settings (real-world scenarios) lacking annotated datasets. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 6 / 14
  • 7.
    Machine Learning Approaches DetectionApproach • Log Parsing • Feature Extraction • Model Training • Anomaly Detection Log Parsing: Identifying constant patterns and extracting variable parameters Input: Unstructured log messages Output: Structured templates (log events), extracted variable parameters Feature Extraction: - Identier-based partitioning - Converting log sequences into feature vectors (using TF-IDF or Word2Vec) Input: Log events Output: Numerical feature vectors Anomaly Detection: Identifying abnormal log sequences Input: Log feature vectors, A ML model Output: A prediction (i.e., an anomaly or not) for each vector Figure: The Approach Used for Anomaly Detection. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 7 / 14
  • 8.
    Common Models Typical Models Anomalydetection is applied using the log feature vectors generated in Feature Extraction step. Machine learning methods usually output one prediction per log sequence. The goal is to identify anomalous log instances within the data. Machine learning models that can be used for Anomaly Detection SVM, Random Forest, Decision Tree, logistic regression K-Means, DBSCAN, Isolation Forest, PCA Autoencoders, LSTM, Transformers M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 8 / 14
  • 9.
    Common Models Example: Autoencoder Learnsto reconstruct normal sequences. Low reconstruction error normal High reconstruction error anomaly Trained only on normal data to model expected behavior. Eective for unsupervised anomaly detection in logs. Figure: Autoencoder architecture with an input layer, output layer, and one hidden layer. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 9 / 14
  • 10.
    Case Study HDFS Dataset(Hadoop) The dataset contains 11,175,629 log messages. These messages are grouped into 575,062 log sequences using Block ID. Anomaly labels are based on manual annotations by Hadoop experts. Around 2.9% of the sequences are labeled as anomalous. Popular benchmark for comparison. Datasets # Log Grouping Train # Sequences # Anomalies (SS) # Anomalies (US) # Events Messages Strategy Ratio train test train test train test Reduced HDFS 104,815 Session Window 0.5 3,970 3,970 118 195 156 157 14 (block_id) Complete HDFS 11,175,629 Session Window 0.9 517,554 57,507 16,044 794 15,154 1,684 48 (block_id) Table: HDFS dataset. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 10 / 14
  • 11.
    Case Study Example Results(LogAnomaly) Precision, Recall, and F1-Score are used to evaluate anomaly detection models. Precision: 96% (Percentage of true anomalies among all predicted anomalies) Recall: 94% (Percentage of actual anomalies correctly identied) F1-Score: 95% (Harmonic mean of Precision and Recall) Precision = TP TP + FP (1) Recall = TP TP + FN (2) F1-Score = 2 × Precision × Recall Precision + Recall (3) Results reported in: LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs, IJCAI, 2019. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 11 / 14
  • 12.
    Conclusion and FutureWork Conclusion Logs are one of the most valuable data sources for system operation. ML improves log anomaly detection signicantly. Preprocessing and parsing are crucial. Need for robust and interpretable models. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 12 / 14
  • 13.
    Conclusion and FutureWork Future Directions Leverage large language models (e.g., Transformers). Hybrid approaches (deep learning + clustering). Real-time anomaly detection. M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 13 / 14
  • 14.
    Conclusion and FutureWork Thank You! Questions? M. Bekkouche (ESI-SBA) Log-Based Anomaly Detection July 14 th, 2025 [0.1cm] Rom 14 / 14