Log File Classification Using Machine Learning

Image copyright: 20th Century Fox

Complex Adaptive System
“Entity consisting of many diverse and autonomous
components or parts (called agents) which are interrelated,
interdependent, linked through many (dense)
interconnections, and behave as a unified whole in learning
from experience and in adjusting (not just reacting) to
changes in the environment”
- BusinessDictionary.com

Business AI
Image copyright: Manga Entertainment

Training Process
Prediction Process

mkdir -p data/{train,test}/laptop
find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log
do
rows=$(wc -l "$log" | awk '{ print $1 }')
head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}"
tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}"
done

$ wc -l data/train/laptop/*.log
15033 data/train/laptop/corecaptured.log
28257 data/train/laptop/debuglog.log
607 data/train/laptop/displaypolicyd.log
258 data/train/laptop/displaypolicyd.stdout.log
4401 data/train/laptop/fsck_hfs.log
614 data/train/laptop/failfast.log
44 data/train/laptop/httpd.log
129561 data/train/laptop/install.log
13671 data/train/laptop/system.log
55905 data/train/laptop/trac.log
3065 data/train/laptop/wifi.log
251416 total

With Python 2.7 and Python pip in place:
pip install numpy sklearn

Boilerplate for input parameters

Train and Test Data directories
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--train_data_dir', type=str,
default='data/train/laptop',
help='data directory containing training logs')
parser.add_argument('--test_data_dir', type=str, default='data/test/laptop',
help='data directory containing training logs')
args = parser.parse_args()

Two log dicts, each with a list for data and type.
Our corresponding wrapper function calls:
train_log_collection = create_log_dict(args.train_data_dir)
test_log_collection = create_log_dict(args.test_data_dir)

import glob
def create_log_dict(logfile_path):
log_collection = {}
logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files
for logfile in logfiles:
file_handle = open(logfile, "r")
filedata_array = file_handle.read().split('n')
file_handle.close()
# Remove empty lines
for line in filedata_array:
if len(line) == 0:
del filedata_array[filedata_array.index(line)]
# Add log file data and type
if log_collection.has_key('data'):
log_collection['data'] = log_collection['data'] + filedata_array
# numerise log type for each line
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = log_collection['type'] + temp_types # Add log type array
# Cater for first time iteration
else:
log_collection['data'] = filedata_array
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = temp_types
return log_collection
create_log_dict

Text to Numeric:
- CountVectorizer()
- TfidfTransformer()
Chain them together with Pipeline()

model = train(algorithm, feature_data, target_data)

model = train(algorithm, feature_data, target_data)
Which becomes:
from sklearn import naive_bayes
model = train(naive_bayes.MultinomialNB(),
train_log_collection['data'],
train_log_collection['type'])

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
def train(algorithm, training_feature_data, training_target_data):
model = Pipeline([('vect', CountVectorizer()), ('tfidf',
TfidfTransformer()), ('clf', algorithm)])
model.fit(training_feature_data,training_target_data)
return model

print(test_log_collection['data'][321] + "n" +
glob.glob(args.train_data_dir +
"/*.log")[model.predict( test_log_collection['data'][321:322])[0].astype(in
t)])
The prediction:
Oct 20 20:20:15 CCFile::captureLogRun Skipping current file Dir file
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt, Current File
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt
data/train/laptop/ corecaptured.log

import numpy as np
print(np.mean(model.predict(test_log_collection['data']) ==
test_log_collection['type']))
Accuracy:
0.986352045397
98.64% = Not bad!

def predict(model, new_docs):
predicted = model.predict(new_docs['data'])
accuracy = np.mean(predicted == new_docs['type'])
return accuracy
def report(clf_type,accuracy):
print("033[1m" + clf_type + "033[0m033[92m")
print("Accuracy: " + str(round(accuracy * 100,2)) + "%n")
Print
algorithms = [ naive_bayes.MultinomialNB() ]
for algorithm in algorithms:
model = train(algorithm, log_collection['data'], log_collection['type'])
accuracy = predict(model,test_log_collection)
report((str(algorithm).split('(')[0]),accuracy)

algorithms = [
linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3,
random_state=42, max_iter=5, tol=None),
naive_bayes.MultinomialNB(),
tree.DecisionTreeClassifier(max_depth=1000),
ensemble.ExtraTreesClassifier(),
svm.LinearSVC(),
]

Training log collection => 250587 data entries
Testing log collection => 27843 data entries
SGDClassifier
Accuracy: 97.38%
MultinomialNB
Accuracy: 98.64%
DecisionTreeClassifier
Accuracy: 95.33%
ExtraTreesClassifier
Accuracy: 99.15%
LinearSVC
Accuracy: 99.17%

Log File Classification Using Machine Learning

Log File Classification Using Machine Learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Log File Classification Using Machine Learning

Similaire à Log File Classification Using Machine Learning (20)

Plus de OpenCredo

Plus de OpenCredo (20)

Dernier

Dernier (20)

Log File Classification Using Machine Learning