In this talk Senior Consultant Maartens Lourens introduces machine learning in a pragmatic way. He aims to make it easy to understand the basic Machine Learning process involved by leveraging only the essential tools and libraries required.
31. Complex Adaptive System
“Entity consisting of many diverse and autonomous
components or parts (called agents) which are interrelated,
interdependent, linked through many (dense)
interconnections, and behave as a unified whole in learning
from experience and in adjusting (not just reacting) to
changes in the environment”
- BusinessDictionary.com
49. Train and Test Data directories
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--train_data_dir', type=str,
default='data/train/laptop',
help='data directory containing training logs')
parser.add_argument('--test_data_dir', type=str, default='data/test/laptop',
help='data directory containing training logs')
args = parser.parse_args()
50. Two log dicts, each with a list for data and type.
Our corresponding wrapper function calls:
train_log_collection = create_log_dict(args.train_data_dir)
test_log_collection = create_log_dict(args.test_data_dir)
52. import glob
def create_log_dict(logfile_path):
log_collection = {}
logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files
for logfile in logfiles:
file_handle = open(logfile, "r")
filedata_array = file_handle.read().split('n')
file_handle.close()
# Remove empty lines
for line in filedata_array:
if len(line) == 0:
del filedata_array[filedata_array.index(line)]
# Add log file data and type
if log_collection.has_key('data'):
log_collection['data'] = log_collection['data'] + filedata_array
# numerise log type for each line
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = log_collection['type'] + temp_types # Add log type array
# Cater for first time iteration
else:
log_collection['data'] = filedata_array
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = temp_types
return log_collection
create_log_dict
53. Text to Numeric:
- CountVectorizer()
- TfidfTransformer()
Chain them together with Pipeline()
58. model = train(algorithm, feature_data, target_data)
Which becomes:
from sklearn import naive_bayes
model = train(naive_bayes.MultinomialNB(),
train_log_collection['data'],
train_log_collection['type'])
60. from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
def train(algorithm, training_feature_data, training_target_data):
model = Pipeline([('vect', CountVectorizer()), ('tfidf',
TfidfTransformer()), ('clf', algorithm)])
model.fit(training_feature_data,training_target_data)
return model
61.
62.
63.
64.
65. print(test_log_collection['data'][321] + "n" +
glob.glob(args.train_data_dir +
"/*.log")[model.predict( test_log_collection['data'][321:322])[0].astype(in
t)])
The prediction:
Oct 20 20:20:15 CCFile::captureLogRun Skipping current file Dir file
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt, Current File
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt
data/train/laptop/ corecaptured.log
66. import numpy as np
print(np.mean(model.predict(test_log_collection['data']) ==
test_log_collection['type']))
Accuracy:
0.986352045397
98.64% = Not bad!