SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
●
●
●
●
●
●
Business Feedback
Loop
Image copyright: 20th Century Fox
Complex Adaptive System
“Entity consisting of many diverse and autonomous
components or parts (called agents) which are interrelated,
interdependent, linked through many (dense)
interconnections, and behave as a unified whole in learning
from experience and in adjusting (not just reacting) to
changes in the environment”
- BusinessDictionary.com
●
●
●
Business AI
Image copyright: Manga Entertainment
…
Training Process
Prediction Process
Train Test
mkdir -p data/{train,test}/laptop
find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log
do
rows=$(wc -l "$log" | awk '{ print $1 }')
head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}"
tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}"
done
$ wc -l data/train/laptop/*.log
15033 data/train/laptop/corecaptured.log
28257 data/train/laptop/debuglog.log
607 data/train/laptop/displaypolicyd.log
258 data/train/laptop/displaypolicyd.stdout.log
4401 data/train/laptop/fsck_hfs.log
614 data/train/laptop/failfast.log
44 data/train/laptop/httpd.log
129561 data/train/laptop/install.log
13671 data/train/laptop/system.log
55905 data/train/laptop/trac.log
3065 data/train/laptop/wifi.log
251416 total
With Python 2.7 and Python pip in place:
pip install numpy sklearn
Boilerplate for input parameters
Train and Test Data directories
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--train_data_dir', type=str,
default='data/train/laptop',
help='data directory containing training logs')
parser.add_argument('--test_data_dir', type=str, default='data/test/laptop',
help='data directory containing training logs')
args = parser.parse_args()
Two log dicts, each with a list for data and type.
Our corresponding wrapper function calls:
train_log_collection = create_log_dict(args.train_data_dir)
test_log_collection = create_log_dict(args.test_data_dir)
create_log_dict
import glob
def create_log_dict(logfile_path):
log_collection = {}
logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files
for logfile in logfiles:
file_handle = open(logfile, "r")
filedata_array = file_handle.read().split('n')
file_handle.close()
# Remove empty lines
for line in filedata_array:
if len(line) == 0:
del filedata_array[filedata_array.index(line)]
# Add log file data and type
if log_collection.has_key('data'):
log_collection['data'] = log_collection['data'] + filedata_array
# numerise log type for each line
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = log_collection['type'] + temp_types # Add log type array
# Cater for first time iteration
else:
log_collection['data'] = filedata_array
temp_types = [logfiles.index(logfile)] * len(filedata_array)
log_collection['type'] = temp_types
return log_collection
create_log_dict
Text to Numeric:
- CountVectorizer()
- TfidfTransformer()
Chain them together with Pipeline()
train
model = train(algorithm, feature_data, target_data)
model = train(algorithm, feature_data, target_data)
Which becomes:
from sklearn import naive_bayes
model = train(naive_bayes.MultinomialNB(),
train_log_collection['data'],
train_log_collection['type'])
train
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
def train(algorithm, training_feature_data, training_target_data):
model = Pipeline([('vect', CountVectorizer()), ('tfidf',
TfidfTransformer()), ('clf', algorithm)])
model.fit(training_feature_data,training_target_data)
return model
print(test_log_collection['data'][321] + "n" +
glob.glob(args.train_data_dir +
"/*.log")[model.predict( test_log_collection['data'][321:322])[0].astype(in
t)])
The prediction:
Oct 20 20:20:15 CCFile::captureLogRun Skipping current file Dir file
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt, Current File
[2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt
data/train/laptop/ corecaptured.log
import numpy as np
print(np.mean(model.predict(test_log_collection['data']) ==
test_log_collection['type']))
Accuracy:
0.986352045397
98.64% = Not bad!
def predict(model, new_docs):
predicted = model.predict(new_docs['data'])
accuracy = np.mean(predicted == new_docs['type'])
return accuracy
def report(clf_type,accuracy):
print("033[1m" + clf_type + "033[0m033[92m")
print("Accuracy: " + str(round(accuracy * 100,2)) + "%n")
Print
algorithms = [ naive_bayes.MultinomialNB() ]
for algorithm in algorithms:
model = train(algorithm, log_collection['data'], log_collection['type'])
accuracy = predict(model,test_log_collection)
report((str(algorithm).split('(')[0]),accuracy)
algorithms = [
linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3,
random_state=42, max_iter=5, tol=None),
naive_bayes.MultinomialNB(),
tree.DecisionTreeClassifier(max_depth=1000),
ensemble.ExtraTreesClassifier(),
svm.LinearSVC(),
]
Training log collection => 250587 data entries
Testing log collection => 27843 data entries
SGDClassifier
Accuracy: 97.38%
MultinomialNB
Accuracy: 98.64%
DecisionTreeClassifier
Accuracy: 95.33%
ExtraTreesClassifier
Accuracy: 99.15%
LinearSVC
Accuracy: 99.17%
Log File Classification Using Machine Learning
Log File Classification Using Machine Learning

Contenu connexe

Tendances

Two dimensional array
Two dimensional arrayTwo dimensional array
Two dimensional arrayRajendran
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
Dev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsDev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsSvetlin Nakov
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regressionAkhilesh Joshi
 
simple linear regression
simple linear regressionsimple linear regression
simple linear regressionAkhilesh Joshi
 
Reasoning about laziness
Reasoning about lazinessReasoning about laziness
Reasoning about lazinessJohan Tibell
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regressionAkhilesh Joshi
 
Munihac 2018 - Beautiful Template Haskell
Munihac 2018 - Beautiful Template HaskellMunihac 2018 - Beautiful Template Haskell
Munihac 2018 - Beautiful Template HaskellMatthew Pickering
 
Multi dimensional array
Multi dimensional arrayMulti dimensional array
Multi dimensional arrayRajendran
 
Advanced Tagless Final - Saying Farewell to Free
Advanced Tagless Final - Saying Farewell to FreeAdvanced Tagless Final - Saying Farewell to Free
Advanced Tagless Final - Saying Farewell to FreeLuka Jacobowitz
 
Principled Error Handling with FP
Principled Error Handling with FPPrincipled Error Handling with FP
Principled Error Handling with FPLuka Jacobowitz
 
Functional Programming in Swift
Functional Programming in SwiftFunctional Programming in Swift
Functional Programming in SwiftSaugat Gautam
 
Building l10n Payroll Structures from the Ground up
Building l10n Payroll Structures from the Ground upBuilding l10n Payroll Structures from the Ground up
Building l10n Payroll Structures from the Ground upOdoo
 
Odoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new apiOdoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new apiOdoo
 
support vector regression
support vector regressionsupport vector regression
support vector regressionAkhilesh Joshi
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and RAkhilesh Joshi
 

Tendances (20)

Two dimensional array
Two dimensional arrayTwo dimensional array
Two dimensional array
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Lec2+lec3
Lec2+lec3Lec2+lec3
Lec2+lec3
 
Dev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and AlgorithmsDev Concepts: Data Structures and Algorithms
Dev Concepts: Data Structures and Algorithms
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
simple linear regression
simple linear regressionsimple linear regression
simple linear regression
 
Reasoning about laziness
Reasoning about lazinessReasoning about laziness
Reasoning about laziness
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regression
 
Munihac 2018 - Beautiful Template Haskell
Munihac 2018 - Beautiful Template HaskellMunihac 2018 - Beautiful Template Haskell
Munihac 2018 - Beautiful Template Haskell
 
Multi dimensional array
Multi dimensional arrayMulti dimensional array
Multi dimensional array
 
Advanced Tagless Final - Saying Farewell to Free
Advanced Tagless Final - Saying Farewell to FreeAdvanced Tagless Final - Saying Farewell to Free
Advanced Tagless Final - Saying Farewell to Free
 
Principled Error Handling with FP
Principled Error Handling with FPPrincipled Error Handling with FP
Principled Error Handling with FP
 
knn classification
knn classificationknn classification
knn classification
 
Functional Programming in Swift
Functional Programming in SwiftFunctional Programming in Swift
Functional Programming in Swift
 
Xiicsmonth
XiicsmonthXiicsmonth
Xiicsmonth
 
R basics
R basicsR basics
R basics
 
Building l10n Payroll Structures from the Ground up
Building l10n Payroll Structures from the Ground upBuilding l10n Payroll Structures from the Ground up
Building l10n Payroll Structures from the Ground up
 
Odoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new apiOdoo - From v7 to v8: the new api
Odoo - From v7 to v8: the new api
 
support vector regression
support vector regressionsupport vector regression
support vector regression
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 

Similaire à Log File Classification Using Machine Learning

R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant TrainingAidIQ
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Where's My SQL? Designing Databases with ActiveRecord Migrations
Where's My SQL? Designing Databases with ActiveRecord MigrationsWhere's My SQL? Designing Databases with ActiveRecord Migrations
Where's My SQL? Designing Databases with ActiveRecord MigrationsEleanor McHugh
 
Creating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonCreating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonSiddhi
 
TensorFlow.Data 및 TensorFlow Hub
TensorFlow.Data 및 TensorFlow HubTensorFlow.Data 및 TensorFlow Hub
TensorFlow.Data 및 TensorFlow HubJeongkyu Shin
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansiblebcoca
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.pptssuserd64918
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwoEishay Smith
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialjbellis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Swift Sequences & Collections
Swift Sequences & CollectionsSwift Sequences & Collections
Swift Sequences & CollectionsCocoaHeads France
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09stat
 

Similaire à Log File Classification Using Machine Learning (20)

Practical cats
Practical catsPractical cats
Practical cats
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Where's My SQL? Designing Databases with ActiveRecord Migrations
Where's My SQL? Designing Databases with ActiveRecord MigrationsWhere's My SQL? Designing Databases with ActiveRecord Migrations
Where's My SQL? Designing Databases with ActiveRecord Migrations
 
Creating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonCreating Domain Specific Languages in Python
Creating Domain Specific Languages in Python
 
TensorFlow.Data 및 TensorFlow Hub
TensorFlow.Data 및 TensorFlow HubTensorFlow.Data 및 TensorFlow Hub
TensorFlow.Data 및 TensorFlow Hub
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansible
 
Monads in Swift
Monads in SwiftMonads in Swift
Monads in Swift
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Using Scala Slick at FortyTwo
Using Scala Slick at FortyTwoUsing Scala Slick at FortyTwo
Using Scala Slick at FortyTwo
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Swift Sequences & Collections
Swift Sequences & CollectionsSwift Sequences & Collections
Swift Sequences & Collections
 
Stat Design3 18 09
Stat Design3 18 09Stat Design3 18 09
Stat Design3 18 09
 
Unit 3
Unit 3Unit 3
Unit 3
 

Plus de OpenCredo

Webinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringWebinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringOpenCredo
 
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...OpenCredo
 
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...OpenCredo
 
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...OpenCredo
 
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki WattJourneys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki WattOpenCredo
 
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto Fernandez
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto FernandezKafka Summit 2018: A Journey Building Kafka Connectors - Pegerto Fernandez
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto FernandezOpenCredo
 
MuCon 2017: A not So(A) Trivial Question by Tareq Abedrabbo
MuCon 2017: A not So(A) Trivial Question by Tareq AbedrabboMuCon 2017: A not So(A) Trivial Question by Tareq Abedrabbo
MuCon 2017: A not So(A) Trivial Question by Tareq AbedrabboOpenCredo
 
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps By Antoni...
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps  By Antoni...DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps  By Antoni...
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps By Antoni...OpenCredo
 
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...OpenCredo
 
Succeeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzSucceeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzOpenCredo
 
Progscon 2017: Serverless Architectures - Rafal Gancarz
Progscon 2017: Serverless Architectures - Rafal GancarzProgscon 2017: Serverless Architectures - Rafal Gancarz
Progscon 2017: Serverless Architectures - Rafal GancarzOpenCredo
 
QCON London 2017 - Monitoring Serverless Architectures by Rafal Gancarz
QCON London 2017 - Monitoring Serverless Architectures by Rafal GancarzQCON London 2017 - Monitoring Serverless Architectures by Rafal Gancarz
QCON London 2017 - Monitoring Serverless Architectures by Rafal GancarzOpenCredo
 
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...OpenCredo
 
London Hashicorp Meetup #8 - Testing Programmable Infrastructure By Matt Long
London Hashicorp Meetup #8 -  Testing Programmable Infrastructure By Matt LongLondon Hashicorp Meetup #8 -  Testing Programmable Infrastructure By Matt Long
London Hashicorp Meetup #8 - Testing Programmable Infrastructure By Matt LongOpenCredo
 
ServerlessConf: Serverless for the Enterprise - Rafal Gancarz
ServerlessConf: Serverless for the Enterprise - Rafal GancarzServerlessConf: Serverless for the Enterprise - Rafal Gancarz
ServerlessConf: Serverless for the Enterprise - Rafal GancarzOpenCredo
 
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...OpenCredo
 
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...OpenCredo
 
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant OpenCredo
 
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant OpenCredo
 
A Visual Introduction to Event Sourcing and CQRS by Lorenzo Nicora
A Visual Introduction to Event Sourcing and CQRS by Lorenzo NicoraA Visual Introduction to Event Sourcing and CQRS by Lorenzo Nicora
A Visual Introduction to Event Sourcing and CQRS by Lorenzo NicoraOpenCredo
 

Plus de OpenCredo (20)

Webinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringWebinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform Engineering
 
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
MuCon 2019: Exploring Your Microservices Architecture Through Network Science...
 
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
 
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...
Mucon 2018: Heuristics for Identifying Microservice Boundaries By Erich Eichi...
 
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki WattJourneys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
 
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto Fernandez
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto FernandezKafka Summit 2018: A Journey Building Kafka Connectors - Pegerto Fernandez
Kafka Summit 2018: A Journey Building Kafka Connectors - Pegerto Fernandez
 
MuCon 2017: A not So(A) Trivial Question by Tareq Abedrabbo
MuCon 2017: A not So(A) Trivial Question by Tareq AbedrabboMuCon 2017: A not So(A) Trivial Question by Tareq Abedrabbo
MuCon 2017: A not So(A) Trivial Question by Tareq Abedrabbo
 
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps By Antoni...
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps  By Antoni...DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps  By Antoni...
DevOpsCon Berlin 2017: Project Management from Stone Age to DevOps By Antoni...
 
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
 
Succeeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzSucceeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal Gancarz
 
Progscon 2017: Serverless Architectures - Rafal Gancarz
Progscon 2017: Serverless Architectures - Rafal GancarzProgscon 2017: Serverless Architectures - Rafal Gancarz
Progscon 2017: Serverless Architectures - Rafal Gancarz
 
QCON London 2017 - Monitoring Serverless Architectures by Rafal Gancarz
QCON London 2017 - Monitoring Serverless Architectures by Rafal GancarzQCON London 2017 - Monitoring Serverless Architectures by Rafal Gancarz
QCON London 2017 - Monitoring Serverless Architectures by Rafal Gancarz
 
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...
Voxxed Bristol 2017 - From C to Q, one event at a time: Event Sourcing illust...
 
London Hashicorp Meetup #8 - Testing Programmable Infrastructure By Matt Long
London Hashicorp Meetup #8 -  Testing Programmable Infrastructure By Matt LongLondon Hashicorp Meetup #8 -  Testing Programmable Infrastructure By Matt Long
London Hashicorp Meetup #8 - Testing Programmable Infrastructure By Matt Long
 
ServerlessConf: Serverless for the Enterprise - Rafal Gancarz
ServerlessConf: Serverless for the Enterprise - Rafal GancarzServerlessConf: Serverless for the Enterprise - Rafal Gancarz
ServerlessConf: Serverless for the Enterprise - Rafal Gancarz
 
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...
O'Reilly 2016: "Continuous Delivery with Containers: The Trials and Tribulati...
 
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...
Haufe #msaday - The Actor model: an alternative approach to concurrency By Lo...
 
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant
Haufe #msaday - Seven More Deadly Sins of Microservices by Daniel Bryant
 
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant
Haufe #msaday - Building a Microservice Ecosystem by Daniel Bryant
 
A Visual Introduction to Event Sourcing and CQRS by Lorenzo Nicora
A Visual Introduction to Event Sourcing and CQRS by Lorenzo NicoraA Visual Introduction to Event Sourcing and CQRS by Lorenzo Nicora
A Visual Introduction to Event Sourcing and CQRS by Lorenzo Nicora
 

Dernier

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Log File Classification Using Machine Learning

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 23.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. Image copyright: 20th Century Fox
  • 31. Complex Adaptive System “Entity consisting of many diverse and autonomous components or parts (called agents) which are interrelated, interdependent, linked through many (dense) interconnections, and behave as a unified whole in learning from experience and in adjusting (not just reacting) to changes in the environment” - BusinessDictionary.com
  • 33. Business AI Image copyright: Manga Entertainment
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 40.
  • 41.
  • 42.
  • 44.
  • 45. mkdir -p data/{train,test}/laptop find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log do rows=$(wc -l "$log" | awk '{ print $1 }') head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}" tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}" done
  • 46. $ wc -l data/train/laptop/*.log 15033 data/train/laptop/corecaptured.log 28257 data/train/laptop/debuglog.log 607 data/train/laptop/displaypolicyd.log 258 data/train/laptop/displaypolicyd.stdout.log 4401 data/train/laptop/fsck_hfs.log 614 data/train/laptop/failfast.log 44 data/train/laptop/httpd.log 129561 data/train/laptop/install.log 13671 data/train/laptop/system.log 55905 data/train/laptop/trac.log 3065 data/train/laptop/wifi.log 251416 total
  • 47. With Python 2.7 and Python pip in place: pip install numpy sklearn
  • 48. Boilerplate for input parameters
  • 49. Train and Test Data directories import argparse parser = argparse.ArgumentParser() parser.add_argument('--train_data_dir', type=str, default='data/train/laptop', help='data directory containing training logs') parser.add_argument('--test_data_dir', type=str, default='data/test/laptop', help='data directory containing training logs') args = parser.parse_args()
  • 50. Two log dicts, each with a list for data and type. Our corresponding wrapper function calls: train_log_collection = create_log_dict(args.train_data_dir) test_log_collection = create_log_dict(args.test_data_dir)
  • 52. import glob def create_log_dict(logfile_path): log_collection = {} logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files for logfile in logfiles: file_handle = open(logfile, "r") filedata_array = file_handle.read().split('n') file_handle.close() # Remove empty lines for line in filedata_array: if len(line) == 0: del filedata_array[filedata_array.index(line)] # Add log file data and type if log_collection.has_key('data'): log_collection['data'] = log_collection['data'] + filedata_array # numerise log type for each line temp_types = [logfiles.index(logfile)] * len(filedata_array) log_collection['type'] = log_collection['type'] + temp_types # Add log type array # Cater for first time iteration else: log_collection['data'] = filedata_array temp_types = [logfiles.index(logfile)] * len(filedata_array) log_collection['type'] = temp_types return log_collection create_log_dict
  • 53. Text to Numeric: - CountVectorizer() - TfidfTransformer() Chain them together with Pipeline()
  • 54.
  • 55.
  • 56. train
  • 57. model = train(algorithm, feature_data, target_data)
  • 58. model = train(algorithm, feature_data, target_data) Which becomes: from sklearn import naive_bayes model = train(naive_bayes.MultinomialNB(), train_log_collection['data'], train_log_collection['type'])
  • 59. train
  • 60. from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer def train(algorithm, training_feature_data, training_target_data): model = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', algorithm)]) model.fit(training_feature_data,training_target_data) return model
  • 61.
  • 62.
  • 63.
  • 64.
  • 65. print(test_log_collection['data'][321] + "n" + glob.glob(args.train_data_dir + "/*.log")[model.predict( test_log_collection['data'][321:322])[0].astype(in t)]) The prediction: Oct 20 20:20:15 CCFile::captureLogRun Skipping current file Dir file [2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt, Current File [2017-10-20_20,20,15.846930]-AirPortBrcm4360_Logs-004.txt data/train/laptop/ corecaptured.log
  • 66. import numpy as np print(np.mean(model.predict(test_log_collection['data']) == test_log_collection['type'])) Accuracy: 0.986352045397 98.64% = Not bad!
  • 67.
  • 68. def predict(model, new_docs): predicted = model.predict(new_docs['data']) accuracy = np.mean(predicted == new_docs['type']) return accuracy def report(clf_type,accuracy): print("033[1m" + clf_type + "033[0m033[92m") print("Accuracy: " + str(round(accuracy * 100,2)) + "%n") Print algorithms = [ naive_bayes.MultinomialNB() ] for algorithm in algorithms: model = train(algorithm, log_collection['data'], log_collection['type']) accuracy = predict(model,test_log_collection) report((str(algorithm).split('(')[0]),accuracy)
  • 69. algorithms = [ linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None), naive_bayes.MultinomialNB(), tree.DecisionTreeClassifier(max_depth=1000), ensemble.ExtraTreesClassifier(), svm.LinearSVC(), ]
  • 70. Training log collection => 250587 data entries Testing log collection => 27843 data entries SGDClassifier Accuracy: 97.38% MultinomialNB Accuracy: 98.64% DecisionTreeClassifier Accuracy: 95.33% ExtraTreesClassifier Accuracy: 99.15% LinearSVC Accuracy: 99.17%