1. FLORIDA INSTITUTE OF TECHNOLOGY
USING THE DATA MINING CLASSIFIER SELECTION MODEL
TO IMPROVE NETWORK INTRUSION DETECTION
DATA MINING FOR CYBERSECURITY
FINAL PROJECT
SUBMITTED TO:
DR. LIAM MAYRON
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
CYB 5675: DATA MINING
BY
CRAIG CANNON
MELBOURNE, FLORIDA
JUNE 15TH 2014
2. USING THE DATA MINING CLASSIFIER SELECTION MODEL
TO IMPROVE NETWORK INTRUSION DETECTION
INTRODUCTION:
These days there are massive amounts of data being collected and stored in databases all
over the world. This trend has been occurring now for decades. Databases are reaching sizes in
the Terabyte range of data and could be rapidly approaching the Petabytes range in a few years.
Inside these enormous data warehouses lay hidden invaluable information and knowledge that
can only be extracted using some very creative data mining algorithms. These algorithms allow
us to extract what is called nuggets of knowledge from large sets of data. Basically, they allow us
to turn data into information. These algorithms were created originally for this purpose and have
performed this purpose quite well. Nevertheless, over the years these algorithms have found
themselves even more useful in the area of network intrusion detection.
In the last few decades cyber attacks have been on the rise worldwide. Security breaches
have plagued governments, individuals, and even corporations for some time now and seem to be
an increasing trend with no end in sight. To defend against these attacks network systems such as
firewalls, endpoint security servers, intrusion prevention systems, and intrusion detection
systems have been placed on many networks within the global internet infrastructure. Out of all
of these defense systems, network intrusion detection systems using data mining techniques is
proving itself to be the most promising defense against cyber attacks. When used with data
mining techniques, network intrusion detection systems can analyze massive amounts of audit
data while at the same time improve detection efficiency.
In the past few years researchers have been doing a lot of experimenting with various
algorithms on network intrusion detection systems. Different approaches have been tried in an
effort to improve both the effectiveness and efficiency of NIDS. As a result, detection has
improved tremendously over the years and continues to be a major weapon in the cyber war
arsenal. For this reason, choosing the best data mining classification method for network
intrusion detection systems has become even more vital for future NID systems designs.
Many NID system designers have begun to take a closer look at the results of this
research. One particular research technique that comes to mind involves discovering algorithms
that distinctively work better on certain types of attacks. Findings in this area could help
immensely in enhancing intrusion detection performance. Network Intrusion Detection System
(NIDS) plays vital role of identifying various types of attacks. NIDS main role is to secure the
network security and its information infrastructure.
The main purpose of NIDS is to detect intrusion among normal audit data which is in
essence considered as a classification problem. Classification is defined as given a collection of
records known as a training set where each record has a set of attributes one of which is called
class. Classifications goal is to assign a class to previous unseen records. In order to do this task,
data is divided into both a training set and a test set. The training set is used to build the model
and the test set is used to validate the model. (Kumar, 2002) Below is a typical framework of a
classification model. This framework gives a big picture overview of the experiment that is being
3. undertaken using the KDD99 dataset. Basically two random samples of the data will be taken
and one will be used as the training dataset while the other as the test set. See Figure 1 below.
Figure 1
The purpose of this project is evaluate a wide-ranging set of classifier algorithms against
the KDD dataset in order to determine if they can demonstrate better performance in a particular
attack category. A comparison of each algorithm will be completed and their performance results
will be matched to their specific attack category. An attempt to detect attacks in Probe, DOS,
U2R, and R2L categories will be used as the baseline foundation of this project. Each of these
attack categories involve distinctive attack types that and are used commonly by attackers when
hacking a network. Once we have identified which algorithm best detects which attack category,
we will propose a model for designing an ideal network intrusion detection systems for future
usage.
BACKGROUND:
It is a well known fact that earning classifier models is a critical problem in data mining.
Audit type data from real world networks is often presented as a set of records with multiple
attributes. One of those attributes is a special category attribute called class. Initially, a training
set of records with known class labels are extracted. The idea is to learn a model for the class in
terms of other attributes. Once this model has been identified the objective is to use this newly
found model based on both the predicted and actual classes to predict the class of any given set
of records in order to optimize the system overall.
Traditionally, the goal has been to reduce the number of misclassified records thereby in
essence capitalize on accuracy. Many techniques in times past have been attempted to build
efficient and effective classifier models. Three methods in particularly, techniques that learn
rule-based models are principally well-liked in data mining.
4. The first method is the Transductive Confidence Machines for K-Nearest Neighbors
(TCM-KNN) technique. The TCM-KNN technique is a commonly used machine learning and
data mining algorithm used for intrusion detection. It is highly effective in the areas of fraud
detection, pattern recognition and outlier detection. Historically the TCM-KNN algorithm wasn’t
considered an algorithm fit for intrusion detection. Nevertheless, that attitude has changed
immensely over the past decade as research has proven the algorithms validity over time. Results
have proven to be outstanding. It has been found that TCM-KNN has high detection rates and
low false positives even when used on a limited data set for training. For instance an average true
positive rate of 99.6% and a false positive rate of 0.1% was was reported on a random sampling
subset of the KDD dataset. (Ngyun, 2008) TCM-KNN uses the computation of the confidence
based upon randomness. Unlike conventional methods in machine learning, transduction can
offer measures of dependability to individual points, and uses very extensive assumptions except
for the iid assumption (the training as well as new (unlabeled) points are independently and
identically distributed). TCM-KNN is an ideal method to use for pattern recognition, fraud
detection, and outlier detection.(Yang, 2007)
The next method is the PNrule technique. PNrule technique was assessed on the KDD
testing datasheet and performed quite well in its detection efforts. Returning a true positive rate
of over 96 % against denial of service attacks (DOS) and a less than 10% false positive rate
overall, the PNrule proved itself competent player in the data mining space. (Ngyun, 2008)
Basically, the PNrule is a two stage general-to-specific framework based on finding rules that
predict the presence of a target class (P-rules) and the absence of a target class (N-rules). The
main inspiration is to learn a set of P-rules that together cover most of the positive examples and
yet still its numerical importance. Initially, highly accurate rules are selected, but later accuracy
is compromised in favor of support. This lessening of accuracy causes some negative examples
or false positives to be supported. Now, we combine all the true positives and false positives
collectively supported by the union of all the P-rules, and learn N-rules on this reduced data-set
to remove the false positives. This two-phase approach is what makes the PN rule approach
different. PNrule also uses a method that allows it to avoid over fitting and still maintain its
flexibility. (Agarwal, 2000)
The final method is the Multiclass Support Vector machines (SVMs) method. SVM
technically considered as a binary classification problem, is based on the idea of constructing
optimal hyper-planes to improve generalization abilities. Because most real world pattern
recognition applications are multi-class classification cases, multi-class SVM algorithms are
ideal for solving these types of real world problems. SVM is a decomposition-based method that
divides a multi-class problem into multiple binary problems in order to construct multiple two-class
SVM classifiers and combine their classification results. There are three ways SVMs can be
implemented using binary SVM algorithms: first there’s the one-vs-all, next there’s the one-vs-one,
and finally there’s the error correcting output coding way. Among the existing
decomposition approaches, the one-vs-all strategy has been regarded as a simple method with
somewhat low accuracy when compared with other multi-class SVMs. The one-vs-all classifier
is also extremely powerful and can produce results that are usually at least as accurate as other
methods. (Xu, 2006) In times past, when SVM were evaluated on the KDD99 dataset, the results
were promising. False positives were maintained at a very low level averaging out to about 0.6%
while True Positives were relatively in the rage of 75% for DOS attacks and 82% for probe
attacks using a small dataset of 10,000 records. (Ngyun, 2008)
5. SYSTEM DESIGN AND IMPLEMENTATION:
The system design experiment evaluation atmosphere consists of three steps: First the
environment is setup with the required hardware and software, the data mining software is
selected and installed, and the dataset itself is chosen and preprocessed. Next, five discrete
classifier algorithms are chosen from a wide-ranging set of accepted classifier algorithms. These
were chosen because of they represent a number of different industries and they are well known
conventional classifier algorithms not to mention that they are already included in the WEKA
software. The five classifier algorithms chosen are: Naïve Bayes, J48 (C4.5 revision), Decision
table, JRip (Ripper) and SMO. Finally, the experiments are run using different classifier
algorithms, the results are recorded, and the performances of each classifier algorithm are
compared. Based on the information obtain an ideal model is suggested.
All experimental iterations were done on the following hardware and software: Dell
Latitude e6410 with Intel Core i5 2.53GHz processor, 4GB RAM, running on Microsoft
Windows 7 operating system platform. The open source data mining machine learning package
called WEKA version 3.6 for windows was used. WEKA is a data mining workbench that
contains a large number of algorithms for classification and a lot of algorithms for data
preprocessing, feature selection, clustering, and finding association rules. It’s an all-inclusive
workbench and it’s written in Java and runs on multiple platforms. (Witten, 2011)
The experiment uses the KDD99 dataset which came about as a direct result of the 1998
DARPA Intrusion Detection Evaluation Program whose sole objective was to survey and
evaluate research in intrusion detection. This standard set of pre audited data includes a wide
variety of intrusions simulated in a military network environment. Nine weeks of raw TCP dump
data for a local-area network (LAN) simulating a typical U.S. Air Force LAN was interspersed
with multiple types of attacks. The raw training data contained seven weeks of TCP dump
network traffic resulting in about four million connection records. (Stolfo, 1999)
A connection is a series of TCP packets starting and ending at some distinct times,
between which data flows to and from a source IP address to a destination IP address. Labeled as
either normal, or as an attack, each connection could only contain exactly one specific attack
type. As a result four main categories of attacks were identified: DOS: denial-of-service, R2L:
remote to local, U2R: user to remote, and Probing. (Stolfo, 1999) These are the four categories of
attacks that will be used in this classifier experiment’s dataset. The actual type of attacks in the
dataset as well as their categories can be seen here in Figure 2 below.
Figure 2
6. As our test dataset, the KDD99 dataset contains one type of normal data and 22 different types of
attacks. Each instance of the dataset contains 41 features and one final feature for classifying.
(Nguyen, 2008) In order to be able to do the experiment, we filtered the original KDD99 dataset
down to about 10% of the data. This data was randomly chosen and represents a cross-section of
attack types. The size of the original data set had 494,020 records and we reduce our dataset
down to 49,402 records. To do this we loaded the original dataset into WEKA, went to Choose to
Filters to Unsupervised to Instances and chose the remove Percentage Filter. Afterwards, we
double clicked on the name of the filter and put 90% in the percentage box which leaves only
10% of the data remaining. Figures 3 and 4 screenshots can be seen below showing the before
and after results of the filter being applied to the original dataset. This new dataset will be used
as training set. The next random sampling will remove 97% of the data and will only leave 3% or
about 14,821 instances and it will be used as the test set.
Figure 3
Figure 4
7. After the filter was applied, the different type of attacks had to be categorized based on
the four chosen categories: DOS, Probe, U2R, and R2L. This required looking at the total
numbers of each attack and putting them in their respective category type. This was all done
manually after WEKA identified each attack type. An attack by category was formed for each
dataset. Afterwards a distribution of connection type classifier evaluation table was created in
Excel for each dataset as well. Figures 5 and 6 below shows screenshots of these Excel attack
tables. These tables will be used later to do some calculations.
Figure 5
Figure 6
8. Once the training set and test sets were created the time had come to select the classifiers
and began training. The cross validation method was chosen as the test option because it avoids
overlapping test sets. Using one group for testing and the rest for training allows time to fine tune
the model building process. The experiment was started with the cross validation option and
continued until all the algorithms were tested. Keeping a consistent testing method across the
board is really the only way to have a true comparison and get a baseline standard established
prior to improving the results by doing further iterations and recommending a model. Once that
is done an accurate estimate can be identified and the results of each algorithm compared.
The first classifier chosen was the NaiveBayes algorithm. NaiveBayes is a powerful
probabilistic algorithm that learns from training data the conditional probability of each attribute.
It is called naïve because it relies on two simplifying assumptions: it assumes that the predictive
attributes are conditionally independent given the class, and it posits that no hidden attributes
influence the prediction process. (Nguyen, 2008) NaïveBayes does its classification by applying
Bayes rule to compute the probability when given the exacting instance. Afterwards, NaïveBayes
predicts the class with the highest subsequent probability. The goal of classification is to
correctly forecast the value of a selected discrete class variable given a vector of predictors or
attributes. NaïveBayes is very effective in many real world datasets because it learns to combine
classifier predictions. Nevertheless, because of attribute redundancy its predictive accuracy is
sometimes reduced. (Baitharu, 2013) Figures 7 and 8 below show the process of running the
NaïveBayes algorithm on the KDD99 training dataset using the cross validation process.
Figure 7
9. Figure 8
The second classifier chosen was the J48 algorithm. J48 formerly known as J48 is an
optimized version of C4.5 version 8. (Nguyen, 2008) J48 can make precise forecasts from data as
well as explain the patterns in it. J48 can deal effectively with many problems of numeric
attributes such as: missing values, pruning, estimating error rates, complexity of decision tree
induction, and generating rules from trees. (Baitharu, 2013) In terms of predictive accuracy, J48
performs slightly better than CART. The learning and classification steps of J48 are very fast;
however, J48 lacks scalability and efficiency when it’s applied to large datasets. This was clearly
shown in the experiment when J48 was applied to the training dataset. The system running the
software came to a complete halt when attempting to process the entire KDD99 dataset with over
490,000 instances. Nevertheless, on both the test and training dataset J48 performed quite well.
Figures 9 and 10 screenshots below shows the process of running the J48 algorithm on the
KDD99 dataset using the cross validation process.
Figure 9
10. Figure 10
The third classifier chosen was the Decision table algorithm which builds a decision table
majority classifier. Decision Table builds by using a simple decision table majority classifier. It
summarizes the dataset with a ‘decision table’ which contains the same number of attributes as
the original dataset. Afterwards, a new data item is allocated to a category by finding the line in
the decision table that goes with the non-class values of the data item. Decision Table uses the
wrapper method to find a good subset of attributes for inclusion in the table. By eliminating
attributes that contribute little or nothing to a model of the dataset, the algorithm creates a
smaller and condensed decision table. (Kohavi, 1995) Figures 11 and 12 screenshots below
shows the process of running the Decision Table algorithm on the KDD99 dataset using the cross
validation process.
Figure 11
11. Figure 12
The fourth classifier chosen was the JRip (Ripper) algorithm. JRip implements a
propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER).
Ripper builds a rule set by frequently adding rules to a blank rule set until all constructive
examples are covered. Rules are formed by greedily adding conditions to the predecessor of a
rule until no negative examples are covered. JRip (RIPPER) proceeds by treating all the
examples of a particular judgment in the training data as a class, and finding a set of rules that
cover all the members of that class. Thereafter it proceeds to the next class and does the same,
repeating this until all classes have been covered. (Rajput, 2000) JRip is a rule learner similar to
the commercial rule learner RIPPER. The RIPPER rule learning algorithm assembles a rule set
in which all positive examples are covered, and its algorithm performs proficiently noisy
datasets. Before building a rule, the current set of training examples are partitioned into two
subsets, a growing set and a pruning set. The rule is created from examples in the growing set.
The rule set begins with an empty rule set and rules are added incrementally to the rule set until
no negative examples are covered. To prune a rule, RIPPER chooses the deletion that make best
use of the function (Witten, 2005)
The fifth and final classifier is the SMO algorithm. SMO can be either linear or non-linear.
SMO is a supervised classifier. SMO uses two different datasets: training set and a test
set. In a best case scenario situation the classes are linearly separable by a line that splits the two
classes perfectly when in actuality there are many lines that separate the dataset and finding the
best one becomes the "separating line". The best line is found by making the best use of the
distance to the nearest points of both classes in the training set. The difficulty of the model has to
be balanced between fitting on the training data and being generic. This can be achieved by
allowing models which can make errors. SMO can make some errors to avoid over-fitting by
minimizing the number of errors that will be made.(Baitharu, 2013)
DISCUSSION:
The instances that performed the best out of the five chosen algorithms were assessed on
the KDD dataset. Figure 13 gives the results of our findings. Both the true positives (TP) and
false positives (FP) of each algorithm was recorded. These are the factors that were used to
identify the best algorithm for a particular category. The Total Time to build the model also was
used in order to get some real world insight into if these models were actually applicable in terms
of real world time constraints. I must say I was somewhat surprised at the results. The J48
classifier had both the highest true positive rate for the DOS category as well as the lowest false
positive rate for the DOS category. This in essence proves that certain algorithms do indeed work
12. better for certain attack categories which confirms the finds of other researchers like Nguyen and
Choi. Another important insight which stood out involved the long total time taken to build the
model by the JRip Algorithm. JRip took 914.82 seconds to build the model while NaiveBayes
only took 10.55 seconds. The Decision Table algorithm had the second highest amount of Total
time taking 642.55 seconds to build the model. Decision table also had only a .70 or 70% true
positive rate for DOS indicating it wouldn’t be the best algorithm to use for finding denial of
service attacks. Another interesting point is that based on my datasets there weren’t many U2R
and R2L attack records included in either of the datasets. This made it very difficult to identify
which algorithm would be best for these attacks. As shown in figure 13 below, most of the rates
besides each of these attacks are zeroes. In essence this experiment was a great project.
Crunching the numbers by building spreadsheets helped me to make more sense of what was
actually happening with the numbers. Based of all of the findings, I feel that J48 classifier is best
for DOS attacks while Decision table is best for PROBE attacks both having the highest true
positive rates as well as the lowest false positive rates in their respective attack categories. U2R
and R2L attack findings were inconclusive due to a lack of data for these attacks. As a results the
propose model recommended would combine both J48 and Decision Table classifiers into a NID
system to improve it efficiency. If the system is found to work too slow in the real world,
Decision Table would be replace with the SMO algorithm which is fairly good at identifying
PROBE attacks in less time than Decision table. In essence, designing an IDS system with these
two algorithms working in parallel would be a well designed IDS system.
Figure 13
13. CONCLUSION:
In this final report I have attempted to provide a understanding of research about data
mining classifier selection models and their impact on intrusion detection systems. I have also
tried to give some introductory understanding into different classifier algorithms and techniques
used to build efficient intrusion detection models. The WEKA data mining workbench was used
as the tool of choice to demonstrate how these models would actually be designed as well as how
they would work in identifying certain types of attacks. Background information was given
describing different classifier methodologies and citing how researchers are progressing in this
area of data mining. A System design and implementation methodology was presented outlining
the project I choose to try to give an understanding of some of the current research being done
today in the area of data mining and cyber security. Overall the research in this project in my
opinion clearly demonstrated that a correlation exists between specific data mining algorithms
and specific attack types. Understanding these relationships as well as other factors such as total
time and average accuracy can prove invaluable in future designs of intrusion detection systems.
Nevertheless, there is much more research that can be done in this area. From my findings I have
learned that identifying the relationship between certain algorithms paired with certain attack
types is an important topic for both future IDS designs as well as future security risks mitigation
efforts. Having a clearer understanding of this topic has made me more aware of both the
intricacies and complexities of data mining algorithms when used in cyber security applications.
As ubiquitous computing continues to change the technology infrastructure landscape, one thing
is for sure, data mining techniques being used in cyber security technologies will continue to
gain even more importance in the ongoing cyber war for many years to come.
ACKNOWLEDGEMENTS:
This project has been one of the most challenging research projects that I have ever done, yet it
has been one of the most rewarding projects I have ever done as well. I would like to thank to my
professor, Dr. Liam Mayron, whose encouragement, guidance and support from the beginning to
the end has motivated me to do my very best and never accept anything less. His wisdom and
insight has not only allowed me to develop better understanding of this subject but also has
taught me to never give up and for that I am eternally grateful.
14. REFERENCES
Agarwal, R., Joshi, M., (2000)
PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in
Network Intrusion Detection) retrieved from:
http://www.siam.org/meetings/sdm01/pdf/sdm01_30.pdf
Baitharu, T., Pani, S., (2013)
A Survey on Application of Machine Learning Algorithms on Data Mining: International Journal
of Innovative Technology and Exploring Engineering
retrieved from: http://www.ijitee.org/attachments/File/v3i7/G1363123713.pdf
Kohavi, R. "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"
from: KDD-96 Proceedings (1996) retrieved from:
http://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf
Kumar, V. (2002)
Data Mining Algorithms: Tutorial Presented at IPAM 2002 Workshop on Mathematical
Challenges in Scientific Data Mining retrieved from:
http://www.ipam.ucla.edu/publications/sdm2002/sdm2002_4007.pdf
Nguyen, H., Choi, D., (2008)
Application of Data Mining to Network Intrusion Detection: Classifier Selection Model
retrieved from: http://arxiv.org/ftp/arxiv/papers/1007/1007.1268.pdf
Rajput, A., Aharwal, R., (2000) J48 and JRIP Rules for E-Governance Data - retrieved from:
http://www.cscjournals.org/csc/manuscript/Journals/IJCSS/volume5/Issue2/IJCSS-448.pdf
Stolfo, S., Fan, W., Lee, W., Prodromidis, A’, Cost-based Modeling and Evaluation
for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM
Project retrieved from: https://kdd.ics.uci.edu/databases/kddcup99/task.html
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd
edn. Morgan Kaufmann, San Francisco (2005)
Xu, X., (2006) Adaptive Intrusion Detection Based on Machine Learning: Feature Extraction,
Classifier Construction and Sequential Pattern Prediction retrieved from:
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=8CD0CFADEE68079885E5FD4F71B
C4F7E?doi=10.1.1.107.9575&rep=rep1&type=pdf
Yang, L., Guo, L. (2007)
An active learning based TCM-KNN algorithm for supervised network intrusion detection
retrieved from: http://sourcedb.ict.cas.cn/cn/ictthesis/200907/P020090722621827712247.pdf