Noorbehbahani data preprocessing for anomaly based network intrusion

By :F.Noorbehbahani
Fall 2013
Data preprocessing for anomaly based
network intrusion
detection: A review

u Dataset creation
u involves identifying representative network traffic for training and
testing. These datasets should be labeled indicating whether the
connection is normal or anomalous.
u Feature construction
u create additional features with a better discriminative ability than the
initial feature set. This can bring significant improvement to
machinelearning algorithms. Features can be constructed manually, or
by using data mining methods such as sequence analysis, association
mining, and frequent-episode mining.
u Reduction
u is commonly used to decrease the dimensionality of the dataset by
discarding any redundant or irrelevant features.(FS)
Data preprocessing

u comprehensively reviewing the features derived from
network traffic, and the related data preprocessing
techniques which have been used in anomaly-based NIDS
since 1999.
u grouping anomaly-based NIDS based on the types of
network traffic features used for detection. The aim is to
show where the majority of research has been focused.
The groups show a trend from previously using packet
header features exclusively, to using more payload
features.
paper main contributions

AnomalyBasedFeatures
Packet Header
Basic
Single
Connection
Multiple
Connection
Protocol Based
Specification
Based
Parser Based
AP Keyboard
Based
KDD Cup 99
Payload Based
N-gram analysis of
request to server
Analysis of request
to Web App
General payload
pattern matching
Analysis of web
content to clients

u Minimize data preprocessing requirements
u Real-time, High bandwidth links
u Summarizing a series of network packet headers into a
single flow record, such as NetFlow, further reduces
resource requirements
u Packet header approaches also have the advantage of
remaining valid when traffic payloads are encrypted, such
as with SSL sessions.
Packet header anomaly detection

u Data preprocessing to extract packet headers is
traightforward.
u Many software programs and libraries already exist to
process network traffic, e.g. libpcap, tcpdump, tshark,
tcptrace, Softflowd, NetFlow, and IPFIX implementations.
u The complex part of the data preprocessing is using
appropriate feature construction to derive more
discriminative features (e.g. time-based statistical
measures) from this basic traffic information.

u Only three papers use the basic features extracted directly
from individual packet headers without further feature
construction.
u PHAD
u to detect attacks against the TCP/IP stack, IDS evasion techniques,
imperfect attack code, and anomalous traffic from victim machines
u learns normal ranges for each packet header field at the data link
(Ethernet), Network (IP), and Transport/control (TCP, UDP, ICMP)
layers
u The result is 33 packet header fields used as basic features. The
possible numeric range of each packet header field is very large, so
to reduce this space, clustering is used.
u a univariate approach which cannot model dependencies
between features.
Packet header basic features

u SPADE : one of the first attempts to use an anomaly method for
portscan detection
u the basic features are instead used to build a normal traffic
distribution model for the monitored network.
u Traffic distributions are maintained in real time by tracking joint
probability measurements, e.g. P (source address, destination
address, destination port), or using a Bayes Network.
u During detection, packets are compared to the probability
distribution to calculate an anomaly score.
u By retaining these unusual packets, it is possible to look for
portscans over
u a much wider time window.

u Attacks against wireless networks have also been detected using
packet headers, in this case from the MAC layer frame header.
u The approach requires tapping the local wireless network.
u Guennoun et al. (2008) perform preprocessing to extract all the frame
headers, convert any continuous features to categorical ones, and
derive new features
u A wrapper approach is then used to find the best set of features. It
uses a forward search algorithm which starts with the single most
relevant feature, tests it with a k-means classifier, and then iteratively
adds the next most relevant feature to the set. It was found that the
top eight ranked features produced a classifier with the best
accuracy.

u use complete network flows as data instances rather than
individual packet data.
u Analyzing flows provides more context than analyzing individual
packets standalone.
u Flows are unidirectional sequences of packets sharing a
common key such as the same source address and port, and
destination address and port.
u complete after a timeout period, or for TCP with end of session
flags (e.g. FIN or RST).
u A convenient way of obtaining flow information is to use
NetFlow records.
Single connection derived features

u Having a router generate NetFlow data saves the NIDS
from doing its own data preprocessing tasks such as
parsing of IP headers, maintaining packet counts, and
stream (flow) reassembly.
u Alternatively, NetFlow records can be produced on a
computer host using software such as softflowd NetFlow
records also significantly reduce the storage requirements
compared to full packet capture.
u NetFlow information is only based on packet headers, so
the transport payload is ignored.
SCD features

u The most common and important SCD features:
timebased statistical measures by monitoring basic
features over the duration of the flow.
u Examples
u counts of packets and bytes in the flow (as per NetFlow records),
u the average inter-packet arrival time,
u the mean packet length.
u These features are useful for fingerprinting sessions,
detecting unusual data flows, or finding other anomalies
within a single session.
SCD features

u ANDSOM
u Data preprocessing first segments the dataset by service type (TCP or UDP) and
the application protocol (HTTP or SMTP).
u For each data segment a different model is created. In this case self-organizing
maps (SOM) are used.
u The calculated SCD features are quad, start time, end time, whether the session
had a valid start (2 SYN packets), whether the connection was closed properly
(FINs) or improperly (RST), number of queries per second, average size of questions,
average size of answers, question answer idle time, answer question idle time, and
the duration of the connection.
u These features provide a fingerprint for the session. During the detection phase the
data instances were compared to the appropriate SOM model to detect
anomalies in that service.
u Testing successfully found an injected BIND attack and an HTTP tunnel, both of
which are detectable within a single flow.
SCD features

u Yamada et al.
u use SCD features to find attacks against webservers when the traffic is
encrypted by SSL or TLS.
u only use information from the unencrypted protocol headers for
detection.
u The features used are :
u the HTTP request and response sizes, calculated across each continuous activity
of each user.
u Since using size features alone would produce many false positives,
frequency analysis is also performed to eliminate alerts common to the
webserver.
u Statistically rare alerts are flagged as anomalies.
SCD features

u Anomaly detection using only TCP flags as SCD features
u TCP flags are extracted from packets within each TCP session,
and each flag combination is quantized as a symbol.
u A separate model is produced for each of the observed protocols
SSH, HTTP and FTP
u During the detection phase, network traffic is evaluated against
the appropriate model for anomaly detection.
u The approach was found to detect scans initiated by nmap, and
SSH and HTTP misuse.
u While this approach detects attacks which modify TCP
characteristics, it is not likely to detect payload-based attacks.
SCD features

u SCD features have been used to detect connections
which pass through multiple stepping stones (Yang and
Huang, 2007).
u SCD features are also used by Early and Brodley (2006).
Their aim is to automatically detect which application
protocol (e.g. SSH, telnet, SMTP, or HTTP) is being used
without using the destination port as a guide.
SCD features

u Are useful for finding anomalous behavior within a single
session, such as an unexpected protocol, unusual data
sizes, unusual packet timing, or unusual TCP flag
sequences.
u Particular detection capabilities include backdoors, HTTP
tunnels, stepping stones, BIND attacks, and command and
control channels.
u However, by themselves they cannot be used to find
activity spanning multiple flows such as DoS attacks or
network probes. For that, MCD features are required.
SCD features

u Are constructed by monitoring base features over multiple
flows or connections
u They enable detection of anomalies which manifest
themselves as unusual patterns of traffic, such as network
probes and DoS attacks.
u Domain knowledge is used to choose a window of data to
consider.
u The time windows range from 5 s to 24 h, with shorter time
windows detecting bursty attacks, and long time windows
more likely to detect slow and stealthy attacks.
u Connection based windows are also used, such as
nalyzing the most recent 100 connections
Multiple connection derived features

u Domain knowledge is used to choose a window of data to
consider.
u The time windows range from 5 s to 24 h, with shorter time
windows detecting bursty attacks, and long time windows
more likely to detect slow and stealthy attacks.
u Connection based windows are also used, such as
nalyzing the most recent 100 connections.
MCD features

u it has known limitations
u Advantages
u being publicly available, labeled, and preprocessed ready for
machine learning.
u Each network connection was processed into a labeled
vector of 41 features constructed using data mining
techniques and expert domain knowledge when creating
a machine learning misuse-based NIDS
KDD cup 99

u 9 basic and SCD header features for each connection
(similar to NetFlow)
u 9 time-based MCD header features constructed over a 2
s window
u 10 host-based MCD header features constructed over a
100 connection window to detect slow probes.
u 13 content-based features were constructed from the
traffic payloads using domain knowledge. Data mining
algorithms could not be used since the payloads were
unprocessed and therefore unstructured. They were
designed to specifically detect U2R and R2L attacks.
KDD 99 data preprocessing produced

u Many remote attacks on computers place the exploit
code inside the payload of network packets. Hence these
attacks are not directly detectable by packet header
approaches
u Payload attacks are more computationally expensive to
detect due to requiring deeper searches into network
sessions.
Content anomaly detection

u SANS Top Cyber Security Risks” 2009 report lists the top two
cyber risks as client side software which remains
unpatched, and vulnerable Internet-facing websites.
u The first risk can be exploited using malicious content
destined for a client, while the second can be exploited
using crafted content in requests to servers.
u In these cases, bytes containing the exploit code are
contained within network packet payloads beyond the
TCP/IP headers, such as within downloaded files.
Content anomaly detection

u PAYL
u uses 1-g and unsupervised learning to build a byte-frequency
distribution model of
u network traffic payloads.
u A 1-g is simply a single byte with value in the range 0e255. The
result of preprocessing a packet payload this way is a feature
vector containing the relative frequency count of each of the
256 possible 1-g (bytes) in the payload.
u The model also includes the average frequency, as well as the
variance and standard deviation as other features.
u Separate models of normal traffic are created for each
combination of destination port and length of the flow.
N-gram analysis of requests to servers

u PAYL was designed to detect zero-day worms, since flows with
worm payloads can produce an unusual byte-frequency
distribution.
u Testing was performed on all attacks in the DARPA 1999 dataset
using individual packets as data units (connection data units
were also attempted).
u The overall detection rate was close to 60% at a false positive
rate less then 1%.
u The authors point to a large non-overlap between PAYL and
PHAD, with one modeling header data and the other modeling
payloads. The two approaches could complement each other.

u ANAGRAM also builds on PAYL, but uses a mixture of high-
order N-grams with N > 1.
u This reduces its susceptibility to mimicry attacks since
higher order N-grams are harder to emulate in padded
bytes.
u By contrast, PAYL can be easily evaded if normal byte
frequencies are known to an attacker since malicious
payloads can be padded with bytes to match it.
u ANAGRAM uses supervised learning to model normal
traffic by storing N-grams of normal packets into one
bloom filter.

u Similarly, McPAD creates 2v-grams and uses a sliding window to
cover all sets of 2 bytes, n positions apart in network traffic
payloads.
u Since each byte can have values in the range 0 to 255, and n =
2, the feature space is 256^2 = 65,536. By varying v , different
feature spaces are constructed, each handled by a different
classifier.
u The dimensionality of the feature space is then reduced using a
clustering algorithm.
u Multiple one-class SVMs are used for classification, and a meta-
classifier combines these outputs into a final classification
prediction. The results of testing McPAD showed it could detect
shellcode attacks in HTTP requests.

u Organizations may require additional monitoring of critical
applications.
u One method is to create an application-specific anomaly
detector, such as for web applications.
u anomaly-based SQL injection detector : host based and
relied on the interception of SQL statements between the
web application and the database.
Analysis of requests to web applications

u Common network architectures ensure client hosts
(workstations) within an organization are not directly
exposed to the Internet at the network layer. This protects
the client hosts from external threats such as probes, DoS,
network worms and other attacks against open ports
(services).
u However, many other threats are faced by these clients,
particularly when they are exposed to untrusted code or
data.
Analysis of web content to clients

u This review has identified the various feature sets used by
anomaly-based NIDS.
u When designing a NIDS, the choice of network traffic
features is largely driven by the detection requirements.
u If broad anomaly detection is desired, then separate
anomaly detectors should be built for each of the feature
sets.
u For more targetted anomaly detection, a single feature
set can be used.
Conclusion and Feature set
recommendation

u Packet header features have the advantages of
u being fast, with relatively low computation and memory overheads,
and avoid some of the privacy and legal concerns regarding
network data analysis.
u Basic features can be used to
u flag single packets which are anomalous with respect to a normal
training model (e.g. PHAD),
u or as a filtering mechanism so only unusual packets are fed to
downstream algorithms (e.g. SPADE).
u Individual packets cannot be used to identify unusual trends or
patterns over time.
recommendation

u To identify anomalous patterns across multiple packets,but
within a single connection, SCD header features are used.
u e.g. if all connections to port 80 on the local network are
expected to be HTTP traffic, but the timing of packets
within a monitored port 80 connection does not match an
HTTP profile, then an anomaly can be raised.
recommendation

u MCD features are generally derived over a time window
of connections.
u Most MCD features are volume-based, such as the count
of connections to a particular destination IP address and
port in a given time window.
u MCD features can be easily used to detect unusual traffic
volumes associated with DoS attacks or scanning
behavior, but at the cost of overlooking individual
anomalous packets (since these will not meet the volume-
based threshold).
recommendation

u While packet header feature limitations :
u packet header approaches cannot be used to directly detect
attacks aimed at applications, since the attack bytes are
embedded in the packet body.
u many of today’s exploits are directed at applications rather than
network services.
u Eg : buffer overflow attacks against web servers, web
application exploits, and attacks targetting web clients
such as drive-by-downloads.
recommendation

u NIDS must use payload-based features extracted from packet
bodies to detect these types of attacks, since the packet
headers can remain completely normal.
u Payload analysis is more computationally expensive than
header analysis. This is due to requiring deeper packet
inspection, dealing with a variety of payload types (HTML, XML,
pdf, jpg, etc.), transfer encoding (gzip, Base64), and
obfuscation techniques.
u The advantage of payload analysis is having access to all bytes
transferred between network devices.
u This allows a rich set of payload-based features to be
constructed for anomaly detection.
recommendation

u Due to the complexity of payload analysis, many techniques focus on
small subsets of the payload, e.g. the HTTP request, or only the
JavaScript sections of downloaded web content.
u The anomaly-based techniques do not try to match signatures of
known malware, however they can apply heuristics such as pattern
matching for the presence of shellcode, or highlighting suspiciously
long strings which may indicate a buffer overflow attempt.
u The reviewed payload based approaches derive features from either
the payload of a single connection or a user application session, and
compare the features to a normal model.
u In effect these are SCD payload-based features. Extending this
approach to multiple connections to produce MCD payload-based
features could allow different types of anomalies to stand out, e.g.
detecting an unusually large number of HTTP redirects in a network
could indicate a widespread infection attempt.
recommendation

Noorbehbahani data preprocessing for anomaly based network intrusion

Noorbehbahani data preprocessing for anomaly based network intrusion

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Noorbehbahani data preprocessing for anomaly based network intrusion

Similaire à Noorbehbahani data preprocessing for anomaly based network intrusion (20)

Dernier

Dernier (20)

Noorbehbahani data preprocessing for anomaly based network intrusion