SlideShare une entreprise Scribd logo
1  sur  3
Télécharger pour lire hors ligne
Fast and Effective Worm Fingerprinting via
                       Machine Learning
                           Stewart Yang, Jianping Song, Harish Rajamani, Taewon Cho
                                       Yin Zhang and Raymond Mooney
             Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, USA
                         {windtown, sjp, harishr, khatz, yzhang, mooney}@cs.utexas.edu



   Abstract— As Internet worms become ever faster and more          on their content. There is a range of classification algorithms in
sophisticated, it is important to be able to extract worm sig-      the machine learning literature that optimize for classification
natures in an accurate and timely manner. In this paper, we         accuracy – the percentage of instances that are correctly
apply machine learning to automatically fingerprint polymorphic
worms, which are able to change their appearance across every       classified – as well as for other criteria, such as training time
instance. Using real Internet traces and synthetic polymorphic      and noise resistance. In particular, in terms of time complexity,
worms, we evaluated the performance of several advanced             most of the methods tested here are linear in the size of the
machine learning algorithms, including naive Bayes, decision-tree   training data, compared to the higher complexity of Polygraph.
induction, rule learning (RIPPER), and support vector machines.        As the main contribution of this paper, we conducted ex-
The results are very promising. Compared with Polygraph, the
state of the art in polymorphic worm fingerprinting, several         tensive experimental studies to verify our conjecture that other
machine learning algorithms are able to generate more accurate      standard machine learning methods would outperform those
signatures, tolerate more noise in the training data, and require   used in Polygraph. The algorithms we tested included Naive
much shorter training time. These results open the possibility of   Bayes learners (NaiveBayes and SparseNB), Support Vector
applying machine learning to build a fast and accurate online       Machines (SVM), Decision Trees (J48), and Rule learners
worm fingerprinting system.
                                                                    (JRip). These learners were all used “right out of the box”
                                                                    from the Weka data-mining package [3], except for sequential-
       I. I NTRODUCTION AND BACKGROUND W ORK                        signature Polygraph (Seq-Poly), which we implemented fol-
                                                                    lowing the best-performing method from [2].
   A typical intrusion detection system monitors all the in-
coming and outgoing traffic, while removing traffic flows                              III. E XPERIMENTAL RESULTS
that match predefined rules (signatures). However, worm fin-          A. Experimental Design
gerprinting currently requires security experts to manually
                                                                       Our experimental comparisons were conducted on a combi-
analyze captured worm instances and thus can be very slow.
                                                                    nation of network traffic traces and self-generated polymorphic
Meanwhile, recent studies have shown that new worms such as
                                                                    worm instances. The two network traces – referred to here
the SQL SLammer can compromise all vulnerable hosts in the
                                                                    as the day trace and the week trace – were collected from a
network in as short as 10 minutes. Moreover, worms released
                                                                    100Mbps fiber link at ICSI, recorded over the span of one day
in the past few years have become even more powerful by us-
                                                                    and one week respectively. These two traces were previously
ing polymorphic techniques to avoid detection. As a result, in
                                                                    used in experiments on Autograph. As preprocessing, we
order to effectively stop worm outbreaks, new automated and
                                                                    reassembled packets in the two traces into flows and filtered
robust worm fingerprinting techniques need to be developed.
                                                                    out flows that were labeled as worms by the Bro intrusion
   Among the first content-based worm fingerprinting systems,
                                                                    detection system; the resulting pool of flows only contained
Autograph [1] uses a predetermined heuristic to pre-classify
                                                                    innocuous flows. Following the studies on Polygraph, we
input flows into suspicious and unsuspicious flows, which are
                                                                    generated polymorphic worm flows for the Apache-knacker
then fed as training data to a Rabin’s fingerprint based feature
                                                                    worm and the Atphttpd worm, headers for these worms were
extractor and greedy signature generating algorithm.
                                                                    sampled uniformly from the pool of previously constructed
   In Polygraph [2] Newsome et al. propose three algorithms
                                                                    innocuous flows, and bodies were constructed from known
which focus on detecting and generating signatures for poly-
                                                                    signatures of these two worms by filling random characters
morphic worms: sequential signatures, conjunctive signatures
                                                                    into the wildcard slots of these signatures.
and Bayesian signatures. Polygraph employs a common sub-
                                                                       As the next step, we transformed the string-based flows
string finder instead of Rabin’s fingerprint algorithm to con-
                                                                    into feature-vector representations by employing one of two
struct features.
                                                                    feature construction techniques: a common substring finder
                                                                    like that used in [2] (COM) and an n-gram finder like that
   II. W ORM FINGERPRINTING VIA MACHINE LEARNING                    used in [4] (n-GRAM). As in Polygraph, COM looks for
   The task of worm fingerprinting can be abstracted to the          all substrings within a predetermined length limit that appear
problem of constructing a classifier to separate a specific type      in more than a given percentage of flows. Following [4], n-
of flow (worms) from all other flows (innocuous flows) based           GRAM finds all n-grams in the payloads and retains the 500
6e-05
n-grams with the highest information gain with respect to                                                                                                                 Seq-Poly
                                                                                                                                                                              JRip
                                                                                                                                                                               J48
discriminating between suspicious and unsuspicious flows. In                                           5e-05
                                                                                                                                                                             SVM


order to find the best parameters for the two methods, we
                                                                                                      4e-05
conducted development experiments on the day trace.




                                                                               False Positives Rate
   The experiments were carried out on desktop machines with                                          3e-05

3.0GHz Intel Pentium IV processors and running Linux kernel
                                                                                                      2e-05
2.6.13. We compared all six algorithms based on the following
criteria. To measure the accuracy of generated signatures, we                                         1e-05

recorded the cross-validated false positive rate (the percentage
of innocuous flows incorrectly classified as worms) as well                                                   0
                                                                                                                0      10     20      30        40      50       60       70         80   90
                                                                                                                             Percentage of Innocuous Flows in Suspicious Pool
as the false negative rate (the percentage of worm flows                                               0.9
                                                                                                                Seq-Poly
misclassified as innocuous). To evaluate the resilience of these                                       0.8
                                                                                                                    JRip
                                                                                                                     J48
                                                                                                                   SVM
algorithms to unavoidable class noise in the suspicious pool,                                         0.7

we computed noise curves by varying the ratio of innocuous                                            0.6




                                                                               False Negative Rate
flows in the suspicious pool and recording the error rates                                             0.5

at each point. Finally, to evaluate the detection speed, we                                           0.4

recorded the time required by each algorithm to process the                                           0.3

training sets. To ensure the reliability of the results, for each                                     0.2

setting we report the results averaged over ten runs.                                                 0.1


                                                                                                       0
                                                                                                            0       10      20       30       40       50       60        70     80       90
B. Accuracy of Generated Signatures                                                                                         Percentage of Innocuous Flows in Suspicious Pool


   1) With a worm-free unsuspicious pool.: Following the
                                                                    Fig. 1. False positive and false negative rates under varying percentage of
experimental design used to test Polygraph, for different           noise in the suspicious pool (200 flows). Unsuspicious pool (45,000 flows)
suspicious pool sizes from the week trace, we generated             contains 20 worm flows.
a noise curve for different levels of classification noise in
the suspicious pool. From the experiments, we observed that
                                                                       Worm authors can disable worm detection algorithms by
Polygraph started to generate false positives when the level of
                                                                    slow-poisoning the unsuspicious pool before launching the
noise in the suspicious pool increased, which agrees with the
                                                                    attack. On the other hand, the use of recorded innocuous flows
results presented in [2].
   JRip (the Weka implementation of RIPPER) performed the           to form the unsuspicious pool may cause worm fingerprinting
best among the contending algorithms, as it achieved zero false     algorithms to generate erroneous worm signatures for new
positive rates consistently; in fact, for most runs, JRip just      legitimate innocuous flows (with different characteristic pay-
produced single rules that directly encoded the address of the      loads) right after their release, because these flows only exist
security flaw in the server system, required for any worm to         in the suspicious pool and not in the unsuspicious pool.
break into it (and not exploited in innocuous flow payloads).
In addition, JRip successfully generated such signatures when       C. Training Time
there were only five true worm flows in a suspicious pool of             1) Training time for the accuracy experiments.: Figure 2
size 50, which suggests it would be able to detect a new worm       presents training times for the two experiments presented in
early in its outbreak. Since there seem to be small “smoking        the previous subsection.
gun” signatures for such worms, it is not surprising that              The time complexity for sequential Polygraph is O(n2 m),
symbolic rule learning algorithms like JRip are more accurate       where n and m are the numbers of suspicious and unsuspicious
than more numerical and probabilistic methods since their bias      flows respectively. This, we believe, is one of the major
for finding simple symbolic descriptions of categories seems         limitations of the Polygraph algorithm, because in the outbreak
to be a good match for this problem.                                of a new worm, the suspicious pool can easily grow up to
   2) When the unsuspicious pool contains worms.: The au-           hundreds or even thousands of flows, and consequently the
thors of Polygraph make the assumption that the unsuspicious        time required by Polygraph to train on this suspicious pool
pool is free of worm flows, in order to use flows from a few          will be too long to effectively quarantine the worm. On the
days earlier to form pure unsuspicious pools. To verify the         contrary, the time complexity for JRip is O(m + n), which is
conjecture that Polygraph will be rendered ineffective if this      also observed in the graphs, as when the size of suspicious
assumption is relaxed, we repeated our previous experimental        pool increase from 50 to 200, the training time stays roughly
setting, additionally blending in twenty simulated worm flows        the same because the number of unsuspicious flows (45,000)
into the unsuspicious pool.                                         dominates the total number of flows.The training time of J48
   As shown in Figure 1, Polygraph had a consistently high          lies between that of JRip and Polygraph.
false negative rate, while JRip generated zero false negative          2) End-to-end training time for production use.: Given
rates even when the number of innocuous flows increased, and         the training time of JRip (3 to 10 minutes), we wanted to
only began to mislabel worm flows as innocuous when the              verify that our approach can be made fast enough to effec-
number of worm flows in the suspicious pool dropped below            tively quarantine a worm outbreak. With a more efficient C-
that of the unsuspicious pool.                                      implementation of RIPPER, n-grams as feature extractors, and
1600
                                                        Seq-Poly
                                                            JRip                                                                                             D. Introducing a few purer labeled flows
                                                             J48
                                           1400             SVM
                                                      NaiveBayes
                                                           MNNB
                                                                                                                                                                Part of our ongoing research involves replaying randomly
                                           1200
            Training Time (sec)
                                                                                                                                                             sampled traffic flows on virtual hosts to see if they are worms.
                                           1000
                                                                                                                                                             The virtual hosts are equipped with the latest server software,
                                            800
                                                                                                                                                             thus a worm flow, when replayed on a host, will reveal its
                                            600                                                                                                              malicious nature by exploiting flaws in the software. This
                                            400                                                                                                              approach, when compared with the suspicious flow capturing
                                            200
                                                                                                                                                             algorithms introduced in Autograph and Polygraph, can obtain
                                             0
                                                                                                                                                             suspicious and unsuspicious flows that are much purer in
                                                  0       10        20      30        40      50       60       70
                                                                   Percentage of Innocuous Flows in Suspicious Pool
                                                                                                                            80        90
                                                                                                                                                             nature, but is prohibitive due to the cost of establishing and
                                           6000
                                                                                                                 Seq-Poly                                    maintaining the replay engine, as well as the time needed to
                                                                                                                     JRip

                                           5000
                                                                                                                      J48
                                                                                                                    SVM                                      replay each flow.
                                                                                                                                                                Hence, we propose to augment existing fingerprinting al-
                                           4000                                                                                                              gorithms by taking the fewer but purer labeled flows into
            Training Time (sec)




                                           3000
                                                                                                                                                             consideration. One way to incorporate the new flows is to
                                                                                                                                                             give them larger weights in training compared to the weights
                                           2000                                                                                                              given to the less pure suspicious flows. We conducted the same
                                           1000
                                                                                                                                                             experiments as those done in section III-C.2, while introducing
                                                                                                                                                             a set of 10 labeled flows that were all worms. These flows were
                                             0
                                                  0       10        20      30        40      50       60       70          80        90
                                                                                                                                                             added to the original training set but were given a weight of 5.0
                                                                   Percentage of Innocuous Flows in Suspicious Pool
                                                                                                                                                             instead of 1.0. The results indicated that the minimum number
Fig. 2. Training time under varying percentage of noise in suspicious pool.                                                                                  of unsuspicious flows to ensure zero false positive/negative
The above graph depicts suspicious pool of size 50 and pure unsuspicious                                                                                     rates was lowered from 1,000 to 750, which in turn reduced
pool. The bottom graph depicts suspicious pool of size 200 and unsuspicious                                                                                  the minimum end-to-end training time by 11.1% down to 16
pool with 20 worm flows
                                                                                                                                                             seconds.

                                            0.3
                                                                                                      False Positive
                                                                                                                                 40                                      IV. C ONCLUSIONS AND FUTURE WORK
                                                                                                     False Negative

                                           0.25
                                                                                                      Training Time
                                                                                                                                 35                             We verified in this paper that certain machine learning
                                                                                                                                 30                          algorithms work well for the problem of worm fingerprinting.
            False Positive/Negative Rate




                                            0.2                                                                                                              In particular, we compared the performance of five machine
                                                                                                                                       Training Time (sec)




                                                                                                                                 25


                                           0.15                                                                                  20
                                                                                                                                                             learning algorithms against the best existing worm finger-
                                                                                                                                                             printing algorithm (Polygraph) on a blend of network traces
                                                                                                                                 15
                                            0.1                                                                                                              and simulated polymorphic worm flows. Results showed that
                                                                                                                                 10
                                                                                                                                                             two machine learning algorithms perform significantly better
                                           0.05
                                                                                                                                 5                           than Polygraph in terms of resilience to noise and detection
                                             0
                                              200        400       600    800   1000    1200     1400     1600     1800
                                                                                                                                0
                                                                                                                             2000
                                                                                                                                                             speed. More specifically, RIPPER produced zero negative rates
                                                                         Number of Unsuspicious Flows
                                                                                                                                                             consistently on noisy training data and was able to capture new
Fig. 3. False positive rate, false negative rate and training time under varying
                                                                                                                                                             worms with very few worm instances in the suspicious pool.
number of unsuspicious flows.                                                                                                                                 Moreover, the algorithm runs in time linear in the total number
                                                                                                                                                             of training flows, which makes it tractable for containing a
                                                                                                                                                             large-scale worm outbreak. As future work, we plan to test our
                                                                                                                                                             techniques on worms with even greater polymorphism using
a more streamlined process, we conducted additional experi-                                                                                                  more advanced worm construction ideas.
ments to explore how fast the Ripper algorithm could train on
a minimum number of examples necessary to identify a worm.                                                                                                                               R EFERENCES
We then measured the end-to-end time required to fingerprint                                                                                                  [1] Kim, H.A., Karp, B.: Autograph: Toward automated, distributed worm
a new worm with this “production level” implementation of                                                                                                        signature detection. In: Proceedings of the USENIX Security Symposium.
our approach.                                                                                                                                                    (2004)
                                                                                                                                                             [2] Newsome, J., Karp, B., Song, D.: Polygraph: Automatically generating
   From Figure 3 we can see that end-to-end training time                                                                                                        signatures for polymorphic worms. In: Proceedings of the IEEE Sympo-
is linear in the number of flows used in training - with                                                                                                          sium on Security and Privacy. (2005)
                                                                                                                                                             [3] Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools
1,000 unsuspicious flows it is 18 seconds and with 2,000                                                                                                          and Techniques with Java Implementations. Morgan Kaufman Pub., San
unsuspicious flows it increases to 34 seconds. Moreover, when                                                                                                     Francisco (1999)
there are 1,000 or more unsuspicious flows in training, both                                                                                                  [4] Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables
                                                                                                                                                                 in the wild. In: KDD ’04: Proceedings of the 2004 ACM SIGKDD
false positive and false negative rates stay at zero consistently.                                                                                               international conference on Knowledge discovery and data mining, New
This result suggests that we can safely bring the number of                                                                                                      York, NY, USA, ACM Press (2004) 470–478
unsuspicious flows down to 1,000 and thus reduce the end-to-
end training time to 18 seconds.

Contenu connexe

En vedette

Como escoger una empresa de recogida de trastos
Como escoger una empresa de recogida de trastosComo escoger una empresa de recogida de trastos
Como escoger una empresa de recogida de trastosLimpiezasExpress
 
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONAL
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONALCOMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONAL
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONALJaime Claros
 
Las 5 tumbas más buscadas de la historia
Las 5 tumbas más buscadas de la historia Las 5 tumbas más buscadas de la historia
Las 5 tumbas más buscadas de la historia LimpiezasExpress
 
Maruti swift media campaign _05 august 2016
Maruti swift media campaign _05 august 2016Maruti swift media campaign _05 august 2016
Maruti swift media campaign _05 august 2016Mansee Saxena
 
Dios es azucar
Dios es azucarDios es azucar
Dios es azucarBerli Onle
 
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...Spencer Jardine
 
Reflexions Sobre El Nou Perfil Docent Maig 2007
Reflexions Sobre El Nou Perfil Docent Maig 2007Reflexions Sobre El Nou Perfil Docent Maig 2007
Reflexions Sobre El Nou Perfil Docent Maig 2007Boris Mir
 
¿Como consigo una casa limpia y saludable?
¿Como consigo una casa limpia y saludable?¿Como consigo una casa limpia y saludable?
¿Como consigo una casa limpia y saludable?LimpiezasExpress
 
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPoint
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPointLeaves and Leaf Processes, Photosynthesis Biology Lesson PowerPoint
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPointwww.sciencepowerpoint.com
 
如何備料:資料的抓取、清理以及串接
如何備料:資料的抓取、清理以及串接如何備料:資料的抓取、清理以及串接
如何備料:資料的抓取、清理以及串接muyueh
 
Borrego al horno de Jaime Castellanos
Borrego al horno de Jaime Castellanos Borrego al horno de Jaime Castellanos
Borrego al horno de Jaime Castellanos Jaime Castellanos
 
Capítulo XVII - MC
Capítulo XVII - MCCapítulo XVII - MC
Capítulo XVII - MC12anogolega
 
彭盛韶/公私協力的公共服務 - 以資料面詮釋
彭盛韶/公私協力的公共服務 - 以資料面詮釋彭盛韶/公私協力的公共服務 - 以資料面詮釋
彭盛韶/公私協力的公共服務 - 以資料面詮釋台灣資料科學年會
 
What women really want?
What women really want?What women really want?
What women really want?DI Marketing
 
李仁杰/ Riot Games Head of Data Science
李仁杰/ Riot Games Head of Data Science 李仁杰/ Riot Games Head of Data Science
李仁杰/ Riot Games Head of Data Science 台灣資料科學年會
 
李慕約&王向榮/如何備料:資料的抓取、清理以及串接
李慕約&王向榮/如何備料:資料的抓取、清理以及串接李慕約&王向榮/如何備料:資料的抓取、清理以及串接
李慕約&王向榮/如何備料:資料的抓取、清理以及串接台灣資料科學年會
 
Anything But Death By Cubicle
Anything But Death By CubicleAnything But Death By Cubicle
Anything But Death By CubicleMike Hamanaka
 

En vedette (20)

Como escoger una empresa de recogida de trastos
Como escoger una empresa de recogida de trastosComo escoger una empresa de recogida de trastos
Como escoger una empresa de recogida de trastos
 
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONAL
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONALCOMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONAL
COMPARACION ENTRE LA EDUCACION VIRTUAL Y LA EDUCACIÓN TRADICIONAL
 
Las 5 tumbas más buscadas de la historia
Las 5 tumbas más buscadas de la historia Las 5 tumbas más buscadas de la historia
Las 5 tumbas más buscadas de la historia
 
Maruti swift media campaign _05 august 2016
Maruti swift media campaign _05 august 2016Maruti swift media campaign _05 august 2016
Maruti swift media campaign _05 august 2016
 
Dios es azucar
Dios es azucarDios es azucar
Dios es azucar
 
Case study of maruti sazuki pvt ltd.
Case study of maruti sazuki pvt ltd.Case study of maruti sazuki pvt ltd.
Case study of maruti sazuki pvt ltd.
 
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...
Objective 8 - Information Literacy: How the ISU Libraries succeeded in suppor...
 
Reflexions Sobre El Nou Perfil Docent Maig 2007
Reflexions Sobre El Nou Perfil Docent Maig 2007Reflexions Sobre El Nou Perfil Docent Maig 2007
Reflexions Sobre El Nou Perfil Docent Maig 2007
 
¿Como consigo una casa limpia y saludable?
¿Como consigo una casa limpia y saludable?¿Como consigo una casa limpia y saludable?
¿Como consigo una casa limpia y saludable?
 
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPoint
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPointLeaves and Leaf Processes, Photosynthesis Biology Lesson PowerPoint
Leaves and Leaf Processes, Photosynthesis Biology Lesson PowerPoint
 
如何備料:資料的抓取、清理以及串接
如何備料:資料的抓取、清理以及串接如何備料:資料的抓取、清理以及串接
如何備料:資料的抓取、清理以及串接
 
Historia del arte
Historia del arteHistoria del arte
Historia del arte
 
Borrego al horno de Jaime Castellanos
Borrego al horno de Jaime Castellanos Borrego al horno de Jaime Castellanos
Borrego al horno de Jaime Castellanos
 
Capítulo XVII - MC
Capítulo XVII - MCCapítulo XVII - MC
Capítulo XVII - MC
 
彭盛韶/公私協力的公共服務 - 以資料面詮釋
彭盛韶/公私協力的公共服務 - 以資料面詮釋彭盛韶/公私協力的公共服務 - 以資料面詮釋
彭盛韶/公私協力的公共服務 - 以資料面詮釋
 
What women really want?
What women really want?What women really want?
What women really want?
 
李仁杰/ Riot Games Head of Data Science
李仁杰/ Riot Games Head of Data Science 李仁杰/ Riot Games Head of Data Science
李仁杰/ Riot Games Head of Data Science
 
李慕約&王向榮/如何備料:資料的抓取、清理以及串接
李慕約&王向榮/如何備料:資料的抓取、清理以及串接李慕約&王向榮/如何備料:資料的抓取、清理以及串接
李慕約&王向榮/如何備料:資料的抓取、清理以及串接
 
Anything But Death By Cubicle
Anything But Death By CubicleAnything But Death By Cubicle
Anything But Death By Cubicle
 
Adecco - Vietnam Salary Guide 2016
Adecco - Vietnam Salary Guide 2016Adecco - Vietnam Salary Guide 2016
Adecco - Vietnam Salary Guide 2016
 

Similaire à Fast and Effective Worm Fingerprinting via Machine Learning

Implementation and Performance Evaluation of Neural Network for English Alpha...
Implementation and Performance Evaluation of Neural Network for English Alpha...Implementation and Performance Evaluation of Neural Network for English Alpha...
Implementation and Performance Evaluation of Neural Network for English Alpha...ijtsrd
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...zeinabmovasaghinia
 
Icacci presentation-isi-ransomware
Icacci presentation-isi-ransomwareIcacci presentation-isi-ransomware
Icacci presentation-isi-ransomwarevinaykumar R
 
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...IRJET Journal
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda Chiramba
 
ieee project topic & abstracts in php
ieee project topic & abstracts in phpieee project topic & abstracts in php
ieee project topic & abstracts in phpaswin tbbc
 
Automatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsAutomatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsUltraUploader
 
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationBench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationDongsun Kim
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Projectbutest
 
IDS Survey on Entropy
IDS Survey  on Entropy IDS Survey  on Entropy
IDS Survey on Entropy Raj Kamal
 
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET Journal
 
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...AM Publications
 
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...Alexander Zhdanov
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
Elgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingElgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingijwmn
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWBOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWIRJET Journal
 
Modeling & automated containment of worms(synopsis)
Modeling & automated containment of worms(synopsis)Modeling & automated containment of worms(synopsis)
Modeling & automated containment of worms(synopsis)Mumbai Academisc
 

Similaire à Fast and Effective Worm Fingerprinting via Machine Learning (20)

Implementation and Performance Evaluation of Neural Network for English Alpha...
Implementation and Performance Evaluation of Neural Network for English Alpha...Implementation and Performance Evaluation of Neural Network for English Alpha...
Implementation and Performance Evaluation of Neural Network for English Alpha...
 
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...@@@Rf8 polymorphic worm detection using structural infor    (control flow gra...
@@@Rf8 polymorphic worm detection using structural infor (control flow gra...
 
Icacci presentation-isi-ransomware
Icacci presentation-isi-ransomwareIcacci presentation-isi-ransomware
Icacci presentation-isi-ransomware
 
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...
Analysis of Various Cluster Algorithms Based on Flying Insect Wing Beat Sound...
 
Yolinda chiramba Survey Paper
Yolinda chiramba Survey PaperYolinda chiramba Survey Paper
Yolinda chiramba Survey Paper
 
Et25897899
Et25897899Et25897899
Et25897899
 
20170412 om patri pres 153pdf
20170412 om patri pres 153pdf20170412 om patri pres 153pdf
20170412 om patri pres 153pdf
 
ieee project topic & abstracts in php
ieee project topic & abstracts in phpieee project topic & abstracts in php
ieee project topic & abstracts in php
 
Automatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic wormsAutomatically generating signatures for polymorphic worms
Automatically generating signatures for polymorphic worms
 
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug LocalizationBench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
Bench4BL: Reproducibility Study on the Performance of IR-Based Bug Localization
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
IDS Survey on Entropy
IDS Survey  on Entropy IDS Survey  on Entropy
IDS Survey on Entropy
 
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...IRJET -  	  A Review on Replay Spoof Detection in Automatic Speaker Verificat...
IRJET - A Review on Replay Spoof Detection in Automatic Speaker Verificat...
 
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...
IMPLEMENTATION OF HAMMING NETWORK ALGORITHM TO DECIPHER THE CHARACTERS OF PIG...
 
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...
Identification of Port-Scans in Honeypot Traffic Using Unsupervised Anomaly D...
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
Elgamal signature for content distribution with network coding
Elgamal signature for content distribution with network codingElgamal signature for content distribution with network coding
Elgamal signature for content distribution with network coding
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEWBOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
BOTNET DETECTION USING VARIOUS MACHINE LEARNING ALGORITHMS: A REVIEW
 
Modeling & automated containment of worms(synopsis)
Modeling & automated containment of worms(synopsis)Modeling & automated containment of worms(synopsis)
Modeling & automated containment of worms(synopsis)
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Fast and Effective Worm Fingerprinting via Machine Learning

  • 1. Fast and Effective Worm Fingerprinting via Machine Learning Stewart Yang, Jianping Song, Harish Rajamani, Taewon Cho Yin Zhang and Raymond Mooney Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, USA {windtown, sjp, harishr, khatz, yzhang, mooney}@cs.utexas.edu Abstract— As Internet worms become ever faster and more on their content. There is a range of classification algorithms in sophisticated, it is important to be able to extract worm sig- the machine learning literature that optimize for classification natures in an accurate and timely manner. In this paper, we accuracy – the percentage of instances that are correctly apply machine learning to automatically fingerprint polymorphic worms, which are able to change their appearance across every classified – as well as for other criteria, such as training time instance. Using real Internet traces and synthetic polymorphic and noise resistance. In particular, in terms of time complexity, worms, we evaluated the performance of several advanced most of the methods tested here are linear in the size of the machine learning algorithms, including naive Bayes, decision-tree training data, compared to the higher complexity of Polygraph. induction, rule learning (RIPPER), and support vector machines. As the main contribution of this paper, we conducted ex- The results are very promising. Compared with Polygraph, the state of the art in polymorphic worm fingerprinting, several tensive experimental studies to verify our conjecture that other machine learning algorithms are able to generate more accurate standard machine learning methods would outperform those signatures, tolerate more noise in the training data, and require used in Polygraph. The algorithms we tested included Naive much shorter training time. These results open the possibility of Bayes learners (NaiveBayes and SparseNB), Support Vector applying machine learning to build a fast and accurate online Machines (SVM), Decision Trees (J48), and Rule learners worm fingerprinting system. (JRip). These learners were all used “right out of the box” from the Weka data-mining package [3], except for sequential- I. I NTRODUCTION AND BACKGROUND W ORK signature Polygraph (Seq-Poly), which we implemented fol- lowing the best-performing method from [2]. A typical intrusion detection system monitors all the in- coming and outgoing traffic, while removing traffic flows III. E XPERIMENTAL RESULTS that match predefined rules (signatures). However, worm fin- A. Experimental Design gerprinting currently requires security experts to manually Our experimental comparisons were conducted on a combi- analyze captured worm instances and thus can be very slow. nation of network traffic traces and self-generated polymorphic Meanwhile, recent studies have shown that new worms such as worm instances. The two network traces – referred to here the SQL SLammer can compromise all vulnerable hosts in the as the day trace and the week trace – were collected from a network in as short as 10 minutes. Moreover, worms released 100Mbps fiber link at ICSI, recorded over the span of one day in the past few years have become even more powerful by us- and one week respectively. These two traces were previously ing polymorphic techniques to avoid detection. As a result, in used in experiments on Autograph. As preprocessing, we order to effectively stop worm outbreaks, new automated and reassembled packets in the two traces into flows and filtered robust worm fingerprinting techniques need to be developed. out flows that were labeled as worms by the Bro intrusion Among the first content-based worm fingerprinting systems, detection system; the resulting pool of flows only contained Autograph [1] uses a predetermined heuristic to pre-classify innocuous flows. Following the studies on Polygraph, we input flows into suspicious and unsuspicious flows, which are generated polymorphic worm flows for the Apache-knacker then fed as training data to a Rabin’s fingerprint based feature worm and the Atphttpd worm, headers for these worms were extractor and greedy signature generating algorithm. sampled uniformly from the pool of previously constructed In Polygraph [2] Newsome et al. propose three algorithms innocuous flows, and bodies were constructed from known which focus on detecting and generating signatures for poly- signatures of these two worms by filling random characters morphic worms: sequential signatures, conjunctive signatures into the wildcard slots of these signatures. and Bayesian signatures. Polygraph employs a common sub- As the next step, we transformed the string-based flows string finder instead of Rabin’s fingerprint algorithm to con- into feature-vector representations by employing one of two struct features. feature construction techniques: a common substring finder like that used in [2] (COM) and an n-gram finder like that II. W ORM FINGERPRINTING VIA MACHINE LEARNING used in [4] (n-GRAM). As in Polygraph, COM looks for The task of worm fingerprinting can be abstracted to the all substrings within a predetermined length limit that appear problem of constructing a classifier to separate a specific type in more than a given percentage of flows. Following [4], n- of flow (worms) from all other flows (innocuous flows) based GRAM finds all n-grams in the payloads and retains the 500
  • 2. 6e-05 n-grams with the highest information gain with respect to Seq-Poly JRip J48 discriminating between suspicious and unsuspicious flows. In 5e-05 SVM order to find the best parameters for the two methods, we 4e-05 conducted development experiments on the day trace. False Positives Rate The experiments were carried out on desktop machines with 3e-05 3.0GHz Intel Pentium IV processors and running Linux kernel 2e-05 2.6.13. We compared all six algorithms based on the following criteria. To measure the accuracy of generated signatures, we 1e-05 recorded the cross-validated false positive rate (the percentage of innocuous flows incorrectly classified as worms) as well 0 0 10 20 30 40 50 60 70 80 90 Percentage of Innocuous Flows in Suspicious Pool as the false negative rate (the percentage of worm flows 0.9 Seq-Poly misclassified as innocuous). To evaluate the resilience of these 0.8 JRip J48 SVM algorithms to unavoidable class noise in the suspicious pool, 0.7 we computed noise curves by varying the ratio of innocuous 0.6 False Negative Rate flows in the suspicious pool and recording the error rates 0.5 at each point. Finally, to evaluate the detection speed, we 0.4 recorded the time required by each algorithm to process the 0.3 training sets. To ensure the reliability of the results, for each 0.2 setting we report the results averaged over ten runs. 0.1 0 0 10 20 30 40 50 60 70 80 90 B. Accuracy of Generated Signatures Percentage of Innocuous Flows in Suspicious Pool 1) With a worm-free unsuspicious pool.: Following the Fig. 1. False positive and false negative rates under varying percentage of experimental design used to test Polygraph, for different noise in the suspicious pool (200 flows). Unsuspicious pool (45,000 flows) suspicious pool sizes from the week trace, we generated contains 20 worm flows. a noise curve for different levels of classification noise in the suspicious pool. From the experiments, we observed that Worm authors can disable worm detection algorithms by Polygraph started to generate false positives when the level of slow-poisoning the unsuspicious pool before launching the noise in the suspicious pool increased, which agrees with the attack. On the other hand, the use of recorded innocuous flows results presented in [2]. JRip (the Weka implementation of RIPPER) performed the to form the unsuspicious pool may cause worm fingerprinting best among the contending algorithms, as it achieved zero false algorithms to generate erroneous worm signatures for new positive rates consistently; in fact, for most runs, JRip just legitimate innocuous flows (with different characteristic pay- produced single rules that directly encoded the address of the loads) right after their release, because these flows only exist security flaw in the server system, required for any worm to in the suspicious pool and not in the unsuspicious pool. break into it (and not exploited in innocuous flow payloads). In addition, JRip successfully generated such signatures when C. Training Time there were only five true worm flows in a suspicious pool of 1) Training time for the accuracy experiments.: Figure 2 size 50, which suggests it would be able to detect a new worm presents training times for the two experiments presented in early in its outbreak. Since there seem to be small “smoking the previous subsection. gun” signatures for such worms, it is not surprising that The time complexity for sequential Polygraph is O(n2 m), symbolic rule learning algorithms like JRip are more accurate where n and m are the numbers of suspicious and unsuspicious than more numerical and probabilistic methods since their bias flows respectively. This, we believe, is one of the major for finding simple symbolic descriptions of categories seems limitations of the Polygraph algorithm, because in the outbreak to be a good match for this problem. of a new worm, the suspicious pool can easily grow up to 2) When the unsuspicious pool contains worms.: The au- hundreds or even thousands of flows, and consequently the thors of Polygraph make the assumption that the unsuspicious time required by Polygraph to train on this suspicious pool pool is free of worm flows, in order to use flows from a few will be too long to effectively quarantine the worm. On the days earlier to form pure unsuspicious pools. To verify the contrary, the time complexity for JRip is O(m + n), which is conjecture that Polygraph will be rendered ineffective if this also observed in the graphs, as when the size of suspicious assumption is relaxed, we repeated our previous experimental pool increase from 50 to 200, the training time stays roughly setting, additionally blending in twenty simulated worm flows the same because the number of unsuspicious flows (45,000) into the unsuspicious pool. dominates the total number of flows.The training time of J48 As shown in Figure 1, Polygraph had a consistently high lies between that of JRip and Polygraph. false negative rate, while JRip generated zero false negative 2) End-to-end training time for production use.: Given rates even when the number of innocuous flows increased, and the training time of JRip (3 to 10 minutes), we wanted to only began to mislabel worm flows as innocuous when the verify that our approach can be made fast enough to effec- number of worm flows in the suspicious pool dropped below tively quarantine a worm outbreak. With a more efficient C- that of the unsuspicious pool. implementation of RIPPER, n-grams as feature extractors, and
  • 3. 1600 Seq-Poly JRip D. Introducing a few purer labeled flows J48 1400 SVM NaiveBayes MNNB Part of our ongoing research involves replaying randomly 1200 Training Time (sec) sampled traffic flows on virtual hosts to see if they are worms. 1000 The virtual hosts are equipped with the latest server software, 800 thus a worm flow, when replayed on a host, will reveal its 600 malicious nature by exploiting flaws in the software. This 400 approach, when compared with the suspicious flow capturing 200 algorithms introduced in Autograph and Polygraph, can obtain 0 suspicious and unsuspicious flows that are much purer in 0 10 20 30 40 50 60 70 Percentage of Innocuous Flows in Suspicious Pool 80 90 nature, but is prohibitive due to the cost of establishing and 6000 Seq-Poly maintaining the replay engine, as well as the time needed to JRip 5000 J48 SVM replay each flow. Hence, we propose to augment existing fingerprinting al- 4000 gorithms by taking the fewer but purer labeled flows into Training Time (sec) 3000 consideration. One way to incorporate the new flows is to give them larger weights in training compared to the weights 2000 given to the less pure suspicious flows. We conducted the same 1000 experiments as those done in section III-C.2, while introducing a set of 10 labeled flows that were all worms. These flows were 0 0 10 20 30 40 50 60 70 80 90 added to the original training set but were given a weight of 5.0 Percentage of Innocuous Flows in Suspicious Pool instead of 1.0. The results indicated that the minimum number Fig. 2. Training time under varying percentage of noise in suspicious pool. of unsuspicious flows to ensure zero false positive/negative The above graph depicts suspicious pool of size 50 and pure unsuspicious rates was lowered from 1,000 to 750, which in turn reduced pool. The bottom graph depicts suspicious pool of size 200 and unsuspicious the minimum end-to-end training time by 11.1% down to 16 pool with 20 worm flows seconds. 0.3 False Positive 40 IV. C ONCLUSIONS AND FUTURE WORK False Negative 0.25 Training Time 35 We verified in this paper that certain machine learning 30 algorithms work well for the problem of worm fingerprinting. False Positive/Negative Rate 0.2 In particular, we compared the performance of five machine Training Time (sec) 25 0.15 20 learning algorithms against the best existing worm finger- printing algorithm (Polygraph) on a blend of network traces 15 0.1 and simulated polymorphic worm flows. Results showed that 10 two machine learning algorithms perform significantly better 0.05 5 than Polygraph in terms of resilience to noise and detection 0 200 400 600 800 1000 1200 1400 1600 1800 0 2000 speed. More specifically, RIPPER produced zero negative rates Number of Unsuspicious Flows consistently on noisy training data and was able to capture new Fig. 3. False positive rate, false negative rate and training time under varying worms with very few worm instances in the suspicious pool. number of unsuspicious flows. Moreover, the algorithm runs in time linear in the total number of training flows, which makes it tractable for containing a large-scale worm outbreak. As future work, we plan to test our techniques on worms with even greater polymorphism using a more streamlined process, we conducted additional experi- more advanced worm construction ideas. ments to explore how fast the Ripper algorithm could train on a minimum number of examples necessary to identify a worm. R EFERENCES We then measured the end-to-end time required to fingerprint [1] Kim, H.A., Karp, B.: Autograph: Toward automated, distributed worm a new worm with this “production level” implementation of signature detection. In: Proceedings of the USENIX Security Symposium. our approach. (2004) [2] Newsome, J., Karp, B., Song, D.: Polygraph: Automatically generating From Figure 3 we can see that end-to-end training time signatures for polymorphic worms. In: Proceedings of the IEEE Sympo- is linear in the number of flows used in training - with sium on Security and Privacy. (2005) [3] Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools 1,000 unsuspicious flows it is 18 seconds and with 2,000 and Techniques with Java Implementations. Morgan Kaufman Pub., San unsuspicious flows it increases to 34 seconds. Moreover, when Francisco (1999) there are 1,000 or more unsuspicious flows in training, both [4] Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: KDD ’04: Proceedings of the 2004 ACM SIGKDD false positive and false negative rates stay at zero consistently. international conference on Knowledge discovery and data mining, New This result suggests that we can safely bring the number of York, NY, USA, ACM Press (2004) 470–478 unsuspicious flows down to 1,000 and thus reduce the end-to- end training time to 18 seconds.