What is a malware ?
Different malware analysis techniques.
What’s wrong with those techniques.
What’s this paper about ?
Proposed malware classification system.
Evaluation and validation.
Experimental result analysis.
Comparing accuracy of classifiers BFS and AFS.
Comparing of model building time BFS and AFS.
A software program that purposefully fulfils the harmful
intent of an attacker is usually known as malicious software
The suspicious program is scanned with fully-automated
These tools are able to quickly assess what a malware is
capable of if it infiltrated the system.
Even though a fully-automated analysis does not provide as
much information as an analyst, it is still the fastest method
to sift through large quantities of malware.
The static properties include hashes, embedded strings,
embedded resources, and header information.The
properties should be able to show elementary indicators of
To observe a malicious file, it might often times be put in an
isolated laboratory to see if it directly infects the
Analysts will frequently monitor these laboratories to see if
the malicious file tries to attach to any hosts.
With this information, the analyst will then be able to
replicate the situation.
Reversing the code of the malicious file can decode
encrypted data that was stored by the sample, and see
other capabilities of the file that did not show up during the
In order to manually reverse the code, malware analysis
tools such as a debugger and disassembler are needed.
The main problem with these techniques are:
High false positive and false negative rates.
The process of building a classification model takes
time which hinders the early detection of malware.
This paper presents a system that addresses both the
issues mentioned before.
It uses an integration of both static and dynamic analysis
features of malware binaries incorporated with machine
learning process for detecting zero-day malware.
Due to pros and cons of the techniques mentioned before, it
is obvious that a relevant of features needs to be selected
so that the classification model can be built in less time
with high accuracy.
Feature selection is a method of identifying top ranked
It detects the relevant features thus making it easy to
discard the irrelevant ones.
A perfect selection of features can improve the learning
speed as well as generalization capacity of the model.
A large corpus of malicious samples are collected and then
scanned using AVG AV to endorse their maliciousness.
The clean files used are collected manually from system
directories of successive versions of the respective
All the collected specimen are then made to execute in an
automated analysis environment using a modified version
of Cuckoo sandbox.
The system is configured to generate the analysis reports
in JSON format after executing a specimen in it.
The JSON reports are then parsed to obtain the various
malware features including both static and dynamic
The dataset so obtained contains very large number of
features and is not suitable for building the classification
This data is prepared to have a reduced set of malware
attributes which can be used for building the classification
Building a classification model from the training data is
time consuming task .
So, the top ranked features are selected from this reduced
data set using Information Gain (IG) method.
The selected features are then used to build the
classification model using ML algorithms.
These classifiers are used for distinguishing malicious files
from benign ones.
The model build time is observed while conducting the
experiments using both the datasets i.e. BFS and AFS.
The training data is required by the classification
algorithms to build the model while testing data is required
to test the models so built.
Validation is done by cross validation technique which is
used for evaluating the results generated by the
The machine learning algorithms are evaluated by using
following performance measures
True positive rate (TPR): Rate of correctly identified
malicious files (also known as recall or sensitivity). It is a
measure of completeness or quality.
𝑇𝑃 + 𝐹𝑁
 A. Moser, C. Kruegel, E. Kirda,“Exploring Multiple
Execution Paths for Malware Analysis,”Proc. of IEEE
Symposium on Security and Privacy, pp. 231-245. IEEE
Computer Society, USA, 2007, doi:10.1109/SP.2007.17.
 E. Gandotra, D. Bansal,S. Sofat,“Malware Analysis and
Classification: A Survey,” Journal of Information Security,
vol. 5, pp. 56-65, 2014.
 Internet Security Threat Report, Symantec,Volume 21,
April, 2016, [online]. Available:
 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P.
Reutemann and I.Witten,“The WEKA Data Mining
Software: An Update,” ACM SIGKDD Explorations
Newsletter, vol. 11, no. 1 pp. 10-18, 2009.
 M. Schultz, E. Eskin, F. Zadok, and S. Stolfo,“Data
mining methods for detection of new malicious
executables,”Proc. of 2001 IEEE Symposium on Security
and Privacy, IEEE, Oakland, CA, 2001, pp. 38-49, Doi:
 J. Kolter, and M. Maloof,“Learning to detect malicious
executables in the wild,” Proc. of the 10th ACM SIGKDD
international conference on Knowledge discovery and
data mining, ACM NewYork, NY, USA, 2004, pp. 470–478,
 D. Kong and G.Yan,“Discriminant malware distance
learning on structural information for automated malware
classification,”Proc. of the ACM SIGMETRICS/
international conference on Measurement and modeling
of computer systems,ACM NewYork, USA, 2013, pp. 347-
348, doi: 10.1145/2465529.2465531.
 R.Tian, L. Batten, and S.Versteeg,“Function Length as a
Tool for Malware Classification,” Proc. of the 3rd