2. • With the rapid development of the Internet,
malware became one of the major cyber threats
nowadays.
• Any software performing malicious actions,
including information stealing, espionage, etc.
can be referred to as malware. Kaspersky Labs
(2017) define malware as “a type of computer
program designed to infect a legitimate user's
computer and inflict harm on it in multiple
ways.”
• Its capability doesn’t only extend to
compromising computers, to destroy data or
make them useless, but can also steal the
secure details like credit card numbers, bank
account and distribute the information to the
programmer without the user’s knowledge.
• Attackers exploit vulnerabilities in web services,
browsers and operating systems, or use social
engineering techniques to make users run the
3. • A security practitioner is not only
interested in how accurate a learning
system performs, but also needs to
understand how such performance is
achieved – a requirement not satisfied
by many “black-box” applications of
machine learning. In this section we
supplement our proposed methodology
and provide a procedure for explaining
classification results obtained using our
method.
• To develop the proof of concept for the
machine learning based malware
classification based on Cuckoo
Sandbox.
• To determine the best feature
representation method and how the
features should be extracted, the most
accurate algorithm that can distinguish
4. • While the diversity of malware is increasing, anti-virus scanners cannot fulfill
the needs of protection, resulting in millions of hosts being attacked.
• According to Kaspersky Labs (2016), 6,563,145 different hosts were attacked,
and 4,000,000 unique malware objects were detected in 2015.
• There is a decrease in the skill level that is required for malware
development, due to the high availability of attacking tools on the Internet
nowadays.
• High availability of anti-detection techniques, as well as ability to buy malware
on the black market result in the opportunity to become an attacker for
anyone, not depending on the skill level.
5. 1. To propose a framework for Malware Classification System (MCS) to analyse
malware behavior dynamically using a concept of information theory and a
machine learning technique.
2. To extract behavioral patterns from execution reports of malware in terms of its
features and generates a data repository.
3. To select the most promising features using information theory based concepts
6. • Malware, or malicious software, is any
program or file that is harmful to a computer
user.
• These malicious programs can perform a
variety of functions, including stealing,
encrypting or deleting sensitive data, altering
or hijacking core computing functions and
monitoring users' computer activity without
their permission.
• Malware includes computer viruses, Worms,
Trojan horse, Spyware etc.
7. • A. Viruses
• B. Worms
• C. Trojan Horse
• D. Spyware
• E. Adware:
• F. Backdoors
• G. Key logger
• H. Ransom ware
8. • Effectively capture knowledge of
the malware to represent.
• The representation can enable
classifiers to efficiently and
effectively correlate data across
large number of objects.
• Malicious software is classified into
families, each family originating
from a single source base and
exhibiting a set of consistent
behaviors.
9. • Malware analysis is a process of
identifying malware behaviour, what
they are doing, what they want, and
what their main goals are.
• Malware analysis involves a complex
process in its activity. Forensics,
reverse engineering, disassembly,
debugging, these activities take a lot of
time in the progress.
• The goal of malware analysis is to gain
an understanding of how a malware
works, so that we can protect our
organization by preventing malware
attacks.
10. • Analysing malicious software without executing it is called static analysis.
• The detection patterns used in static analysis include:
1. String signature
2. Byte-sequence n-grams
3. Syntactic library call
4. Control flow graph
5. Opcode (operational code) frequency distribution etc..
11. The executable has to be unpacked and decrypted before doing static
analysis.
•Tools for unpacked/decrypt
1. Disassembler/Debugger tools: IDA Pro and OllyDbg: which provide a lot
of insight into what the malware is doing and provide patterns to identify the
attackers.
2. Memory dumper tools: LordPE and OllyDump: used to obtain protected
code located in the system’s memory and dump it to a file.
12. • Dynamic malware analysis is known as the analysis of infected file during its
execution. During the process, infected files are analysed in simulated
environment, something like a virtual machine. After that malware researchers
use certain tools like the System Analyzer, Process Explorer, etc. to identify the
general behaviour of the particular file. In the process, the file is detected after
executing it in actual environment and during the execution of file its system
interaction, its behaviour and effect on the system are observed.
• The advantage of dynamic analysis is that it accurately analyses the known as
well as unknown malware however; this analysis technique is more time
consuming. It necessitates as much time as to prepare the environment for
malware analysis such as a virtual machine environment.
13. STATIC ANALYSIS
1. Fast and safe
2. Good in analyzing the mul-tipath
malware (Global View)
3. Can't analyze the obfuscated and
polymorphic
4. Can't detect new, unknown malware
5. Low level of false positive (accuracy
is high)
DYNAMIC ANALYSIS
1. Time Consuming and vul-nerable
2. Difficult to analyze the mul-tipath
malware
3. Can analyze the obfuscated and
polymorphic
4. Detectknown as well as unknown
malware
5. High level of false positive (accuracy
is low)
14. • Binary Collection
Maltrieve Installation
• Dynamic Analysis
Cuckoo Sandbox Installation
• Analytics
Feature Extraction
• Classification
Machine Learning Algorithm
• Label
Labelling of Malware
• Final Result
Evaluation of Algorithm
15. • Maltrieve originated as a fork of mwcrawler. It retrieves malware directly from the
sources as listed at a number of sites. Currently we crawl the following:
• Malc0de
• Malware Domain List
• Malware URLs
• VX Vault
• URLquery
• CleanMX
• ZeusTracker
17. Malware binaries are collected via
honeypots and spam-traps, and
malware family labels are generated
by running an anti-virus tool on each
binary.
To assess behavioural patterns
shared by instances of the same
malware family, the behaviour of
each binary is monitored in a
sandbox environment and behavior-
based analysis reports summarizing
operations, such as opening an
outgoing IRC connection or stopping
18. • VirusTotal is a free service that
analyses suspicious files and URLs
and facilitates the quick detection of
viruses, worms, Trojans, and all kinds
of malware.
• VirusTotal is a free online service that
analyses files and URLs enabling the
identification of viruses, worms,
Trojans and other kinds of malicious
content detected by antivirus engines
and website scanners.
• It may be used as a means to detect
false positives, i.e. innocuous
resources detected as malicious by
19. • Cuckoo is a malware sandboxing utility
which has practical applications of the
dynamical analysis approach. Instead of
statically analyzing the binary file, it gets
executed and monitored in real time.
• Cuckoo is an open source automated
malware analysis system that allows you
to perform analysis on sandboxed
malware.
• Cuckoo Sandbox started as a Google
Summer of Code project in 2010 within
the Honeynet Project. After the initial
work during the summer of 2010, the first
beta release was published on February
5th, 2011, when Cuckoo was publicly
announced and distributed for the first
time.
20. Cuckoo is designed for use in analyzing the following
kinds of files:
• Generic Windows executables
• DLL files
• PDF documents
• Microsoft Office documents
• URLs
• PHP scripts
• Almost everything else
21. • Traces of win32 API calls performed
by all processes spawned by the
Malware
• Files being created, deleted, and
downloaded by the malware during
its execution
• Memory dumps of the malware
processes
• Network traffic trace in PCAP format
• Screenshots of the Windows desktop
taken during the execution of
the malware
• Full memory dumps of the machines
22. • The process of extracting data
from the files is called feature
extraction.
• The goal of feature extraction is
to obtain a set of informative and
non-redundant data. It is
essential to understand that
features should represent the
important and relevant
information about our dataset
since without it we cannot make
an accurate prediction.
23. • Excessive amount of raw features available(image classification,
spam detection).
• Learning algorithms are already well defined.
• No Machine Learning algorithm can perform table without feature
extraction but if features are extracted well, even linear methods
show great results.
• Companies invest in feature extraction pipeline.
25. • Various machine learning approaches like Association Rule, Support Vector
Machine, Decision Tree, Random Forest, Naive Bayes and Clustering have
been proposed for detecting and classifying unknown samples into either
known malware families or underline those samples that exhibit unseen
behavior, for detailed analysis.
• The basic idea of any machine learning task is to train the model, based on
some algorithm, to perform a certain task: classification, clusterization,
regression, etc.
26. Data intake Data transformation Model Training
Model testing
Model deployment
Test Dataset
Machine
Learning
Workflow
Process
27. 1. Data intake: At first, the dataset is loaded from the file and is saved in
memory.
2. Data transformation: Data that was loaded at step 1 is transformed,
cleared, & normalized so that it lies in the same range, has the same
format, etc. and feature extraction and selection has done. Further,data
is separated into sets – ‘training set’ and ‘test set’.
3. Model Training. At this stage, a model is built using the selected
algorithm.
28. 4.Model Testing. The model that was built or trained during step 3 is tested
using the test data set, and the produced result is used for building a new
model, that would consider previous models, i.e. “learn” from them.
5. Model Deployment. At this stage, the best model is selected (either after
the defined number of iteration or as soon as the needed result is achieved).
29. • K-Nearest Neighbours (KNN) is one of the simplest, though, accurate
machine learning algorithms. KNN is a non-parametric algorithm, meaning
that it does not make any assumptions about the data structure.
• In real world problems, data rarely obeys the general theoretical
assumptions, making non-parametric algorithms a good solution for such
problems.
• KNN model representation is as simple as the dataset – there is no learning
required, the entire training set is stored.
• KNN can be used for both classification and regression problems.
31. • In Support Vector Machines (SVM) the term ‘support vectors’ refers to the
points lying closest to the hyperplane, that would change the hyperplane
position if removed. The distance between the support vector and the
hyperplane is referred to as margin.
• The further from the hyperplane our classes lie, the more accurate predictions
we can make. That is why, although multiple hyperplanes can be found per
problem, the goal of the SVM algorithm is to find such a hyperplane that
would result in the maximum margins
33. • Only if we have a method for the users to know a malware when it enters their
system, one can protect or take precaution.
• With all the anti-virus packages available currently, still the malware finds its way
into our personal computer.
• Signature-based antivirus products are able to detect only those malwares that
has already caused damage and are registered.
• The reports generated by dynamic analysis can be compiled into behavioural
profiles that can be clustered to combine samples with similar behaviour into
coherent families.
• The machine learning technologies that are being used in detecting and
classifying malwares are not adequate to handle challenges arising from the
huge amount of dynamic and severely imbalanced network data.