SlideShare une entreprise Scribd logo
1  sur  46
MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH
RUNNING WINDOW ENTROPY AND MACHINE LEARNING
Student: Keith J. Jones, Chair: Dr. Yong Wang
Dakota State University - keith.j.jones@ieee.org
COMMITTEE
 Dr. Yong Wang (Chair)
 Dr. Jun Liu
 Dr. Wayne Pauli
 Dr. Joshua Pauli
 Dr. Joshua Stroschein
OUTLINE
1. Introduction & Motivation
2. Literature Review
3. Research Methodology
4. Optimizing Running Window Entropy
5. Qualitative Data Exploration
6. Malgazer: An RWE-Based Malware Classifier
7. Comparison of Malgazer to Prior Literature
8. The Malgazer Web Application
9. Conclusions
1. INTRODUCTION & MOTIVATION
 Malware: 200,000+ new threats released into the internet each day,
and that number general grows each year (Panda Security, 2017)
 Malware detection, with machine learning methods, claim efficacy
rates of 95% (Saxe & Berlin, 2015) or beyond (Cylance Inc. 2016)
 Malware detection and classification often used interchangeably in
prior research.
 Malware detection is a subset of the malware classification problem:
THEMALWARE CLASSIFICATION PROBLEM
THE ML CLASSIFIER MODEL
 The machine learning based model can be represented by
the set:
C: {T(), P(), MLM(), E(), TU(), PP()}
 T() – Transformation function
 P() – Preprocessing function
 MLM() – The machine learning model
 E() – Error measurement function
 TU() – Tuning function
 PP() – Post processing function
TRAINING THE MODEL
RUNNING WINDOW ENTROPY
 This is a measurement that shows the localized entropy at each
offset within the malware.
 I was able to optimize the original running window entropy algorithm
to less than 2% of the running time for the original algorithm, on
average.
 Jones, K., & Wang, Y. (2018).
WHAT DOES R.W.E. GIVE US?
 A view of “randomness”.
 0 is not random, 1 is totally random.
 A view of “information gain”.
 A view of “relevancy”.
 Running windows of entropy present the entropic
structure of a malware sample.
RUNNING WINDOW ENTROPY (CONT)
RUNNING WINDOW ENTROPY (CONT)
RUNNING WINDOW ENTROPY (CONT)
THE RESEARCH PROBLEM
 Develop an automated machine learning based
classifier (Malgazer) using running window
entropy that will classify new malware samples
functionally. This problem will be solved using a
design science research methodology (Arnould,
Hevner, March, & Park, 2004).
 Null Hypothesis: A classifier cannot be built to perform
better than random guessing (or in this case random
classifications).
SIX DSRM CONTRIBUTIONS
1. Model Artifact: The ML based classifier model.
2. Method Artifact: Optimization of R.W.E. Algorithm
3. Method Artifact: Collect and normalize functional classification
information from publicly available resources
4. Method Artifact: Scaling the training of machine learning classifiers
5. Instantiation Artifact: An actual ML based classifier.
6. Instantiation Artifact: A user application to access the classifier
2. LITERATURE REVIEW
 While reading, began categorizing the prior literature.
 Malware Classification Method Categories:
 Malware Detector:
 Boot sectors
 Dynamic analysis
 Entropy statistics
 Images from bytes and n-grams
 Opcode statistics
 PDF streams
 Static analysis
 Static analysis and opcodes
 Static analysis and n-grams
 Static and dynamic analysis
 Structural Entropy
 N-grams
2. LITERATURE REVIEW
 Malware Classification Method Categories:
 Malware Classifier:
 Byte code
 Dynamic analysis
 HMM
 Images from bytes
 Images from bytes and opcodes
 Intermediate language of opcodes
 Static analysis
 Static analysis and images from bytes
 Static analysis and opcodes
 Static and dynamic analysis
 Structural entropy
2. LITERATURE REVIEW
 Reviewed 71 prior methods
 Spans 1996 through 2018
 41 malware detectors
 30 malware classifiers
 Only 2 classifiers used features similar to running window entropy.
 The first method only classifies samples if they were packed.
 The second method only reported an accuracy of 87.2%. It also used
structural entropy.
 One method, “GIST”, was implemented for deeper comparisons.
 GIST = filters on a 2D image of the malware sample, 320 values
3. RESEARCH METHODOLOGY
3. RESEARCH METHODOLOGY
 Comparison to prior literature
 Confusion Matrices:
 Per classification accuracy
 ROC/AUC
 Empirical comparison to GIST method.
 Theoretical comparisons to other methods.
TRAINING THE MODEL
 What training set?
 Publicly available malware through VirusTotal.
 Windows PE files
 60,000 samples of classified and known malware, 10,000 samples from
each of 6 classifications:
 Virus, Backdoor, Worm, Trojan, Ransom, PUA
 What testing set?
 Approximately 10-20% of the collected malware samples.
 What input data? (features)
 RWE
 GIST
 Scale the 200+ experiments using Terraform.
VIRUSTOTAL YARA RULE
1. import "math"
2. import "pe"
3.
4. /* 20180312.1 */
5.
6. rule peexe_files_20180312_1 {
7. condition:
8. file_type contains "peexe"
9. }
ANOTHER VIRUSTOTAL YARA RULE
1. import "math"
2. import "pe”
3.
4. /* 20180501.1 */
5.
6. rule peexe_backdoor_files_20180501_1 {
7. condition:
8. (file_type contains "peexe") and
(microsoft contains "Backdoor")
9. }
TRAINING THE MODEL (CONT)
 Machine learning algorithms:
 Decision Trees
 Random Forests
 K Nearest Neighbors
 SVM
 Naïve Bayes
 Nearest Centroid
 Adaboost
 OneVRest
 Deep Learning:
 Artificial Neural Networks
 Convolutional Neural Networks
TRAINING PHASES
1. Grid Search
2. 10-Fold Validation
3. Final Training
4. OPTIMIZING R.W.E.
 I was able to optimize the original running window entropy algorithm
to less than 2% of the running time for the original algorithm, on
average.
 This topic was published in a peer reviewed conference, June 2018.
 Jones, K., & Wang, Y. (2018).
ORIGINAL ALGORITHM RUNNING TIME
OPTIMIZED ALGORITHM RUNNING TIME
ORIGINAL V. OPTIMIZED ALGORITHMS
OPTIMIZED V. ORIGINAL
PERCENTAGE OF RUNNING TIME
Malware Size
(bytes)
256 Byte
Entropy
Window
512 Byte Entropy
Window
1,024 Byte
Entropy Window
2,048 Byte
Entropy
Window
256 93.55% N/A N/A N/A
512 1.99% 97.61% N/A N/A
1,024 1.74% 1.38% 89.77% N/A
2,048 1.79% 1.26% 0.93% 87.23%
4,096 1.76% 1.24% 0.87% 0.58%
8,192 1.65% 1.14% 0.85% 0.56%
16,384 1.58% 1.24% 0.87% 0.56%
32,768 1.69% 1.20% 0.86% 0.57%
65,536 1.67% 1.18% 0.87% 0.55%
131,072 1.66% 1.16% 0.92% 0.54%
5. QUALITATIVE DATA EXPLORATION
 Experimental data set was collected.
 Approximately 25,190 publicly available samples from VirusTotal
 The RWE was computed.
 RWE was down sampled because malware binary size is a variable
 The classifications were explored.
 The unified functional classifications were determined:
 Virus, Worm, Backdoor, Trojan, Ransom, PUA
 The VirusTotal products were chosen:
 Microsoft, Symantec, Kaspersky, Sophos, TrendMicro, ClamAV
 Classifier experimentation.
 Source the final malware data set.
5. QUALITATIVE DATA EXPLORATION
SHA256 Microsoft Symantec Kaspersky Sophos TrendMicro ClamAV
4009b81841c893eceaf2d92e1816a968fad66db69f7
0429724e6d6f7bb9a4ca7
Trojan.Gen.2 HT_DYNAMER_GA25051D.UV
PM
Win.Trojan.Ag
ent-1149280
79b155f341a6c3311b4ded9fc2c239f0e124b705b62
7cfe62a9b6b6620d80c5b
Trojan.Gen.2 Mal/Gene
ric-S
HT_DYNAMER_GA25051D.UV
PM
Win.Trojan.Ag
ent-1149280
eea41fa0f53c20ab2302fdfef59dd4b3dba80f14ec41
9f9d9c0fb18eac484c1a
Virus:Win32/Shodi.I Trojan.Gen.N
PE.2
Virus.Win32.Virut.ce W32/Scri
bble-B
TROJ_OBFUSCATOR_GC1402
62.UVPM
Win.Trojan.Sh
ohdi-6136104-
0
60dcb577124f01cbfcadbe7a86b61d06ad3783083a
04f4847a2a255f210ffb28
SMG.Heur!g
en
HEUR:Trojan.Win32.Start
Page
CloudWra
pper
(PUA)
Win.Trojan.Fa
m-6454574-1
fbee9ad67e71ac0e109503a2acd0dd7be1e5f91c47
3bcb349fbbd5e0050a1d3f
Trojan:Win32/Qhost HEUR:Trojan.Win32.Gen
eric
HT_AGENT_HC050083.UVPM Win.Trojan.Qh
ost-160
472257d7fc92f035a50b37ec2536b24bf83ec66ef35
6df4c03afd8f297138bcf
Virus:Win32/Shodi.I Trojan.Gen.N
PE.2
Virus.Win32.Virut.ce W32/Sho
di-I
TROJ_OBFUSCATOR_GC1402
62.UVPM
Win.Trojan.Sh
ohdi-6136104-
0
f308ad39249c32127653c45e20eb318aca5db84849
4812d07b688573c128f90a
Trojan:Win32/Dorv.A Trojan Horse Trojan.Win32.Inject.azg
w
Troj/Simb
ot-J
TROJ_KRYPTK.SMS Win.Trojan.Inj
ector-
6297684-0
2c1b5dabfc0d3646d27814e8e0092e856debbb704a
53de402a9b49ed3a81130a
Trojan.Gen.2 Mal/Gene
ric-S
HT_DYNAMER_GA25051D.UV
PM
Win.Trojan.Ag
ent-1149280
5c9888db9bf4aebe7102dc086505aed8afc33ccc7e4
bb75257e8052f77f77061
TrojanDropper:Win32/D
inwod.B!bit
Suspicious.IR
CBot
HEUR:Trojan.Win32.Gen
eric
HT_AGENT_GL26002B.UVPM
6d38be3380671eaa11f1dec1758012b6a2e7b8f48c
ed5bd4edc37ea48c640da4
Trojan.Gen.2 not-a-
virus:RiskTool.Win32.Fly
Studio.bjxf
Generic
PUA JE
(PUA)
b897a0fb1865e016bb88806300ddd9ee739f791944
2f40d1b5d22cc974923996
Worm:Win32/Ludbarum
a.A
Trojan.Gen.2 Trojan-
Ransom.Win32.Blocker.k
puo
W32/Mat
o-N
TSPY_LUDBARUMA_BK083ED
B.TOMC
Win.Worm.Un
tukmu-
5949608-0
5. QUALITATIVE DATA EXPLORATION
Search String Malgazer Classification
Trj Trojan
PUP PUA
Wrm Worm
Bkdr Backdoor
Cryp Ransom
PUA PUA
RiskTool PUA
VirTool Virus
VirLock Ransom
HackTool PUA
RemoteAdmin PUA
Trojan Trojan
Worm Worm
Troj Trojan
6. MALGAZER
 Over 200 experiments
 1 experiment was 1 training session.
 Computed the features:
 RWE
 GIST
 Scaled the computations
 Terraform
 Grid searching
 Provided hyper-parameters
 Cross 10-Fold validation scoring
 Provided average accuracy statistics
 Final Training
 Provided ROC/AUC and an actual classifier
Model Base Model /
Optimizer
Features RWE
Window
(Bytes)
RWE Data
Points
Accuracy
Mean
Accuracy
Variance
adaboost rf RWE 1,024 1,024 95.00% 0.0045
adaboost dt RWE 1,024 1,024 94.96% 0.0047
adaboost dt RWE 256 4,096 94.95% 0.0046
adaboost rf RWE 256 4,096 94.84% 0.0041
adaboost rf RWE 256 1,024 94.69% 0.0043
adaboost dt RWE 256 1,024 94.63% 0.0042
ann-5 adadelta RWE 256 512 94.38% 0.0050
ann-5 adam RWE 256 512 94.31% 0.0038
ann-4 adadelta RWE 256 512 94.29% 0.0052
ann-4 adam RWE 256 512 94.28% 0.0048
ovr knn GIST N/A N/A 94.24% 0.0073
knn N/A GIST N/A N/A 94.24% 0.0069
ann-5 adamax RWE 256 512 94.22% 0.0059
ann-3 nadam RWE 256 512 94.19% 0.0054
ann-4 adamax RWE 256 512 94.19% 0.0057
ann-3 rmsprop RWE 256 512 94.12% 0.0050
ann-3 adadelta RWE 256 512 94.10% 0.0051
ann-3 adagrad RWE 256 512 94.09% 0.0043
ann-4 adadelta GIST N/A N/A 94.09% 0.0065
ann-3 adadelta GIST N/A N/A 94.08% 0.0052
ann-4 adamax GIST N/A N/A 94.05% 0.0060
ann-1 rmsprop RWE 256 512 94.04% 0.0073
ann-3 adam RWE 256 512 94.01% 0.0060
ann-1 adadelta RWE 256 512 93.98% 0.0066
ann-1 adagrad RWE 256 512 93.96% 0.0046
ann-4 adam GIST N/A N/A 93.96% 0.0043
ann-2 adadelta RWE 256 512 93.95% 0.0037
ann-2 nadam RWE 256 512 93.94% 0.0051
7. MALGAZER COMPARISON
 The Malgazer classifier slightly outperformed the GIST classifier,
therefore they are at a minimum comparable for classification
accuracy.
 Using this comparison, it is possible to theoretically compare
Malgazer to other prior literature.
 Malgazer was comparable to the prior literature reviewed in this
dissertation.
 Additional observations were also made…
ADABOOST/RF RWE 1,024/1,024
ADABOOST/DT RWE 1,024/1,024
KNN GIST
THE MALGAZER CLASSIFIER “C”
 CATEGORY: Malware Classifier – Running Window Entropy
 EFFICACY: accuracy of approximately 95%
 TRAINING DATA: 60,000 malware samples and functional classifications from
VirusTotal, where there were six functional classifications developed from the
publicly available detection information: Virus, Trojan, Worm, PUA, Ransom, and
Backdoor
 TRANSFORMATION FUNCTION T(): The RWE data
 PRE-PROCESSING FUNCTION P(): The re-sampled RWE to a fixed sample size
 MACHINE LEARNING MODEL FUNCTION MLM():
 Adaboost/DT
 Adaboost/RF
 Artificial neural networks
 ERROR FUNCTION E(): Confusion matrices
 TUNING FUNCTION TU(): Window and data point sampling sizes for running
window entropy.
 POST PROCESSING FUNCTION PP(): Not discussed.
8. MALGAZER WEB APPLICATION
8. MALGAZER WEB APPLICATION
8. MALGAZER WEB APPLICATION
8. MALGAZER WEB APPLICATION
8. MALGAZER WEB APPLICATION
9. CONCLUSIONS
 The Six Research Contributions
 The generalized classification model
 RWE optimization
 Collection and normalization of functional classification information
 Scaling the computations
 The most accurate classifier (slightly more than GIST)
 The web application
 Recommendations
 RWE is slow, but if you are already using it, this could be a great method for
automated classification. Otherwise, GIST is faster and provides about the
same accuracy.
 Future research opportunities
 Continue generalized model application
 Functional classification determination (Problematic Trojan?)
 Optimal RWE window size and data sample length
OPEN SOURCE
http://github.com/keithjjones/malgazer
QUESTIONS
?

Contenu connexe

Tendances

Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Indraneel Dabhade
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingDatabricks
 
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...DataStax Academy
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemEhsan38
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016DataStax
 
Threat hunting and achieving security maturity
Threat hunting and achieving security maturityThreat hunting and achieving security maturity
Threat hunting and achieving security maturityDNIF
 
An Introduction to Locks in Go
An Introduction to Locks in GoAn Introduction to Locks in Go
An Introduction to Locks in GoYu-Shuan Hsieh
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...Flink Forward
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introductionMostafa Abdel-sallam
 
Blotting techniques
Blotting techniquesBlotting techniques
Blotting techniquesKinzaHaroon1
 
Knowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideKnowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideXiachongFeng
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache KafkaPaul Brebner
 

Tendances (20)

Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)Android Malware 2020 (CCCS-CIC-AndMal-2020)
Android Malware 2020 (CCCS-CIC-AndMal-2020)
 
AI-Driven Personalized Email Marketing
AI-Driven Personalized Email MarketingAI-Driven Personalized Email Marketing
AI-Driven Personalized Email Marketing
 
PIC your malware
PIC your malwarePIC your malware
PIC your malware
 
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
Cassandra Summit 2014: Novel Multi-Region Clusters — Cassandra Deployments Sp...
 
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender SystemRecSysOps: Best Practices for Operating a Large-Scale Recommender System
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
 
Threat hunting and achieving security maturity
Threat hunting and achieving security maturityThreat hunting and achieving security maturity
Threat hunting and achieving security maturity
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
How can Optical Mapping accelerate my research?
How can Optical Mapping accelerate my research?How can Optical Mapping accelerate my research?
How can Optical Mapping accelerate my research?
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
An Introduction to Locks in Go
An Introduction to Locks in GoAn Introduction to Locks in Go
An Introduction to Locks in Go
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
01 Metasploit kung fu introduction
01 Metasploit kung fu introduction01 Metasploit kung fu introduction
01 Metasploit kung fu introduction
 
Blotting techniques
Blotting techniquesBlotting techniques
Blotting techniques
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
Knowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical GuideKnowledge Distillation for Federated Learning: a Practical Guide
Knowledge Distillation for Federated Learning: a Practical Guide
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
Nmap basics
Nmap basicsNmap basics
Nmap basics
 

Similaire à Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNING WINDOW ENTROPY AND MACHINE LEARNING

Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...UltraUploader
 
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...IJCSIS Research Publications
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning MethodsIRJET Journal
 
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...CSCJournals
 
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSA STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSijaia
 
Design and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLDesign and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLSiva krishnam raju Patsamatla
 
Machine Learning in Malware Detection
Machine Learning in Malware DetectionMachine Learning in Malware Detection
Machine Learning in Malware DetectionKaspersky
 
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Darshan Gorasiya
 
Zero day malware detection
Zero day malware detectionZero day malware detection
Zero day malware detectionsujeeshkumarj
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....Research Publish Journals (Publisher)
 
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...IJCI JOURNAL
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationUltraUploader
 
Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Akash Karwande
 
Predict Android ransomware using categorical classifiaction.pptx
Predict Android ransomware using categorical classifiaction.pptxPredict Android ransomware using categorical classifiaction.pptx
Predict Android ransomware using categorical classifiaction.pptxlaharisai03
 
Malware evolution and Endpoint Detection and Response
Malware evolution and Endpoint Detection and Response Malware evolution and Endpoint Detection and Response
Malware evolution and Endpoint Detection and Response Adrian Guthrie
 
Malware evolution and Endpoint Detection and Response Technology
Malware evolution and Endpoint Detection and Response  TechnologyMalware evolution and Endpoint Detection and Response  Technology
Malware evolution and Endpoint Detection and Response TechnologyAdrian Guthrie
 
A feature selection and evaluation scheme for computer virus detection
A feature selection and evaluation scheme for computer virus detectionA feature selection and evaluation scheme for computer virus detection
A feature selection and evaluation scheme for computer virus detectionUltraUploader
 

Similaire à Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNING WINDOW ENTROPY AND MACHINE LEARNING (20)

Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...Application of data mining based malicious code detection techniques for dete...
Application of data mining based malicious code detection techniques for dete...
 
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
Selecting Prominent API Calls and Labeling Malicious Samples for Effective Ma...
 
IRJET - Survey on Malware Detection using Deep Learning Methods
IRJET -  	  Survey on Malware Detection using Deep Learning MethodsIRJET -  	  Survey on Malware Detection using Deep Learning Methods
IRJET - Survey on Malware Detection using Deep Learning Methods
 
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
Integrated Feature Extraction Approach Towards Detection of Polymorphic Malwa...
 
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODSA STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
A STATIC MALWARE DETECTION SYSTEM USING DATA MINING METHODS
 
Design and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using MLDesign and Development of an Efficient Malware Detection Using ML
Design and Development of an Efficient Malware Detection Using ML
 
Machine Learning in Malware Detection
Machine Learning in Malware DetectionMachine Learning in Malware Detection
Machine Learning in Malware Detection
 
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
 
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
Analysis of Malware Infected Systems & Classification with Gradient-boosted T...
 
Zero day malware detection
Zero day malware detectionZero day malware detection
Zero day malware detection
 
Malware analysis and detection using reverse Engineering, Available at: www....
Malware analysis and detection using reverse Engineering,  Available at: www....Malware analysis and detection using reverse Engineering,  Available at: www....
Malware analysis and detection using reverse Engineering, Available at: www....
 
Malware1
Malware1Malware1
Malware1
 
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...
MACHINE LEARNING APPLICATIONS IN MALWARE CLASSIFICATION: A METAANALYSIS LITER...
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creation
 
Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques Malware Detection Using Data Mining Techniques
Malware Detection Using Data Mining Techniques
 
Predict Android ransomware using categorical classifiaction.pptx
Predict Android ransomware using categorical classifiaction.pptxPredict Android ransomware using categorical classifiaction.pptx
Predict Android ransomware using categorical classifiaction.pptx
 
Malware evolution and Endpoint Detection and Response
Malware evolution and Endpoint Detection and Response Malware evolution and Endpoint Detection and Response
Malware evolution and Endpoint Detection and Response
 
Malware evolution and Endpoint Detection and Response Technology
Malware evolution and Endpoint Detection and Response  TechnologyMalware evolution and Endpoint Detection and Response  Technology
Malware evolution and Endpoint Detection and Response Technology
 
A feature selection and evaluation scheme for computer virus detection
A feature selection and evaluation scheme for computer virus detectionA feature selection and evaluation scheme for computer virus detection
A feature selection and evaluation scheme for computer virus detection
 

Dernier

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Dernier (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Keith J. Jones, Ph.D. - MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNING WINDOW ENTROPY AND MACHINE LEARNING

  • 1. MALGAZER: AN AUTOMATED MALWARE CLASSIFIER WITH RUNNING WINDOW ENTROPY AND MACHINE LEARNING Student: Keith J. Jones, Chair: Dr. Yong Wang Dakota State University - keith.j.jones@ieee.org
  • 2. COMMITTEE  Dr. Yong Wang (Chair)  Dr. Jun Liu  Dr. Wayne Pauli  Dr. Joshua Pauli  Dr. Joshua Stroschein
  • 3. OUTLINE 1. Introduction & Motivation 2. Literature Review 3. Research Methodology 4. Optimizing Running Window Entropy 5. Qualitative Data Exploration 6. Malgazer: An RWE-Based Malware Classifier 7. Comparison of Malgazer to Prior Literature 8. The Malgazer Web Application 9. Conclusions
  • 4. 1. INTRODUCTION & MOTIVATION  Malware: 200,000+ new threats released into the internet each day, and that number general grows each year (Panda Security, 2017)  Malware detection, with machine learning methods, claim efficacy rates of 95% (Saxe & Berlin, 2015) or beyond (Cylance Inc. 2016)  Malware detection and classification often used interchangeably in prior research.  Malware detection is a subset of the malware classification problem:
  • 6. THE ML CLASSIFIER MODEL  The machine learning based model can be represented by the set: C: {T(), P(), MLM(), E(), TU(), PP()}  T() – Transformation function  P() – Preprocessing function  MLM() – The machine learning model  E() – Error measurement function  TU() – Tuning function  PP() – Post processing function
  • 8. RUNNING WINDOW ENTROPY  This is a measurement that shows the localized entropy at each offset within the malware.  I was able to optimize the original running window entropy algorithm to less than 2% of the running time for the original algorithm, on average.  Jones, K., & Wang, Y. (2018).
  • 9. WHAT DOES R.W.E. GIVE US?  A view of “randomness”.  0 is not random, 1 is totally random.  A view of “information gain”.  A view of “relevancy”.  Running windows of entropy present the entropic structure of a malware sample.
  • 13. THE RESEARCH PROBLEM  Develop an automated machine learning based classifier (Malgazer) using running window entropy that will classify new malware samples functionally. This problem will be solved using a design science research methodology (Arnould, Hevner, March, & Park, 2004).  Null Hypothesis: A classifier cannot be built to perform better than random guessing (or in this case random classifications).
  • 14. SIX DSRM CONTRIBUTIONS 1. Model Artifact: The ML based classifier model. 2. Method Artifact: Optimization of R.W.E. Algorithm 3. Method Artifact: Collect and normalize functional classification information from publicly available resources 4. Method Artifact: Scaling the training of machine learning classifiers 5. Instantiation Artifact: An actual ML based classifier. 6. Instantiation Artifact: A user application to access the classifier
  • 15. 2. LITERATURE REVIEW  While reading, began categorizing the prior literature.  Malware Classification Method Categories:  Malware Detector:  Boot sectors  Dynamic analysis  Entropy statistics  Images from bytes and n-grams  Opcode statistics  PDF streams  Static analysis  Static analysis and opcodes  Static analysis and n-grams  Static and dynamic analysis  Structural Entropy  N-grams
  • 16. 2. LITERATURE REVIEW  Malware Classification Method Categories:  Malware Classifier:  Byte code  Dynamic analysis  HMM  Images from bytes  Images from bytes and opcodes  Intermediate language of opcodes  Static analysis  Static analysis and images from bytes  Static analysis and opcodes  Static and dynamic analysis  Structural entropy
  • 17. 2. LITERATURE REVIEW  Reviewed 71 prior methods  Spans 1996 through 2018  41 malware detectors  30 malware classifiers  Only 2 classifiers used features similar to running window entropy.  The first method only classifies samples if they were packed.  The second method only reported an accuracy of 87.2%. It also used structural entropy.  One method, “GIST”, was implemented for deeper comparisons.  GIST = filters on a 2D image of the malware sample, 320 values
  • 19. 3. RESEARCH METHODOLOGY  Comparison to prior literature  Confusion Matrices:  Per classification accuracy  ROC/AUC  Empirical comparison to GIST method.  Theoretical comparisons to other methods.
  • 20. TRAINING THE MODEL  What training set?  Publicly available malware through VirusTotal.  Windows PE files  60,000 samples of classified and known malware, 10,000 samples from each of 6 classifications:  Virus, Backdoor, Worm, Trojan, Ransom, PUA  What testing set?  Approximately 10-20% of the collected malware samples.  What input data? (features)  RWE  GIST  Scale the 200+ experiments using Terraform.
  • 21. VIRUSTOTAL YARA RULE 1. import "math" 2. import "pe" 3. 4. /* 20180312.1 */ 5. 6. rule peexe_files_20180312_1 { 7. condition: 8. file_type contains "peexe" 9. }
  • 22. ANOTHER VIRUSTOTAL YARA RULE 1. import "math" 2. import "pe” 3. 4. /* 20180501.1 */ 5. 6. rule peexe_backdoor_files_20180501_1 { 7. condition: 8. (file_type contains "peexe") and (microsoft contains "Backdoor") 9. }
  • 23. TRAINING THE MODEL (CONT)  Machine learning algorithms:  Decision Trees  Random Forests  K Nearest Neighbors  SVM  Naïve Bayes  Nearest Centroid  Adaboost  OneVRest  Deep Learning:  Artificial Neural Networks  Convolutional Neural Networks
  • 24. TRAINING PHASES 1. Grid Search 2. 10-Fold Validation 3. Final Training
  • 25. 4. OPTIMIZING R.W.E.  I was able to optimize the original running window entropy algorithm to less than 2% of the running time for the original algorithm, on average.  This topic was published in a peer reviewed conference, June 2018.  Jones, K., & Wang, Y. (2018).
  • 28. ORIGINAL V. OPTIMIZED ALGORITHMS OPTIMIZED V. ORIGINAL PERCENTAGE OF RUNNING TIME Malware Size (bytes) 256 Byte Entropy Window 512 Byte Entropy Window 1,024 Byte Entropy Window 2,048 Byte Entropy Window 256 93.55% N/A N/A N/A 512 1.99% 97.61% N/A N/A 1,024 1.74% 1.38% 89.77% N/A 2,048 1.79% 1.26% 0.93% 87.23% 4,096 1.76% 1.24% 0.87% 0.58% 8,192 1.65% 1.14% 0.85% 0.56% 16,384 1.58% 1.24% 0.87% 0.56% 32,768 1.69% 1.20% 0.86% 0.57% 65,536 1.67% 1.18% 0.87% 0.55% 131,072 1.66% 1.16% 0.92% 0.54%
  • 29. 5. QUALITATIVE DATA EXPLORATION  Experimental data set was collected.  Approximately 25,190 publicly available samples from VirusTotal  The RWE was computed.  RWE was down sampled because malware binary size is a variable  The classifications were explored.  The unified functional classifications were determined:  Virus, Worm, Backdoor, Trojan, Ransom, PUA  The VirusTotal products were chosen:  Microsoft, Symantec, Kaspersky, Sophos, TrendMicro, ClamAV  Classifier experimentation.  Source the final malware data set.
  • 30. 5. QUALITATIVE DATA EXPLORATION SHA256 Microsoft Symantec Kaspersky Sophos TrendMicro ClamAV 4009b81841c893eceaf2d92e1816a968fad66db69f7 0429724e6d6f7bb9a4ca7 Trojan.Gen.2 HT_DYNAMER_GA25051D.UV PM Win.Trojan.Ag ent-1149280 79b155f341a6c3311b4ded9fc2c239f0e124b705b62 7cfe62a9b6b6620d80c5b Trojan.Gen.2 Mal/Gene ric-S HT_DYNAMER_GA25051D.UV PM Win.Trojan.Ag ent-1149280 eea41fa0f53c20ab2302fdfef59dd4b3dba80f14ec41 9f9d9c0fb18eac484c1a Virus:Win32/Shodi.I Trojan.Gen.N PE.2 Virus.Win32.Virut.ce W32/Scri bble-B TROJ_OBFUSCATOR_GC1402 62.UVPM Win.Trojan.Sh ohdi-6136104- 0 60dcb577124f01cbfcadbe7a86b61d06ad3783083a 04f4847a2a255f210ffb28 SMG.Heur!g en HEUR:Trojan.Win32.Start Page CloudWra pper (PUA) Win.Trojan.Fa m-6454574-1 fbee9ad67e71ac0e109503a2acd0dd7be1e5f91c47 3bcb349fbbd5e0050a1d3f Trojan:Win32/Qhost HEUR:Trojan.Win32.Gen eric HT_AGENT_HC050083.UVPM Win.Trojan.Qh ost-160 472257d7fc92f035a50b37ec2536b24bf83ec66ef35 6df4c03afd8f297138bcf Virus:Win32/Shodi.I Trojan.Gen.N PE.2 Virus.Win32.Virut.ce W32/Sho di-I TROJ_OBFUSCATOR_GC1402 62.UVPM Win.Trojan.Sh ohdi-6136104- 0 f308ad39249c32127653c45e20eb318aca5db84849 4812d07b688573c128f90a Trojan:Win32/Dorv.A Trojan Horse Trojan.Win32.Inject.azg w Troj/Simb ot-J TROJ_KRYPTK.SMS Win.Trojan.Inj ector- 6297684-0 2c1b5dabfc0d3646d27814e8e0092e856debbb704a 53de402a9b49ed3a81130a Trojan.Gen.2 Mal/Gene ric-S HT_DYNAMER_GA25051D.UV PM Win.Trojan.Ag ent-1149280 5c9888db9bf4aebe7102dc086505aed8afc33ccc7e4 bb75257e8052f77f77061 TrojanDropper:Win32/D inwod.B!bit Suspicious.IR CBot HEUR:Trojan.Win32.Gen eric HT_AGENT_GL26002B.UVPM 6d38be3380671eaa11f1dec1758012b6a2e7b8f48c ed5bd4edc37ea48c640da4 Trojan.Gen.2 not-a- virus:RiskTool.Win32.Fly Studio.bjxf Generic PUA JE (PUA) b897a0fb1865e016bb88806300ddd9ee739f791944 2f40d1b5d22cc974923996 Worm:Win32/Ludbarum a.A Trojan.Gen.2 Trojan- Ransom.Win32.Blocker.k puo W32/Mat o-N TSPY_LUDBARUMA_BK083ED B.TOMC Win.Worm.Un tukmu- 5949608-0
  • 31. 5. QUALITATIVE DATA EXPLORATION Search String Malgazer Classification Trj Trojan PUP PUA Wrm Worm Bkdr Backdoor Cryp Ransom PUA PUA RiskTool PUA VirTool Virus VirLock Ransom HackTool PUA RemoteAdmin PUA Trojan Trojan Worm Worm Troj Trojan
  • 32. 6. MALGAZER  Over 200 experiments  1 experiment was 1 training session.  Computed the features:  RWE  GIST  Scaled the computations  Terraform  Grid searching  Provided hyper-parameters  Cross 10-Fold validation scoring  Provided average accuracy statistics  Final Training  Provided ROC/AUC and an actual classifier
  • 33. Model Base Model / Optimizer Features RWE Window (Bytes) RWE Data Points Accuracy Mean Accuracy Variance adaboost rf RWE 1,024 1,024 95.00% 0.0045 adaboost dt RWE 1,024 1,024 94.96% 0.0047 adaboost dt RWE 256 4,096 94.95% 0.0046 adaboost rf RWE 256 4,096 94.84% 0.0041 adaboost rf RWE 256 1,024 94.69% 0.0043 adaboost dt RWE 256 1,024 94.63% 0.0042 ann-5 adadelta RWE 256 512 94.38% 0.0050 ann-5 adam RWE 256 512 94.31% 0.0038 ann-4 adadelta RWE 256 512 94.29% 0.0052 ann-4 adam RWE 256 512 94.28% 0.0048 ovr knn GIST N/A N/A 94.24% 0.0073 knn N/A GIST N/A N/A 94.24% 0.0069 ann-5 adamax RWE 256 512 94.22% 0.0059 ann-3 nadam RWE 256 512 94.19% 0.0054 ann-4 adamax RWE 256 512 94.19% 0.0057 ann-3 rmsprop RWE 256 512 94.12% 0.0050 ann-3 adadelta RWE 256 512 94.10% 0.0051 ann-3 adagrad RWE 256 512 94.09% 0.0043 ann-4 adadelta GIST N/A N/A 94.09% 0.0065 ann-3 adadelta GIST N/A N/A 94.08% 0.0052 ann-4 adamax GIST N/A N/A 94.05% 0.0060 ann-1 rmsprop RWE 256 512 94.04% 0.0073 ann-3 adam RWE 256 512 94.01% 0.0060 ann-1 adadelta RWE 256 512 93.98% 0.0066 ann-1 adagrad RWE 256 512 93.96% 0.0046 ann-4 adam GIST N/A N/A 93.96% 0.0043 ann-2 adadelta RWE 256 512 93.95% 0.0037 ann-2 nadam RWE 256 512 93.94% 0.0051
  • 34. 7. MALGAZER COMPARISON  The Malgazer classifier slightly outperformed the GIST classifier, therefore they are at a minimum comparable for classification accuracy.  Using this comparison, it is possible to theoretically compare Malgazer to other prior literature.  Malgazer was comparable to the prior literature reviewed in this dissertation.  Additional observations were also made…
  • 38. THE MALGAZER CLASSIFIER “C”  CATEGORY: Malware Classifier – Running Window Entropy  EFFICACY: accuracy of approximately 95%  TRAINING DATA: 60,000 malware samples and functional classifications from VirusTotal, where there were six functional classifications developed from the publicly available detection information: Virus, Trojan, Worm, PUA, Ransom, and Backdoor  TRANSFORMATION FUNCTION T(): The RWE data  PRE-PROCESSING FUNCTION P(): The re-sampled RWE to a fixed sample size  MACHINE LEARNING MODEL FUNCTION MLM():  Adaboost/DT  Adaboost/RF  Artificial neural networks  ERROR FUNCTION E(): Confusion matrices  TUNING FUNCTION TU(): Window and data point sampling sizes for running window entropy.  POST PROCESSING FUNCTION PP(): Not discussed.
  • 39. 8. MALGAZER WEB APPLICATION
  • 40. 8. MALGAZER WEB APPLICATION
  • 41. 8. MALGAZER WEB APPLICATION
  • 42. 8. MALGAZER WEB APPLICATION
  • 43. 8. MALGAZER WEB APPLICATION
  • 44. 9. CONCLUSIONS  The Six Research Contributions  The generalized classification model  RWE optimization  Collection and normalization of functional classification information  Scaling the computations  The most accurate classifier (slightly more than GIST)  The web application  Recommendations  RWE is slow, but if you are already using it, this could be a great method for automated classification. Otherwise, GIST is faster and provides about the same accuracy.  Future research opportunities  Continue generalized model application  Functional classification determination (Problematic Trojan?)  Optimal RWE window size and data sample length