SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013

MALICIOUS PDF DOCUMENT DETECTION BASED
ON FEATURE EXTRACTION AND ENTROPY
Himanshu Pareek
Center for Development of Advanced Computing, Hyderabad, India

ABSTRACT
In this paper we present a machine learning based approach for detection of malicious PDF
documents. We identify various features in PDF documents which are used by malware authors
to construct a malicious file. Based on these feature set we arrive on models which is used to
detect malicious PDF documents. Based on these feature sets, detection rate is high as compared
to approaches which depends on analysis of JavaScript embedded in the PDF document.

KEYWORDS
Malware Detection, Malicious PDF Document, Heuristics

1. INTRODUCTION
PDF documents are heavily used for launching attacks by cyber criminals [1]. A PDF document
with exciting topic is sent to victim and when document is opened, particular vulnerability in the
rendering software implementation or configuration is exploited to launch the next step of attack.
For example, direct execution of native executable (if code was embedded in the PDF document
itself), injection of code into a running process or even downloading binary from the internet and
then executing it. In this work, we analyze a data set of benign and malicious PDF documents and
observe differences in them using machine learning techniques. Then a document can be analyzed
to determine whether it is malicious or benign.

2. RELATED WORK
Pavel Laskov and Nedim Srndic [2] have implemented an approach for static detection of
malicious java script bearing PDF documents. Though this approach is very useful as it does not
involve any sandbox but would be able to detect malicious PDF documents if they use mostly
obfuscated java script. Zacharias Tzermias et al. [3] designed and implemented MDScan which
combines static and dynamic analysis for malicious PDF detection. This feature however pushes
towards reliable detection of malicious PDF documents but emulation technique increases
processing time. It also required emulation software on user system. Florian Schmitt [4] also
presents an approach to detect malicious PDF document based on java script analysis. Our work
takes these previous works to detect obfuscated java script into account but also adds other
parameters like OpenAction, deflate etc. Similar to our work is the one done by Charles Smutz
DOI : 10.5121/ijsptm.2013.2504

31
International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013

and Angelos Stavrou [10]. In their work, they have taken a huge set of features including font,
creation date etc but have not disclosed whether they really affect the classification results. While
in our work, we target only few crucial features.

3. DATASETS
We collected malicious PDF files from various malware hosting websites like
offensivecomputing.net, contagiodump.blogspot.com, malware.lu and virusshare.com. We
decided to keep all these malwares in training set (3000 benign and 3000 malicious) and keep on
collecting new malwares from various users and keep them in test set. We kept 700 PDF
documents in the test set.

4. OUR APPROACH
Our approach is based on machine learning. We first took classified sets of malicious and benign
PDF files and extracted various features from them. Then based on standard machine learning
classifier algorithms we build the model. Then these models are used to classify a unknown given
file as malicious or benign. We analyzed all the PDF documents in training set manually for the
ten features shown in Table 1. Based on our own analysis of malicious PDF document we also
added parameter 7 that is the presence of both java script and OpenAction.
These feature set was decided on the work done by Paul Baccas at Sophos [5] to indicate what are
prominent features used in malicious PDF documents as compared to genuine documents.
Importance of these feature sets can be understood from Adobe PDF Specification [6]. For
example:
OpenAction if present in an object header line specifies a destination page which is displayed
when document is opened or may specify an action to be performed when the document is
opened. It can also be used in attacks [7]. Similarly 'Launch' specifies to launch an application,
usually to open a file. '/JS' and '/JavaScript' indicates towards the presence of JavaScript inside the
object. PDF Specification supports many data encoding techniques like 'FlateDecode' to compress
data which may help to keep the document size less. In our approach, we pick up such filters and
uncompress the data to discover any java script code. If java script is found after deflating the
compressed data then we assign a severity of two. We rely on the approach given by Pavel
Laskov [4] to detect obfuscated java script and if found that is given a severity of 3.
We also took in account that a PDF document with the intent of exploiting may contain some
garbage strings and hence reduce the document entropy. However this may not be a reliable
method to detect malicious document we just used this as a parameter with very strict restriction.
If entropy of a document goes below than 2 then only flag is raised an a severity of two is
assigned.
Following feature vector is calculated for every PDF files for further calculation

32
International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013

Table 1: Features Extracted From a PDF files

S. No.
1
2
3
4
5
6
7
8
9
10

Feature
/JS - The number of JavaScript launched.
/JavaScript - The number of embedded JavaScript.
/OpenAction - The presence of an open action
/Launch - The use of the /Launch statement.
AsciiHexDecode - The use of the asciihexdecode
filter.
Chars After Last EOF
(/JS OR /JavaScript) AND (/OpenAction OR
/Launch)
If Stream Entropy less than 2
JavaScript after deflating
Obfuscated JavaScript

Severity Number
1
1
1
1
1
1
3
2
2
3

We calculate Euclidean distance score by using following formula.

ED 

iN

S
i 0

2
i

We first used whether Euclidian distance can be used to classify a document as benign or
malicious but Euclidian distance showed high false positive rate. Then, we turned to specialized
machine learning algorithms to classify the PDF document based on our feature set.

5. IMPLEMENTATION AND EVALUATION
We used the pdfid.py from Stevens [8] to start with identification of various features from the
PDF file and calculate ED score of various malicious PDF documents. Then to utilize the
machine learning algorithms, we used Weka implementation [9] of various algorithms. True
positive rate of various classifiers are depicted in Figure 1 and false positive rates are depicted in
Figure 2.

33
International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013

Figure 1: True Positive Rate of Various Classifiers

Figure 2: False Positive Rate of Various Classifiers

We also did analysis to determine how much time is taken to process a single PDF file. As
approach is based on feature extraction there is no explicit difference between processing times
for benign and malicious files. Time increases with the file size. this is shown in the figure 3.

34
International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013

Figure 3: Processing Time for Files.

6. CONCLUSIONS
With this paper, we presented an approach to detect malicious PDF documents based on feature
extraction and machine learning algorithms. Our approach does not utilize any java script
emulation methods. We also show that carefully chosen set for machine learning algorithm can
provide better results. However with our set of features LMT performed slightly better than J48
but both were better than Naive Bayes and Bayes Net. If we remove few parameters like JS and
OpenAction, detection rate falls by fair enough margin.

7. REFERENCES
[1]

Bruce
Schneier,
"PDF
most
common
malware
vector",
Available
at
httsp://www.schneier.com/blog/archives/2010/03/pdf_the_most_co.html
[2] Laskov, Pavel, and Nedim Šrndić. "Static detection of malicious JavaScript-bearing PDF documents."
Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 2011.
[3] Tzermias, Zacharias, et al. "Combining static and dynamic analysis for the detection of malicious
documents." Proceedings of the Fourth European Workshop on System Security. ACM, 2011.
[4] Schmitt, Florian, Jan Gassen, and Elmar Gerhards-Padilla. "PDF Scrutinizer: Detecting JavaScriptbased attacks in PDF documents." Privacy, Security and Trust (PST), 2012 Tenth Annual
International Conference on. IEEE, 2012.
[5] Paul Baccas, "Finding Rules For Heuristic Detection Of Malicious PDFs: With Analysis Of
Embedded Exploit Code", Virus Bulletin Conference September 2010
[6] Adobe PDF Specification. Available at http://www.adobe.com/devnet/pdf/pdf_reference.html
[7] Exploit
Action
with
PDF
OpenAction.
Available
at
http://securitylabs.websense.com/content/Blogs/3202.aspx
[8] Didier Stevens, PDF Tools. "pdfid.py", Available at http://blog.didierstevens.com/programs/pdf-tools/
[9] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten
(2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
[10] Smutz, Charles, and Angelos Stavrou. "Malicious PDF detection using metadata and structural
features." Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 2012.
35

Contenu connexe

Tendances

Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Swati Patel
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataIRJET Journal
 
A web content analytics
A web content analyticsA web content analytics
A web content analyticscsandit
 
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...IJNSA Journal
 
Auto sign an automatic signature generator for high-speed malware filtering d...
Auto sign an automatic signature generator for high-speed malware filtering d...Auto sign an automatic signature generator for high-speed malware filtering d...
Auto sign an automatic signature generator for high-speed malware filtering d...UltraUploader
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
Defense mechanism for d do s attack through machine learning
Defense mechanism for d do s attack through machine learningDefense mechanism for d do s attack through machine learning
Defense mechanism for d do s attack through machine learningeSAT Publishing House
 
encryption and hash algorithms
encryption and hash algorithmsencryption and hash algorithms
encryption and hash algorithmsCARMEN ALCIVAR
 
An automated approach to fix buffer overflows
An automated approach to fix buffer overflows An automated approach to fix buffer overflows
An automated approach to fix buffer overflows IJECEIAES
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentationSandeep Joshi
 
Double Server Public Key Encryption with Keyword Search for Secure Cloud Storage
Double Server Public Key Encryption with Keyword Search for Secure Cloud StorageDouble Server Public Key Encryption with Keyword Search for Secure Cloud Storage
Double Server Public Key Encryption with Keyword Search for Secure Cloud Storageijtsrd
 

Tendances (14)

Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey Software Birthmark for Theft Detection of JavaScript Programs: A Survey
Software Birthmark for Theft Detection of JavaScript Programs: A Survey
 
Efficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted DataEfficient Similarity Search Over Encrypted Data
Efficient Similarity Search Over Encrypted Data
 
A web content analytics
A web content analyticsA web content analytics
A web content analytics
 
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
Malware Detection Module using Machine Learning Algorithms to Assist in Centr...
 
nullcon 2011 - Penetration Testing a Biometric System
nullcon 2011 - Penetration Testing a Biometric Systemnullcon 2011 - Penetration Testing a Biometric System
nullcon 2011 - Penetration Testing a Biometric System
 
Auto sign an automatic signature generator for high-speed malware filtering d...
Auto sign an automatic signature generator for high-speed malware filtering d...Auto sign an automatic signature generator for high-speed malware filtering d...
Auto sign an automatic signature generator for high-speed malware filtering d...
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
Project Jugaad
Project JugaadProject Jugaad
Project Jugaad
 
Defense mechanism for d do s attack through machine learning
Defense mechanism for d do s attack through machine learningDefense mechanism for d do s attack through machine learning
Defense mechanism for d do s attack through machine learning
 
encryption and hash algorithms
encryption and hash algorithmsencryption and hash algorithms
encryption and hash algorithms
 
An automated approach to fix buffer overflows
An automated approach to fix buffer overflows An automated approach to fix buffer overflows
An automated approach to fix buffer overflows
 
Android malware presentation
Android malware presentationAndroid malware presentation
Android malware presentation
 
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
[IJET-V1I6P6] Authors: Ms. Neeta D. Birajdar, Mr. Madhav N. Dhuppe, Ms. Trupt...
 
Double Server Public Key Encryption with Keyword Search for Secure Cloud Storage
Double Server Public Key Encryption with Keyword Search for Secure Cloud StorageDouble Server Public Key Encryption with Keyword Search for Secure Cloud Storage
Double Server Public Key Encryption with Keyword Search for Secure Cloud Storage
 

En vedette

Clientside attack using HoneyClient Technology
Clientside attack using HoneyClient TechnologyClientside attack using HoneyClient Technology
Clientside attack using HoneyClient TechnologyJulia Yu-Chin Cheng
 
Java Shellcode Execution
Java Shellcode ExecutionJava Shellcode Execution
Java Shellcode ExecutionRyan Wincey
 
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documentsREMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documentsRhydham Joshi
 
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...RootedCON
 
Fisica prepa tec milenio
Fisica prepa tec milenioFisica prepa tec milenio
Fisica prepa tec milenioMaestros Online
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

En vedette (8)

Clientside attack using HoneyClient Technology
Clientside attack using HoneyClient TechnologyClientside attack using HoneyClient Technology
Clientside attack using HoneyClient Technology
 
Java Shellcode Execution
Java Shellcode ExecutionJava Shellcode Execution
Java Shellcode Execution
 
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documentsREMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
REMnux Tutorial-3: Investigation of Malicious PDF & Doc documents
 
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...
José Miguel Esparza - Obfuscation and (non-)detection of malicious PDF files ...
 
Fisica prepa tec milenio
Fisica prepa tec milenioFisica prepa tec milenio
Fisica prepa tec milenio
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similaire à Malicious pdf document detection based on feature extraction and entropy

Zero day-malware-protection-brief-2607983
Zero day-malware-protection-brief-2607983Zero day-malware-protection-brief-2607983
Zero day-malware-protection-brief-2607983saif khan
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
Privacy Preserving Mining in Code Profiling Data
Privacy Preserving Mining in Code Profiling DataPrivacy Preserving Mining in Code Profiling Data
Privacy Preserving Mining in Code Profiling DataDr. Amarjeet Singh
 
Classification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining ApproachClassification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining Approachijsrd.com
 
Cyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesCyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesSandeep Kumar Seeram
 
‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
 
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMAPPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMijcsit
 
Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsIRJET Journal
 
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSA SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSIJNSA Journal
 
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSA SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSIJNSA Journal
 
Machine Learning in Malware Detection
Machine Learning in Malware DetectionMachine Learning in Malware Detection
Machine Learning in Malware DetectionKaspersky
 
Software Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesSoftware Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesAM Publications
 
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityWhitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityHappiest Minds Technologies
 
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...IJNSA Journal
 
IRJET- Android Malware Detection using Machine Learning
IRJET-  	  Android Malware Detection using Machine LearningIRJET-  	  Android Malware Detection using Machine Learning
IRJET- Android Malware Detection using Machine LearningIRJET Journal
 

Similaire à Malicious pdf document detection based on feature extraction and entropy (20)

Convolutional Neural Networks
Convolutional Neural Networks Convolutional Neural Networks
Convolutional Neural Networks
 
Zero day-malware-protection-brief-2607983
Zero day-malware-protection-brief-2607983Zero day-malware-protection-brief-2607983
Zero day-malware-protection-brief-2607983
 
A44090104
A44090104A44090104
A44090104
 
Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
Privacy Preserving Mining in Code Profiling Data
Privacy Preserving Mining in Code Profiling DataPrivacy Preserving Mining in Code Profiling Data
Privacy Preserving Mining in Code Profiling Data
 
Classification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining ApproachClassification of Malware based on Data Mining Approach
Classification of Malware based on Data Mining Approach
 
Cyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on ExamplesCyber Defense Forensic Analyst - Real World Hands-on Examples
Cyber Defense Forensic Analyst - Real World Hands-on Examples
 
‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud
 
FuzzyDbg_Report.pdf
FuzzyDbg_Report.pdfFuzzyDbg_Report.pdf
FuzzyDbg_Report.pdf
 
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEMAPPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
APPBACS: AN APPLICATION BEHAVIOR ANALYSIS AND CLASSIFICATION SYSTEM
 
Exploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD MetricsExploring the Efficiency of the Program using OOAD Metrics
Exploring the Efficiency of the Program using OOAD Metrics
 
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSA SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
 
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMSA SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
A SYSTEM FOR VALIDATING AND COMPARING HOST-BASED DDOS DETECTION MECHANISMS
 
Machine Learning in Malware Detection
Machine Learning in Malware DetectionMachine Learning in Malware Detection
Machine Learning in Malware Detection
 
Az4301280282
Az4301280282Az4301280282
Az4301280282
 
Software Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesSoftware Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining Techniques
 
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityWhitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
 
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
[IJET-V1I2P2] Authors :Karishma Pandey, Madhura Naik, Junaid Qamar,Mahendra P...
 
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...
FEATURE EXTRACTION AND FEATURE SELECTION: REDUCING DATA COMPLEXITY WITH APACH...
 
IRJET- Android Malware Detection using Machine Learning
IRJET-  	  Android Malware Detection using Machine LearningIRJET-  	  Android Malware Detection using Machine Learning
IRJET- Android Malware Detection using Machine Learning
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Malicious pdf document detection based on feature extraction and entropy

  • 1. International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013 MALICIOUS PDF DOCUMENT DETECTION BASED ON FEATURE EXTRACTION AND ENTROPY Himanshu Pareek Center for Development of Advanced Computing, Hyderabad, India ABSTRACT In this paper we present a machine learning based approach for detection of malicious PDF documents. We identify various features in PDF documents which are used by malware authors to construct a malicious file. Based on these feature set we arrive on models which is used to detect malicious PDF documents. Based on these feature sets, detection rate is high as compared to approaches which depends on analysis of JavaScript embedded in the PDF document. KEYWORDS Malware Detection, Malicious PDF Document, Heuristics 1. INTRODUCTION PDF documents are heavily used for launching attacks by cyber criminals [1]. A PDF document with exciting topic is sent to victim and when document is opened, particular vulnerability in the rendering software implementation or configuration is exploited to launch the next step of attack. For example, direct execution of native executable (if code was embedded in the PDF document itself), injection of code into a running process or even downloading binary from the internet and then executing it. In this work, we analyze a data set of benign and malicious PDF documents and observe differences in them using machine learning techniques. Then a document can be analyzed to determine whether it is malicious or benign. 2. RELATED WORK Pavel Laskov and Nedim Srndic [2] have implemented an approach for static detection of malicious java script bearing PDF documents. Though this approach is very useful as it does not involve any sandbox but would be able to detect malicious PDF documents if they use mostly obfuscated java script. Zacharias Tzermias et al. [3] designed and implemented MDScan which combines static and dynamic analysis for malicious PDF detection. This feature however pushes towards reliable detection of malicious PDF documents but emulation technique increases processing time. It also required emulation software on user system. Florian Schmitt [4] also presents an approach to detect malicious PDF document based on java script analysis. Our work takes these previous works to detect obfuscated java script into account but also adds other parameters like OpenAction, deflate etc. Similar to our work is the one done by Charles Smutz DOI : 10.5121/ijsptm.2013.2504 31
  • 2. International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013 and Angelos Stavrou [10]. In their work, they have taken a huge set of features including font, creation date etc but have not disclosed whether they really affect the classification results. While in our work, we target only few crucial features. 3. DATASETS We collected malicious PDF files from various malware hosting websites like offensivecomputing.net, contagiodump.blogspot.com, malware.lu and virusshare.com. We decided to keep all these malwares in training set (3000 benign and 3000 malicious) and keep on collecting new malwares from various users and keep them in test set. We kept 700 PDF documents in the test set. 4. OUR APPROACH Our approach is based on machine learning. We first took classified sets of malicious and benign PDF files and extracted various features from them. Then based on standard machine learning classifier algorithms we build the model. Then these models are used to classify a unknown given file as malicious or benign. We analyzed all the PDF documents in training set manually for the ten features shown in Table 1. Based on our own analysis of malicious PDF document we also added parameter 7 that is the presence of both java script and OpenAction. These feature set was decided on the work done by Paul Baccas at Sophos [5] to indicate what are prominent features used in malicious PDF documents as compared to genuine documents. Importance of these feature sets can be understood from Adobe PDF Specification [6]. For example: OpenAction if present in an object header line specifies a destination page which is displayed when document is opened or may specify an action to be performed when the document is opened. It can also be used in attacks [7]. Similarly 'Launch' specifies to launch an application, usually to open a file. '/JS' and '/JavaScript' indicates towards the presence of JavaScript inside the object. PDF Specification supports many data encoding techniques like 'FlateDecode' to compress data which may help to keep the document size less. In our approach, we pick up such filters and uncompress the data to discover any java script code. If java script is found after deflating the compressed data then we assign a severity of two. We rely on the approach given by Pavel Laskov [4] to detect obfuscated java script and if found that is given a severity of 3. We also took in account that a PDF document with the intent of exploiting may contain some garbage strings and hence reduce the document entropy. However this may not be a reliable method to detect malicious document we just used this as a parameter with very strict restriction. If entropy of a document goes below than 2 then only flag is raised an a severity of two is assigned. Following feature vector is calculated for every PDF files for further calculation 32
  • 3. International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013 Table 1: Features Extracted From a PDF files S. No. 1 2 3 4 5 6 7 8 9 10 Feature /JS - The number of JavaScript launched. /JavaScript - The number of embedded JavaScript. /OpenAction - The presence of an open action /Launch - The use of the /Launch statement. AsciiHexDecode - The use of the asciihexdecode filter. Chars After Last EOF (/JS OR /JavaScript) AND (/OpenAction OR /Launch) If Stream Entropy less than 2 JavaScript after deflating Obfuscated JavaScript Severity Number 1 1 1 1 1 1 3 2 2 3 We calculate Euclidean distance score by using following formula. ED  iN S i 0 2 i We first used whether Euclidian distance can be used to classify a document as benign or malicious but Euclidian distance showed high false positive rate. Then, we turned to specialized machine learning algorithms to classify the PDF document based on our feature set. 5. IMPLEMENTATION AND EVALUATION We used the pdfid.py from Stevens [8] to start with identification of various features from the PDF file and calculate ED score of various malicious PDF documents. Then to utilize the machine learning algorithms, we used Weka implementation [9] of various algorithms. True positive rate of various classifiers are depicted in Figure 1 and false positive rates are depicted in Figure 2. 33
  • 4. International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013 Figure 1: True Positive Rate of Various Classifiers Figure 2: False Positive Rate of Various Classifiers We also did analysis to determine how much time is taken to process a single PDF file. As approach is based on feature extraction there is no explicit difference between processing times for benign and malicious files. Time increases with the file size. this is shown in the figure 3. 34
  • 5. International Journal of Security, Privacy and Trust Management ( IJSPTM) Vol 2, No 5, October 2013 Figure 3: Processing Time for Files. 6. CONCLUSIONS With this paper, we presented an approach to detect malicious PDF documents based on feature extraction and machine learning algorithms. Our approach does not utilize any java script emulation methods. We also show that carefully chosen set for machine learning algorithm can provide better results. However with our set of features LMT performed slightly better than J48 but both were better than Naive Bayes and Bayes Net. If we remove few parameters like JS and OpenAction, detection rate falls by fair enough margin. 7. REFERENCES [1] Bruce Schneier, "PDF most common malware vector", Available at httsp://www.schneier.com/blog/archives/2010/03/pdf_the_most_co.html [2] Laskov, Pavel, and Nedim Šrndić. "Static detection of malicious JavaScript-bearing PDF documents." Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 2011. [3] Tzermias, Zacharias, et al. "Combining static and dynamic analysis for the detection of malicious documents." Proceedings of the Fourth European Workshop on System Security. ACM, 2011. [4] Schmitt, Florian, Jan Gassen, and Elmar Gerhards-Padilla. "PDF Scrutinizer: Detecting JavaScriptbased attacks in PDF documents." Privacy, Security and Trust (PST), 2012 Tenth Annual International Conference on. IEEE, 2012. [5] Paul Baccas, "Finding Rules For Heuristic Detection Of Malicious PDFs: With Analysis Of Embedded Exploit Code", Virus Bulletin Conference September 2010 [6] Adobe PDF Specification. Available at http://www.adobe.com/devnet/pdf/pdf_reference.html [7] Exploit Action with PDF OpenAction. Available at http://securitylabs.websense.com/content/Blogs/3202.aspx [8] Didier Stevens, PDF Tools. "pdfid.py", Available at http://blog.didierstevens.com/programs/pdf-tools/ [9] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. [10] Smutz, Charles, and Angelos Stavrou. "Malicious PDF detection using metadata and structural features." Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 2012. 35