SlideShare une entreprise Scribd logo
1  sur  19
©2019 FireEye
©2019 FireEye©2019 FireEye
What is a String?
2
Persists in compilation
ASCII/Narrow
– N characters + NULL
– No file format, context
– 0x31 0x33 0x33 0x37 0x00
– ‘1337’, right?
– Not necessarily:
– Memory addresses
– CPU instructions
– Data used by the program
 Unicode/Wide
– 2 bytes, double-NULL terminated
SourceCode
int main() {
printf("1337");
return 0;
}
ObjectFile
"1337"
.EXEBinary
.data
0x56000:
"1337"
©2019 FireEye©2019 FireEye
The Strings Program
3
!This program cannot be run in DOS mode.
??3@YAXPAX@Z
??2@YAPAXI@Z
__CxxFrameHandler
_except_handler3
WSAStartup() error: %d
User-Agent: Mozilla/4.0 (compatible; MSIE 6.00; Windows
NT 5.1)
GetLastInputInfo
SeShutdownPrivilege
%sIEXPLORE.EXE
SOFTWAREMicrosoftWindowsCurrentVersionApp
PathsIEXPLORE.EXE
[Machine IdleTime:] %d days + %.2d:%.2d:%.2d
[Machine UpTime:] %-.2d Days %-.2d Hours %-.2d Minutes
%-.2d Seconds
ServiceDll
SYSTEMCurrentControlSetServices%sParameters
if exist "%s" goto selfkill
del "%s"
attrib -a -r -s -h "%s"
Inject '%s' to PID '%d' Successfully!
cmd.exe /c
Hi,Master [%d/%d/%d %d:%d:%d]
©2019 FireEye©2019 FireEye
Malware Triage
4
Customer
Suspected
compromise
Incident Response
Forensic analysis
Identify malware
sample
Reverse Engineer
Binary triage
Malware analysis
+SOC analysts, red teamers, malware researchers, n00bs, experts
©2019 FireEye©2019 FireEye
Strings in Practice – Static Analysis
5
Running Strings on larger binaries produces tens of
thousands of strings.
Strings produces a ton of noise mixed in with
important information
Knowing which strings are relevant often requires
highly experienced analysts.
Relevance is subjective and its definition can vary
significantly across analysts.
©2019 FireEye©2019 FireEye
Hypothesis and Goals
6
 Develop a tool that can:
– efficiently identify and prioritize strings
– based on relevance for malware analysis
StringSifter should:
– be easy to use
– generalize across:
– roles, use cases, downstream apps
– save time and money
 How does it work?
©2019 FireEye©2019 FireEye
( )
 Create optimal ordering of a list of items
 Precise individual item scores less important
than their relative ordering
 In classification, regression, clustering we
predict a class or single score
 LTR rarely applied in security applications
Learning to Rank
7
f
©2019 FireEye©2019 FireEye
 Rank items within unseen lists in a similar way to rankings within training lists
 Each item associated with a set of features and an ordinal integer label
 Ordinal label is the teaching signal that encodes relevance level
LTR as Supervised Learning
8
©2019 FireEye©2019 FireEye
EMBER Dataset
9
 Endgame Malware BEnchmark for Research
– v1 (1.1 million PE files scanned on or before 2017)
 https://arxiv.org/abs/1804.04637
 https://github.com/endgameinc/ember
– 400k train + test malware binaries from v1
 malware defined as > 40 VT vendors say malicious
 Ran Strings on 400k malware binaries
– produced 3+ billion ASCII + Unicode strings (24+ GB)
– performed sampling, stratified by malware family
– labeled according to weak supervision
©2019 FireEye©2019 FireEye
Weak Supervision
10
https://github.com/snorkel-team/snorkel
 Data Labeling Bottleneck
 Ordinal Labeling Functions
– ABSTAIN = -1
– VERY_IRRELEVANT = 0
– IRRELEVANT = 1
– SEMI_IRRELEVANT = 2
– NEUTRAL = 3
– SEMI_RELEVANT = 4
– RELEVANT = 5
– VERY_RELEVANT = 6
 cardinality = 7, tie goes NEUTRAL
 Apply 70+ LFs over input strings,
generate probabilistic labels
©2019 FireEye©2019 FireEye
 Natural Language Processing
– Entropy rate
– KL Divergence
– Markov model
 Host, Network IoCs
 Malware Regexes
– encodings (base64)
– format specifiers
– user agents
LF Examples
11
t
%
F
0.02
0.07
0.01
0.2
0.2
0.01
0.03
0.14
0.05
0.01
http://evil.com
SOFTWAREincludeevil.pdb
t%Ft
Vr}Y
0.018
0.014
0.007
0.001
[(‘QQQQ’, 0),
(’QQQQ’, 0),
(’PPPPP’, 0),
(’PPPPP’, 0),
(’WWWWW’, 0),
(‘SSSSS’, 0),
(’SSSSS’, 0),
…
(’VVVVV’, 0)]
[(‘~!@#$%^&*()_+`1234567890-=qwertyuiop[]QWERTYUIOP{}|asdfghjkl;`ASDFGHJKL:”zxcvbnm,./ZXCVBNM<> ?’,4.55),
(‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54),
(‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54),
…
(‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54)]
[(‘ttt’, 0),
(’_&t`’, 0),
(’toti’, 0),
(’at=Q’, 0),
…
(’oti0{’, 0)]
[(‘winter’, 0.15),
(‘Finger’, 0.16),
(‘Inter’, 0.17),
…
(’Conte’, 0.18)]
©2019 FireEye©2019 FireEye
LFApplier
12
©2019 FireEye©2019 FireEye
Learning a LabelModel
13
https://github.com/snorkel-team/snorkel
 LFS have different:
– Accuracies
– Correlations
– Certainties
 Learning a LabelModel
– Inverse generalized
covariance matrix of LFs
– Matrix completion (Robust PCA)
 Snorkel @ ICML ’19
– https://arxiv.org/abs/1903.05844
©2019 FireEye©2019 FireEye
 Gradient Boosted Decision Trees (GBDTs)
– combine outputs from multiple Decision Trees
– reduce loss using gradient descent
– weighted sum of trees’ predictions as ensemble
– LightGBM (https://github.com/microsoft/LightGBM)
 Histogram-binned GBDTs with LTR obj. function
 Neural networks
– tf-ranking (https://github.com/tensorflow/ranking)
– Scoring Function: defines the network
– Loss (e.g. pairwise logistic), Metrics (e.g. precision)
LTR Models
14
©2019 FireEye©2019 FireEye
 Normalized Discounted Cumulative Gain
– Normalized: divide DCG by ideal DCG on a
ground truth holdout dataset
– Discounted: divides each string’s predicted
relevance by a monotonically increasing
function (log of its ranked position)
– Cumulative: the cumulative gain or summed
total of every string’s relevance
– Gain: the magnitude of each string’s relevance
Evaluation
15
©2019 FireEye©2019 FireEye
Kernel Density of Test NDCG Scores
16
StringSifter performs well on a holdout set of 7+ years of FLARE malware reports.
©2019 FireEye©2019 FireEye
Putting it All Together
17
©2019 FireEye©2019 FireEye
Demo
18
©2019 FireEye©2019 FireEye
Open Sourcing StringSifter
19
 The tool is now live
– Command line and Docker tools
– flarestrings <my_sample> | rank_strings
 FLOSS outputs, live memory dumps
 Weak Supervision for Cybersecurity
– Other label-starved problems?
 In the works
– more labeling functions, mach-o + ELF files
https://github.com/fireeye/stringsifter
pip install stringsifter
@phtully

Contenu connexe

Similaire à Learning to Rank Relevant Malware Strings Using Weak Supervision

OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxFernandoVizer
 
Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1Cysinfo Cyber Security Community
 
Security hole #5 application security science or quality assurance
Security hole #5 application security   science or quality assuranceSecurity hole #5 application security   science or quality assurance
Security hole #5 application security science or quality assuranceTjylen Veselyj
 
SmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationSmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationMalachi Jones
 
Reversing & malware analysis training part 10 exploit development basics
Reversing & malware analysis training part 10   exploit development basicsReversing & malware analysis training part 10   exploit development basics
Reversing & malware analysis training part 10 exploit development basicsAbdulrahman Bassam
 
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...Codemotion
 
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET Journal
 
IRJET- Obfuscation: Maze of Code
IRJET- Obfuscation: Maze of CodeIRJET- Obfuscation: Maze of Code
IRJET- Obfuscation: Maze of CodeIRJET Journal
 
Fighting Malware Without Antivirus
Fighting Malware Without AntivirusFighting Malware Without Antivirus
Fighting Malware Without AntivirusEnergySec
 
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...PROIDEA
 
Web application security
Web application securityWeb application security
Web application securityRavi Raj
 
PVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio
 
Eight simple rules to writing secure PHP programs
Eight simple rules to writing secure PHP programsEight simple rules to writing secure PHP programs
Eight simple rules to writing secure PHP programsAleksandr Yampolskiy
 
OWASP PHPIDS talk slides
OWASP PHPIDS talk slidesOWASP PHPIDS talk slides
OWASP PHPIDS talk slidesguestd34230
 
Secure Software Development Lifecycle - Devoxx MA 2018
Secure Software Development Lifecycle - Devoxx MA 2018Secure Software Development Lifecycle - Devoxx MA 2018
Secure Software Development Lifecycle - Devoxx MA 2018Imola Informatica
 
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019Alexandre Borges
 

Similaire à Learning to Rank Relevant Malware Strings Using Weak Supervision (20)

OWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptxOWASP_Top_Ten_Proactive_Controls_v2.pptx
OWASP_Top_Ten_Proactive_Controls_v2.pptx
 
Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1Advanced malwareanalysis training session2 botnet analysis part1
Advanced malwareanalysis training session2 botnet analysis part1
 
Security hole #5 application security science or quality assurance
Security hole #5 application security   science or quality assuranceSecurity hole #5 application security   science or quality assurance
Security hole #5 application security science or quality assurance
 
SmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_ExploitationSmartphoneHacking_Android_Exploitation
SmartphoneHacking_Android_Exploitation
 
Reversing & malware analysis training part 10 exploit development basics
Reversing & malware analysis training part 10   exploit development basicsReversing & malware analysis training part 10   exploit development basics
Reversing & malware analysis training part 10 exploit development basics
 
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...
Fernando Arnaboldi - Exposing Hidden Exploitable Behaviors Using Extended Dif...
 
PHP Security
PHP SecurityPHP Security
PHP Security
 
Web backdoors attacks, evasion, detection
Web backdoors   attacks, evasion, detectionWeb backdoors   attacks, evasion, detection
Web backdoors attacks, evasion, detection
 
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
 
IRJET- Obfuscation: Maze of Code
IRJET- Obfuscation: Maze of CodeIRJET- Obfuscation: Maze of Code
IRJET- Obfuscation: Maze of Code
 
Fighting Malware Without Antivirus
Fighting Malware Without AntivirusFighting Malware Without Antivirus
Fighting Malware Without Antivirus
 
Invoke-DOSfuscation
Invoke-DOSfuscationInvoke-DOSfuscation
Invoke-DOSfuscation
 
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...
CONFidence 2018: Invoke-DOSfuscation: Techniques FOR %F IN (-style) DO (S-lev...
 
Web application security
Web application securityWeb application security
Web application security
 
PVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applications
 
Eight simple rules to writing secure PHP programs
Eight simple rules to writing secure PHP programsEight simple rules to writing secure PHP programs
Eight simple rules to writing secure PHP programs
 
OWASP PHPIDS talk slides
OWASP PHPIDS talk slidesOWASP PHPIDS talk slides
OWASP PHPIDS talk slides
 
Secure Software Development Lifecycle - Devoxx MA 2018
Secure Software Development Lifecycle - Devoxx MA 2018Secure Software Development Lifecycle - Devoxx MA 2018
Secure Software Development Lifecycle - Devoxx MA 2018
 
Awalin-CapWIC
Awalin-CapWICAwalin-CapWIC
Awalin-CapWIC
 
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019
.NET MALWARE THREAT: INTERNALS AND REVERSING DEF CON USA 2019
 

Dernier

英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 

Dernier (20)

英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 

Learning to Rank Relevant Malware Strings Using Weak Supervision

  • 2. ©2019 FireEye©2019 FireEye What is a String? 2 Persists in compilation ASCII/Narrow – N characters + NULL – No file format, context – 0x31 0x33 0x33 0x37 0x00 – ‘1337’, right? – Not necessarily: – Memory addresses – CPU instructions – Data used by the program  Unicode/Wide – 2 bytes, double-NULL terminated SourceCode int main() { printf("1337"); return 0; } ObjectFile "1337" .EXEBinary .data 0x56000: "1337"
  • 3. ©2019 FireEye©2019 FireEye The Strings Program 3 !This program cannot be run in DOS mode. ??3@YAXPAX@Z ??2@YAPAXI@Z __CxxFrameHandler _except_handler3 WSAStartup() error: %d User-Agent: Mozilla/4.0 (compatible; MSIE 6.00; Windows NT 5.1) GetLastInputInfo SeShutdownPrivilege %sIEXPLORE.EXE SOFTWAREMicrosoftWindowsCurrentVersionApp PathsIEXPLORE.EXE [Machine IdleTime:] %d days + %.2d:%.2d:%.2d [Machine UpTime:] %-.2d Days %-.2d Hours %-.2d Minutes %-.2d Seconds ServiceDll SYSTEMCurrentControlSetServices%sParameters if exist "%s" goto selfkill del "%s" attrib -a -r -s -h "%s" Inject '%s' to PID '%d' Successfully! cmd.exe /c Hi,Master [%d/%d/%d %d:%d:%d]
  • 4. ©2019 FireEye©2019 FireEye Malware Triage 4 Customer Suspected compromise Incident Response Forensic analysis Identify malware sample Reverse Engineer Binary triage Malware analysis +SOC analysts, red teamers, malware researchers, n00bs, experts
  • 5. ©2019 FireEye©2019 FireEye Strings in Practice – Static Analysis 5 Running Strings on larger binaries produces tens of thousands of strings. Strings produces a ton of noise mixed in with important information Knowing which strings are relevant often requires highly experienced analysts. Relevance is subjective and its definition can vary significantly across analysts.
  • 6. ©2019 FireEye©2019 FireEye Hypothesis and Goals 6  Develop a tool that can: – efficiently identify and prioritize strings – based on relevance for malware analysis StringSifter should: – be easy to use – generalize across: – roles, use cases, downstream apps – save time and money  How does it work?
  • 7. ©2019 FireEye©2019 FireEye ( )  Create optimal ordering of a list of items  Precise individual item scores less important than their relative ordering  In classification, regression, clustering we predict a class or single score  LTR rarely applied in security applications Learning to Rank 7 f
  • 8. ©2019 FireEye©2019 FireEye  Rank items within unseen lists in a similar way to rankings within training lists  Each item associated with a set of features and an ordinal integer label  Ordinal label is the teaching signal that encodes relevance level LTR as Supervised Learning 8
  • 9. ©2019 FireEye©2019 FireEye EMBER Dataset 9  Endgame Malware BEnchmark for Research – v1 (1.1 million PE files scanned on or before 2017)  https://arxiv.org/abs/1804.04637  https://github.com/endgameinc/ember – 400k train + test malware binaries from v1  malware defined as > 40 VT vendors say malicious  Ran Strings on 400k malware binaries – produced 3+ billion ASCII + Unicode strings (24+ GB) – performed sampling, stratified by malware family – labeled according to weak supervision
  • 10. ©2019 FireEye©2019 FireEye Weak Supervision 10 https://github.com/snorkel-team/snorkel  Data Labeling Bottleneck  Ordinal Labeling Functions – ABSTAIN = -1 – VERY_IRRELEVANT = 0 – IRRELEVANT = 1 – SEMI_IRRELEVANT = 2 – NEUTRAL = 3 – SEMI_RELEVANT = 4 – RELEVANT = 5 – VERY_RELEVANT = 6  cardinality = 7, tie goes NEUTRAL  Apply 70+ LFs over input strings, generate probabilistic labels
  • 11. ©2019 FireEye©2019 FireEye  Natural Language Processing – Entropy rate – KL Divergence – Markov model  Host, Network IoCs  Malware Regexes – encodings (base64) – format specifiers – user agents LF Examples 11 t % F 0.02 0.07 0.01 0.2 0.2 0.01 0.03 0.14 0.05 0.01 http://evil.com SOFTWAREincludeevil.pdb t%Ft Vr}Y 0.018 0.014 0.007 0.001 [(‘QQQQ’, 0), (’QQQQ’, 0), (’PPPPP’, 0), (’PPPPP’, 0), (’WWWWW’, 0), (‘SSSSS’, 0), (’SSSSS’, 0), … (’VVVVV’, 0)] [(‘~!@#$%^&*()_+`1234567890-=qwertyuiop[]QWERTYUIOP{}|asdfghjkl;`ASDFGHJKL:”zxcvbnm,./ZXCVBNM<> ?’,4.55), (‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54), (‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54), … (‘!”#$%&’()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~’, 4.54)] [(‘ttt’, 0), (’_&t`’, 0), (’toti’, 0), (’at=Q’, 0), … (’oti0{’, 0)] [(‘winter’, 0.15), (‘Finger’, 0.16), (‘Inter’, 0.17), … (’Conte’, 0.18)]
  • 13. ©2019 FireEye©2019 FireEye Learning a LabelModel 13 https://github.com/snorkel-team/snorkel  LFS have different: – Accuracies – Correlations – Certainties  Learning a LabelModel – Inverse generalized covariance matrix of LFs – Matrix completion (Robust PCA)  Snorkel @ ICML ’19 – https://arxiv.org/abs/1903.05844
  • 14. ©2019 FireEye©2019 FireEye  Gradient Boosted Decision Trees (GBDTs) – combine outputs from multiple Decision Trees – reduce loss using gradient descent – weighted sum of trees’ predictions as ensemble – LightGBM (https://github.com/microsoft/LightGBM)  Histogram-binned GBDTs with LTR obj. function  Neural networks – tf-ranking (https://github.com/tensorflow/ranking) – Scoring Function: defines the network – Loss (e.g. pairwise logistic), Metrics (e.g. precision) LTR Models 14
  • 15. ©2019 FireEye©2019 FireEye  Normalized Discounted Cumulative Gain – Normalized: divide DCG by ideal DCG on a ground truth holdout dataset – Discounted: divides each string’s predicted relevance by a monotonically increasing function (log of its ranked position) – Cumulative: the cumulative gain or summed total of every string’s relevance – Gain: the magnitude of each string’s relevance Evaluation 15
  • 16. ©2019 FireEye©2019 FireEye Kernel Density of Test NDCG Scores 16 StringSifter performs well on a holdout set of 7+ years of FLARE malware reports.
  • 19. ©2019 FireEye©2019 FireEye Open Sourcing StringSifter 19  The tool is now live – Command line and Docker tools – flarestrings <my_sample> | rank_strings  FLOSS outputs, live memory dumps  Weak Supervision for Cybersecurity – Other label-starved problems?  In the works – more labeling functions, mach-o + ELF files https://github.com/fireeye/stringsifter pip install stringsifter @phtully

Notes de l'éditeur

  1. Starts at hex 21 / 94 printable characters
  2. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
  3. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
  4. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
  5. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
  6. Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. E.g. if you are doing spam detection on email, you will look at all the features associated with that email and classify it as spam or not. The aim of traditional ML is to come up with a class (spam or no-spam) or a single numerical score for that instance. LTR solves a ranking problem on a list of items. The aim of LTR is to come up with optimal ordering of those items. As such, LTR doesn’t care much about the exact score that each item gets, but cares more about the relative ordering among all the items.
  7. - Learning to rank learns to directly rank items by training a model to predict the probability of a certain item ranking over another item. - This is done by learning a scoring function where items ranked higher should have higher scores. The model can be trained via gradient descent on a loss function defined over these scores. - For each item, gradient descent pushes the score up for every item that ranks below it and pushes the score down for every item that ranks above it. The “strength” of the push is determined by the difference in scores. - To ensure that the model focuses on getting the higher ranks (which are generally more important) correct, we can weight the “strength” of the push by a factor that accounts for how important the ranking is.
  8. Discounted reflects the goal of having the most relevant strings ranked towards the top of our predictions Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely. which we obtain from FLARE-identified relevant strings contained within historical malware reports.
  9. Discounted reflects the goal of having the most relevant strings ranked towards the top of our predictions Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely. which we obtain from FLARE-identified relevant strings contained within historical malware reports.