SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
A WHIRLWIND TOUR
OF ACADEMIC TECHNIQUES
FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare, Deakin University
Introduction








Started off in industry (Qualys, now Volvent).
Have a Masters by Research.
About to receive a PhD from Deakin University.
Last 5 years in post-graduate University research.
Learnt some cool things along the way.
What did I do at University?


Malwise v1 (Masters)




Malwise v2




Binary comparison and visualization service.

Clonewise




Binary clustering service.

Simseer




More improved malware variant search service.

Simseer Cluster




Improved version.

Simseer Search




Malware variant detection system.

Automated detection of embedded libraries in source.

Bugalyze


Detection of bugs using data flow analysis.
Outline








Mathematical Objects
Comparing
Similarity Searching
Classification
Clustering
Program Analysis
An incomplete list of mathematical
objects








Strings
Vectors
Sets
Sets of Objects
Trees
Graphs
Objects




Objects have different performance.
Example
 Comparing

two vectors is fairly fast.
 Exact matching two strings is fairly fast.
 Inexact matching two strings is medium slow/fast.
 Comparing two graphs is slow.
A K T KT K
| | | | | sequence alignment O(mn)
A TK TT T K
Transforming one object to another


Problem
 Comparing

two 100kb strings using the edit distance is
impractically slow.



Solution

ed(“hello”, “ggello”) = 2

 Transform

the strings into vectors.
 Then, use a vector comparison – which is fast.


Examples
 Comparing

malware samples
 Finding near duplicate web pages
 Comparing E-Mails
N-Grams







Extract all N-length substrings (N-Grams) from
original string.
From training set of strings, choose best N-Grams.
Each unique N-Gram is an index in a vector.
The value of the element is the number of times it
occurs.
W|IEH}R

W|IE
|IEH
IEH}
EH}R
Another N-Gram example





Extract N-Grams
Represent new object as a ‘Set of N-Grams’
Compare sets using set similarity metrics
A Graph problem










Graph problems like approximate similarity are slow to
solve.
Decompose graph into subgraphs of at most k-nodes.
Canonicalize small graphs, represent by adjacency
matrix, transform to string.
Graph is now a ‘Set of Strings’.
Optionally represent as vector of ‘important ksubgraphs’.
Use Vector distance metrics to compare, index, and
search.
K-subgraph decomposition
L_0

L_0

L_3

L_3

L_3

L_3

L_6

L_6

true

L_0

L_6

L_6

L_1

L_1

L_7

true

L_1

L_7

L_1

L_4

L_2

L_4

L_2

L_4

L_7

true
L_2

L_7

L_2

true

L_4
L_5

true
L_5

L_0

L_5

L_3
L_6

0101000
0000000
0000010
0010100
0000010
0000001
1001000

0001010
0000000
1000000
0000100
0010000
0101000
1000000

0000001
0000100
0000001
0010000
0001010
0010000
0100100

L_1

L_2

L_4

L_5
Graphs – Case Study






Implemented in Malwise and Simseer
Take control flow graphs of programs.
Decompile into strings.
One:
 Consider

program as a vector of N-Grams of
decompiled strings.



L_0

Two:

L_3
true

 Consider

program as a set of strings.

L_6

true
L_1

L_7

true
L_2

L_4
true
L_5

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}
Final Remarks on Objects




Know how to represent your problem.
Look into how the representation can be
approximated
 By



transforming it into another object

Vectors are often a good choice.
Comparing


Problem
 Measure

the similarity (or distance between) two

objects.


Solution
 Represent

objects mathematically.
 Use multitude of mathematical measures.


Examples
 Malware

similarity
 Near duplicate web pages
Comparing Sets






A set is a collection of elements.
Given an equality function between elements, we
can measure set similarity.
Inexact matching
index
 Dice coefficient



 Jaccard



s

2 A B
AB

J ( A, B) 

A B
A B
Comparing Vectors – Ugh, math.


Euclidean Distance 

d ( p, q ) 

 (qi  pi)
n

2

i 1



Manhattan Distance 

n

d ( p, q )   q 
i 1



Cosine Similarity 

i

similarity  cos( ) 

p

i

A B
A B
Vector distance – a different look




A vector is an n-dimensional point in space.
E.g., a 2-d vector is <x,y>
Cosine similarity






Line from origin to n-dimensional point.
Given 2 lines, what’s the angle (theta) between
them?
The smaller the angle, the more similar.
Point A

Point B

Theta
Comparing Vectors – Case Study


Malwise v2
 Feature

vector of N-Grams of decompiled flowgraphs
 Manhattan Distance


Simseer Search
 Same

feature vector
 Euclidean Distance
Comparing Sets – Case Study








Malwise v1
An element is a graph invariant of the control flow
graph, represented as an integer.
A program is a set of integers.
Compare similarity between two programs using
Dice coefficient.
Malwise v1 - Comparing Sets

1
T



F

2

(1 -> 2), (1 -> 4)
(2 -> 3), ()
(), ()
(4 -> 3), ()

4
T

T

3

s ( A, B) 

2 wi x Ai  Bi
i

w x A  w x B
i

i

i

i

i

i
Comparing Sets of Strings in Malwise
v2 – Case Study






String is a decompiled flowgraph.
Program is a set of strings.
Edit distance between strings.
Construct 1:1 mapping between elements of sets:
 Such



that the sum of distances is minimized.

Solved using ‘combinatorial optimisation’
 Assignment

Problem
 Solution by “graph matching”
Malwise v2 - Comparing Sets of
Strings
L_0
L_3
true

L_6

true
L_1

L_7

true
L_2

L_4

true

proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}

W|IEH}R

true
L_5

p
BR
BW|{B}BR
BI{B}BR
BSSR
BSR
BSSSR

BR
BW|{B}BR
BSSR

d=ed(p,q)

q
Final Remarks on Comparing




Inexact matching is your friend.
Try to use known distance metrics.
 They



have useful properties and index better.

If it’s too slow to compare, transform the object.
Similarity Searching


Problem
 Find



all ‘similar’ objects to my query in a database

Example
 Find

all words in a dictionary with at most 3 differences
to my query word.




This problem is known as a ‘similarity search’
Solution
 Naive

exhaustive search.
 Better to use ‘Metric Trees’
Similarity Search Constraints


Variations
 K-nearest

neighbours – the k closests objects to the

query.
 All objects within a specific distance to the query.




Search based on using a ‘metric distance’.
Metric distances satisfy mathematical properties.
Examples
 Euclidean

Distance
 Jaccard Distance
 Cosine Distance is not metric
Searching – Case Study


Malwise v2
 Distance

metric is Manhattan Distance.
 Use VP-Trees to index and search in stage 1.
 Use DBM-Trees to index and search in stage 2.
 Implemented using open source GBDI Arboretum
library.
Query Benign

r
q
d(p,q)
p
Query Malicious
Query
Malware
Final Remarks on Searching





Searching for inexact matches is useful.
Use good distance metrics.
Use open source libraries.
Classification


The problem:
 Given

a set of N classes.
 And a query object.
 Assign one of the classes to the object.


Class A
Class B

Examples
 Is

this binary (malicious, not malicious)?
 Is this gmail email (primary, social, promotional)?
 Is this web page (defaced, not defaced)?
Classification Methodology


Supervised Learning
 Given

a training set of objects labelled by their class.
 Build a model.
 Then use the model to classify unknown objects.


Unsupervised Learning
 No

labelled data exists.
 “Cluster” objects into classes.
 Use clusters to train model.
 Then classify as per-normal.
Classification – What do I have to do?








Represent objects using “feature vectors”
A vector is an array.
Each element represents a “feature”.
The value of the element tends to be a count of
something, or a size.
Feature examples
 The

number of times a dictionary word such as “Hello”
appears in an E-Mail.
 The size of a binary.
 The number of times LoadLibraryA is executed.
Classification – WEKA?







Put the feature vectors into the text-based ARFF file
format.
Plug into the WEKA machine learning toolkit.
Experiment with different classifiers.
Part of your labelled data can be used to evaluate
the accuracy.
Weka ARFF file
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,?
WEKA

10/25/2013

University of Waikato

34
Classification – Case Study


Clonewise
 Feature

vector is set of features extracted from a pair
of packages.
 Classify - do these packages share code (yes, no)?
 Classify – is the 1st package embedded in the 2nd
package (yes, no)?
Final Remarks on Classification





Lots of problems can be considered as this.
Learn how to use WEKA.
Vectors are very good representations.
Clustering


Problem
 To

group together “similar” objects under some notion
of similarity.



Easy solution
 Represent

objects using “feature vectors”.
 Plug into WEKA.


Packages in Fedora Linux 
Clustering - Case Study


Simseer Cluster
 Represent

binaries using N-Grams of decompiled
flowgraphs.
 Use most frequent N-Grams as features.
 Distance measure is cosine distance.
Final Remarks on Clustering




A classic machine learning problem.
Again, learn to use WEKA.
Program Analysis





An incredibly large and deep field.
This section skims the surface.
Main approaches
Proving 
 Model Checking

 Abstract Interpretation
 Data Flow Analysis 
 Theorem
Model Checking





Looks at program states generated by a program.
Some states indicate bugs.
Try BLAST, a model checker for small C programs.
 Caveat

- it’s pretty old now.
Theorem Proving - SMT


SMT – what is it?




An equation solver that covers the types of operations seen
in machine code.

Approach for Bug Detection
User input can be anything generally, so treat this as a
“symbolic” variable.
 The rest is concrete.
 Simulate execution of the program, plugging all the machine
code that is executed into the solver formuli.




Concolic execution


Combining symbolic execution with concrete execution.
Concolic Execution







At branches, can we have user input that forces us
to go down each path?
Use the SMT solver to tell us.
Launch execution down ‘feasible’ paths.
Use the solver to tell us if bugs are present.
 What

user input, if any, can make this pointer NULL?
Concolic path-sensitive analysis
lea 0x4(%esp
),%ecx
and $0 xfffffff,%esp
0
pushl -0x4(%ecx
)
push %ebp
mov %esp
,%ebp
push %ecx
sub $0x24,%esp
call 4011 0 <___main
b
>
movl $0x0,-0x8(%ebp
)
jmp 40115f <_main
+0x2f>

1

movl $0x4020
a0,(%esp
)
4011
call
b 8 <_puts
>
addl $0x1,-0x8(%ebp
)

3
cmpl $0x9,-0x8(%ebp
)
jle 40114f <_main
+0x1f>

2
add
pop
pop
lea
ret

$0x24,%esp
%ecx
%ebp
-0x4(%ecx
),%esp

4

2
Abstract Interpretation




Abstract the execution of the program.
Example
 Only

consider the sign of a variable, not the actual
value.



Requires a transfer function
 What



an instruction does to the abstract data.

And a Join/Meet function
 How

data is combined when it meets from different
control flow.
Data Flow Analysis


Similar to abstract interpretation.
 Uses

a transfer function, a join.
 Implement both using a monotone framework.




Data Flow analysis is used by compilers.
Classic data flow problems
 The

reach of defining or assigning to a variable.
 Knowing if a variable will be read again before being
assigned a new value.
Data Flow Analysis – Case Study




Implemented in Bugalyze.
Example bug detection
 In

free(ptr), where is ptr used before it is reassigned,
and is it used in a free?




Has found real bugs in Debian Linux.
Still a work-in-progress.
Bugalyze – Case Study
Final Remarks on Program Analysis





A wide and deep field.
Good to know the basic approaches.
Reversing is becoming more rigourous (think
HexRays).
Conclusion






Academia has some useful techniques.
It’s good to know some of the basic methods.
Will improve industrial programs.
Any questions?

Contenu connexe

Tendances

KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiersKrish_ver2
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centersAndres Mendez-Vazquez
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval byIJNSA Journal
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 

Tendances (20)

LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERINGCOMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
COMPUTER LABORATORY-4 LAB MANUAL BE COMPUTER ENGINEERING
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
CarroNatali
CarroNataliCarroNatali
CarroNatali
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
LectureNotes-02-DSA
LectureNotes-02-DSALectureNotes-02-DSA
LectureNotes-02-DSA
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 

En vedette

新浪内部对腾讯公司的深度解析
新浪内部对腾讯公司的深度解析新浪内部对腾讯公司的深度解析
新浪内部对腾讯公司的深度解析Vianne Cai
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource KernelsSilvio Cesare
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisSilvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
異種・協調・分散ロボットに関する研究
異種・協調・分散ロボットに関する研究異種・協調・分散ロボットに関する研究
異種・協調・分散ロボットに関する研究haganemetal
 
微博合作介绍 V0.2
微博合作介绍 V0.2微博合作介绍 V0.2
微博合作介绍 V0.2Vianne Cai
 
Moto%20 x%20project
Moto%20 x%20projectMoto%20 x%20project
Moto%20 x%20projectgeneralvee
 

En vedette (7)

新浪内部对腾讯公司的深度解析
新浪内部对腾讯公司的深度解析新浪内部对腾讯公司的深度解析
新浪内部对腾讯公司的深度解析
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
異種・協調・分散ロボットに関する研究
異種・協調・分散ロボットに関する研究異種・協調・分散ロボットに関する研究
異種・協調・分散ロボットに関する研究
 
微博合作介绍 V0.2
微博合作介绍 V0.2微博合作介绍 V0.2
微博合作介绍 V0.2
 
Moto%20 x%20project
Moto%20 x%20projectMoto%20 x%20project
Moto%20 x%20project
 

Similaire à A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionSiddharth Shrivastava
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docx
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docxThe Java Learning Kit Chapter 6 – Arrays Copyrigh.docx
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docxarnoldmeredith47041
 
Python for data science
Python for data sciencePython for data science
Python for data sciencebotsplash.com
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
Linear Regression Parameters
Linear Regression ParametersLinear Regression Parameters
Linear Regression Parameterscamposer
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Matthew Rowe
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 

Similaire à A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS (20)

Machine Learning - Simple Linear Regression
Machine Learning - Simple Linear RegressionMachine Learning - Simple Linear Regression
Machine Learning - Simple Linear Regression
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docx
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docxThe Java Learning Kit Chapter 6 – Arrays Copyrigh.docx
The Java Learning Kit Chapter 6 – Arrays Copyrigh.docx
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
PythonML.pptx
PythonML.pptxPythonML.pptx
PythonML.pptx
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Linear Regression Parameters
Linear Regression ParametersLinear Regression Parameters
Linear Regression Parameters
 
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
Datamining with R
Datamining with RDatamining with R
Datamining with R
 
Clustering
ClusteringClustering
Clustering
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 

Plus de Silvio Cesare

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGSilvio Cesare
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySilvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Silvio Cesare
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...Silvio Cesare
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...Silvio Cesare
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionSilvio Cesare
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSilvio Cesare
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationSilvio Cesare
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxSilvio Cesare
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSilvio Cesare
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareSilvio Cesare
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowSilvio Cesare
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...Silvio Cesare
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For EmulationSilvio Cesare
 

Plus de Silvio Cesare (15)

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detection
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

  • 1. A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS Silvio Cesare, Deakin University
  • 2. Introduction      Started off in industry (Qualys, now Volvent). Have a Masters by Research. About to receive a PhD from Deakin University. Last 5 years in post-graduate University research. Learnt some cool things along the way.
  • 3. What did I do at University?  Malwise v1 (Masters)   Malwise v2   Binary comparison and visualization service. Clonewise   Binary clustering service. Simseer   More improved malware variant search service. Simseer Cluster   Improved version. Simseer Search   Malware variant detection system. Automated detection of embedded libraries in source. Bugalyze  Detection of bugs using data flow analysis.
  • 5. An incomplete list of mathematical objects       Strings Vectors Sets Sets of Objects Trees Graphs
  • 6. Objects   Objects have different performance. Example  Comparing two vectors is fairly fast.  Exact matching two strings is fairly fast.  Inexact matching two strings is medium slow/fast.  Comparing two graphs is slow. A K T KT K | | | | | sequence alignment O(mn) A TK TT T K
  • 7. Transforming one object to another  Problem  Comparing two 100kb strings using the edit distance is impractically slow.  Solution ed(“hello”, “ggello”) = 2  Transform the strings into vectors.  Then, use a vector comparison – which is fast.  Examples  Comparing malware samples  Finding near duplicate web pages  Comparing E-Mails
  • 8. N-Grams     Extract all N-length substrings (N-Grams) from original string. From training set of strings, choose best N-Grams. Each unique N-Gram is an index in a vector. The value of the element is the number of times it occurs. W|IEH}R W|IE |IEH IEH} EH}R
  • 9. Another N-Gram example    Extract N-Grams Represent new object as a ‘Set of N-Grams’ Compare sets using set similarity metrics
  • 10. A Graph problem       Graph problems like approximate similarity are slow to solve. Decompose graph into subgraphs of at most k-nodes. Canonicalize small graphs, represent by adjacency matrix, transform to string. Graph is now a ‘Set of Strings’. Optionally represent as vector of ‘important ksubgraphs’. Use Vector distance metrics to compare, index, and search.
  • 12. Graphs – Case Study     Implemented in Malwise and Simseer Take control flow graphs of programs. Decompile into strings. One:  Consider program as a vector of N-Grams of decompiled strings.  L_0 Two: L_3 true  Consider program as a set of strings. L_6 true L_1 L_7 true L_2 L_4 true L_5 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; }
  • 13. Final Remarks on Objects   Know how to represent your problem. Look into how the representation can be approximated  By  transforming it into another object Vectors are often a good choice.
  • 14. Comparing  Problem  Measure the similarity (or distance between) two objects.  Solution  Represent objects mathematically.  Use multitude of mathematical measures.  Examples  Malware similarity  Near duplicate web pages
  • 15. Comparing Sets    A set is a collection of elements. Given an equality function between elements, we can measure set similarity. Inexact matching index  Dice coefficient   Jaccard  s 2 A B AB J ( A, B)  A B A B
  • 16. Comparing Vectors – Ugh, math.  Euclidean Distance  d ( p, q )   (qi  pi) n 2 i 1  Manhattan Distance  n d ( p, q )   q  i 1  Cosine Similarity  i similarity  cos( )  p i A B A B
  • 17. Vector distance – a different look   A vector is an n-dimensional point in space. E.g., a 2-d vector is <x,y>
  • 18. Cosine similarity    Line from origin to n-dimensional point. Given 2 lines, what’s the angle (theta) between them? The smaller the angle, the more similar. Point A Point B Theta
  • 19. Comparing Vectors – Case Study  Malwise v2  Feature vector of N-Grams of decompiled flowgraphs  Manhattan Distance  Simseer Search  Same feature vector  Euclidean Distance
  • 20. Comparing Sets – Case Study     Malwise v1 An element is a graph invariant of the control flow graph, represented as an integer. A program is a set of integers. Compare similarity between two programs using Dice coefficient.
  • 21. Malwise v1 - Comparing Sets 1 T  F 2 (1 -> 2), (1 -> 4) (2 -> 3), () (), () (4 -> 3), () 4 T T 3 s ( A, B)  2 wi x Ai  Bi i w x A  w x B i i i i i i
  • 22. Comparing Sets of Strings in Malwise v2 – Case Study     String is a decompiled flowgraph. Program is a set of strings. Edit distance between strings. Construct 1:1 mapping between elements of sets:  Such  that the sum of distances is minimized. Solved using ‘combinatorial optimisation’  Assignment Problem  Solution by “graph matching”
  • 23. Malwise v2 - Comparing Sets of Strings L_0 L_3 true L_6 true L_1 L_7 true L_2 L_4 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; } W|IEH}R true L_5 p BR BW|{B}BR BI{B}BR BSSR BSR BSSSR BR BW|{B}BR BSSR d=ed(p,q) q
  • 24. Final Remarks on Comparing   Inexact matching is your friend. Try to use known distance metrics.  They  have useful properties and index better. If it’s too slow to compare, transform the object.
  • 25. Similarity Searching  Problem  Find  all ‘similar’ objects to my query in a database Example  Find all words in a dictionary with at most 3 differences to my query word.   This problem is known as a ‘similarity search’ Solution  Naive exhaustive search.  Better to use ‘Metric Trees’
  • 26. Similarity Search Constraints  Variations  K-nearest neighbours – the k closests objects to the query.  All objects within a specific distance to the query.    Search based on using a ‘metric distance’. Metric distances satisfy mathematical properties. Examples  Euclidean Distance  Jaccard Distance  Cosine Distance is not metric
  • 27. Searching – Case Study  Malwise v2  Distance metric is Manhattan Distance.  Use VP-Trees to index and search in stage 1.  Use DBM-Trees to index and search in stage 2.  Implemented using open source GBDI Arboretum library. Query Benign r q d(p,q) p Query Malicious Query Malware
  • 28. Final Remarks on Searching    Searching for inexact matches is useful. Use good distance metrics. Use open source libraries.
  • 29. Classification  The problem:  Given a set of N classes.  And a query object.  Assign one of the classes to the object.  Class A Class B Examples  Is this binary (malicious, not malicious)?  Is this gmail email (primary, social, promotional)?  Is this web page (defaced, not defaced)?
  • 30. Classification Methodology  Supervised Learning  Given a training set of objects labelled by their class.  Build a model.  Then use the model to classify unknown objects.  Unsupervised Learning  No labelled data exists.  “Cluster” objects into classes.  Use clusters to train model.  Then classify as per-normal.
  • 31. Classification – What do I have to do?      Represent objects using “feature vectors” A vector is an array. Each element represents a “feature”. The value of the element tends to be a count of something, or a size. Feature examples  The number of times a dictionary word such as “Hello” appears in an E-Mail.  The size of a binary.  The number of times LoadLibraryA is executed.
  • 32. Classification – WEKA?     Put the feature vectors into the text-based ARFF file format. Plug into the WEKA machine learning toolkit. Experiment with different classifiers. Part of your labelled data can be used to evaluate the accuracy.
  • 33. Weka ARFF file @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,?
  • 35. Classification – Case Study  Clonewise  Feature vector is set of features extracted from a pair of packages.  Classify - do these packages share code (yes, no)?  Classify – is the 1st package embedded in the 2nd package (yes, no)?
  • 36. Final Remarks on Classification    Lots of problems can be considered as this. Learn how to use WEKA. Vectors are very good representations.
  • 37. Clustering  Problem  To group together “similar” objects under some notion of similarity.  Easy solution  Represent objects using “feature vectors”.  Plug into WEKA.  Packages in Fedora Linux 
  • 38. Clustering - Case Study  Simseer Cluster  Represent binaries using N-Grams of decompiled flowgraphs.  Use most frequent N-Grams as features.  Distance measure is cosine distance.
  • 39. Final Remarks on Clustering   A classic machine learning problem. Again, learn to use WEKA.
  • 40. Program Analysis    An incredibly large and deep field. This section skims the surface. Main approaches Proving   Model Checking   Abstract Interpretation  Data Flow Analysis   Theorem
  • 41. Model Checking    Looks at program states generated by a program. Some states indicate bugs. Try BLAST, a model checker for small C programs.  Caveat - it’s pretty old now.
  • 42. Theorem Proving - SMT  SMT – what is it?   An equation solver that covers the types of operations seen in machine code. Approach for Bug Detection User input can be anything generally, so treat this as a “symbolic” variable.  The rest is concrete.  Simulate execution of the program, plugging all the machine code that is executed into the solver formuli.   Concolic execution  Combining symbolic execution with concrete execution.
  • 43. Concolic Execution     At branches, can we have user input that forces us to go down each path? Use the SMT solver to tell us. Launch execution down ‘feasible’ paths. Use the solver to tell us if bugs are present.  What user input, if any, can make this pointer NULL?
  • 44. Concolic path-sensitive analysis lea 0x4(%esp ),%ecx and $0 xfffffff,%esp 0 pushl -0x4(%ecx ) push %ebp mov %esp ,%ebp push %ecx sub $0x24,%esp call 4011 0 <___main b > movl $0x0,-0x8(%ebp ) jmp 40115f <_main +0x2f> 1 movl $0x4020 a0,(%esp ) 4011 call b 8 <_puts > addl $0x1,-0x8(%ebp ) 3 cmpl $0x9,-0x8(%ebp ) jle 40114f <_main +0x1f> 2 add pop pop lea ret $0x24,%esp %ecx %ebp -0x4(%ecx ),%esp 4 2
  • 45. Abstract Interpretation   Abstract the execution of the program. Example  Only consider the sign of a variable, not the actual value.  Requires a transfer function  What  an instruction does to the abstract data. And a Join/Meet function  How data is combined when it meets from different control flow.
  • 46. Data Flow Analysis  Similar to abstract interpretation.  Uses a transfer function, a join.  Implement both using a monotone framework.   Data Flow analysis is used by compilers. Classic data flow problems  The reach of defining or assigning to a variable.  Knowing if a variable will be read again before being assigned a new value.
  • 47. Data Flow Analysis – Case Study   Implemented in Bugalyze. Example bug detection  In free(ptr), where is ptr used before it is reassigned, and is it used in a free?   Has found real bugs in Debian Linux. Still a work-in-progress.
  • 49. Final Remarks on Program Analysis    A wide and deep field. Good to know the basic approaches. Reversing is becoming more rigourous (think HexRays).
  • 50. Conclusion     Academia has some useful techniques. It’s good to know some of the basic methods. Will improve industrial programs. Any questions?