SlideShare une entreprise Scribd logo
1  sur  46
Automatically Detecting Package Clones and Inferring Security
                                              Vulnerabilities

                                                       Silvio Cesare
                                                  Deakin University
                                          <silvio.cesare@gmail.com>
   PhD Candidate at Deakin University, AU.


   Research interests:
     Malware detection
     Automated vulnerability detection



   Book author
     Software similarity and classification, Springer.
     http://www.springer.com/computer/security+and+cryptology/
      book/978-1-4471-2908-0


   Spoke at Black Hat in 2003 on Open source kernel auditing.


   http://www.FooCodeChu.com
 Developers may “embed” or “clone” code
 from 3rd party sources
    Static linking
    Maintaining a internal copy of a library.
    Forking a library.


 Lots   of examples
    XML parsing  libxml in various programs
    Image processing  libpng in Firefox
    Networking  Open SSL in Cisco IOS
    Compression  zlib everywhere
 Linuxpolicies generally disallow (image
  below).

 It   still happens.

 Multiple   versions of packages now exist.

 Each    copy needs patches from upstream.

 Copies become insecure over time from
  unapplied patches.
 Scan   binaries for version strings.

 Done in 2005 on mass scale for zlib in Debian
  Linux.

bzlib_private.h:#define BZ_VERSION   "1.0.5, 10-Dec-2007"


tiffvers.h:#define TIFFLIB_VERSION_STR "LIBTIFF, Version
3.8.2nCopyright (c) 1988-1996 Sam LefflernCopyright (c)
1991-1996 Silicon Graphics, Inc."


png.h:#define PNG_HEADER_VERSION_STRING 
  " libpng version 1.2.27 - April 29, 2008n"
 10,000   – 20,000 packages in Linux distros.

 Debian   tracks over 420 libraries (see below).

 Most   distros don‟t track at all.

 How    many vulnerabilities are there?

 How    to automate?
1.   Problem definition and our approach
2.   Statistical classification
3.   Scaling the analysis
4.   Inferring security vulnerabilities
5.   Implementation and evaluation
6.   Discussion
7.   Related work
8.   Future work and conclusion

Remember to complete the Black Hat speaker feedback survey.
 Find package code re-use in sources.
 Infer vulns caused by out-of-date code .




                Firefox Source

                                 libpng Source
 Consider code re-use detection a binary
 classification problem:
    Do packages A and B share code? Yes or no?


 Features   for classification:
    Common filenames
    Hashes
    Fuzzy content


 Askthe question does package A share code
 with package X, forall X in the repository.
 Classification   assigns classes to objects.

 Supervised   learning.

 Unsupervised     learning.               Class 1 - Spam



                           object   ?



                                         Class 2 – Not Spam
 Feature   vector   1.    N_Filenames_A
                     2.    N_Filenames_Source_A
                     3.    N_Filenames_B
                     4.    N_Filenames_Source_B
                     5.    N_Common_Filenames
                     6.    N_Common_Similar_Filenames
                     7.    N_Common_FilenameHashes
                     8.    N_Common_FilenameHash80
                     9.    N_Common_ExactFilenameHash
                     10.   N_Score_of_Common_Filename
                     11.   N_Score_of_Common_Similar_Filename
                     12.   N_Score_of_Common_FilenameHash
                     13.   N_Score_of_Common_FilenameHash80
                     14.   N_Score_of_Common_ExactFilenameHash80
                     15.   N_Data_Common_Filenames
                     16.   N_Data_Common_Similar_Filenames
                     17.   N_Data_Common_FilenameHashes
                     18.   N_Data_Common_FilenameHash80
                     19.   N_Data_Common_ExactFilenameHash
                     20.   N_Data_Score_of_Common_Filename
                     21.   N_Data_Score_of_Common_Similar_Filename
                     22.   N_Data_Score_of_Common_FilenameHash
                     23.   N_Data_Score_of_Common_FilenameHash80
                     24.   N_Data_Score_of_Common_ExactFilenameHash80
                     25.   N_Common_ExactHash
                     26.   N_Common_DataExactHash
expat-2.0.1/lib     tla-1.3.5+dfsg/src/expat/lib/
 Sourceand data.    amigaconfig.h
                     ascii.h             ascii.h

 Normalize names.   asciitab.h
                     expat.dsp
                                         asciitab.h
                                         expat.dsp
                     expat_external.h    expat_external.h
                     expat.h             expat.h
                     expat_static.dsp    expat_static.dsp
c                    expatw.dsp          expatw.dsp
                     expatw_static.dsp   expatw_static.dsp
cpp
                     iasciitab.h         iasciitab.h
cxx                  internal.h          internal.h
cc                   latin1tab.h         latin1tab.h
php                  libexpat.def        libexpat.def
inc                  libexpatw.def       libexpatw.def
                     macconfig.h         macconfig.h
java                 Makefile.MPW        Makefile.MPW
py                   nametab.h           nametab.h
rb                   utf8tab.h           utf8tab.h
js                   winconfig.h         winconfig.h
                     xmlparse.c          xmlparse.c
pl
                     xmlrole.c           xmlrole.c
pm                   xmlrole.h           xmlrole.h
ml                   xmltok.c            xmltok.c
mli                  xmltok.h            xmltok.h
lua                  xmltok_impl.c       xmltok_impl.c
                     xmltok_impl.h       xmltok_impl.h
                     xmltok_ns.c         xmltok_ns.c
 Edit   distance between filenames.

 Similarity   >= 85%

                            edit _ dist ( s, t )
     similarity ( s, t ) 1
                           max( len( s), len(t ))
 Use      fuzzy hashing (ssdeep).

 Number          of identical hashes.

 Number          of > 80% similar hashes.

 Number          of > 0% similar hashes.
ssdeep,1.0--blocksize:hash:hash,filename
96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"config.h"
96:MD9fHjsEuddrg31904l8bgx5ROg2MQZHZqpAlycowOsexbHDbk:MJwz/l2PqGqqbr2yk6pVgrwPV,"INSTALL"
96:EQOJvOl4ab3hhiNFXc4wwcweomr0cNJDBoqXjmAHKX8dEt001nfEhVIuX0dDcs:3mzpAsZpprbshfu3oujjdEN
dp21,"README"
 README     filenames less important.

 libpng.c   more important .

 Scorefilenames using „inverse document
 frequency.‟
                                           D
                  idf (t , D)   log
                                      {d   D:t   d}

 Sum   scores of matching filenames.
   Which similar filenames to match?


   Each matching has a cost – the filename score.


   Choose matchings to maximize sum of costs.

                                        p
                                            Weight(q)
                                                        q
           Makefile

           Makefile.ca
                         png44.c
           png43.c
                         png.h
           png.h
                         Makefile
           README

           rules
   Given two sets, A and T, of equal size, together with a weight function C: A × T → R. Find a
    bijection f: A →T such that the cost function:



                     a A
                            C (a, f (a))
    is optimal.




   Known in combinatorial optimisation as „the
    assignment problem.‟

   Solved optimally in cubic time.

   Greedy solution is faster.
 Not   all features are important.


                         1. Feature1        1. Feature3 (0.80)
 Feature   ranking.     2. Feature2       2. Feature1 (0.60)
                         3. Feature3        3. Feature2 (0.01)




 Subset   selection.     1. Feature1        1. Feature1
                          2. Feature2       2. Feature2
                          3. Feature3



 We    chose not to use it.
 Consider   feature vectors as N-dimensional
 points.

 Linear   classifiers.

 Non   linear classifiers.
                              Class A

                              Class B

 Decision   trees.
 Speedup      clone detection on a package.

 Open   MP.

                                            Classify(Package_X, Package_1)

 Embarrisingly    parallel.
                        Clone Detection –
                                            Classify(Package_X, Package_2)
                           Package_X




                                            Classify(Package_X, Package_N)
 Open   MPI.

 Single   job is clone detection on package.
                                                 Clone Detection – Package_1



 Slaves   consume jobs.
                               Clone Detection   Clone Detection - Package_2




 Embarrassingly   parallel.

                                                 Clone Detection - Package_N
4   Node Amazon EC2 Cluster
    Dual CPU
    8 cores per CPU
    88 EC2 compute units
    60.5G memory per node


 Clonedetection on embedded libs known by
 Debian.

 Store   the results for later use.
Summary: Off-by-one error in the __opiereadrec
function in readrec.c in libopie in OPIE 2.4.1-test1
and earlier, as used on FreeBSD 6.4 through 8.1-
PRERELEASE and other platforms, allows remote
attackers to cause a denial of service (daemon crash)
or possibly execute arbitrary code via a long
username, as demonstrated by a long USER
command to the FreeBSD 8.0 ftpd.
 By   package
1.   Take CVE, match CPE name to Debian package.

2.   Parse CVE summary and extract vuln filename.

3.   Find clones of package with similar filename.

4.   Trim dynamically linked clones.

5.   Is vuln affected clone already being tracked?
 By   CVE
 3,500Lines of C++ and shell scripts.
 Open Source
 http://www.github.com/silviocesare/Clonewise
 Ubuntu    Linux

 3,077,063   unique filenames.

 Follows   inverse power law distribution.

R   square value of regression analysis 0.928.
 Debian   Linux embedded-code-copies.txt.
    Not really machine readable.
    Cull entries which we can‟t match to packages.
    761 labelled positives.


 Negatives   any packages not in positives
    47578 generated labelled negatives.
 Identified    34 previously unknown clones in
  Debian.
     Lots more to do.


 Statistical   classification
     4 classifiers - Random Forest gave best accuracy.
     Increasing the decision threshold reduces FPs.
     Predict 3 FPs in 10,000 classifications.
     More likely an upper limit.
Classifier      TP/FN     FP/TN       TP Rate   FP Rate

Naïve Bayes     439/322   484/56296   57.69%    0.85%
Multilayer
Perceptron      204/557   48/56732    26.81%    0.08%

C4.5            523/238   86/56694    68.73%    0.15%

Random Forest   533/228   60/56720    70.04%    0.11%
Random Forest
(0.8)           446/315   15/56765    58.61%    0.03%
4   hours on an Amazon HPC cluster.

 MPI_Scatter   to do static job assignment was
 inefficient.
    Better to consume from a work queue.


 Need   to use multicore to balance load.
Package             Embedded Package        Package         Embedded Package
OpenSceneGraph                  lib3ds              boson                lib3ds
mrpt-opengl                     lib3ds              libopenscenegraph7   lib3ds
mingw32-OpenSceneGraph          lib3ds              libfreeimage         libpng
libtlen                         expat               libfreeimage         libtiff
centerim                        expat               libfreeimage         openexr
mcabber                         expat               r-base-core          libbz2
                                                    r-base-core-ra       libbz2
udunits2                        expat
                                                    lsb-rpm              libbz2
libnodeupdown-backend-ganglia   expat
                                                    criticalmass         libcurl
libwmf                          gd                  albert               expat
kadu                            mimetex             mcabber              expat
cgit                            git                 centerim             expat
tkimg                           libpng              wengophone           gaim
tkimg                           libtiff             libpam-opie          libopie
ser                             php-Smarty          pysol-sound-server   libmikod
pgpoolAdmin                     php-Smarty          gnome-xcf-thumnailer xcftool
sepostgresql                    postgresql          plt-scheme           libgd
 Write   access to Debian‟s security tracker.

 Red   Hat embedded code copies wiki created.

 Debian plan to integrate Clonewise into
 infrastructure.
 Red    Hat reference CVEs of embedded libs.
 Not   every vendor does.
 It   would be nice if CVE supported this.
   Clonewise detects code reuse.

   If zlib embedded in packages X and Y:
       Clonewise detects clones between all X, Y, and zlib.


   What we really want to know is:
     X is not cloned in Y.
     Zlib is cloned in X and Y.



   Mitigation
       Clone detection on known embedded libraries.
 Debian   Linux zlib audit in 2005

 Plagiarism   detection
    Attribute counting
    Structure-based


 Code   clone detection
     Tokenization
                                                  if
 
                                       condition then    else

    Abstract syntax trees        ==            return          =




                             x    0                             x   1
 Source   repositories
    Sourceforge
    Github

 Other   OSs – BSD etc

 Integration   into build/packaging systems?

 Integration   into Debian Linux infrastructure.
   More than just Clonewise..

   Simseer – Free flowgraph-based malware similarity and
    families.

   110,000 LOC C++. Happy to talk to vendors.
   Vendors have 10,000+ packages.

   How to audit for clones?

   Clonewise can provide a solution.

   And help improve security.

   http://www.FooCodeChu.com

Remember to complete the Black Hat speaker feedback survey.

Contenu connexe

Tendances

Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM Compiler
Erin Dees
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
Joe Stein
 

Tendances (19)

Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM Compiler
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Playfulness at Work
Playfulness at WorkPlayfulness at Work
Playfulness at Work
 
Unix lab
Unix labUnix lab
Unix lab
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Fuzzing - Part 1
Fuzzing - Part 1Fuzzing - Part 1
Fuzzing - Part 1
 
Howto argparse
Howto argparseHowto argparse
Howto argparse
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Python for Linux System Administration
Python for Linux System AdministrationPython for Linux System Administration
Python for Linux System Administration
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Biopython
BiopythonBiopython
Biopython
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
Memory Management In Python The Basics
Memory Management In Python The BasicsMemory Management In Python The Basics
Memory Management In Python The Basics
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
21bUc8YeDzZpE
21bUc8YeDzZpE21bUc8YeDzZpE
21bUc8YeDzZpE
 

En vedette

Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
imanmahsa
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAK
olatz
 

En vedette (6)

PhD Proposal
PhD ProposalPhD Proposal
PhD Proposal
 
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)
 
ICPC Demo
ICPC DemoICPC Demo
ICPC Demo
 
Empirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone DetectionEmpirical Results on Cloning and Clone Detection
Empirical Results on Cloning and Clone Detection
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
BIZI- OHITURAK
BIZI- OHITURAKBIZI- OHITURAK
BIZI- OHITURAK
 

Similaire à Clonewise - Automatically Detecting Package Clones and Inferring Security Vulnerabilities

Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
Silvio Cesare
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
CSCI  132  Practical  Unix  and  Programming   .docx
CSCI  132  Practical  Unix  and  Programming   .docxCSCI  132  Practical  Unix  and  Programming   .docx
CSCI  132  Practical  Unix  and  Programming   .docx
mydrynan
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kushaal Singla
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
fashionbigchennai
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
Revolution Analytics
 
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docxCOMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
donnajames55
 

Similaire à Clonewise - Automatically Detecting Package Clones and Inferring Security Vulnerabilities (20)

Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
 
Exploitation Crash Course
Exploitation Crash CourseExploitation Crash Course
Exploitation Crash Course
 
CSCI  132  Practical  Unix  and  Programming   .docx
CSCI  132  Practical  Unix  and  Programming   .docxCSCI  132  Practical  Unix  and  Programming   .docx
CSCI  132  Practical  Unix  and  Programming   .docx
 
Assignment 1 MapReduce With Hadoop
Assignment 1  MapReduce With HadoopAssignment 1  MapReduce With Hadoop
Assignment 1 MapReduce With Hadoop
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdfReaction StatisticsBackgroundWhen collecting experimental data f.pdf
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Dive into exploit development
Dive into exploit developmentDive into exploit development
Dive into exploit development
 
The Ring programming language version 1.7 book - Part 43 of 196
The Ring programming language version 1.7 book - Part 43 of 196The Ring programming language version 1.7 book - Part 43 of 196
The Ring programming language version 1.7 book - Part 43 of 196
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 
7.-Download_CS201-Solved-Subjective-with-Reference-by-Aqib.doc
7.-Download_CS201-Solved-Subjective-with-Reference-by-Aqib.doc7.-Download_CS201-Solved-Subjective-with-Reference-by-Aqib.doc
7.-Download_CS201-Solved-Subjective-with-Reference-by-Aqib.doc
 
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docxCOMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
COMP 2103X1 Assignment 2Due Thursday, January 26 by 700 PM.docx
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
Unit VI
Unit VI Unit VI
Unit VI
 

Plus de Silvio Cesare

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
Silvio Cesare
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
Silvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Silvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Silvio Cesare
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
Silvio Cesare
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detection
Silvio Cesare
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
Silvio Cesare
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
Silvio Cesare
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
Silvio Cesare
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
Silvio Cesare
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
Silvio Cesare
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
Silvio Cesare
 

Plus de Silvio Cesare (15)

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detection
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Clonewise - Automatically Detecting Package Clones and Inferring Security Vulnerabilities

  • 1. Automatically Detecting Package Clones and Inferring Security Vulnerabilities Silvio Cesare Deakin University <silvio.cesare@gmail.com>
  • 2. PhD Candidate at Deakin University, AU.  Research interests:  Malware detection  Automated vulnerability detection  Book author  Software similarity and classification, Springer.  http://www.springer.com/computer/security+and+cryptology/ book/978-1-4471-2908-0  Spoke at Black Hat in 2003 on Open source kernel auditing.  http://www.FooCodeChu.com
  • 3.  Developers may “embed” or “clone” code from 3rd party sources  Static linking  Maintaining a internal copy of a library.  Forking a library.  Lots of examples  XML parsing  libxml in various programs  Image processing  libpng in Firefox  Networking  Open SSL in Cisco IOS  Compression  zlib everywhere
  • 4.  Linuxpolicies generally disallow (image below).  It still happens.  Multiple versions of packages now exist.  Each copy needs patches from upstream.  Copies become insecure over time from unapplied patches.
  • 5.  Scan binaries for version strings.  Done in 2005 on mass scale for zlib in Debian Linux. bzlib_private.h:#define BZ_VERSION "1.0.5, 10-Dec-2007" tiffvers.h:#define TIFFLIB_VERSION_STR "LIBTIFF, Version 3.8.2nCopyright (c) 1988-1996 Sam LefflernCopyright (c) 1991-1996 Silicon Graphics, Inc." png.h:#define PNG_HEADER_VERSION_STRING " libpng version 1.2.27 - April 29, 2008n"
  • 6.  10,000 – 20,000 packages in Linux distros.  Debian tracks over 420 libraries (see below).  Most distros don‟t track at all.  How many vulnerabilities are there?  How to automate?
  • 7. 1. Problem definition and our approach 2. Statistical classification 3. Scaling the analysis 4. Inferring security vulnerabilities 5. Implementation and evaluation 6. Discussion 7. Related work 8. Future work and conclusion Remember to complete the Black Hat speaker feedback survey.
  • 8.  Find package code re-use in sources.  Infer vulns caused by out-of-date code . Firefox Source libpng Source
  • 9.  Consider code re-use detection a binary classification problem:  Do packages A and B share code? Yes or no?  Features for classification:  Common filenames  Hashes  Fuzzy content  Askthe question does package A share code with package X, forall X in the repository.
  • 10.
  • 11.  Classification assigns classes to objects.  Supervised learning.  Unsupervised learning. Class 1 - Spam object ? Class 2 – Not Spam
  • 12.  Feature vector 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. N_Common_Filenames 6. N_Common_Similar_Filenames 7. N_Common_FilenameHashes 8. N_Common_FilenameHash80 9. N_Common_ExactFilenameHash 10. N_Score_of_Common_Filename 11. N_Score_of_Common_Similar_Filename 12. N_Score_of_Common_FilenameHash 13. N_Score_of_Common_FilenameHash80 14. N_Score_of_Common_ExactFilenameHash80 15. N_Data_Common_Filenames 16. N_Data_Common_Similar_Filenames 17. N_Data_Common_FilenameHashes 18. N_Data_Common_FilenameHash80 19. N_Data_Common_ExactFilenameHash 20. N_Data_Score_of_Common_Filename 21. N_Data_Score_of_Common_Similar_Filename 22. N_Data_Score_of_Common_FilenameHash 23. N_Data_Score_of_Common_FilenameHash80 24. N_Data_Score_of_Common_ExactFilenameHash80 25. N_Common_ExactHash 26. N_Common_DataExactHash
  • 13. expat-2.0.1/lib tla-1.3.5+dfsg/src/expat/lib/  Sourceand data. amigaconfig.h ascii.h ascii.h  Normalize names. asciitab.h expat.dsp asciitab.h expat.dsp expat_external.h expat_external.h expat.h expat.h expat_static.dsp expat_static.dsp c expatw.dsp expatw.dsp expatw_static.dsp expatw_static.dsp cpp iasciitab.h iasciitab.h cxx internal.h internal.h cc latin1tab.h latin1tab.h php libexpat.def libexpat.def inc libexpatw.def libexpatw.def macconfig.h macconfig.h java Makefile.MPW Makefile.MPW py nametab.h nametab.h rb utf8tab.h utf8tab.h js winconfig.h winconfig.h xmlparse.c xmlparse.c pl xmlrole.c xmlrole.c pm xmlrole.h xmlrole.h ml xmltok.c xmltok.c mli xmltok.h xmltok.h lua xmltok_impl.c xmltok_impl.c xmltok_impl.h xmltok_impl.h xmltok_ns.c xmltok_ns.c
  • 14.  Edit distance between filenames.  Similarity >= 85% edit _ dist ( s, t ) similarity ( s, t ) 1 max( len( s), len(t ))
  • 15.  Use fuzzy hashing (ssdeep).  Number of identical hashes.  Number of > 80% similar hashes.  Number of > 0% similar hashes. ssdeep,1.0--blocksize:hash:hash,filename 96:KQhaGCVZGhr83h3bc0ok3892m12wzgnH5w2pw+sxNEI58:FIVkH4x73h39LH+2w+sxaD,"config.h" 96:MD9fHjsEuddrg31904l8bgx5ROg2MQZHZqpAlycowOsexbHDbk:MJwz/l2PqGqqbr2yk6pVgrwPV,"INSTALL" 96:EQOJvOl4ab3hhiNFXc4wwcweomr0cNJDBoqXjmAHKX8dEt001nfEhVIuX0dDcs:3mzpAsZpprbshfu3oujjdEN dp21,"README"
  • 16.  README filenames less important.  libpng.c more important .  Scorefilenames using „inverse document frequency.‟ D idf (t , D) log {d D:t d}  Sum scores of matching filenames.
  • 17. Which similar filenames to match?  Each matching has a cost – the filename score.  Choose matchings to maximize sum of costs. p Weight(q) q Makefile Makefile.ca png44.c png43.c png.h png.h Makefile README rules
  • 18. Given two sets, A and T, of equal size, together with a weight function C: A × T → R. Find a bijection f: A →T such that the cost function: a A C (a, f (a)) is optimal.  Known in combinatorial optimisation as „the assignment problem.‟  Solved optimally in cubic time.  Greedy solution is faster.
  • 19.  Not all features are important. 1. Feature1 1. Feature3 (0.80)  Feature ranking. 2. Feature2  2. Feature1 (0.60) 3. Feature3 3. Feature2 (0.01)  Subset selection. 1. Feature1 1. Feature1 2. Feature2  2. Feature2 3. Feature3  We chose not to use it.
  • 20.  Consider feature vectors as N-dimensional points.  Linear classifiers.  Non linear classifiers. Class A Class B  Decision trees.
  • 21.
  • 22.  Speedup clone detection on a package.  Open MP. Classify(Package_X, Package_1)  Embarrisingly parallel. Clone Detection – Classify(Package_X, Package_2) Package_X Classify(Package_X, Package_N)
  • 23.  Open MPI.  Single job is clone detection on package. Clone Detection – Package_1  Slaves consume jobs. Clone Detection Clone Detection - Package_2  Embarrassingly parallel. Clone Detection - Package_N
  • 24. 4 Node Amazon EC2 Cluster  Dual CPU  8 cores per CPU  88 EC2 compute units  60.5G memory per node  Clonedetection on embedded libs known by Debian.  Store the results for later use.
  • 25.
  • 26. Summary: Off-by-one error in the __opiereadrec function in readrec.c in libopie in OPIE 2.4.1-test1 and earlier, as used on FreeBSD 6.4 through 8.1- PRERELEASE and other platforms, allows remote attackers to cause a denial of service (daemon crash) or possibly execute arbitrary code via a long username, as demonstrated by a long USER command to the FreeBSD 8.0 ftpd.
  • 27.
  • 28.  By package
  • 29. 1. Take CVE, match CPE name to Debian package. 2. Parse CVE summary and extract vuln filename. 3. Find clones of package with similar filename. 4. Trim dynamically linked clones. 5. Is vuln affected clone already being tracked?
  • 30.  By CVE
  • 31.
  • 32.  3,500Lines of C++ and shell scripts.  Open Source http://www.github.com/silviocesare/Clonewise
  • 33.  Ubuntu Linux  3,077,063 unique filenames.  Follows inverse power law distribution. R square value of regression analysis 0.928.
  • 34.  Debian Linux embedded-code-copies.txt.  Not really machine readable.  Cull entries which we can‟t match to packages.  761 labelled positives.  Negatives any packages not in positives  47578 generated labelled negatives.
  • 35.  Identified 34 previously unknown clones in Debian.  Lots more to do.  Statistical classification  4 classifiers - Random Forest gave best accuracy.  Increasing the decision threshold reduces FPs.  Predict 3 FPs in 10,000 classifications.  More likely an upper limit.
  • 36. Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 439/322 484/56296 57.69% 0.85% Multilayer Perceptron 204/557 48/56732 26.81% 0.08% C4.5 523/238 86/56694 68.73% 0.15% Random Forest 533/228 60/56720 70.04% 0.11% Random Forest (0.8) 446/315 15/56765 58.61% 0.03%
  • 37. 4 hours on an Amazon HPC cluster.  MPI_Scatter to do static job assignment was inefficient.  Better to consume from a work queue.  Need to use multicore to balance load.
  • 38. Package Embedded Package Package Embedded Package OpenSceneGraph lib3ds boson lib3ds mrpt-opengl lib3ds libopenscenegraph7 lib3ds mingw32-OpenSceneGraph lib3ds libfreeimage libpng libtlen expat libfreeimage libtiff centerim expat libfreeimage openexr mcabber expat r-base-core libbz2 r-base-core-ra libbz2 udunits2 expat lsb-rpm libbz2 libnodeupdown-backend-ganglia expat criticalmass libcurl libwmf gd albert expat kadu mimetex mcabber expat cgit git centerim expat tkimg libpng wengophone gaim tkimg libtiff libpam-opie libopie ser php-Smarty pysol-sound-server libmikod pgpoolAdmin php-Smarty gnome-xcf-thumnailer xcftool sepostgresql postgresql plt-scheme libgd
  • 39.
  • 40.  Write access to Debian‟s security tracker.  Red Hat embedded code copies wiki created.  Debian plan to integrate Clonewise into infrastructure.
  • 41.  Red Hat reference CVEs of embedded libs.  Not every vendor does.  It would be nice if CVE supported this.
  • 42. Clonewise detects code reuse.  If zlib embedded in packages X and Y:  Clonewise detects clones between all X, Y, and zlib.  What we really want to know is:  X is not cloned in Y.  Zlib is cloned in X and Y.  Mitigation  Clone detection on known embedded libraries.
  • 43.  Debian Linux zlib audit in 2005  Plagiarism detection  Attribute counting  Structure-based  Code clone detection Tokenization if  condition then else  Abstract syntax trees == return = x 0 x 1
  • 44.  Source repositories  Sourceforge  Github  Other OSs – BSD etc  Integration into build/packaging systems?  Integration into Debian Linux infrastructure.
  • 45. More than just Clonewise..  Simseer – Free flowgraph-based malware similarity and families.  110,000 LOC C++. Happy to talk to vendors.
  • 46. Vendors have 10,000+ packages.  How to audit for clones?  Clonewise can provide a solution.  And help improve security.  http://www.FooCodeChu.com Remember to complete the Black Hat speaker feedback survey.