This document discusses a methodology for detecting vulnerabilities in software based on analysis of the project's Git history. It proposes an approach called HVD that considers whether lines of code were added or removed in code changes, which could improve precision over existing techniques. An evaluation using a dataset of over 350,000 commits found that HVD increased the area under the precision-recall curve by 18.8% compared to a baseline that ignores line additions and removals. Features related to computer resources like memory, CPU and networking were found to most significantly contribute to the classification model. The study demonstrates that automatically detecting vulnerabilities from Git data can produce results aligned with human intuition.
3. Introduction
Research
Background
The Trend of Security Incidents
Key facts. Why this research is important:
In Quantity
# of CVE reports: 1,020 (2000) → 14,643 (2017) [NVD]
In Quality
• Equifax exposed 143M consumers’ data due to website application
vulnerability (2017)
• Yahoo breached 3B users’ account information (2013)
4. The Century of Vulnerability
# OF VULNERABILITIES
As information technology is broadly
adopted, the impact of security
incidents is getting extensive and
critical.
5. Introduction To Help Code Reviewer
We know how to deliver software in proper quality. Code review!
Best Practice is Well-Known
Review patches before release and fix bugs before deployment. Still, however,
even the famous OSS projects struggle with the lack of code reviewers.
A Trade-Off of Automation Techniques
Software projects widely adopt a variety of automation approaches. Vulnerability
detection techniques faces a contradictory:
• (a) High precision. Useless if the tool outputs a billions of false positives.
• (b) Adaptability. No one wants to make efforts only for ensuring security such
as annotating unsafe user inputs.
Research
Background
# Example of taint annotation
int printf(/*@untainted@*/ char *fmt,
...);
6. Git is somewhat difficult.
No worries, it’s not only you!
WHAT’S GIT?
7. “(Git is) expressly designed
to make you feel
less intelligent than you
thought you were”
– Andrew Morton
The Greatness of Git -
www.linuxfoundation.org
8. Introduction
What’s Git?
But Git is Always Stay With You
Trust me, or try this command on your terminal:
# List up how much you rely on Git
history | awk '{ print $2 }' | sort |
uniq -c | sort -r | head
9. Introduction
What’s Git?
Git for Machine Learning
Git provides what machine learning requires; good data:
• Adopted by 69.2% of 30K developers [StackOverfow]
• Trusted by most prominent OSS projects such as Linux
Kernel, OpenSSL, FFmpeg, PostgreSQL, Chrome V8, and
Apache HTTPD.
10. Introduction
What’s Git?
CVE-ID and Security Fix on Git
A sufficient number of reliable security fixes:
• Refers CVE-IDs in their commit message
• Or, fixed commits are referred by CVE database
13. A static analysis to detect
suspicious vulnerabilities based
on Git history.
METHODOLOGY - HVD
14. Methodology Proposal Approach
Concept
• This research proposes the approach which aims to
reduce the false positive rate compared to VCCFinder
[Perl et al] without sacrificing adaptability.
• The data source is the same to VCCFinder but this
approach takes account of added-lines and removed-
lines in patch feature while VCCFinder doesn’t.
15. Methodology VCCFinder: a Novel Approach
Concept
Generally, it’s hard to apply machine learning to source code
because most high-level programming languages such as
C/C++ are less redundant compared to natural languages
and assembly languages. To address this difficulty, Perl et
al.:
• Narrowed down the problem to the quantifiable lemma.
The quality of source code can be hardly quantified but
vulnerability can be expressed as 0 or 1.
• Leveraged the legacies. CVE database and the prominent
OSS projects.
16. “I really never wanted to do
source control
management at all and felt
that it was just about the
least interesting thing in
the computing world”
– Linus Torvalds
10 Years of Git -
www.linuxfoundation.org
22. Evaluation Dataset Provided by Perl et al.
Experiment
• This dataset contains commits labelled by VCC and UC and associated with
their CVE-IDs.
• It comprises 714 VCCs out of 350k commits in total from 66 OSS repositories
implemented in C/C++.
• The number of unique tokens counts 170k.
• Compressed size is 525mb (npz).
23. Evaluation Implementation in Python
Experiment
To make the experiment reliable, I adopted a variety of libraries including:
• Numpy
• SciPy
• Scikit-learn
• Unidiff
LT-I: note that the reproducibility is limited since the source of VCCFinder is not
publicly available.
24. Evaluation Environment Specs
Experiment
The computation was performed at the one of CX250 Cluster (MPC):
• CPU: Intel Xeon E5-2680v2 2.80GHz (10-core) x2
• Memory: 64GB (4GB DDR3-1866 ECC x16)
26. Evaluation Trade-off
• Execution time x3: (LT-I, LT-S) = (17m06s, 45m36s)
• Note: the vast majority of the processing time is occupied by learning phase.
In the practical use case, the learnt model is dumped and shared with future
predictions for a while once calculated. Then, it takes a few seconds to parse a
given unknown commit and perform prediction by using the shared model.
Hence, the execution time of learning phase should not influence the
development process.
Precision
27. Evaluation The most contributing features
Effective Features
To gain more profound insights from the
experiment, this study also reveals that
valuables consisting of words related to
computer resource most significantly
contributed to the classification model.
For instance:
• (RAM) structors: memory allocation with
complex structures
• (RAM) vmalloc: virtual memory allocation
• (CPU) skbuf_head: a spin-lock of threads
• (network) tso: TCP Segmentation Offload
• (network) if_ether: a flag of Ethernet
availability
28. Evaluation Findings & insights
Effective Features
Findings:
• The valuable tokens which are relevant to computer resources such as CPU,
memory, and network
• The figure also shows most contributing valuables are added-tokens.
Insights:
• These findings do not surprise us because it’s obvious that vulnerability occurs
correlating closely with side effects with computer resource management and
adding code.
• However, it’s worth verifying that automatic detection approach makes no
difference with the experiential intuition of human.
30. Despite the difficulty that the features acquirable via Git are limited, this study shows LT-
S improved AUC of the precision-recall curve by 18.8% compared to LT-I without losing
the original advantages:
• (a) Scalability
• (b) Generality
• (c) Explainability
CONCLUSION