The document discusses the history and development of static code analyzers. It describes how early tools used regular expressions that were ineffective for complex code analysis. Modern static analyzers overcome these limitations through techniques like type inference, data flow analysis, symbolic execution, and pattern-based analysis. They also leverage method annotations and a mixture of analysis approaches. While machine learning is hyped, static analysis remains very challenging due to the complexity of code and rapid language evolution.
8. Greetings from the past: simple tools and bad
standards
• RATS
• Cppcheck
• MISRA C
8
9. Regular expressions don’t work
• It’s difficult to search even for simplest interchanges: (A + B == B + A);
• Macros: who will expand them?
• Types: who will evaluate the typedef chain?
• Values: how to figure out that an index is out of array bounds?
9
10. Regular expressions don’t work
My patience has run out in 2010, and I wrote a critical article:
«Static analysis and regular expressions»
https://www.viva64.com/en/b/0087/
10
11. What is inside modern static code
analyzers
on the PVS-Studio example
11
12. Type inference
• Type information is needed for implementing the majority of
diagnostics
• Ability to infer a type from a typedef chain is needed
• Ability to substitute types (and constants) for templates’ analysis is
needed
typedef
12
14. Type inference
template<class T, size_t N> struct X
{
T A[N];
void Foo()
{
memset(A, 0, sizeof(T) * 10);
}
};
void Do()
{
X<int, 5> a;
a.Foo();
}
PVS-Studio: V512 CWE-119 Instantiate X < int, 5 >: A call of
the 'memset' function will lead to overflow of the buffer
'A'. test.cpp 127
14
15. Data-flow analysis
int cache_lookup_path(...., vnode_t dp, ....)
{
....
if (dp && (dp->v_flag & VISHARDLINK)) {
break;
}
if ((dp->v_flag & VROOT) ||
dp == ndp->ni_rootdir ||
dp->v_parent == NULLVP)
break;
....
}
Error in the
XNU kernel project
PVS-Studio: V522 CWE-690 There might be dereferencing of a
potential null pointer 'dp'. vfs_cache.c 1449
15
20. Data-flow analysis
• CoreHard Spring 2018. Pavel Belikov. How Data Flow works in a static
code analyzer
https://youtu.be/nrQUpGM9vYQ
20
21. Symbolic execution
void F(int X)
{
int A = X;
int B = X + 10;
int Q[5];
Q[B - A] = 1;
}
PVS-Studio: V557 CWE-787 Array overrun is possible. The 'B - A' index is pointing
beyond array bound. test.cpp 126
21
22. Symbolic execution
PVS-Studio: V547 CWE-571 Expression 'A < C' is always true. test.cpp 137
void F(int A, int B, int C)
{
if (A < B)
if (B < C)
if (A < C)
foo();
}
22
23. Pattern-based analysis
Error in the Linux Kernel project
static ssize_t lp8788_show_eoc_time(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct lp8788_charger *pchg = dev_get_drvdata(dev);
char *stime[] = { "400ms", "5min", "10min", "15min",
"20min", "25min", "30min" "No timeout" };
....
}
PVS-Studio: V653 A suspicious string consisting of two parts is used for
array initialization. It is possible that a comma is missing. Consider
inspecting this literal: "30min" "No timeout". lp8788-charger.c 657
23
24. Pattern-based analysis
Error in the
WebRTC project
void AsyncSocksProxySocket::SendAuth() {
....
char * sensitive = new char[len];
pass_.CopyTo(sensitive, true);
request.WriteString(sensitive); // Password
memset(sensitive, 0, len);
delete [] sensitive;
DirectSend(request.Data(), request.Length());
state_ = SS_AUTH;
}
PVS-Studio: V597 CWE-14 The compiler could delete the 'memset' function
call, which is used to flush 'sensitive' object. The RtlSecureZeroMemory()
function should be used to erase the private data. socketadapters.cc 677 24
26. Method annotations
• Static analysis is not magic, but a great work
• For example, in PVS-Studio 7140 functions are annotated
(only for C and C++)
26
27. Method annotations
• WinAPI
• Standard C library,
• Standard template library,
• glibc (GNU C Library)
• Qt
• MFC
• zlib
• libpng
• OpenSSL
• And so on.
27
30. Example of the function fread annotation
define MAX_AVISYNTH_SCRIPT_LENGTH 16384
void TavisynthPage::onLoad(void)
{
....
char script[MAX_AVISYNTH_SCRIPT_LENGTH];
size_t len = fread(script, 1, MAX_AVISYNTH_SCRIPT_LENGTH, f);
fclose(f);
script[len] = '0';
....
}
Error in the
Ffdshow project
30
31. Example of the function fread annotation
define MAX_AVISYNTH_SCRIPT_LENGTH 16384
void TavisynthPage::onLoad(void)
{
....
char script[MAX_AVISYNTH_SCRIPT_LENGTH];
size_t len = fread(script, 1, MAX_AVISYNTH_SCRIPT_LENGTH, f);
fclose(f);
script[len] = '0';
....
}
Error in the
Ffdshow project
PVS-Studio: V557 Array overrun is possible. The value of 'len' index
could reach 16384. cavisynth.cpp 129
31
33. Example of the function fread annotation
Error in the Android projectbool ELFAttribute::merge(....) {
....
uint32_t subsection_length =
*reinterpret_cast<const uint32_t*>(subsection_data);
if (llvm::sys::IsLittleEndianHost !=
m_Config.targets().isLittleEndian())
bswap32(subsection_length);
....
}
PVS-Studio: V530 CWE-252 The return value of function 'bswap32' is required to be
utilized. ELFAttribute.cpp 84
33
34. Mixture of techniques
int Div(int X)
{
return 10 / X;
}
void Foo()
{
for (int i = 0; i < 5; ++i)
Div(i);
}
PVS-Studio: V609 CWE-628 Divide by zero. Denominator 'X' == 0.
The 'Div' function processes value '[0..4]'. Inspect the first
argument. Check lines: 106, 110. test.cpp 106
Automated annotation
+
Data flow analysis
34
38. Incredible number of ways to get a null
pointer
• p = x ? array : nullptr;
• if (x) p = array; else p = nullptr;
• p = malloc(n);
char *p;
switch (x)
{
case 1: p = "foo"; break;
default: p = strstr(str, "tag"); break;
}
38
39. Huge number of ways to dereference a null
pointer
• *p
• p[i]
• p->foo()
• memset(p, 0, n);
• int *x = p; *x = 123;
• T* x = new(p) T;
39
40. Why learn when you can accurately evaluate?
• Division by 0;
• Null pointer dereference;
• Index out of array bounds;
• Overflows;
• Condition is always true/false;
• And so on.
• Moreover, developers themselves search for similar errors by
«executing code in their heads»
40
42. Second problem: lack of examples
• Yes, you can search for some cases which represent a template-based
technology
• Where take so many examples?
42
43. C++ language is evolving rapidly.
How to search for errors in the code
where new syntax is applied?
43
45. Is machine learning useless?
• No, but too much hype
• In my opinion, this is an interesting area: false positive suppression
45
46. Conclusion
• Static analysis is complicated and exciting
• Analyzers represent two great differences now and 10 years ago
• Introducing static analysis is inevitable due to the growth of projects’
sizes and difficulty
• The same process took place in the case of version control systems
• The same was this bugtrackers
46
47. Time for your questions!
E-Mail: karpov@viva64.com
Twitter: @Code_Analysis
Instagram: @pvs_studio_unicorn
47