The 7 Things I Know About Cyber Security After 25 Years | April 2024
How We Analyzed 1000 Dumps in One Day - Dina Goldshtein, Brightsource - DevOpsDays Tel Aviv 2015
1. How We Analyzed
1000 Dumps in
One Day
DINA GOLDSHTEIN
EMBEDDED TEAM LEADER, BRIGHTSOURCE ENERGY
BLOGS.MICROSOFT.CO.IL/DINAZIL/
@DINAGOZIL
2. Agenda
What we do and why we need dumps
Manual analysis process
The holy grail: automatic dump analysis
Our automatic triage workflow
3. About Us
BrightSource Energy builds solar power plants
Power plants have control software
Control software crashes
4. Our Production Environment
The office (development) network is connected to the Internet
The production (power plant) network is isolated
There is a (very slow) one-way link from production to development
5. In the Beginning…
Mask all crashes by a nice error dialog and an “orderly” shut-down
Analyze errors using very extensive log files from all components
Alas, last error in log doesn’t always correspond to the fiend
Need to know exact exception, when it occurred and where!
6. Crash Dumps
A dump is a snapshot of a process’s memory: threads, heap, exceptions,
locks, etc.
Various tools can open dump files and see what’s inside
7. How???
An executable can be compiled with debug information - the symbols
Symbols files (.PDB) contain information which allows debuggers to
match addresses and other information in the file to names of DLLs,
functions, variables, lines of code, etc.
8. How???
An executable can be compiled with debug information - the symbols
Symbols files (.PDB) contain information which allows debuggers to
match addresses and other information in the file to names of DLLs,
functions, variables, lines of code, etc.
9. Symbol Server
Symbols can be provided to the debugger explicitly
But they can also reside in a Symbol Server (stored by name and hash)
The debugger can download debugging symbols automatically for the
right product version
10. Production Crashes
We can’t attach a debugger, or do remote analysis of production errors
Windows can be configured to automatically save a dump when a
process crashes
When crashes occur, dump files are generated and transmitted to a
central location and then the office network
11. Manual Dump Analysis
With high failure rates, we’re talking dozens of dumps per day from a
single facility
Many errors are exact duplicates
Manual analysis means:
◦ Copy dump to my machine (it’s not uncommon for a dump to be 2-3GB)
◦ Copy debugger support files and symbols (if no symbol server is present)
◦ Open dump in debugger (Visual Studio/WinDbg)
◦ Locate the exception and call stack
◦ Triage and open a bug for the relevant developer
◦ Probably around 10 minutes per dump…
12. Automatic Dump Analysis
ClrMD is a NuGet package which provides a debugger API for dumps
and live processes
◦ Works with both native and managed code
The core of our automatic solution uses ClrMD for automatic dump
analysis and triage:
◦ Exception information
◦ Call stack
◦ Likely faulting component
Recently became open source on GitHub
13. Some Code…
target = DataTarget.LoadCrashDump(dumpPath);
if (target.ClrVersions.Count > 0) {
ClrInfo dacVersion = target.ClrVersions[0];
string dacLocation = dacVersion.TryDownloadDac();
runtime = target.CreateRuntime(dacLocation);
}
var dc = (IDebugControl)target.DebuggerInterface;
dc.GetLastEventInformation(out eventType, out processId,
out threadIndex, extraInformation, extraInformationSize,
out extraInformationUsed, description, descriptionSize,
out descriptionUsed);
var dso = (IDebugSystemObjects)target.DebuggerInterface;
var sysIds = new uint[count];
dso.GetThreadIdsByIndex(threadIndex, count, null, sysIds);
if (IsThreadManaged(sysIds[0])) {
var td = runtime.Threads.First(t => t.OSThreadId == sysIds[0]);
clrException = td.CurrentException;
}
14. Our Dump Analysis Workflow
At the end of a shift, operators copy dumps to a network share in the
office network
A script goes over the dumps one by one and uses ClrMD to find the
root cause of the error
According to a configuration file, the faulting module’s owner is alerted
and a ticket is opened in Redmine
15. From Hours to Seconds
Manual, tedious, error-prone dump analysis by red-eyed developers…
…Automatic, happy, untiring ninja script
17. Summary
What we do and why we need dumps
Manual analysis process
The holy grail: automatic dump analysis
Our automatic triage workflow
Resources:
◦ The slides: http://tinyurl.com/dumpstlv
◦ ClrMD on GitHub
◦ DumpAnalyzer on GitHub
◦ msos on GitHub
18. Questions?
Thank You!
DINA GOLDSHTEIN
EMBEDDED TEAM LEADER, BRIGHTSOURCE ENERGY
BLOGS.MICROSOFT.CO.IL/DINAZIL/
@DINAGOZIL
"Retouched Kitty" by Ozan Kilic is licensed under Creative Commons Attribution 2.0