Debugging complex systems can be difficult. Luckily, the Erlang ecosystem is full of tools to help you out. With the right mindset and the right tools, debugging complex Erlang systems can be easy. In this talk, I'll share the debugging methodology I've developed over the years.
4. Understand the system
1. Be familiar with your stack: OS, VM, application,
protocols, external services, etc.
2. Know your tools:
• Take time to experiment with new tools.
• Match the tool to the bug.
3. What are/were the requirements?
4. Is it really a bug?
4
5. Reproduce the bug
1. What are the conditions that trigger the bug:
• function input?
• invalid state?
• environment variable?
• OS settings?
2. Try reproducing locally.
3. Try reproducing in production.
4. Reproducibility will greatly simplify the debugging
process and the validation of the fix!
5
6. Collect data
1. Search on Google, it might be a known bug!
2. Don’t jump to conclusions, use observations to
guide your intuition (you’re always wrong!).
3. Use debugging tools to get more insights.
4. Filter out the noise (especially important for
performance bugs!!!).
5. Too much is like not enough.
6
7. Use process of elimination
1. Divide and conquer!
2. Start with macro observations:
• Are all servers affected?
• Are all datacenters affected?
• Is there an external service involved?
3. From there, narrow down your search using data!
4. Watch out, one bug might be hiding another one.
7
8. Change one thing at a time
1. Don’t take any shortcuts; if you change too many
variables, you won’t be able to correlate the
results.
2. Be smart; pick the one change that cuts the
search space the most.
3. If you want to test different theories in parallel,
create a branch for each change and deploy on
different nodes.
8
9. Keep an audit trail
1. Don’t trust your memory, you might get some
facts wrong.
2. Can help if you’re debugging a similar problem in
the future.
3. Useful for post-mortems!
4. Allows you to collaborate with co-workers (e.g.
via Google Docs).
5. Can be used to coach teammates.
9
10. Verify your assumptions
1. Check the basics:
• What code is deployed?
• Same VM version?
• Same application config?
• Same kernel version?
• Same system config?
2. Is the tool lying? Validate your tools!
3. Backtrack and go over your audit trail, you might
have missed something!
10
11. Take a step back
1. Step on your ego and ask a co-worker for help!
2. Ask an expert.
3. Sleep on it; your thoughts will be clearer in the
morning.
11
12. Validate your fix
1. Is it a side effect? Heisenbug?
2. Did you really fix the root cause or just work
around it?
3. Validate in production:
1. start with one node per datacenter
2. slowly roll out the fix and monitor
4. Add regression tests if you can.
5. …
6. Go back to bed Zzzz.
12
13. Systematic debugging rules
1. Understand the system
2. Reproduce the bug
3. Collect data
4. Use process of elimination
5. Change only one thing at a time
6. Keep an audit trail
7. Verify your assumptions
8. Take a step back
9. Validate your fix
13
22. Profilers
1. fprof / eprof
2. eflame
3. system tap / dtrace / lttng / perf
4. many others!
Good for: performance bugs
22
23. System utilities
Good for: resources and performance bugs
Some of my favourite ones:
1. top / htop
2. ngrep / netstat / tcpdump
3. strace
4. iotop / lsof
23
29. 1. Are the numbers in the database good?
No => Not a web application bug.
2. Are the numbers in the logs good?
No => Not a ETL bug.
3. Are there other services on the same box
affected by missing log events?
No => Probably not a filesystem bug.
Example #1
29
30. 1> Tid = rtb_gateway_counter:table_id(bid_metrics_counters).
2> ets:tab2list(Tid).
[{{6,4,undefined},0,0,6,6},{{6,3,2},6,6,12,12}]
Is the data being aggregated? Yes.
Bug is most likely in the function that serializes the
ETS table to JSON.
Example #1
30
31. 1> redbug:start("rtb_gateway_counter:map_counters(bid_metrics_counters, '_', '_', '_')").
09:07:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undefined},0,0,6,6},{{6,3,2},7,7,14,14}],
<0.215.0>, …]}
09:08:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undefined},0,0,12,12},{{6,3,2},12,12,24,24}],
<0.215.0>, …]}
Let’s trace!
Function should be calling itself recursively…
Example #1
31
33. Example #2
2014-07-31 19:50:39.915 [error] emulator Error in process <0.23676.390> on node 'rtb-
gateway' with exit value:
{function_clause,[{cowboy_protocol,parse_uri_path,[<<0 bytes>>,
{state,#Port<0.13093005>,ranch_tcp,[cowboy_router,cowboy_handler],false,
[{listener,rtbgw_lb},{dispatch,[{'_',[],[{[<<13 bytes>>,exchange],
[],rtb_gateway_notification_handler,[ewr,<<4 bytes>>]},{[<<5 bytes>>,exchange...
Where do we start?
33
34. 1. Google the error in case it’s a known bug. Not
really…
2. Add extra logging in ranch to print out the
request state when the bug occurs.
3. Use ngrep to validate that some HTTP requests
are malformed (e.g. improper content-size).
Bug is non-trivial to reproduce so let’s start by
collecting data…
Example #2
34
35. 1. Capture TCP streams of failing requests using
tcpdump.
2. Build tool to replay TCP dump (httpreplay).
3. Replay traffic capture…
4. Can reproduce… but not deterministically.
5. Stepped on my ego and passed the flag to a
teammate.
Example #2
35
36. 1. Teammate tried tracing the problem… no dice.
2. Teammate took a step back…
Example #2
36
37. Take a step backTeammate has an eureka moment while driving…
The socket in the cowboy req record is mutable!
p.s. Thanks, Jeremie! :)
37
39. Example #3
1. Receive a “service trouble” email from Dynect
Concierge (DNS)…
2. Receive a “DOWN/PROBLEM” email from Nagios…
3. SSH to server to find out beam is dumping a
erl_crash.dump…
Where do we start?
39
40. Example #3
While we’re waiting for the VM to finish writing the
erl_crash.dump, let’s check graphite.
40
41. Example #3
lpgauth # ./erl_crashdump_analyzer.sh /erl_crash.dump
analyzing /erl_crash.dump, generated on: Mon Feb 16 10:57:22 2015
Slogan: eheap_alloc: Cannot allocate 71302344 bytes of memory (of type "heap", thread 4).
…
Different message queue lengths (5 largest different):
===
1 1357844
7 2
22 1
10180 0
…
cat /erl_crash.dump | grep -10 1357844
41
43. Example #3
1. Validate that gen_tcp can actually block.
2. Mitigate by using gen_tcp option send_timeout.
3. Fix the problem by adding back-pressure (the
joys of unbounded queues!).
43
44. Example #3
check(Tid, MaxBacklog) ->
case increment(Tid, MaxBacklog) of
[MaxBacklog, MaxBacklog] ->
false;
[_, Value] when Value =< MaxBacklog ->
true;
{error, tid_missing} ->
false
end.
decrement(Tid) ->
safe_update_counter(Tid, {2, -1, 0, 0}).
increment(Tid, MaxBacklog) ->
safe_update_counter(Tid, [{2, 0}, {2, 1, MaxBacklog, MaxBacklog}]).
44
47. Tips
Take the time to build your own tools!
1. If you find yourself repeating the same commands
often, write a script!
2. Add debugging functions to your modules:
• ETS table tid accessor
• gen_server state accessor
47