Debugging Complex Systems - Erlang Factory SF 2015

Louis-Philippe Gauthier
Director, Product Engineering
Debugging Complex Systems

Let’s ﬁx the process.
Systematic debugging!
3

Understand the system
1. Be familiar with your stack: OS, VM, application,
protocols, external services, etc.
2. Know your tools:
• Take time to experiment with new tools.
• Match the tool to the bug.
3. What are/were the requirements?
4. Is it really a bug?
4

Reproduce the bug
1. What are the conditions that trigger the bug:
• function input?
• invalid state?
• environment variable?
• OS settings?
2. Try reproducing locally.
3. Try reproducing in production.
4. Reproducibility will greatly simplify the debugging
process and the validation of the ﬁx!
5

Collect data
1. Search on Google, it might be a known bug!
2. Don’t jump to conclusions, use observations to
guide your intuition (you’re always wrong!).
3. Use debugging tools to get more insights.
4. Filter out the noise (especially important for
performance bugs!!!).
5. Too much is like not enough.
6

Use process of elimination
1. Divide and conquer!
2. Start with macro observations:
• Are all servers affected?
• Are all datacenters affected?
• Is there an external service involved?
3. From there, narrow down your search using data!
4. Watch out, one bug might be hiding another one.
7

Change one thing at a time
1. Don’t take any shortcuts; if you change too many
variables, you won’t be able to correlate the
results.
2. Be smart; pick the one change that cuts the
search space the most.
3. If you want to test different theories in parallel,
create a branch for each change and deploy on
different nodes.
8

Keep an audit trail
1. Don’t trust your memory, you might get some
facts wrong.
2. Can help if you’re debugging a similar problem in
the future.
3. Useful for post-mortems!
4. Allows you to collaborate with co-workers (e.g.
via Google Docs).
5. Can be used to coach teammates.
9

Verify your assumptions
1. Check the basics:
• What code is deployed?
• Same VM version?
• Same application conﬁg?
• Same kernel version?
• Same system conﬁg?
2. Is the tool lying? Validate your tools!
3. Backtrack and go over your audit trail, you might
have missed something!
10

Take a step back
1. Step on your ego and ask a co-worker for help!
2. Ask an expert.
3. Sleep on it; your thoughts will be clearer in the
morning.
11

Validate your fix
1. Is it a side effect? Heisenbug?
2. Did you really fix the root cause or just work
around it?
3. Validate in production:
1. start with one node per datacenter
2. slowly roll out the fix and monitor
4. Add regression tests if you can.
5. …
6. Go back to bed Zzzz.
12

Systematic debugging rules
1. Understand the system
2. Reproduce the bug
3. Collect data
4. Use process of elimination
5. Change only one thing at a time
6. Keep an audit trail
7. Verify your assumptions
8. Take a step back
9. Validate your ﬁx
13

Erlang interactive shell
lpgauth # erl -remsh rtb-gateway -name shell -setcookie monster
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G)
1> Version = erlang:system_info(otp_release).
"17"
2> b().
Version = "17"
ok
3> f(Version).
ok
4> b().
ok
15

lpgauth # /etc/init.d/rtb-delivery shell
1> ets:i().
id name type size mem owner
----------------------------------------------------------------------------
…
swirl_ﬂows swirl_ﬂows set 15 1611 swirl_tracker
swirl_mappers swirl_mappers set 15 481 swirl_tracker
swirl_reducers swirl_reducers set 0 316 swirl_tracker
swirl_streams swirl_streams set 15 908 swirl_tracker
2> ets:i(swirl_streams).
3> ets:tab2list(swirl_streams).
16

lpgauth # /etc/init.d/rtb-gateway shell
1> F = fun Loop() ->
1> rp(process_info(whereis(code_server), message_queue_len)),
1> timer:sleep(1000),
1> Loop()
1> end.
#Fun<erl_eval.44.90072148>
2> F().
{message_queue_len,0}
17

Loggers
1. io:format/2
2. error_logger + SASL
3. Lager
Don’t forget system logs:
/var/log/messages,
/var/log/external_service.
Good for: logic and typing bugs
18

Metric collectors
1. vmstats + statsderl + statsd
2. folsom / statsman / exometer
3. collectd
4. carbon + graphite
5. many others!
Good for: resources and performance bugs
19

Debuggers
1. erlang:trace/3
2. dbg
3. redbug
4. system tap / dtrace / lttng
Good for: logic and typing bugs
20

Program dumps
1. erl_crash.dump
• observer
• recon / script
• cat / grep / strings
2. core dumps
Good for: resources bugs
21

Proﬁlers
1. fprof / eprof
2. eﬂame
3. system tap / dtrace / lttng / perf
4. many others!
Good for: performance bugs
22

System utilities
Good for: resources and performance bugs
Some of my favourite ones:
1. top / htop
2. ngrep / netstat / tcpdump
3. strace
4. iotop / lsof
23

Static analyzers
1. Dialyzer
Good for: typing bugs
25

Tools for the bug
1. Logic: debuggers, loggers, shell
2. Typing: debuggers, shell, static analyzers
3. Resources: metric collectors, program dumps,
shell, system utilities
4. Performance: metric collectors, shell, system
utilities
26

If a request crashes and no one is around to
monitor it, does it trigger an alert?
27

Example #1
Where do we start?
28

1. Are the numbers in the database good?  
No => Not a web application bug.
2. Are the numbers in the logs good? 
No => Not a ETL bug.
3. Are there other services on the same box
affected by missing log events?  
No => Probably not a ﬁlesystem bug.
Example #1
29

1> Tid = rtb_gateway_counter:table_id(bid_metrics_counters).
2> ets:tab2list(Tid).
[{{6,4,undeﬁned},0,0,6,6},{{6,3,2},6,6,12,12}]
Is the data being aggregated? Yes.
Bug is most likely in the function that serializes the
ETS table to JSON.
Example #1
30

1> redbug:start("rtb_gateway_counter:map_counters(bid_metrics_counters, '_', '_', '_')").
09:07:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undeﬁned},0,0,6,6},{{6,3,2},7,7,14,14}],
<0.215.0>, …]}
09:08:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undeﬁned},0,0,12,12},{{6,3,2},12,12,24,24}],
<0.215.0>, …]}
Let’s trace!
Function should be calling itself recursively…
Example #1
31

Example #2
2014-07-31 19:50:39.915 [error] emulator Error in process <0.23676.390> on node 'rtb-
gateway' with exit value:
{function_clause,[{cowboy_protocol,parse_uri_path,[<<0 bytes>>,
{state,#Port<0.13093005>,ranch_tcp,[cowboy_router,cowboy_handler],false,
[{listener,rtbgw_lb},{dispatch,[{'_',[],[{[<<13 bytes>>,exchange],
[],rtb_gateway_notiﬁcation_handler,[ewr,<<4 bytes>>]},{[<<5 bytes>>,exchange...
Where do we start?
33

1. Google the error in case it’s a known bug. Not
really…
2. Add extra logging in ranch to print out the
request state when the bug occurs.
3. Use ngrep to validate that some HTTP requests
are malformed (e.g. improper content-size).
Bug is non-trivial to reproduce so let’s start by
collecting data…
Example #2
34

1. Capture TCP streams of failing requests using
tcpdump.
2. Build tool to replay TCP dump (httpreplay).
3. Replay trafﬁc capture…
4. Can reproduce… but not deterministically.
5. Stepped on my ego and passed the ﬂag to a
teammate.
Example #2
35

1. Teammate tried tracing the problem… no dice.
2. Teammate took a step back…
Example #2
36

Take a step backTeammate has an eureka moment while driving…
The socket in the cowboy req record is mutable!
p.s. Thanks, Jeremie! :)
37

Example #3
1. Receive a “service trouble” email from Dynect
Concierge (DNS)…
2. Receive a “DOWN/PROBLEM” email from Nagios…
3. SSH to server to ﬁnd out beam is dumping a
erl_crash.dump…
Where do we start?
39

Example #3
While we’re waiting for the VM to ﬁnish writing the
erl_crash.dump, let’s check graphite.
40

Example #3
lpgauth # ./erl_crashdump_analyzer.sh /erl_crash.dump
analyzing /erl_crash.dump, generated on: Mon Feb 16 10:57:22 2015
Slogan: eheap_alloc: Cannot allocate 71302344 bytes of memory (of type "heap", thread 4).
…
Different message queue lengths (5 largest different):
===
1 1357844
7 2
22 1
10180 0
…
cat /erl_crash.dump | grep -10 1357844
41

Example #3
loop(State) ->
receive Msg ->
{ok, State2} = handle_msg(Msg, State),
loop(State2)
end.
handle_msg({call, Ref, From, Msg}, State) ->
gen_tcp:send(Socket, Packet)
…
handle_msg({tcp, Socket, Data}, State) ->
decode_data(Data2, State)
…
What happens if gen_tcp:send/2 blocks?
42

Example #3
1. Validate that gen_tcp can actually block.
2. Mitigate by using gen_tcp option send_timeout.
3. Fix the problem by adding back-pressure (the
joys of unbounded queues!).
43

Example #3
check(Tid, MaxBacklog) ->
case increment(Tid, MaxBacklog) of
[MaxBacklog, MaxBacklog] ->
false;
[_, Value] when Value =< MaxBacklog ->
true;
{error, tid_missing} ->
false
end.
decrement(Tid) ->
safe_update_counter(Tid, {2, -1, 0, 0}).
increment(Tid, MaxBacklog) ->
safe_update_counter(Tid, [{2, 0}, {2, 1, MaxBacklog, MaxBacklog}]).
44

Example #3
1> F = fun Loop() ->
1> [{backlog, Backlog}] = ets:tab2list(anchor_backlog),
1> {message_queue_len, Messages} = process_info(whereis(anchor_server),
message_queue_len),
1> io:format("~p: ~p ~p~n", [time(), Backlog, Messages]),
1> timer:sleep(1000),
1> Loop()
1> end.
#Fun<erl_eval.44.90072148>
2> F().
{17,4,31}: 5 1
{17,4,32}: 1 0
{17,4,33}: 6 0
45

Tips
%% monitoring
{lager, "2.0.0", {git, "http://github.com/basho/lager.git", {tag, "2.0.0"}}},
{lager_logstash, "0.1.3", {git, "https://github.com/rpt/lager_logstash.git", {tag, "0.1.3"}}},
{riak_sysmon, "1.1.3", {git, "http://github.com/lpgauth/riak_sysmon", {branch, "adgear"}}},
{statsderl, "0.3.4", {git, "http://github.com/lpgauth/statsderl.git", {branch, "adgear"}}},
{vmstats, "0.3.4", {git, "http://github.com/lpgauth/vmstats.git", {branch, “adgear"}}},
%% debuggers
{eﬂame, “.*", {git, "http://github.com/lpgauth/eﬂame.git", {branch, "adgear"}}},
{eper, “.*", {git, "http://github.com/massemanet/eper.git", {branch, "master"}}},
{recon, “.*", {git, "http://github.com/ferd/recon.git", {branch, "master"}}},
{timing, “.*", {git, "http://github.com/lpgauth/timing.git", {branch, "master"}}}
Have your debugging tools ready on production
nodes!
46

Tips
Take the time to build your own tools!
1. If you ﬁnd yourself repeating the same commands
often, write a script!
2. Add debugging functions to your modules:
• ETS table tid accessor
• gen_server state accessor
47

Thank you!
github: lpgauth
twitter: lpgauth
dilbert.com
48

Debugging Complex Systems - Erlang Factory SF 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Debugging Complex Systems - Erlang Factory SF 2015

Similaire à Debugging Complex Systems - Erlang Factory SF 2015 (20)

Dernier

Dernier (20)

Debugging Complex Systems - Erlang Factory SF 2015