SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Louis-Philippe Gauthier
Director, Product Engineering
Debugging Complex Systems
2
Let’s fix the process.
Systematic debugging!
3
Understand the system
1. Be familiar with your stack: OS, VM, application,
protocols, external services, etc.
2. Know your tools:
• Take time to experiment with new tools.
• Match the tool to the bug.
3. What are/were the requirements?
4. Is it really a bug?
4
Reproduce the bug
1. What are the conditions that trigger the bug:
• function input?
• invalid state?
• environment variable?
• OS settings?
2. Try reproducing locally.
3. Try reproducing in production.
4. Reproducibility will greatly simplify the debugging
process and the validation of the fix!
5
Collect data
1. Search on Google, it might be a known bug!
2. Don’t jump to conclusions, use observations to
guide your intuition (you’re always wrong!).
3. Use debugging tools to get more insights.
4. Filter out the noise (especially important for
performance bugs!!!).
5. Too much is like not enough.
6
Use process of elimination
1. Divide and conquer!
2. Start with macro observations:
• Are all servers affected?
• Are all datacenters affected?
• Is there an external service involved?
3. From there, narrow down your search using data!
4. Watch out, one bug might be hiding another one.
7
Change one thing at a time
1. Don’t take any shortcuts; if you change too many
variables, you won’t be able to correlate the
results.
2. Be smart; pick the one change that cuts the
search space the most.
3. If you want to test different theories in parallel,
create a branch for each change and deploy on
different nodes.
8
Keep an audit trail
1. Don’t trust your memory, you might get some
facts wrong.
2. Can help if you’re debugging a similar problem in
the future.
3. Useful for post-mortems!
4. Allows you to collaborate with co-workers (e.g.
via Google Docs).
5. Can be used to coach teammates.
9
Verify your assumptions
1. Check the basics:
• What code is deployed?
• Same VM version?
• Same application config?
• Same kernel version?
• Same system config?
2. Is the tool lying? Validate your tools!
3. Backtrack and go over your audit trail, you might
have missed something!
10
Take a step back
1. Step on your ego and ask a co-worker for help!
2. Ask an expert.
3. Sleep on it; your thoughts will be clearer in the
morning.
11
Validate your fix
1. Is it a side effect? Heisenbug?
2. Did you really fix the root cause or just work
around it?
3. Validate in production:
1. start with one node per datacenter
2. slowly roll out the fix and monitor
4. Add regression tests if you can.
5. …
6. Go back to bed Zzzz.
12
Systematic debugging rules
1. Understand the system
2. Reproduce the bug
3. Collect data
4. Use process of elimination
5. Change only one thing at a time
6. Keep an audit trail
7. Verify your assumptions
8. Take a step back
9. Validate your fix
13
Tools of the trade
14
Erlang interactive shell
lpgauth # erl -remsh rtb-gateway -name shell -setcookie monster
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G)
1> Version = erlang:system_info(otp_release).
"17"
2> b().
Version = "17"
ok
3> f(Version).
ok
4> b().
ok
15
Erlang interactive shell
lpgauth # /etc/init.d/rtb-delivery shell
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G)
1> ets:i().
id name type size mem owner
----------------------------------------------------------------------------
…
swirl_flows swirl_flows set 15 1611 swirl_tracker
swirl_mappers swirl_mappers set 15 481 swirl_tracker
swirl_reducers swirl_reducers set 0 316 swirl_tracker
swirl_streams swirl_streams set 15 908 swirl_tracker
2> ets:i(swirl_streams).
3> ets:tab2list(swirl_streams).
16
Erlang interactive shell
lpgauth # /etc/init.d/rtb-gateway shell
Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false]
Eshell V6.3 (abort with ^G)
1> F = fun Loop() ->
1> rp(process_info(whereis(code_server), message_queue_len)),
1> timer:sleep(1000),
1> Loop()
1> end.
#Fun<erl_eval.44.90072148>
2> F().
{message_queue_len,0}
{message_queue_len,2}
{message_queue_len,0}
17
Loggers
1. io:format/2
2. error_logger + SASL
3. Lager
Don’t forget system logs:
/var/log/messages,
/var/log/external_service.
Good for: logic and typing bugs
18
Metric collectors
1. vmstats + statsderl + statsd
2. folsom / statsman / exometer
3. collectd
4. carbon + graphite
5. many others!
Good for: resources and performance bugs
19
Debuggers
1. erlang:trace/3
2. dbg
3. redbug
4. system tap / dtrace / lttng
Good for: logic and typing bugs
20
Program dumps
1. erl_crash.dump
• observer
• recon / script
• cat / grep / strings
2. core dumps
Good for: resources bugs
21
Profilers
1. fprof / eprof
2. eflame
3. system tap / dtrace / lttng / perf
4. many others!
Good for: performance bugs
22
System utilities
Good for: resources and performance bugs
Some of my favourite ones:
1. top / htop
2. ngrep / netstat / tcpdump
3. strace
4. iotop / lsof
23
System utilities
24
Static analyzers
1. Dialyzer
Good for: typing bugs
25
Tools for the bug
1. Logic: debuggers, loggers, shell
2. Typing: debuggers, shell, static analyzers
3. Resources: metric collectors, program dumps,
shell, system utilities
4. Performance: metric collectors, shell, system
utilities
26
If a request crashes and no one is around to
monitor it, does it trigger an alert?
27
Example #1
Where do we start?
28
1. Are the numbers in the database good? 

No => Not a web application bug.
2. Are the numbers in the logs good?

No => Not a ETL bug.
3. Are there other services on the same box
affected by missing log events? 

No => Probably not a filesystem bug.
Example #1
29
1> Tid = rtb_gateway_counter:table_id(bid_metrics_counters).
2> ets:tab2list(Tid).
[{{6,4,undefined},0,0,6,6},{{6,3,2},6,6,12,12}]
Is the data being aggregated? Yes.
Bug is most likely in the function that serializes the
ETS table to JSON.
Example #1
30
1> redbug:start("rtb_gateway_counter:map_counters(bid_metrics_counters, '_', '_', '_')").
09:07:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undefined},0,0,6,6},{{6,3,2},7,7,14,14}],
<0.215.0>, …]}
09:08:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters,
[{{6,4,undefined},0,0,12,12},{{6,3,2},12,12,24,24}],
<0.215.0>, …]}
Let’s trace!
Function should be calling itself recursively…
Example #1
31
32
Example #2
2014-07-31 19:50:39.915 [error] emulator Error in process <0.23676.390> on node 'rtb-
gateway' with exit value:
{function_clause,[{cowboy_protocol,parse_uri_path,[<<0 bytes>>,
{state,#Port<0.13093005>,ranch_tcp,[cowboy_router,cowboy_handler],false,
[{listener,rtbgw_lb},{dispatch,[{'_',[],[{[<<13 bytes>>,exchange],
[],rtb_gateway_notification_handler,[ewr,<<4 bytes>>]},{[<<5 bytes>>,exchange...
Where do we start?
33
1. Google the error in case it’s a known bug. Not
really…
2. Add extra logging in ranch to print out the
request state when the bug occurs.
3. Use ngrep to validate that some HTTP requests
are malformed (e.g. improper content-size).
Bug is non-trivial to reproduce so let’s start by
collecting data…
Example #2
34
1. Capture TCP streams of failing requests using
tcpdump.
2. Build tool to replay TCP dump (httpreplay).
3. Replay traffic capture…
4. Can reproduce… but not deterministically.
5. Stepped on my ego and passed the flag to a
teammate.
Example #2
35
1. Teammate tried tracing the problem… no dice.
2. Teammate took a step back…
Example #2
36
Take a step backTeammate has an eureka moment while driving…
The socket in the cowboy req record is mutable!
p.s. Thanks, Jeremie! :)
37
38
Example #3
1. Receive a “service trouble” email from Dynect
Concierge (DNS)…
2. Receive a “DOWN/PROBLEM” email from Nagios…
3. SSH to server to find out beam is dumping a
erl_crash.dump…
Where do we start?
39
Example #3
While we’re waiting for the VM to finish writing the
erl_crash.dump, let’s check graphite.
40
Example #3
lpgauth # ./erl_crashdump_analyzer.sh /erl_crash.dump
analyzing /erl_crash.dump, generated on: Mon Feb 16 10:57:22 2015
Slogan: eheap_alloc: Cannot allocate 71302344 bytes of memory (of type "heap", thread 4).
…
Different message queue lengths (5 largest different):
===
1 1357844
7 2
22 1
10180 0
…
cat /erl_crash.dump | grep -10 1357844
41
Example #3
loop(State) ->
receive Msg ->
{ok, State2} = handle_msg(Msg, State),
loop(State2)
end.
handle_msg({call, Ref, From, Msg}, State) ->
gen_tcp:send(Socket, Packet)
…
handle_msg({tcp, Socket, Data}, State) ->
decode_data(Data2, State)
…
What happens if gen_tcp:send/2 blocks?
42
Example #3
1. Validate that gen_tcp can actually block.
2. Mitigate by using gen_tcp option send_timeout.
3. Fix the problem by adding back-pressure (the
joys of unbounded queues!).
43
Example #3
check(Tid, MaxBacklog) ->
case increment(Tid, MaxBacklog) of
[MaxBacklog, MaxBacklog] ->
false;
[_, Value] when Value =< MaxBacklog ->
true;
{error, tid_missing} ->
false
end.
decrement(Tid) ->
safe_update_counter(Tid, {2, -1, 0, 0}).
increment(Tid, MaxBacklog) ->
safe_update_counter(Tid, [{2, 0}, {2, 1, MaxBacklog, MaxBacklog}]).
44
Example #3
1> F = fun Loop() ->
1> [{backlog, Backlog}] = ets:tab2list(anchor_backlog),
1> {message_queue_len, Messages} = process_info(whereis(anchor_server),
message_queue_len),
1> io:format("~p: ~p ~p~n", [time(), Backlog, Messages]),
1> timer:sleep(1000),
1> Loop()
1> end.
#Fun<erl_eval.44.90072148>
2> F().
{17,4,31}: 5 1
{17,4,32}: 1 0
{17,4,33}: 6 0
45
Tips
%% monitoring
{lager, "2.0.0", {git, "http://github.com/basho/lager.git", {tag, "2.0.0"}}},
{lager_logstash, "0.1.3", {git, "https://github.com/rpt/lager_logstash.git", {tag, "0.1.3"}}},
{riak_sysmon, "1.1.3", {git, "http://github.com/lpgauth/riak_sysmon", {branch, "adgear"}}},
{statsderl, "0.3.4", {git, "http://github.com/lpgauth/statsderl.git", {branch, "adgear"}}},
{vmstats, "0.3.4", {git, "http://github.com/lpgauth/vmstats.git", {branch, “adgear"}}},
%% debuggers
{eflame, “.*", {git, "http://github.com/lpgauth/eflame.git", {branch, "adgear"}}},
{eper, “.*", {git, "http://github.com/massemanet/eper.git", {branch, "master"}}},
{recon, “.*", {git, "http://github.com/ferd/recon.git", {branch, "master"}}},
{timing, “.*", {git, "http://github.com/lpgauth/timing.git", {branch, "master"}}}
Have your debugging tools ready on production
nodes!
46
Tips
Take the time to build your own tools!
1. If you find yourself repeating the same commands
often, write a script!
2. Add debugging functions to your modules:
• ETS table tid accessor
• gen_server state accessor
47
Thank you!
github: lpgauth
twitter: lpgauth
dilbert.com
48

Contenu connexe

Tendances

Devel::NYTProf 2009-07 (OUTDATED, see 201008)
Devel::NYTProf 2009-07 (OUTDATED, see 201008)Devel::NYTProf 2009-07 (OUTDATED, see 201008)
Devel::NYTProf 2009-07 (OUTDATED, see 201008)Tim Bunce
 
PL/Perl - New Features in PostgreSQL 9.0 201012
PL/Perl - New Features in PostgreSQL 9.0 201012PL/Perl - New Features in PostgreSQL 9.0 201012
PL/Perl - New Features in PostgreSQL 9.0 201012Tim Bunce
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyTim Bunce
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuitNAVER D2
 
Perl Memory Use 201209
Perl Memory Use 201209Perl Memory Use 201209
Perl Memory Use 201209Tim Bunce
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsAll Things Open
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Devel::NYTProf v5 at YAPC::NA 201406
Devel::NYTProf v5 at YAPC::NA 201406Devel::NYTProf v5 at YAPC::NA 201406
Devel::NYTProf v5 at YAPC::NA 201406Tim Bunce
 
Profiling with Devel::NYTProf
Profiling with Devel::NYTProfProfiling with Devel::NYTProf
Profiling with Devel::NYTProfbobcatfish
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
 

Tendances (20)

Devel::NYTProf 2009-07 (OUTDATED, see 201008)
Devel::NYTProf 2009-07 (OUTDATED, see 201008)Devel::NYTProf 2009-07 (OUTDATED, see 201008)
Devel::NYTProf 2009-07 (OUTDATED, see 201008)
 
PL/Perl - New Features in PostgreSQL 9.0 201012
PL/Perl - New Features in PostgreSQL 9.0 201012PL/Perl - New Features in PostgreSQL 9.0 201012
PL/Perl - New Features in PostgreSQL 9.0 201012
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit
 
Perl Memory Use 201209
Perl Memory Use 201209Perl Memory Use 201209
Perl Memory Use 201209
 
Odoo Online platform: architecture and challenges
Odoo Online platform: architecture and challengesOdoo Online platform: architecture and challenges
Odoo Online platform: architecture and challenges
 
Testing Wi-Fi with OSS Tools
Testing Wi-Fi with OSS ToolsTesting Wi-Fi with OSS Tools
Testing Wi-Fi with OSS Tools
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 
Storm
StormStorm
Storm
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Devel::NYTProf v5 at YAPC::NA 201406
Devel::NYTProf v5 at YAPC::NA 201406Devel::NYTProf v5 at YAPC::NA 201406
Devel::NYTProf v5 at YAPC::NA 201406
 
Profiling with Devel::NYTProf
Profiling with Devel::NYTProfProfiling with Devel::NYTProf
Profiling with Devel::NYTProf
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Storm
StormStorm
Storm
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 

En vedette

Staying Afloat with Buoy: A High-Performance HTTP Client
Staying Afloat with Buoy: A High-Performance HTTP ClientStaying Afloat with Buoy: A High-Performance HTTP Client
Staying Afloat with Buoy: A High-Performance HTTP Clientlpgauth
 
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...Lightbend
 
Digital literacies: setting the scene
Digital literacies: setting the sceneDigital literacies: setting the scene
Digital literacies: setting the sceneLis Parcell
 
re RainbowTwtr - 構造化テキストの安全なエスケープ手法について
re RainbowTwtr - 構造化テキストの安全なエスケープ手法についてre RainbowTwtr - 構造化テキストの安全なエスケープ手法について
re RainbowTwtr - 構造化テキストの安全なエスケープ手法についてKazuho Oku
 
A Long Walk to Water: Lesson9 unit2
A Long Walk to Water: Lesson9 unit2A Long Walk to Water: Lesson9 unit2
A Long Walk to Water: Lesson9 unit2Terri Weiss
 
A Long Walk to Water: Lesson1 unit2
A Long Walk to Water: Lesson1 unit2A Long Walk to Water: Lesson1 unit2
A Long Walk to Water: Lesson1 unit2Terri Weiss
 
Hoe ziet de toekomst van Learning Analytics er uit?
Hoe ziet de toekomst van Learning Analytics er uit?Hoe ziet de toekomst van Learning Analytics er uit?
Hoe ziet de toekomst van Learning Analytics er uit?Hendrik Drachsler
 
Donde esta el Dios de justicia
Donde esta el Dios de justiciaDonde esta el Dios de justicia
Donde esta el Dios de justiciaPaulo Arieu
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleKazuho Oku
 
Using Q4M - a message queue for MySQL #osdc.tw
Using Q4M - a message queue for MySQL #osdc.twUsing Q4M - a message queue for MySQL #osdc.tw
Using Q4M - a message queue for MySQL #osdc.twKazuho Oku
 
Groovy every day
Groovy every dayGroovy every day
Groovy every dayPaul Woods
 
Cytoscape プロジェクト現状報告 2011年2月
Cytoscape プロジェクト現状報告 2011年2月Cytoscape プロジェクト現状報告 2011年2月
Cytoscape プロジェクト現状報告 2011年2月Keiichiro Ono
 
Orientamenti di social media marketing
Orientamenti di social media marketingOrientamenti di social media marketing
Orientamenti di social media marketingCommunication Village
 
Little Ones Learning Math Using Technology
Little Ones Learning Math Using TechnologyLittle Ones Learning Math Using Technology
Little Ones Learning Math Using TechnologyJennifer Orr
 
Online Business Trends beyond 2010
Online Business Trends beyond 2010Online Business Trends beyond 2010
Online Business Trends beyond 2010Jesper Åström
 
Workplace etiquettes
Workplace etiquettesWorkplace etiquettes
Workplace etiquettesSIVA GOPAL
 

En vedette (20)

Staying Afloat with Buoy: A High-Performance HTTP Client
Staying Afloat with Buoy: A High-Performance HTTP ClientStaying Afloat with Buoy: A High-Performance HTTP Client
Staying Afloat with Buoy: A High-Performance HTTP Client
 
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...
Benefits Of The Actor Model For Cloud Computing: A Pragmatic Overview For Jav...
 
Deep parking
Deep parkingDeep parking
Deep parking
 
Digital literacies: setting the scene
Digital literacies: setting the sceneDigital literacies: setting the scene
Digital literacies: setting the scene
 
re RainbowTwtr - 構造化テキストの安全なエスケープ手法について
re RainbowTwtr - 構造化テキストの安全なエスケープ手法についてre RainbowTwtr - 構造化テキストの安全なエスケープ手法について
re RainbowTwtr - 構造化テキストの安全なエスケープ手法について
 
A Long Walk to Water: Lesson9 unit2
A Long Walk to Water: Lesson9 unit2A Long Walk to Water: Lesson9 unit2
A Long Walk to Water: Lesson9 unit2
 
A Long Walk to Water: Lesson1 unit2
A Long Walk to Water: Lesson1 unit2A Long Walk to Water: Lesson1 unit2
A Long Walk to Water: Lesson1 unit2
 
Spasovo
SpasovoSpasovo
Spasovo
 
Hoe ziet de toekomst van Learning Analytics er uit?
Hoe ziet de toekomst van Learning Analytics er uit?Hoe ziet de toekomst van Learning Analytics er uit?
Hoe ziet de toekomst van Learning Analytics er uit?
 
Donde esta el Dios de justicia
Donde esta el Dios de justiciaDonde esta el Dios de justicia
Donde esta el Dios de justicia
 
HTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS moduleHTTP::Parser::XS - writing a fast & secure XS module
HTTP::Parser::XS - writing a fast & secure XS module
 
Using Q4M - a message queue for MySQL #osdc.tw
Using Q4M - a message queue for MySQL #osdc.twUsing Q4M - a message queue for MySQL #osdc.tw
Using Q4M - a message queue for MySQL #osdc.tw
 
Groovy every day
Groovy every dayGroovy every day
Groovy every day
 
Cytoscape プロジェクト現状報告 2011年2月
Cytoscape プロジェクト現状報告 2011年2月Cytoscape プロジェクト現状報告 2011年2月
Cytoscape プロジェクト現状報告 2011年2月
 
Orientamenti di social media marketing
Orientamenti di social media marketingOrientamenti di social media marketing
Orientamenti di social media marketing
 
Unit 2.7 Images
Unit 2.7 ImagesUnit 2.7 Images
Unit 2.7 Images
 
Little Ones Learning Math Using Technology
Little Ones Learning Math Using TechnologyLittle Ones Learning Math Using Technology
Little Ones Learning Math Using Technology
 
The Roots of Innovation
The Roots of InnovationThe Roots of Innovation
The Roots of Innovation
 
Online Business Trends beyond 2010
Online Business Trends beyond 2010Online Business Trends beyond 2010
Online Business Trends beyond 2010
 
Workplace etiquettes
Workplace etiquettesWorkplace etiquettes
Workplace etiquettes
 

Similaire à Debugging Complex Systems - Erlang Factory SF 2015

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2ice799
 
Lab 1 reference manual
Lab 1 reference manualLab 1 reference manual
Lab 1 reference manualtrayyoo
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer gamesMaciej Siniło
 
6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi   6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi OdessaJS Conf
 
Static Code Analysis PHP[tek] 2023
Static Code Analysis PHP[tek] 2023Static Code Analysis PHP[tek] 2023
Static Code Analysis PHP[tek] 2023Scott Keck-Warren
 
Let's write a Debugger!
Let's write a Debugger!Let's write a Debugger!
Let's write a Debugger!Levente Kurusa
 
Building Hermetic Systems (without Docker)
Building Hermetic Systems (without Docker)Building Hermetic Systems (without Docker)
Building Hermetic Systems (without Docker)William Farrell
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010Clay Helberg
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Debugging With Id
Debugging With IdDebugging With Id
Debugging With Idguest215c4e
 
Introduzione allo Unit Testing
Introduzione allo Unit TestingIntroduzione allo Unit Testing
Introduzione allo Unit TestingStefano Ottaviani
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks
 
Joxean Koret - Database Security Paradise [Rooted CON 2011]
Joxean Koret - Database Security Paradise [Rooted CON 2011]Joxean Koret - Database Security Paradise [Rooted CON 2011]
Joxean Koret - Database Security Paradise [Rooted CON 2011]RootedCON
 
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJSFestUA
 
Penetrating Windows 8 with syringe utility
Penetrating Windows 8 with syringe utilityPenetrating Windows 8 with syringe utility
Penetrating Windows 8 with syringe utilityIOSR Journals
 
Interview questions
Interview questionsInterview questions
Interview questionsxavier john
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversSatpal Parmar
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Jagadisha Maiya
 

Similaire à Debugging Complex Systems - Erlang Factory SF 2015 (20)

Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2Infrastructure as code might be literally impossible part 2
Infrastructure as code might be literally impossible part 2
 
Lab 1 reference manual
Lab 1 reference manualLab 1 reference manual
Lab 1 reference manual
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer games
 
6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi   6 ways to hack your JavaScript application by Viktor Turskyi
6 ways to hack your JavaScript application by Viktor Turskyi
 
Static Code Analysis PHP[tek] 2023
Static Code Analysis PHP[tek] 2023Static Code Analysis PHP[tek] 2023
Static Code Analysis PHP[tek] 2023
 
Let's write a Debugger!
Let's write a Debugger!Let's write a Debugger!
Let's write a Debugger!
 
Building Hermetic Systems (without Docker)
Building Hermetic Systems (without Docker)Building Hermetic Systems (without Docker)
Building Hermetic Systems (without Docker)
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Chelberg ptcuser 2010
Chelberg ptcuser 2010Chelberg ptcuser 2010
Chelberg ptcuser 2010
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Debugging With Id
Debugging With IdDebugging With Id
Debugging With Id
 
Introduzione allo Unit Testing
Introduzione allo Unit TestingIntroduzione allo Unit Testing
Introduzione allo Unit Testing
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and Databricks
 
Joxean Koret - Database Security Paradise [Rooted CON 2011]
Joxean Koret - Database Security Paradise [Rooted CON 2011]Joxean Koret - Database Security Paradise [Rooted CON 2011]
Joxean Koret - Database Security Paradise [Rooted CON 2011]
 
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложениеJS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
JS Fest 2019. Виктор Турский. 6 способов взломать твое JavaScript приложение
 
Penetrating Windows 8 with syringe utility
Penetrating Windows 8 with syringe utilityPenetrating Windows 8 with syringe utility
Penetrating Windows 8 with syringe utility
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Interview questions
Interview questionsInterview questions
Interview questions
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
 

Dernier

Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf203318pmpc
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 

Dernier (20)

Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 

Debugging Complex Systems - Erlang Factory SF 2015

  • 1. Louis-Philippe Gauthier Director, Product Engineering Debugging Complex Systems
  • 2. 2
  • 3. Let’s fix the process. Systematic debugging! 3
  • 4. Understand the system 1. Be familiar with your stack: OS, VM, application, protocols, external services, etc. 2. Know your tools: • Take time to experiment with new tools. • Match the tool to the bug. 3. What are/were the requirements? 4. Is it really a bug? 4
  • 5. Reproduce the bug 1. What are the conditions that trigger the bug: • function input? • invalid state? • environment variable? • OS settings? 2. Try reproducing locally. 3. Try reproducing in production. 4. Reproducibility will greatly simplify the debugging process and the validation of the fix! 5
  • 6. Collect data 1. Search on Google, it might be a known bug! 2. Don’t jump to conclusions, use observations to guide your intuition (you’re always wrong!). 3. Use debugging tools to get more insights. 4. Filter out the noise (especially important for performance bugs!!!). 5. Too much is like not enough. 6
  • 7. Use process of elimination 1. Divide and conquer! 2. Start with macro observations: • Are all servers affected? • Are all datacenters affected? • Is there an external service involved? 3. From there, narrow down your search using data! 4. Watch out, one bug might be hiding another one. 7
  • 8. Change one thing at a time 1. Don’t take any shortcuts; if you change too many variables, you won’t be able to correlate the results. 2. Be smart; pick the one change that cuts the search space the most. 3. If you want to test different theories in parallel, create a branch for each change and deploy on different nodes. 8
  • 9. Keep an audit trail 1. Don’t trust your memory, you might get some facts wrong. 2. Can help if you’re debugging a similar problem in the future. 3. Useful for post-mortems! 4. Allows you to collaborate with co-workers (e.g. via Google Docs). 5. Can be used to coach teammates. 9
  • 10. Verify your assumptions 1. Check the basics: • What code is deployed? • Same VM version? • Same application config? • Same kernel version? • Same system config? 2. Is the tool lying? Validate your tools! 3. Backtrack and go over your audit trail, you might have missed something! 10
  • 11. Take a step back 1. Step on your ego and ask a co-worker for help! 2. Ask an expert. 3. Sleep on it; your thoughts will be clearer in the morning. 11
  • 12. Validate your fix 1. Is it a side effect? Heisenbug? 2. Did you really fix the root cause or just work around it? 3. Validate in production: 1. start with one node per datacenter 2. slowly roll out the fix and monitor 4. Add regression tests if you can. 5. … 6. Go back to bed Zzzz. 12
  • 13. Systematic debugging rules 1. Understand the system 2. Reproduce the bug 3. Collect data 4. Use process of elimination 5. Change only one thing at a time 6. Keep an audit trail 7. Verify your assumptions 8. Take a step back 9. Validate your fix 13
  • 14. Tools of the trade 14
  • 15. Erlang interactive shell lpgauth # erl -remsh rtb-gateway -name shell -setcookie monster Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> Version = erlang:system_info(otp_release). "17" 2> b(). Version = "17" ok 3> f(Version). ok 4> b(). ok 15
  • 16. Erlang interactive shell lpgauth # /etc/init.d/rtb-delivery shell Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> ets:i(). id name type size mem owner ---------------------------------------------------------------------------- … swirl_flows swirl_flows set 15 1611 swirl_tracker swirl_mappers swirl_mappers set 15 481 swirl_tracker swirl_reducers swirl_reducers set 0 316 swirl_tracker swirl_streams swirl_streams set 15 908 swirl_tracker 2> ets:i(swirl_streams). 3> ets:tab2list(swirl_streams). 16
  • 17. Erlang interactive shell lpgauth # /etc/init.d/rtb-gateway shell Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> F = fun Loop() -> 1> rp(process_info(whereis(code_server), message_queue_len)), 1> timer:sleep(1000), 1> Loop() 1> end. #Fun<erl_eval.44.90072148> 2> F(). {message_queue_len,0} {message_queue_len,2} {message_queue_len,0} 17
  • 18. Loggers 1. io:format/2 2. error_logger + SASL 3. Lager Don’t forget system logs: /var/log/messages, /var/log/external_service. Good for: logic and typing bugs 18
  • 19. Metric collectors 1. vmstats + statsderl + statsd 2. folsom / statsman / exometer 3. collectd 4. carbon + graphite 5. many others! Good for: resources and performance bugs 19
  • 20. Debuggers 1. erlang:trace/3 2. dbg 3. redbug 4. system tap / dtrace / lttng Good for: logic and typing bugs 20
  • 21. Program dumps 1. erl_crash.dump • observer • recon / script • cat / grep / strings 2. core dumps Good for: resources bugs 21
  • 22. Profilers 1. fprof / eprof 2. eflame 3. system tap / dtrace / lttng / perf 4. many others! Good for: performance bugs 22
  • 23. System utilities Good for: resources and performance bugs Some of my favourite ones: 1. top / htop 2. ngrep / netstat / tcpdump 3. strace 4. iotop / lsof 23
  • 25. Static analyzers 1. Dialyzer Good for: typing bugs 25
  • 26. Tools for the bug 1. Logic: debuggers, loggers, shell 2. Typing: debuggers, shell, static analyzers 3. Resources: metric collectors, program dumps, shell, system utilities 4. Performance: metric collectors, shell, system utilities 26
  • 27. If a request crashes and no one is around to monitor it, does it trigger an alert? 27
  • 28. Example #1 Where do we start? 28
  • 29. 1. Are the numbers in the database good? 
 No => Not a web application bug. 2. Are the numbers in the logs good?
 No => Not a ETL bug. 3. Are there other services on the same box affected by missing log events? 
 No => Probably not a filesystem bug. Example #1 29
  • 30. 1> Tid = rtb_gateway_counter:table_id(bid_metrics_counters). 2> ets:tab2list(Tid). [{{6,4,undefined},0,0,6,6},{{6,3,2},6,6,12,12}] Is the data being aggregated? Yes. Bug is most likely in the function that serializes the ETS table to JSON. Example #1 30
  • 31. 1> redbug:start("rtb_gateway_counter:map_counters(bid_metrics_counters, '_', '_', '_')"). 09:07:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters, [{{6,4,undefined},0,0,6,6},{{6,3,2},7,7,14,14}], <0.215.0>, …]} 09:08:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters, [{{6,4,undefined},0,0,12,12},{{6,3,2},12,12,24,24}], <0.215.0>, …]} Let’s trace! Function should be calling itself recursively… Example #1 31
  • 32. 32
  • 33. Example #2 2014-07-31 19:50:39.915 [error] emulator Error in process <0.23676.390> on node 'rtb- gateway' with exit value: {function_clause,[{cowboy_protocol,parse_uri_path,[<<0 bytes>>, {state,#Port<0.13093005>,ranch_tcp,[cowboy_router,cowboy_handler],false, [{listener,rtbgw_lb},{dispatch,[{'_',[],[{[<<13 bytes>>,exchange], [],rtb_gateway_notification_handler,[ewr,<<4 bytes>>]},{[<<5 bytes>>,exchange... Where do we start? 33
  • 34. 1. Google the error in case it’s a known bug. Not really… 2. Add extra logging in ranch to print out the request state when the bug occurs. 3. Use ngrep to validate that some HTTP requests are malformed (e.g. improper content-size). Bug is non-trivial to reproduce so let’s start by collecting data… Example #2 34
  • 35. 1. Capture TCP streams of failing requests using tcpdump. 2. Build tool to replay TCP dump (httpreplay). 3. Replay traffic capture… 4. Can reproduce… but not deterministically. 5. Stepped on my ego and passed the flag to a teammate. Example #2 35
  • 36. 1. Teammate tried tracing the problem… no dice. 2. Teammate took a step back… Example #2 36
  • 37. Take a step backTeammate has an eureka moment while driving… The socket in the cowboy req record is mutable! p.s. Thanks, Jeremie! :) 37
  • 38. 38
  • 39. Example #3 1. Receive a “service trouble” email from Dynect Concierge (DNS)… 2. Receive a “DOWN/PROBLEM” email from Nagios… 3. SSH to server to find out beam is dumping a erl_crash.dump… Where do we start? 39
  • 40. Example #3 While we’re waiting for the VM to finish writing the erl_crash.dump, let’s check graphite. 40
  • 41. Example #3 lpgauth # ./erl_crashdump_analyzer.sh /erl_crash.dump analyzing /erl_crash.dump, generated on: Mon Feb 16 10:57:22 2015 Slogan: eheap_alloc: Cannot allocate 71302344 bytes of memory (of type "heap", thread 4). … Different message queue lengths (5 largest different): === 1 1357844 7 2 22 1 10180 0 … cat /erl_crash.dump | grep -10 1357844 41
  • 42. Example #3 loop(State) -> receive Msg -> {ok, State2} = handle_msg(Msg, State), loop(State2) end. handle_msg({call, Ref, From, Msg}, State) -> gen_tcp:send(Socket, Packet) … handle_msg({tcp, Socket, Data}, State) -> decode_data(Data2, State) … What happens if gen_tcp:send/2 blocks? 42
  • 43. Example #3 1. Validate that gen_tcp can actually block. 2. Mitigate by using gen_tcp option send_timeout. 3. Fix the problem by adding back-pressure (the joys of unbounded queues!). 43
  • 44. Example #3 check(Tid, MaxBacklog) -> case increment(Tid, MaxBacklog) of [MaxBacklog, MaxBacklog] -> false; [_, Value] when Value =< MaxBacklog -> true; {error, tid_missing} -> false end. decrement(Tid) -> safe_update_counter(Tid, {2, -1, 0, 0}). increment(Tid, MaxBacklog) -> safe_update_counter(Tid, [{2, 0}, {2, 1, MaxBacklog, MaxBacklog}]). 44
  • 45. Example #3 1> F = fun Loop() -> 1> [{backlog, Backlog}] = ets:tab2list(anchor_backlog), 1> {message_queue_len, Messages} = process_info(whereis(anchor_server), message_queue_len), 1> io:format("~p: ~p ~p~n", [time(), Backlog, Messages]), 1> timer:sleep(1000), 1> Loop() 1> end. #Fun<erl_eval.44.90072148> 2> F(). {17,4,31}: 5 1 {17,4,32}: 1 0 {17,4,33}: 6 0 45
  • 46. Tips %% monitoring {lager, "2.0.0", {git, "http://github.com/basho/lager.git", {tag, "2.0.0"}}}, {lager_logstash, "0.1.3", {git, "https://github.com/rpt/lager_logstash.git", {tag, "0.1.3"}}}, {riak_sysmon, "1.1.3", {git, "http://github.com/lpgauth/riak_sysmon", {branch, "adgear"}}}, {statsderl, "0.3.4", {git, "http://github.com/lpgauth/statsderl.git", {branch, "adgear"}}}, {vmstats, "0.3.4", {git, "http://github.com/lpgauth/vmstats.git", {branch, “adgear"}}}, %% debuggers {eflame, “.*", {git, "http://github.com/lpgauth/eflame.git", {branch, "adgear"}}}, {eper, “.*", {git, "http://github.com/massemanet/eper.git", {branch, "master"}}}, {recon, “.*", {git, "http://github.com/ferd/recon.git", {branch, "master"}}}, {timing, “.*", {git, "http://github.com/lpgauth/timing.git", {branch, "master"}}} Have your debugging tools ready on production nodes! 46
  • 47. Tips Take the time to build your own tools! 1. If you find yourself repeating the same commands often, write a script! 2. Add debugging functions to your modules: • ETS table tid accessor • gen_server state accessor 47
  • 48. Thank you! github: lpgauth twitter: lpgauth dilbert.com 48