Role of Information and technology in banking and finance .pptx
Os Schlossnagle Theo
1. Advanced
Production Troubleshooting
Embracing Systemic Analysis Techniques
Theo Schlossnagle
1
1
2. Production Troubleshooting
• What does production mean?
• What is troubleshooting?
• Why would we troubleshoot in production?
• Methods and Techniques.
2
2
3. “Production”
• Your business depends on it.
• Your livelihood depends on it.
• Others depend on you fixing it.
• It matters that you fix it and fix it now.
3
3
4. “Troubleshooting”
The @#$% site is ^%$# down!!!
• It’s 3am.
• You’re losing money.
• You’re responsible.
• Everyone wants it fixed now!
4
4
5. Choose no life. Choose no career. Choose no family. Choose a fucking
big computer, choose disk arrays the size of washing machines,
modem racks, CD-ROM writers, and electrical coffee makers. Choose
no sleep, high caffeine and mental insurance. Choose no friends.
Choose black jeans and matching combat boots. Choose chairs for
your office in a range of fucking fabrics. Choose SMTP and wondering
why the fuck you are logged on on a Sunday morning. Choose sitting
in that swivel chair looking at mind-numbing, spirit-crushing web
sites, stuffing fucking junk food into your mouth. Choose rotting away
at the end of it all, pishing your last in some miserable newsgroup,
nothing more than an embarrassment to the selfish, fucked up lusers
Gates spawned to replace the computer-literate.
Choose your future.
Choose to sysadmin.
5
5
6. The Scope of the Problem
• The problem scope doesn’t get any larger.
• Absolutely anything could be the culprit.
• Who “owns” the offending component is irrelevant.
• The problem owns you until you own the solution.
6
6
7. The Rules of Engagement
• There must be process.
• Diagnosis has weak process.
• Resolution has strong process.
7
7
8. Diagnosis
• If there was strong process:
• there would be no problem.
l’s :
a
um
en dr
ev n
• Lu
You need skills.
e;
on n
ca
C
”
ri
r
• ?
u
You need smarts. y
h rfl
e e
n bu
e
iv
• e
g
“
“Systems Thinking”
is
re
e
h
• w
95% art; 5% science
8
8
10. Glossary
• Problem (n.): the specific, unambiguous issues.
• Solution (n.): the exact process of fixing the problem.
• Victim (n.): the entity experiencing the problem.
• Witness (v.): the observe problem manifest.
• Offender (n.): the entity causing the problem.
10
10
11. Methods Techniques
Don’t make things worse.
• Instrumenting applications during troubleshooting is flawed.
• It wastes precious time.
• It is not purely observational and can influence the outcome.
• Diagnosis should be an observational effort.
• By interfering with the systems,
one can alter the problem without solving it.
11
11
12. Methods Techniques
How to observe without changing the problem.
• Systems (operating) tools are good.
• Most tools are application agnostic.
• Don’t lose sight of the problem.
• Let the problem drive your diagnosis, not purely your intuition.
• When in doubt, use brute force.
12
12
13. Methods Techniques
Things I can remember during a crisis
• I started out as an SA; I like systems tool.
• I don’t want to muck with code to find a problem.
• There are a lot of observational analysis tools:
• Network: snoop, tcpdump, ethereal
• System call tracers: strace, ktrace, truss
• Stack introspection: gdb, gstack, pstack, mdb, dbx
• Dynamic tracers: DTrace
13
13
14. Before we Embark. Protect.
• Troubleshooting sometimes involves hacking.
• Changes to production code and/or configuration are always risky.
• Not understanding what you have changed afterwards is
irresponsible.
14
14
15. Version Control Systems
Not good, not cool, not neat. Fundamental necessity.
• Keep track of the changes you make.
• Understand what is running production:
• now
• yesterday
• last week...
15
15
16. Example One
Web Application Hangs...
sometimes.
not other times.
0 to 60 second delays other times.
16
16
18. Where does one begin?
• Where to start looking?
• This should be influenced by the tools you know best.
• Speed is of the essence.
• Repeat the problem first.
• There are a lot of variables, fixate one by becoming the victim.
• Tight, simple, repeatable tests are best.
18
18
19. Architecture
•
Router
It’s where “closest to the user”
Sweet Spot “usable tools” meet. For me at least.
GigE switch
Cisco CS150 Cisco CS150
• But with eight web servers which do
we choose?
GigE switch
www-0-1 www-0-5
www-0-2 www-0-6
• Randomly...
www-0-3 www-0-7
www-0-4 www-0-8
GigE switch
• Perhaps we can do better?
MySQL
19
21. Understanding Evidence
• There is something wrong with www-0-7
• It is different from the rest.
• Right?
www-0-7
• Not necessarily. It is problematic because:
• It is different from the rest, and...
• it is different from itself as of last week. www-0-8
21
21
22. Simple System Tracing
; ps auxww | grep httpd
nobody 417 0.0 0.0 26904 344 ? S 2005 149:34 /opt/apache_1.3.33/bin/httpd -DSSL
root 416 0.0 0.0 26952 120 ? S 2005 5:53 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 19436 0.0 0.5 33212 6136 ? S 16:40 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 20416 0.0 0.6 33292 6540 ? S 17:30 0:01 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 20494 0.0 0.5 32572 5512 ? S 17:34 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 20500 0.0 0.6 33312 6616 ? S 17:35 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 20501 0.0 0.6 33224 6304 ? S 17:35 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 23718 0.0 0.6 33068 6592 ? S 20:23 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 23729 0.0 0.2 29396 2468 ? S 20:24 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 24646 0.0 0.7 32832 7792 ? S 21:13 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 25407 0.0 0.3 29368 3800 ? S 21:54 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 25735 0.0 0.3 29260 3424 ? S 22:11 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
nobody 26062 0.0 0.4 29356 4596 ? S 22:29 0:00 /opt/apache_1.3.33/bin/httpd -DSSL
Which process?
Hopefully they all exhibit signs of the problem.
22
24. Identifying the culprit
; lsof -p 20500
...
httpd 22282 nobody 4u IPv4 517989990 TCP 66.249.65.15:47451-www-va-1:http
(ESTABLISHED)
...
httpd 22282 nobody 17u IPv4 497473156 TCP *:80 (LISTEN)
...
File descriptor 4 is the connection to our client.
And we are stuck reading from it.
And we read nothing and then return to servicing others.
24
25. Guessing the problem
• Revisiting the “evil” lull:
• It is exactly 15 seconds. Every time.
• This can be confirmed with strace -ttt
• What is so special about 15 seconds?
• Our application?
• Apache?
• libc?
• kernel?
25
25
26. My guess: Apache
Apache talks directly to the client and issued
the “read” system call on which we are stuck.
; cd ~/src/apache_1.3.33/
; grep '#define.* 15$' `find . -name *.h`
./src/include/httpd.h:#define DEFAULT_KEEPALIVE_TIMEOUT 15
./src/include/httpd.h:#define M_INVALID 15
./src/lib/expat-lite/xmltok.h:#define XML_TOK_PROLOG_S 15
./src/modules/standard/mod_rewrite.h:#define MAX_ENV_FLAGS 15
26
27. Recap
• Keep-alives were our problems:
• Apache children were tied up waiting.
• A limited number of children.
• When all children are used:
• We have to wait until a child is free.
• Up to 15 seconds.
• Much longer if a back queue exists.
27
27
30. What’s up here?
• Sending data to client gets “stuck”
• writev() sticking is due to kernel buffers filling up.
• make the buffers bigger
• kernel buffer enlargement
• SendBufferSize in Apache
• Install a web accelerator
30
30
31. We’ve looked outward (toward clients)
It has been quick and painless
Looking Inward
The same technique can work for inward problems
at least some of them
31
33. Digging again
; lsof -p 20500
...
httpd 22282 nobody 4u IPv4 517989990 TCP 66.249.65.15:47451-www-va-1:http (ESTABLISHED)
...
httpd 22282 nobody 7u IPv4 517989990 TCP www-va-1.int:47451-dbhost:5432 (ESTABLISHED)
...
httpd 22282 nobody 17u IPv4 497473156 TCP *:80 (LISTEN)
...
File descriptor 7 is the connection to our database.
And we are stuck reading from it.
Hmm... that’s a query with a slow(ish) response
33
34. Not where we want to be.
• Did we successfully locate the problem?
• No...
• We know it is behind the web application.
• Not necessarily behind the web server.
34
34
35. An over-simplified web
transaction
database httpd web
(postgres) daemon browser
database web server client
server machine machine machine
The outward problem space
Disk
The inward problem space
The perspective from analysis on a web server
35
36. Jumping to the next step.
• •
In the outward case, In the inward case,
jumping outside (to the client) is not an jumping inside (to the database) is an
option. option.
• •
Jumping to a load balancer, switch or The web server is delayed reading
router is. from the DB
• •
The tools there are not so good. Confirm the DB is indeed working.
• Best to stay on the web server.
36
36
37. Looking at the system (database)
• What’s wrong with the system?
• Rephrased: what looks wrong?
• Wrong and right relate to a system returning correct results.
• What looks “different” is a better question.
• That requires a basis of comparison.
37
37
38. Looking at processes (database)
• So, what’s running?
• Clearly, a database.
• The site works (albeit slow),
so the database as functioning correctly.
• It’s not performing well.
• Which queries are running slow and why?
• Don’t assume anything.
38
38
42. So, we know the situation is bad.
• Knowing the situation is bad is good.
• We already knew that.
• We have made progress, we know disk service latency on the
database is a likely cause.
• Which queries are running slowly?
42
42
43. Which is the offending process?
• It is pretty hard to tell this on Linux.
• my technique usually involvess peeking at top
• taking a guess
• stracing
• repeat.
43
43
44. This Postgres instance is on Solaris
• I have DTrace
• I do not suffer from inadequacies.
• Next step... world domination.
44
44
48. Database: the offender
; echo “SELECT now() - query_start as duration, current_query
FROM pg_stat_activity
WHERE procpid = 4589” | psql pgods
duration | current_query
-----------------+------------------------------------------------------
00:01:23.345221 | UPDATE users SET emailaddress = lower(emailaddress);
• It should be rather obvious now.
• Someone issued an enormous update.
• It is inducing enormous disk I/O
• Slowing everything else in the system.
48
48
49. The importance of historical data
• •
Trend Analysis. At a bare minimum
• •
Provides a control for experiments. bandwidth
• •
Provides a foundation for conjectures. load
• •
Troubleshooting is about finding the disk I/O and service times
problem.
• memory usage
• Fixing it is a trivial exercise left to the
competent onlooker.
49
49
50. Thanks
Go OmniTI!
OmniTI is hiring:
Smart, fun, driven people
50
50