You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…
Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!
In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.
25. @leinweber
cpu mem disk parallelism
credentials wrong
networking broken
locking issue, check pg_locks
idle in transaction
26. @leinweber
cpu mem disk parallelism
application submitting backlogged workload
connection leak
pool sizes set too large
pg_lock issue + application backlog
27. @leinweber
cpu mem disk parallelism
workload skew causing thrashing
unusual sequential scan workload
failover or restart => no cache
pg_prewarm
28. @leinweber
cpu mem disk parallelism
same as just disk,
but also the application is piling on
29. @leinweber
cpu mem disk parallelism
large GROUP BYs
high disk latency due to unusual page
dispersion pattern in the workload
30. @leinweber
cpu mem disk parallelism
workload has high mem (GROUP BY)
+ app adding backlog
lock contention slowing mem release
60. @leinweber
flirting with disaster
Velocity NY 2013: Richard Cook
"Resilience In Complex Adaptive Systems”
Jens Rasmussen:
Risk management in a dynamic society: a
modeling problem
68. @leinweber
flirting with disaster
Velocity NY 2013: Richard Cook
"Resilience In Complex Adaptive Systems”
Jens Rasmussen:
Risk management in a dynamic society: a
modeling problem