Every company likes to brag about their successes, but not many are willing to talk about their failures. At PagerDuty we have been rigorously tracking downtime in order to analyze it and learn from our mistakes - we even blog about these failures publicly.
Despite being a highly available system, we have had three outages caused by problems with our production Cassandra clusters over the past year. We'll take a look at each of these outages: what we saw from the inside, the actions we took to recover, and most importantly the procedures and monitoring that will help prevent it from happening to you.
2. 2015-09-30
PagerDuty (simplified, circa early 2014)
ONE YEAR OF CASSANDRA FAILURES
Monitoring
system events.pagerduty.com
Cassandra
Enqueuer
Dequeuer
Event Processing
Notifier
XtraDB
Phone
SMS
Email
Push
HTTP
PagerDuty
Customer
5. 2015-09-30
Background
ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
• Shared cluster, 5 machines (with replication factor = 5)
• 10s of GBs of data
• In-flight data: 10s of MBs, maybe 100s
6. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outage 1 - Foreshadowing
• Series of small outages / degradations
• Repair process started
• High load, high latency
• Response: disable thrift, turn off nodes
7. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
6 seconds
~25 ms
8. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
9. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
10. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
11. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
13. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
The Plan
• Trigger repair…
… with lots of people watching
• Use our load shedding strategies for any problems:
• Proactively disable non-critical services
• Disable thrift
14. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Surprise!
• Cron triggers a different repair
• Plus a compaction for a large CF
15. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
Normal
Bad
Horrible
16. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
Normal
Bad
Horrible
:(
17. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Cassandra Pending Tasks: ReadStage (by host)
Over 9000
20. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Aftermath: The Investigation
• Huge investigation
• Silver lining: learned a lot
• Host metrics (CPU, network, etc) fine most of the time
• Need to look at Cassandra metrics for leading indicators
21. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Investigation Conclusion
• Under-provisioned (mainly CPU)
• No partial progress
22. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Lessons
• Capacity planning
• Important even with low volume
• Cassandra-specific monitoring
• Isolation
25. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Changes
• Isolated clusters for everyone
• New service: heaviest Cassandra user so far
• Upgrade Cassandra version
26. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Application Logs
ERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for
SimpleQueueable[entityId=deliveryProcessor_XXXXXXX]
com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException:
[host=##.###.##.1(##.###.##.1):9160, latency=24(24),
attempts=1]InvalidRequestException(why:(
String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed
validation)
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:
159)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl
$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
27. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
“Cassandra Danger Metrics” (Partial)
41. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Lessons
• Isolated clusters pays off
• How to do schema changes:
1. describe cluster;
2. <schema change for one CF>
3. describe cluster;
• Monitor for schema disagreement
43. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Application-Measured Cassandra Call Latency (ms, by CF)
8 seconds
Normal: ~25 ms
44. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
“Cassandra Danger Metrics” (partial)
45. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs (on working hosts)
INFO [HintedHandoff:2] 2014-12-18 03:21:39,396
HintedHandOffManager.java (line 427) Timed out replaying
hints to /##.###.##.6; aborting (9079 delivered)
47. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
The Culprit
Nov 26 21:39:53 prod-artemis-cass06 sudo: donny :
TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/
WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/
bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db
Nov 26 21:40:12 prod-artemis-cass06 sudo: donny :
TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/
WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/
bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db
48. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
49. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/
system/peer_events/system-peer_events-ic-57; partitioner
org.apache.cassandra.dht.RandomPartitioner does not match
system partitioner
org.apache.cassandra.dht.Murmur3Partitioner. Note that the
default partitioner starting with Cassandra 1.2 is
Murmur3Partitioner, so you will need to edit that to match
your old partitioner if upgrading.
50. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
51. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
(Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
52. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
Success!
54. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump
"MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
at
org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService.
java:93)
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191)
at org.apache.cassandra.db.Table.apply(Table.java:375)
at org.apache.cassandra.db.Table.apply(Table.java:354)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283)
at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
55. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump
"COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:
126)
at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305)
at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44)
at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356)
at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService
$1.runMayThrow(PeriodicCommitLogExecutorService.java:46)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
56. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs - Commit Log Allocator
Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197)
at org.apache.cassandra.db.commitlog.CommitLogAllocator
$1.runMayThrow(CommitLogAllocator.java:95)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/
CommitLog-2-1418868735344.log to 1418873812840 failed
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113)
... 4 more
57. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Lessons
• Be careful what habits you develop
• Tools should be as isolated & focused as possible
• Process startup code can create time bombs