SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
2015−09−23
One Year of Cassandra Failures
donny@pagerduty.com
#CassandraSummit
2015-09-30
PagerDuty (simplified, circa early 2014)
ONE YEAR OF CASSANDRA FAILURES
Monitoring
system events.pagerduty.com
Cassandra
Enqueuer
Dequeuer
Event Processing
Notifier
XtraDB
Phone
SMS
Email
Push
HTTP
PagerDuty
Customer
2015−09−23
Span the WAN? Yes you can!
Tomorrow at 9:50 AM
Paul Rechsteiner
2015−09−23
Outage 1
“The Backlog”
2015-09-30
Background
ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
• Shared cluster, 5 machines (with replication factor = 5)
• 10s of GBs of data
• In-flight data: 10s of MBs, maybe 100s
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outage 1 - Foreshadowing
• Series of small outages / degradations
• Repair process started
• High load, high latency
• Response: disable thrift, turn off nodes
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
6 seconds
~25 ms
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Coordinator Read Latency (in ms, by host)
2015−09−23
The Next Day…
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
The Plan
• Trigger repair…
… with lots of people watching
• Use our load shedding strategies for any problems:
• Proactively disable non-critical services
• Disable thrift
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Surprise!
• Cron triggers a different repair
• Plus a compaction for a large CF
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
Normal
Bad
Horrible
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Outgoing Notification Backlog Size
Normal
Bad
Horrible
:(
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Cassandra Pending Tasks: ReadStage (by host)
Over 9000
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Cassandra CPU (by host)
100%
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Factory Reset
Success… kind of
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Aftermath: The Investigation
• Huge investigation
• Silver lining: learned a lot
• Host metrics (CPU, network, etc) fine most of the time
• Need to look at Cassandra metrics for leading indicators
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Investigation Conclusion
• Under-provisioned (mainly CPU)
• No partial progress
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Lessons
• Capacity planning
• Important even with low volume
• Cassandra-specific monitoring
• Isolation
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1
Lessons - Metrics For Cassandra
• Dropped messages (leading)
• Blocked flush writers (leading)
• GC behavior (leading)
• Pending tasks: ReadStage, ResponseStage, etc (lagging)
2015−09−23
Outage 2
“Aliens”
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Changes
• Isolated clusters for everyone
• New service: heaviest Cassandra user so far
• Upgrade Cassandra version
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Application Logs
ERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for
SimpleQueueable[entityId=deliveryProcessor_XXXXXXX]
com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException:
[host=##.###.##.1(##.###.##.1):9160, latency=24(24),
attempts=1]InvalidRequestException(why:(
String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed
validation)
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:
159)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl
$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
“Cassandra Danger Metrics” (Partial)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
cassandra-cli - “describe cluster” - Bad Output
[default@Artemis] describe cluster;
Cluster Information:
Name: prod-artemis
Snitch: org.apache.cassandra.locator.PropertyFileSnitch
Partitioner: org.apache.cassandra.dht.RandomPartitioner
Schema versions:
52eee0b6-dabb-3c44-af80-970b0e7f63ff: [##.###.##.1]
676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.2,
##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
cassandra-cli - “describe cluster” - Good Output
[default@unknown] describe cluster;
Cluster Information:
Name: prod-artemis
Snitch: org.apache.cassandra.locator.PropertyFileSnitch
Partitioner: org.apache.cassandra.dht.RandomPartitioner
Schema versions:
676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.1,
##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5,
##.###.##.6, ##.###.##.7]
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Notifications Sent
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Application-Measured Cassandra Call Latency (in ms, by CF)
15 seconds
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Pending Tasks: MutationStage
22,000
Should be small, < 5
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Actions
17:01:21 disable thrift
17:02:08 kill repair
17:02:35 kill dash nine
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Cassandra Operations (cluster-wide, by CF)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Cassandra Operations (cluster-wide, by CF)
disable thrift
kill repair
kill -9
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Puzzle
• Why did one bad Cassandra node have such a huge effect?
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Bad Coordinator
Timeout vs average request
10,000 ms / 25 ms = 400
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
What Happened To Cassandra?
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
What Happened To Cassandra?
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
Lessons
• Isolated clusters pays off
• How to do schema changes:
1. describe cluster;
2. <schema change for one CF>
3. describe cluster;
• Monitor for schema disagreement
2015−09−23
Outage 3
“Human Error”
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Application-Measured Cassandra Call Latency (ms, by CF)
8 seconds
Normal: ~25 ms
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
“Cassandra Danger Metrics” (partial)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs (on working hosts)
INFO [HintedHandoff:2] 2014-12-18 03:21:39,396
HintedHandOffManager.java (line 427) Timed out replaying
hints to /##.###.##.6; aborting (9079 delivered)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
commitlog Directory
ls -la /var/lib/cassandra/commitlog/
total 1015360
drwxr-xr-x 2 cassandra root 4096 2014-12-18 03:36 .
drwxr-xr-x 6 cassandra root 4096 2014-08-19 17:00 ..
-- SNIP --
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533553.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533554.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:34 CommitLog-2-1418873533555.log
-rw-r--r-- 1 root root 33554432 2014-11-26 21:40 CommitLog-2-1418873533556.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737850.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737851.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737852.log
-rw-r--r-- 1 root root 33554432 2014-11-26 21:39 CommitLog-2-1418873737853.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873800630.log
-rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873812840.log
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
The Culprit
Nov 26 21:39:53 prod-artemis-cass06 sudo: donny :
TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/
WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/
bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db
Nov 26 21:40:12 prod-artemis-cass06 sudo: donny :
TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/
WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/
bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/
system/peer_events/system-peer_events-ic-57; partitioner
org.apache.cassandra.dht.RandomPartitioner does not match
system partitioner
org.apache.cassandra.dht.Murmur3Partitioner. Note that the
default partitioner starting with Cassandra 1.2 is
Murmur3Partitioner, so you will need to edit that to match
your old partitioner if upgrading.
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251)
at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log
(Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
sstable2json
export CASSANDRA_CONF=/etc/cassandra
sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
Success!
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump
"MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349)
at
org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService.
java:93)
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191)
at org.apache.cassandra.db.Table.apply(Table.java:375)
at org.apache.cassandra.db.Table.apply(Table.java:354)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283)
at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Thread Dump
"COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java:
126)
at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305)
at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44)
at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356)
at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService
$1.runMayThrow(PeriodicCommitLogExecutorService.java:46)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Cassandra Logs - Commit Log Allocator
Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197)
at org.apache.cassandra.db.commitlog.CommitLogAllocator
$1.runMayThrow(CommitLogAllocator.java:95)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/
CommitLog-2-1418868735344.log to 1418873812840 failed
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113)
... 4 more
2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
Lessons
• Be careful what habits you develop
• Tools should be as isolated & focused as possible
• Process startup code can create time bombs
2015−09−23
Concluding Thoughts
2015−09−23
donny@pagerduty.com
Thank you.

Contenu connexe

Tendances

Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
Pythian
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
Axel Liljencrantz
 

Tendances (20)

The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
C* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick BransonC* Summit 2013: Cassandra at Instagram by Rick Branson
C* Summit 2013: Cassandra at Instagram by Rick Branson
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
 
Cassandra at teads
Cassandra at teadsCassandra at teads
Cassandra at teads
 
Bulk Loading into Cassandra
Bulk Loading into CassandraBulk Loading into Cassandra
Bulk Loading into Cassandra
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
DataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenterDataStax: Backup and Restore in Cassandra and OpsCenter
DataStax: Backup and Restore in Cassandra and OpsCenter
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run Cassandra
 
Plmce2k15 15 tips galera cluster
Plmce2k15   15 tips galera clusterPlmce2k15   15 tips galera cluster
Plmce2k15 15 tips galera cluster
 

En vedette

The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
DataStax
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 

En vedette (8)

The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
 

Similaire à PagerDuty: One Year of Cassandra Failures

Similaire à PagerDuty: One Year of Cassandra Failures (20)

Watching Your Cassandra Cluster Melt
Watching Your Cassandra Cluster MeltWatching Your Cassandra Cluster Melt
Watching Your Cassandra Cluster Melt
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
 
Operating CloudStack: the easy way (automation!)
Operating CloudStack: the easy way (automation!)Operating CloudStack: the easy way (automation!)
Operating CloudStack: the easy way (automation!)
 
Apache Cassandra: building a production app on an eventually-consistent DB
Apache Cassandra: building a production app on an eventually-consistent DBApache Cassandra: building a production app on an eventually-consistent DB
Apache Cassandra: building a production app on an eventually-consistent DB
 
Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
 
Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud Costs
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Getting started with Cassandra 2.1
Getting started with Cassandra 2.1Getting started with Cassandra 2.1
Getting started with Cassandra 2.1
 
MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Devops kc
Devops kcDevops kc
Devops kc
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
TechEvent Apache Cassandra
TechEvent Apache CassandraTechEvent Apache Cassandra
TechEvent Apache Cassandra
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 

Plus de DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

PagerDuty: One Year of Cassandra Failures

  • 1. 2015−09−23 One Year of Cassandra Failures donny@pagerduty.com #CassandraSummit
  • 2. 2015-09-30 PagerDuty (simplified, circa early 2014) ONE YEAR OF CASSANDRA FAILURES Monitoring system events.pagerduty.com Cassandra Enqueuer Dequeuer Event Processing Notifier XtraDB Phone SMS Email Push HTTP PagerDuty Customer
  • 3. 2015−09−23 Span the WAN? Yes you can! Tomorrow at 9:50 AM Paul Rechsteiner
  • 5. 2015-09-30 Background ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 • Shared cluster, 5 machines (with replication factor = 5) • 10s of GBs of data • In-flight data: 10s of MBs, maybe 100s
  • 6. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outage 1 - Foreshadowing • Series of small outages / degradations • Repair process started • High load, high latency • Response: disable thrift, turn off nodes
  • 7. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host) 6 seconds ~25 ms
  • 8. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  • 9. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  • 10. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  • 11. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Coordinator Read Latency (in ms, by host)
  • 13. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 The Plan • Trigger repair… … with lots of people watching • Use our load shedding strategies for any problems: • Proactively disable non-critical services • Disable thrift
  • 14. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Surprise! • Cron triggers a different repair • Plus a compaction for a large CF
  • 15. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outgoing Notification Backlog Size Normal Bad Horrible
  • 16. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Outgoing Notification Backlog Size Normal Bad Horrible :(
  • 17. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Cassandra Pending Tasks: ReadStage (by host) Over 9000
  • 18. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Cassandra CPU (by host) 100%
  • 19. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Factory Reset Success… kind of
  • 20. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Aftermath: The Investigation • Huge investigation • Silver lining: learned a lot • Host metrics (CPU, network, etc) fine most of the time • Need to look at Cassandra metrics for leading indicators
  • 21. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Investigation Conclusion • Under-provisioned (mainly CPU) • No partial progress
  • 22. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Lessons • Capacity planning • Important even with low volume • Cassandra-specific monitoring • Isolation
  • 23. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 1 Lessons - Metrics For Cassandra • Dropped messages (leading) • Blocked flush writers (leading) • GC behavior (leading) • Pending tasks: ReadStage, ResponseStage, etc (lagging)
  • 25. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Changes • Isolated clusters for everyone • New service: heaviest Cassandra user so far • Upgrade Cassandra version
  • 26. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Application Logs ERR [20141202-23:14:02.808] #222 -- queue: There was a problem running the workqueue task for SimpleQueueable[entityId=deliveryProcessor_XXXXXXX] com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException: [host=##.###.##.1(##.###.##.1):9160, latency=24(24), attempts=1]InvalidRequestException(why:( String didn't validate.) [Artemis][MaterializedNotification][artemisAcceptedAt] failed validation) at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java: 159) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl $ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
  • 27. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 “Cassandra Danger Metrics” (Partial)
  • 28. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 cassandra-cli - “describe cluster” - Bad Output [default@Artemis] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 52eee0b6-dabb-3c44-af80-970b0e7f63ff: [##.###.##.1] 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
  • 29. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2
  • 30. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 cassandra-cli - “describe cluster” - Good Output [default@unknown] describe cluster; Cluster Information: Name: prod-artemis Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: 676d41bc-b9ce-3513-a232-b1056dea1ca6: [##.###.##.1, ##.###.##.2, ##.###.##.3, ##.###.##.4, ##.###.##.5, ##.###.##.6, ##.###.##.7]
  • 31. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Notifications Sent
  • 32. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Application-Measured Cassandra Call Latency (in ms, by CF) 15 seconds
  • 33. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Pending Tasks: MutationStage 22,000 Should be small, < 5
  • 34. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Actions 17:01:21 disable thrift 17:02:08 kill repair 17:02:35 kill dash nine
  • 35. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Cassandra Operations (cluster-wide, by CF)
  • 36. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Cassandra Operations (cluster-wide, by CF) disable thrift kill repair kill -9
  • 37. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Puzzle • Why did one bad Cassandra node have such a huge effect?
  • 38. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Bad Coordinator Timeout vs average request 10,000 ms / 25 ms = 400
  • 39. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 What Happened To Cassandra?
  • 40. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 What Happened To Cassandra?
  • 41. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 2 Lessons • Isolated clusters pays off • How to do schema changes: 1. describe cluster; 2. <schema change for one CF> 3. describe cluster; • Monitor for schema disagreement
  • 43. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Application-Measured Cassandra Call Latency (ms, by CF) 8 seconds Normal: ~25 ms
  • 44. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 “Cassandra Danger Metrics” (partial)
  • 45. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Logs (on working hosts) INFO [HintedHandoff:2] 2014-12-18 03:21:39,396 HintedHandOffManager.java (line 427) Timed out replaying hints to /##.###.##.6; aborting (9079 delivered)
  • 46. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 commitlog Directory ls -la /var/lib/cassandra/commitlog/ total 1015360 drwxr-xr-x 2 cassandra root 4096 2014-12-18 03:36 . drwxr-xr-x 6 cassandra root 4096 2014-08-19 17:00 .. -- SNIP -- -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533553.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:33 CommitLog-2-1418873533554.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:34 CommitLog-2-1418873533555.log -rw-r--r-- 1 root root 33554432 2014-11-26 21:40 CommitLog-2-1418873533556.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737850.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737851.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873737852.log -rw-r--r-- 1 root root 33554432 2014-11-26 21:39 CommitLog-2-1418873737853.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873800630.log -rw-r--r-- 1 cassandra cassandra 33554432 2014-12-18 03:36 CommitLog-2-1418873812840.log
  • 47. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 The Culprit Nov 26 21:39:53 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/ WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/ bin/sstable2json ArtemisQueue-WorkQueue-ic-10035-Data.db Nov 26 21:40:12 prod-artemis-cass06 sudo: donny : TTY=pts/0 ; PWD=/var/lib/cassandra/data/ArtemisQueue/ WorkQueue ; USER=root ; COMMAND=/usr/local/share/cassandra/ bin/sstable2json ArtemisQueue-WorkQueue-ic-10037-Data.db
  • 48. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
  • 49. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db ERROR 14:11:08,067 Cannot open /var/lib/cassandra/data/ system/peer_events/system-peer_events-ic-57; partitioner org.apache.cassandra.dht.RandomPartitioner does not match system partitioner org.apache.cassandra.dht.Murmur3Partitioner. Note that the default partitioner starting with Cassandra 1.2 is Murmur3Partitioner, so you will need to edit that to match your old partitioner if upgrading.
  • 50. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db
  • 51. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Exception in thread "COMMIT-LOG-ALLOCATOR" FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135) at org.apache.cassandra.db.commitlog.CommitLogSegment.freshSegment(CommitLogSegment.java:84) at org.apache.cassandra.db.commitlog.CommitLogAllocator.createFreshSegment(CommitLogAllocator.java:251) at org.apache.cassandra.db.commitlog.CommitLogAllocator.access$500(CommitLogAllocator.java:49) at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:105) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /var/lib/cassandra/commitlog/CommitLog-2-1441980887051.log (Permission denied) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241) at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:119)
  • 52. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 sstable2json export CASSANDRA_CONF=/etc/cassandra sudo sstable2json ArtemisQueue-WorkQueue-ic-2211-Data.db Success!
  • 53. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3
  • 54. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Thread Dump "MutationStage:30" daemon prio=10 tid=0x00007fec64ed9000 nid=0x1fe3 waiting on condition [0x00007fe3b56da000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000061406ffe8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:349) at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService.add(PeriodicCommitLogExecutorService. java:93) at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:191) at org.apache.cassandra.db.Table.apply(Table.java:375) at org.apache.cassandra.db.Table.apply(Table.java:354) at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:283) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:56) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
  • 55. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Thread Dump "COMMIT-LOG-WRITER" prio=10 tid=0x00007fec64293800 nid=0x1f8b waiting on condition [0x00007fec687d0000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000061417d0d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.cassandra.db.commitlog.CommitLogAllocator.fetchSegment(CommitLogAllocator.java: 126) at org.apache.cassandra.db.commitlog.CommitLog.activateNextSegment(CommitLog.java:305) at org.apache.cassandra.db.commitlog.CommitLog.access$100(CommitLog.java:44) at org.apache.cassandra.db.commitlog.CommitLog$LogRecordAdder.run(CommitLog.java:356) at org.apache.cassandra.db.commitlog.PeriodicCommitLogExecutorService $1.runMayThrow(PeriodicCommitLogExecutorService.java:46) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745)
  • 56. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Cassandra Logs - Commit Log Allocator Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main] FSWriteError in /var/lib/cassandra/commitlog/CommitLog-2-1442099692080.log at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:135) at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:197) at org.apache.cassandra.db.commitlog.CommitLogAllocator $1.runMayThrow(CommitLogAllocator.java:95) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Rename from /var/lib/cassandra/commitlog/ CommitLog-2-1418868735344.log to 1418873812840 failed at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:113) ... 4 more
  • 57. 2015-09-30ONE YEAR OF CASSANDRA FAILURES - OUTAGE 3 Lessons • Be careful what habits you develop • Tools should be as isolated & focused as possible • Process startup code can create time bombs