SlideShare une entreprise Scribd logo
1  sur  18
Cassandra Operations at Netflix
Gregg Ulrich


                                  1
Agenda
 Who we are
 How much we use Cassandra
 How we do it
 What we learned




                              2
Who we are
 Cloud Database Engineering
   Development – Cassandra and related tools
   Architecture – data modeling and sizing
   Operations – availability, performance and maintenance
 Operations
   24x7 on-call support for all Cassandra clusters
   Cassandra operations tools
   Proactive problem hunting
   Routine and non-routine maintenances

                                                             3
How much we use Cassandra

30         Number of production clusters
12         Number of multi-region clusters
3          Max regions, one cluster
65         Total TB of data across all clusters
472        Number of Cassandra nodes
72/28      Largest Cassandra cluster (nodes/data in TB)
50k/250k   Max read/writes per second on a single cluster
3*         Size of Operations team

                   * Open position for an additional engineer
                                                                4
I read that Netflix doesn’t have operations
 Extension of Amazon’s PaaS
 Decentralized Cassandra ops is expensive at scale
 Immature product that changes rapidly (and drastically)
 Easily apply best practices across all clusters




                                                            5
How we configure Cassandra in AWS
 Most services get their own Cassandra cluster
 Mostly m2.4xlarge instances, but considering others
 Cassandra and supporting tools baked into the AMI
 Data stored on ephemeral drives
 Data durability – all writes to all availabilty zones
    Alternate AZs in a replication set
    RF = 3


                                                          6
Minimum cluster configuration
 Minimum production cluster configuration – 6 nodes
   3 auto-scaling groups
   2 instances per auto-scaling group
   1 availability zone per auto-scaling group




                                                       7
Minimum cluster configuration, illustrated



ASG1 AZ1
                                   RF=3
ASG2 AZ2               PRIAM



ASG3 AZ3




                                             8
Tools we use
 Administration
   Priam
   Jenkins
 Monitoring and alerting
   Cassandra Explorer
   Dashboards
   Epic




                            9
Tools we use – Priam
 Open-sourced Tomcat webapp running on each instance
 Multi-region token management via SimpleDB
 Node replacement and ring expansion
 Backup and restore
   Full nightly snapshot backup to S3
   Incremental backup of flushed SSTables to S3 every 30 seconds
 Metrics collected via JMX
 REST API to most nodetool functions
                                                                    10
Tools we use – Cassandra Explorer
• Kiosk mode – no
  alerting
• High level cluster
  status (thrift, gossip)
• Warns on a small set
  of metrics




                                    11
Tools we use – Epic
• Netflix-wide
  monitoring and
  alerting tool based on
  RRD
• Priam proxies all JMX
  data to Epic
• Very useful for finding
  specific issues




                            12
Tools we use – Dashboards
• Next level cluster
  metrics
    • Throughput
    • Latency
    • Gossip status
    • Maintenance
      operations
    • Trouble indicators
• Useful for finding
  anomalies
• Most investigations
  start here

                            13
Tools we use – Jenkins
•   Scheduling tool for additional
    monitors and maintenance
    tasks

•   Push button automation for
    recurring tasks

•   Repairs, upgrades, and other
    tasks are only performed
    through Jenkins to preserve
    history of actions

•   On-call dashboard displays
    current issues and maintenance
    required




                                     14
Things we monitor
Cassandra                 System
   Throughput               Disk space
   Latency                  Load average
   Compactions              I/O errors
   Repairs                  Network errors
   Pending threads
   Dropped operations
   Java heap
   SSTable counts
   Cassandra log files
                                               15
Other things we monitor
 Compaction predictions
 Backup failures
 Recent restarts
 Schema changes
 Monitors




                           16
What we learned
 Having Cassandra developers in house is crucial
 Repairs are incredibly expensive
 Multi-tenanted clusters are challenging
 A down node is better than a slow node
 Better to compact on our terms and not Cassandra’s
 Sizing and tuning is difficult and often done live
 Smaller per-node data size is better

                                                       17
Q&A (and Recommended viewing)
     The Best of Times
     Taft and Bakersfield are real places


     South Park
     Later season episodes like F-Word and Elementary School Musical


     Caillou
     My kids love this show; I don’t know why


     Until the Light Takes Us
     Scary documentary on Norwegian Black Metal

                                                                       18

Contenu connexe

Tendances

How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflixVinay Kumar Chella
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsDataWorks Summit
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication confluent
 
Api observability
Api observability Api observability
Api observability Red Hat
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
 

Tendances (20)

How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Api observability
Api observability Api observability
Api observability
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
 

Similaire à Cassandra Operations at Netflix

BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...DataStax Academy
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceMapR Technologies
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideMohammed Fazuluddin
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraDataStax
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011sandeep_tata
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentationSergey Enin
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE
 
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...inside-BigData.com
 
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Cédrick Lunven
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical dataOleksandr Semenov
 

Similaire à Cassandra Operations at Netflix (20)

BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...
 
Cassandra
CassandraCassandra
Cassandra
 
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL PerformanceShift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
Shift into High Gear: Dramatically Improve Hadoop & NoSQL Performance
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Devops kc
Devops kcDevops kc
Devops kc
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Spinnaker VLDB 2011
Spinnaker VLDB 2011Spinnaker VLDB 2011
Spinnaker VLDB 2011
 
Cassandra presentation
Cassandra presentationCassandra presentation
Cassandra presentation
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
NAVGEM on the Cloud: Computational Evaluation of Cloud HPC with a Global Atmo...
 
MYSQL
MYSQLMYSQL
MYSQL
 
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Cassandra Operations at Netflix

  • 1. Cassandra Operations at Netflix Gregg Ulrich 1
  • 2. Agenda  Who we are  How much we use Cassandra  How we do it  What we learned 2
  • 3. Who we are  Cloud Database Engineering  Development – Cassandra and related tools  Architecture – data modeling and sizing  Operations – availability, performance and maintenance  Operations  24x7 on-call support for all Cassandra clusters  Cassandra operations tools  Proactive problem hunting  Routine and non-routine maintenances 3
  • 4. How much we use Cassandra 30 Number of production clusters 12 Number of multi-region clusters 3 Max regions, one cluster 65 Total TB of data across all clusters 472 Number of Cassandra nodes 72/28 Largest Cassandra cluster (nodes/data in TB) 50k/250k Max read/writes per second on a single cluster 3* Size of Operations team * Open position for an additional engineer 4
  • 5. I read that Netflix doesn’t have operations  Extension of Amazon’s PaaS  Decentralized Cassandra ops is expensive at scale  Immature product that changes rapidly (and drastically)  Easily apply best practices across all clusters 5
  • 6. How we configure Cassandra in AWS  Most services get their own Cassandra cluster  Mostly m2.4xlarge instances, but considering others  Cassandra and supporting tools baked into the AMI  Data stored on ephemeral drives  Data durability – all writes to all availabilty zones  Alternate AZs in a replication set  RF = 3 6
  • 7. Minimum cluster configuration  Minimum production cluster configuration – 6 nodes  3 auto-scaling groups  2 instances per auto-scaling group  1 availability zone per auto-scaling group 7
  • 8. Minimum cluster configuration, illustrated ASG1 AZ1 RF=3 ASG2 AZ2 PRIAM ASG3 AZ3 8
  • 9. Tools we use  Administration  Priam  Jenkins  Monitoring and alerting  Cassandra Explorer  Dashboards  Epic 9
  • 10. Tools we use – Priam  Open-sourced Tomcat webapp running on each instance  Multi-region token management via SimpleDB  Node replacement and ring expansion  Backup and restore  Full nightly snapshot backup to S3  Incremental backup of flushed SSTables to S3 every 30 seconds  Metrics collected via JMX  REST API to most nodetool functions 10
  • 11. Tools we use – Cassandra Explorer • Kiosk mode – no alerting • High level cluster status (thrift, gossip) • Warns on a small set of metrics 11
  • 12. Tools we use – Epic • Netflix-wide monitoring and alerting tool based on RRD • Priam proxies all JMX data to Epic • Very useful for finding specific issues 12
  • 13. Tools we use – Dashboards • Next level cluster metrics • Throughput • Latency • Gossip status • Maintenance operations • Trouble indicators • Useful for finding anomalies • Most investigations start here 13
  • 14. Tools we use – Jenkins • Scheduling tool for additional monitors and maintenance tasks • Push button automation for recurring tasks • Repairs, upgrades, and other tasks are only performed through Jenkins to preserve history of actions • On-call dashboard displays current issues and maintenance required 14
  • 15. Things we monitor Cassandra System  Throughput  Disk space  Latency  Load average  Compactions  I/O errors  Repairs  Network errors  Pending threads  Dropped operations  Java heap  SSTable counts  Cassandra log files 15
  • 16. Other things we monitor  Compaction predictions  Backup failures  Recent restarts  Schema changes  Monitors 16
  • 17. What we learned  Having Cassandra developers in house is crucial  Repairs are incredibly expensive  Multi-tenanted clusters are challenging  A down node is better than a slow node  Better to compact on our terms and not Cassandra’s  Sizing and tuning is difficult and often done live  Smaller per-node data size is better 17
  • 18. Q&A (and Recommended viewing) The Best of Times Taft and Bakersfield are real places South Park Later season episodes like F-Word and Elementary School Musical Caillou My kids love this show; I don’t know why Until the Light Takes Us Scary documentary on Norwegian Black Metal 18

Notes de l'éditeur

  1. Keywords – Agenda
  2. Centralized Cassandra team used as a resource for other teams
  3. Minimum cluster size = 6
  4. Don’t developers do everything?True for most of the services, Cassandra is an exceptionNeeded a team focused on Cassandra so that services could quickly adopt
  5. M2.4xlarge68.4 GB of memory26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)1690 GB of instance storage64-bit platformI/O Performance: HighAPI name: m2.4xlargeEphemeral drives mean that we have to bootstrap new nodes
  6. Brief overview on this slide, go into detail on the next one
  7. Things to cover on this slideHow AWS balances between AZsWhat happens when an AZ goes awayHow PRIAM alternates nodes around the ring, even in MR
  8. (Vijay should have covered a lot of this)Refer back to previous slideREST useful for automation. Do not have to connect to nodes directly or use JMXPriam only supports doubling the ring
  9. Node, AZ and cluster level metricsTime series metrics with extensive historyCan compare multiple metrics one one graphAlso configure to send alerts
  10. Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor
  11. Cluster level monitoring, or things that we can not easily derive from JMX or Epic
  12. Try to anticipate when a large minor compaction is going to happenFreedom and responsibility has forced us to monitor schema changesWant to understand every time Cassandra restartsAWS very infrequently swaps out bad nodes. Nodes usually become non-responsive
  13. … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster