Netflix has updated and added new tools and benchmarks for Cassandra in the last year. In this talk we will cover the latest additions and recipes for the Astyanax Java client, updates to Priam to support Cassandra 1.2 Vnodes, plus newly released and upcoming tools that are all part of the NetflixOSS platform. Following on from the Cassandra on SSD on AWS benchmark that was run live during the 2012 Summit, we've been benchmarking a large write intensive multi-region cluster to see how far we can push it. Cassandra is the data storage and global replication foundation for the Cloud Native architecture that runs Netflix streaming for 36 Million users. Netflix is also offering a Cloud Prize for open source contributions to NetflixOSS, and there are ten categories including Best Datastore Integration and Best Contribution to Performance Improvements, with $10K cash and $5K of AWS credits for each winner. We'd like to pay you to use our free software!
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adrian Cockcroft
1. Netflix Open Source Tools and
Benchmarks for Cassandra
June 2013
Adrian Cockcroft
@adrianco #cassandra13 @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
11. Netflix Platform Evolution
Bleeding Edge
Innovation
Common
Pattern
Shared
Pattern
2009-2010 2011-2012 2013-2014
Netflix started out several years ahead of the
industry, but it’s becoming commoditized now
12. Establish our solutions
as Best Practices /
Standards
Hire, Retain and Engage
Top Engineers
Build up Netflix
Technology Brand
Benefit from a shared
ecosystem
Goals
15. Judges
Aino Corry
Program Chair for Qcon/GOTO
Martin Fowler
Chief Scientist ThoughtworksSimon Wardley
Strategist
Yury Izrailevsky
VP Cloud Netflix
Werner Vogels
CTO Amazon Joe Weinman
SVP Telx, Author “Cloudonomics”
16. What do you win?
One winner in each of the 10 categories
Ticket and expenses to attend AWS
Re:Invent 2013 in Las Vegas
A Trophy
17. Entrants
Netflix
Engineering
Six Judges Winners
Nominations
Conforms to
Rules
Working
Code
Community
Traction
Categories
Registration
Opened
March 13
Github
Apache
Licensed
Contributions
Github
Close Entries
September 15
Github
Award
Ceremony
Dinner
November
AWS
Re:Invent
Ten Prize
Categories
$10K cash
$5K AWS
AWS
Re:Invent
Tickets
Trophy
19. Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
20. How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
21. Amazon Video 1.31%
(18x Prime)
(25x Prime)
Nov
2012
Streaming
Bandwidth
March
2013
Mean
Bandwidth
+39% 6mo
22. Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers
(for US, Canada and Latam)
Each icon is
three to a few
hundred
instances
across three
AWS zones
24. Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
25. Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
26. Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
27. Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Switch traffic between regions
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional
failures didn’t make sense. Working on it now.
28. Highly Available Storage
A highly scalable, available and durable
deployment pattern based on Apache
Cassandra
29. Single Function Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Cluster Managed by Priam
Between 6 and 144 nodes
Stateless Data Access REST Service
Astyanax Cassandra Client
Optional
Datacenter
Update Flow
Many Different Single-Function REST Clients
Appdynamics Service Flow Visualization
Each icon represents a horizontally scaled service of three to
hundreds of instances deployed over three availability zones
Over 50 Cassandra clusters
Over 1000 nodes
Over 30TB backup
Over 1M writes/s/cluster
30. Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional Apache
frontend,
memcached, non-
java apps
Monitoring
Log rotation to S3
AppDynamics
machineagent
Epic/Atlas
Java (JDK 6 or 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Tomcat
Application war file, base servlet,
platform, client interface jars,
Astyanax
Healthcheck, status servlets, JMX
interface, Servo autoscale
31. Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and Priam
on JDK
Healthcheck, Statu
s
Monitoring
AppDynamics
machineagent
Epic/Atlas
Java (JDK 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Cassandra Server
Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk holding Commit log
and SSTables
32. Priam – Cassandra Automation
Available at http://github.com/netflix
• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”
33. Priam for C* 1.2 Vnodes
• Prototype work started by Jason Brown
• Completed
– Restructured Priam for Vnode management
• ToDo
– Re-think SSTable backup/restore strategy
34. Cloud Native Big Data
Size the cluster to the data
Size the cluster to the questions
Never wait for space or answers
35. Netflix Dataoven
Data Warehouse
Over 2 Petabytes
Ursula
Aegisthus
Data Pipelines
From cloud
Services
~100 Billion
Events/day
From C*
Terabytes of
Dimension
data
Hadoop Clusters – AWS EMR
1300 nodes 800 nodes Multiple 150 nodes Nightly
RDS
Metadata
Gateways
Tools
36. ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata
37. Global Architecture
Local Client Traffic to Cassandra
Synchronous Replication Across Zones
Asynchronous Replication Across Regions
38. Astyanax Cassandra Client for Java
Available at http://github.com/netflix
• Features
– Abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
– Batch manager
– Many useful recipes
– New: Entity Mapper based on JPA annotations
39. Recipes
• Distributed row lock (without needing zookeeper)
• Multi-region row lock
• Uniqueness constraint
• Multi-row uniqueness constraint
• Chunked and multi-threaded large file storage
• Reverse index search
• All rows query
• Durable message queue
• Contributed: High cardinality reverse index
40. Astyanax Futures
• Maintain backwards compatibility
• Wrapper for C* 1.2 Netty driver
• More CQL support
• NetflixOSS Cloud Prize Ideas
– DynamoDB Backend?
– More recipes?
41. Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Token
Aware
Clients
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
1. Client Writes to local
coordinator
2. Coodinator writes to
other zones
3. Nodes return ack
4. Data written to
internal commit log
disks (no more than
10 seconds later)
If a node goes offline,
hinted handoff
completes the write
when the node comes
back up.
Requests can choose to
wait for one node, a
quorum, or all nodes to
ack the write
SSTable disk writes and
compactions occur
asynchronously
1
4
4
4
2
3
3
3
2
42. Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
US
Clients
Cassandr
a
• Disks
• Zone A
Cassandr
a
• Disks
• Zone B
Cassandr
a
• Disks
• Zone C
Cassandr
a
• Disks
• Zone A
Cassandr
a
• Disks
• Zone B
Cassandr
a
• Disks
• Zone C
1. Client writes to local replicas
2. Local write acks returned to
Client which continues when
2 of 3 local nodes are
committed
3. Local coordinator writes to
remote coordinator.
4. When data arrives, remote
coordinator node acks and
copies to other remote zones
5. Remote nodes ack to local
coordinator
6. Data flushed to internal
commit log disks (no more
than 10 seconds later)
If a node or region goes offline, hinted handoff
completes the write when the node comes back up.
Nightly global compare and repair jobs ensure
everything stays consistent.
EU
Clients
Cassandr
a
• Disks
• Zone A
Cassandr
a
• Disks
• Zone B
Cassandr
a
• Disks
• Zone C
Cassandr
a
• Disks
• Zone A
Cassandr
a
• Disks
• Zone B
Cassandr
a
• Disks
• Zone C
6
5
5
6 6
4
4
4
1
6
6
6
2
2
2
3
100+ms latency
43. Scalability from 48 to 288 nodes on AWS
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
174373
366828
537172
1099837
0
200000
400000
600000
800000
1000000
1200000
0 50 100 150 200 250 300 350
Client Writes/s by node count – Replication Factor = 3
Used 288 of m1.xlarge
4 CPU, 15 GB RAM, 8 ECU
Cassandra 0.86
Benchmark config only
existed for about 1hr
44. Cassandra Disk vs. SSD Benchmark
Same Throughput, Lower Latency, Half Cost
45. 2013 - Cross Region Use Cases
• Geographic Isolation
– US to Europe replication of subscriber data
– Read intensive, low update rate
– Production use since late 2011
• Redundancy for regional failover
– US East to US West replication of everything
– Includes write intensive data, high update rate
– Testing now
46. Validation
Load
2013 - Benchmarking Global Cassandra
Write intensive test of cross region capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test
Load
Test
Load
Inter-Zone Traffic
1 Million writes
CL.ONE
1 Million reads
CL.ONE with no
Data loss
Inter-Region Traffic
S3
47. Copying 18TB from East to West
Cassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes
Thanks to boundary.com for these network analysis plots
48. Inter Region Traffic Test
Verified at desired capacity, no problems, 339 MB/s, 83ms latency
49. Ramp Up Load Until It Breaks!
Unmodified tuning, dropping client data at 1.93GB/s inter region traffic
Spare CPU, IOPS, Network, just need some Cassandra tuning for more
50. Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS
DynECT
DNS
AWS
Route53
Denominator – manage traffic via multiple DNS providers
Denominator
54. AWS Account
Asgard Console
Archaius
Config Service
Cross region Priam C*
Pytheas
Dashboards
Atlas
Monitoring
Genie, Lipstick
Hadoop Services
AWS Usage
Cost Monitoring
Multiple AWS Regions
Eureka Registry
Exhibitor ZK
Edda History
Simian Army
Zuul Traffic Mgr
3 AWS Zones
Application Clusters
Autoscale Groups
Instances
Priam
Cassandra
Persistent Storage
Evcache
Memcached
Ephemeral Storage
NetflixOSS Services Scope
55. •Baked AMI – Tomcat, Apache, your code
•Governator – Guice based dependency injection
•Archaius – dynamic configuration properties client
•Eureka - service registration client
Initialization
•Karyon - Base Server for inbound requests
•RxJava – Reactive pattern
•Hystrix/Turbine – dependencies and real-time status
•Ribbon - REST Client for outbound calls
Service
Requests
•Astyanax – Cassandra client and pattern library
•Evcache – Zone aware Memcached client
•Curator – Zookeeper patterns
•Denominator – DNS routing abstraction
Data Access
•Blitz4j – non-blocking logging
•Servo – metrics export for autoscaling
•Atlas – high volume instrumentation
Logging
NetflixOSS Instance Libraries
56. •CassJmeter – Load testing for Cassandra
•Circus Monkey – Test account reservation rebalancing
•Gcviz – Garbage collection visualization
Test Tools
•Janitor Monkey – Cleans up unused resources
•Efficiency Monkey
•Doctor Monkey
•Howler Monkey – Complains about AWS limits
Maintenance
•Chaos Monkey – Kills Instances
•Chaos Gorilla – Kills Availability Zones
•Chaos Kong – Kills Regions
•Latency Monkey – Latency and error injection
Availability
•Security Monkey – security group and S3 bucket permissions
•Conformity Monkey – architectural pattern warningsSecurity
NetflixOSS Testing and Automation
57. Dashboards with Pytheas (Explorers)
http://techblog.netflix.com/2013/05/announcing-pytheas.html
• Cassandra Explorer
– Browse clusters, keyspaces, column families
• Base Server Explorer
– Browse service endpoints configuration, perf
• Anything else you want to build
59. AWS Usage (coming soon)
Reservation-aware cost monitoring and reporting
60. More Use Cases
More
Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contributions from vendors
What’s Coming Next?
61. Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud native ecosystem
Rapid Evolution - Low MTBIAMSH
(Mean Time Between Idea And Making Stuff Happen)
63. Slideshare NetflixOSS Details
• Lightning Talks Feb S1E1
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1
– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Lightning Talks March S1E2
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-roadmap
• Security Architecture
– http://www.slideshare.net/jason_chan/
• Cost Aware Cloud Architectures – with Jinesh Varia of AWS
– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-varia-aws-and-
adrian-cockroft-netflix
Editor's Notes
When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -> many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data & Clusters