Apache Cassandra makes it possible to execute millions of operations per second in scalable fashion. Harnessing the power of C* leaves many developers pondering about the following:
- Is my data model appropriate and not going to end up as wide partition(s) causing heap pressure and other issues?
- How do I tune my connection pool configuration? What are the optimal settings for my environment ?
- What is my C* cluster capacity in terms of number of IOPs for a given 95th and 99th latency?
- How do I perf-test my data access layer?
In this talk, Vinay Chella, Cloud Data Architect @ Netflix, will share open source tools, techniques and platform(NDBench) that Netflix uses to perf-test their C* fleet with simulations millions of operations per second.
About the Speaker
Vinay Chella Cloud Data Architect, NETFLIX Inc
About Vinay Chella, Cloud Data Architect at Netflix having deeper understanding of Cassandra and other RDBMS. As an Engineer and Architect, working extensively on data modeling, performance tuning and guiding best practices of various persistence stores. Helping various teams @ Netflix building next generation data access layers.
6. • CDE, Cloud Database
Engineering
• Providing data stores as a
service
○Cassandra
○Dynomite
○Elasticsearch and RDS
Who are we?
7. • 98% of streaming data is stored in
Cassandra
• Data ranges from customer
details to Viewing history /
streaming bookmarks to billing
and payment
Cassandra @ Netflix
It was 2014, September timeframe, when we met one of our app teams for coordinating efforts on their new AB test, so every service which is involved in the line of that AB test execution was in the meeting room, so they were going around the table in planning the expected TPS increase for their service and scaling as needed. When my turn came in, I was like what, you are going to send another half million TPS to C* cluster, well let me see what my fortune teller told this week, because I have clue on how my C* cluster behaves with that increased traffic. Well that is what C* is, add nodes in the cloud you get what you want. But the question is how many nodes I need for that increased traffic.
So, unanswered questions are “Available capacity in your cluster”
and “Will increased load, affects my SLAs?”
That is when I took a step back and thought about the actual problem and worked on NDBench project.
Before getting into details of NDBench, let me introduce myself.
Today we will cover the background of NDBench and its architecture, usage and its achievements @ Netflix since we have been using it for last 2 years.
So, getting back to our basic issue, do we have a tool to perf test just the persistence layer?
Is there a way for me to get the remaining capacity on my existing fleet. Well C* by its nature of cloud native and distributed, it gets it own issue in terms of predictability of its latencies and capacity. I would imagine in regular single machine RDBMS systems thing are pretty stable most of the time and it is easy to predict. But when it comes to C*. cloud and vms gets into the middle of predictions, it gets much more complicated.
Before we started this project, we looked into why not already existing perf testing tools
We looked into various benchmark tools as well as REST-based performance tools. While some tools covered a subset of our requirements, we were interested in a tool that could achieve the following:
Dynamically change the benchmark configurations
Be able to integrate with platform cloud services such as dynamic configurations, discovery, metrics, etc.
Run for an infinite duration in order to introduce failure scenarios and test long running maintenances such as database repairs.
Provide pluggable patterns and loads.
Support different client APIs.
Deploy, manage and monitor multiple instances from a single entry point.
Well, on a high level what is NDBench. What are the advantages of NDBench, lets go through them one by one.
NDBench gives us the side by side comparison of performance test runs so that is easy to take a decision.
You can compare the performance of different driver versions of software versions with the help of NDBench.
You can also compare the cost and performance of new instance types that are coming in the market.
It also gives us the ability to change the load parameters while the perf test running, which is one of the rarest feature that you would find out there
One of the challenges with C*, is its datamodel, with the years of experience and resources that we acquired over the time in RDBMS space, it is comparatively easy to model your data for better performance. But, with new data store systems like C*, dynomite it is little confusing with partitions and clustering columns, making a decision with supporting data points on which datamodel works better is alwasy good, so NDBench comes handy when comparing the data models. Varying data models in terms of payload, shard and comparators
And it is built with pluggable architecture, so you can pretty much plug anything that you want
As it is pluggable by its architecture, today it comes with C*, Dynomite and ES plugins but it can be extended to any datastore out there
And it is well baked into Netflix eco system and it can be pluggable
The following diagram shows the architecture of NDBench. The framework consists of three components:
Core: The workload generator
API: Allowing multiple plugins to be developed against NDBench
Web: The UI and servlet context listener
NDBench-core is the core component of NDBench, where one can further tune workload settings.
NDBench can be used from either the command line (using REST calls), or from a web-based user interface (UI).