High-performance (as measured by sub-millisecond response time for queries) is a key characteristic of the Redis database, and it is one of the main reasons why Redis is the most popular key-value database in the world.
In order to continue improving performance across all of the different Redis components, we’ve developed a framework for automatically triggering performance tests, telemetry gathering, profiling, and data visualization upon code commit.
In this talk, we describe how this type of automation and “zero-touch” profiling scaled our ability to pursue performance regressions and to find opportunities to improve the efficiency of our code, helping us (as a company) to start shifting from a reactive to a more proactive performance mindset.
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Open Source Experience Conference 2022
1. E2E Performance Testing,
Profiling, and Analysis at Redis
Open Source Experience Conference, Nov 2022
Filipe Oliveira
Senior Performance Engineer
@Redis
4. PERFORMANCE @REDIS
4
OSS REDIS + Redis Ltd Projects
1. foster benchmark and observability standards
2. support the contributions to the OSS projects
3. optimize an industry-leading solution
5. Ordinarily, on our Companies Core Products
5
We have...
● automatic extensive tests to catch functional failures
...but when
● we accidentally commit a performance regression, nothing intercepts it*!
7. A Real Case From 2019
7
Simple request
1. RediSearch minor version bump
2. Required multiple patch
a. Feedback cycle took us at-least 1 day
b. prioritized over other projects
c. Siloed
d. Jul. 30, Nov. 27, 2019
You can relate to...
● your team run performance tests before releasing
8. Ordinarily, on our Companies Core Products
8
You can state...
● your team run performance tests before releasing
...but solving slowdowns just before releasing is...
● dangerous
● time-consuming
● one of the most difficult tasks to estimate time to
...is just buffering potential issues!
9. Goal: Reduce Feedback Cycle. Avoid Silos
9
Requirements for valid tests
- Stable testing environment
- Deterministic testing tools
- Deterministic outcomes
- Reduced testing/probing overhead
- Reduce tested changes to the minimal
Requirements for acceptance in
products
- Acceptable duration
- No manual work
- Actionable items
- Well defined key performance indicators
CODE REVIEW
PREVIEW /
UNSTABLE
RELEASE
MANUAL
PERF
CHECK
CODE REVIEW
PREVIEW /
UNSTABLE
RELEASE
ZERO TOUCH
PERF CHECK
ZERO TOUCH
PERF CHECK
ZERO TOUCH
PERF CHECK
10. 10
This is Not New/Disruptive
Elastic
https://elasticsearch-benchmarks.elastic.co/#
Lucene
https://home.apache.org/~mikemccand/lucenebench/
13. by branch
scalability analysis
Our Approach
13
by version
1. Initial focus on OSS deployments
2. local and remote triggers
3. Used for testing, profiling
a. Regression analysis
i. and fix
b. Approval of features
c. Proactive optimization
14. 14
Our Approach
1. Initial focus on OSS deployments
2. local and remote triggers
3. Used for testing, profiling
a. Regression analysis
i. and fix
b. Approval of features
c. Proactive optimization
16. Our Approach
16
1. Full process Flame Graph + main thread Flame Graph
2. perf report per dso
3. perf report per dso,sym (w/wout callgraph)
4. perf report per dso,sym,srcline (w/wout callgraph)
5. identical stacks collapsed
6. hotpath callgraph
1
3
2
4
1. Initial focus on OSS deployments
2. local and remote triggers
3. Used for testing, profiling
a. Regression analysis
i. and fix
b. Approval of features
c. Proactive optimization
17. Our Approach
17
1. Full process Flame Graph + main thread Flame Graph
2. perf report per dso
3. perf report per dso,sym (w/wout callgraph)
4. perf report per dso,sym,srcline (w/wout callgraph)
5. identical stacks collapsed
6. hotpath callgraph
4
6
1. Initial focus on OSS deployments
2. local and remote triggers
3. Used for testing, profiling
a. Regression analysis
i. and fix
b. Approval of features
c. Proactive optimization
18. 18
What We’ve Gained
● up to 68% performance boost on the covered commands
● Deeply reduced the feedback cycle ( days -> 1hour )
● Dev’s can easily add tests (243 full suites)
● Scaled team + more challenging!
19. 19
What We’ve Gained
● Finding performance improvements is now everyone’s
power/responsibility
● A/B test new tech/state-of-the-art HW/SW components
● Continuous up-to-date numbers for use-cases that matter
● Foster openness
20. 20
What’s Next
● aggregate performance data across a group of benchmarks
● better statistical analysis methods
● more visibility across API
● Increase OSS / Company adoption
○ expose data on docs