Erik Krogen of LinkedIn presents regarding Dynamometer, a system open sourced by LinkedIn for scale- and performance-testing HDFS. He discusses one major use case for Dynamometer, tuning NameNode GC, and discusses characteristics of NameNode GC such as why it is important, and how it interacts with various current and future GC algorithms.
This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.
2. Dynamometer
• Realistic performance benchmark
& stress test for HDFS
• Open sourced on LinkedIn
GitHub, contributing to Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment
3. What’s a Dynamometer?
A dynamometer or "dyno" for
short, is a device for
measuring force, torque,
or power. For example, the
power produced by an engine
…
- Wikipedia
Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license
4. Main Goals
• Accurate namespace: Namespace
characteristics have a big impact
• Accurate client workload: Request
types and the timing of requests both
have a big impact
• Accurate system workload: Load
imposed by system management
(block reports, etc.) has a big impact
High Fidelity Efficiency
• Low cost: Offline infra has high
utilization; can’t afford to keep
around unused machines for testing
• Low developer effort: Deploying to
large number of machines can be
cumbersome; make it easy
• Fast iteration cycle: Should be able
to iterate quickly
6. Dynamometer
SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS
• How to schedule
and coordinate?
Use YARN!
• Real NameNode,
fake DataNodes to
run on ~1% the
hardware
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN NodeYARN Node
NameNode
Host YARN Cluster
YARN Node
DynoAM
Host HDFS Cluster
FsImage Block
Listings
7. Dynamometer
SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS
• Clients can run on
YARN too!
• Replay real traces
from production
cluster audit logs
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
Host HDFS Cluster
Audit
Logs
8. Contributing to Apache
• Working to put into hadoop-tools
• Easier place for community to access and contribute
• Increased chance of others helping to maintain it
• Follow HDFS-12345 (actual ticket, not a placeholder)
10. NameNode GC Primer
• Why do we care?
• NameNode heaps are huge (multi-hundred GB)
• GC is a big factor in performance
• What’s special about NameNode GC?
• Huge working set: can have over 100GB of long-lived objects
• Massive young gen churn (from RPC requests)
11. Can we use a new GC algorithm to squeeze
more performance out of the NameNode?
Q U E S T I O N
12. Experimental Setup
• 16 hour production trace: Long enough to experience 2 rounds of
mixed GC
• Measure performance via standard metrics (client latency, RPC
queue time)
• Measure GC pauses during startup and normal workloads
• Let’s try G1GC even though we know we’re pushing the limits:
• The region sizes can vary from 1 MB to 32 MB depending on the
heap size. The goal is to have no more than 2048 regions. –
Oracle*
• This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453
14. Tuning G1GC with Dynamometer
• G1GC has lots of tunables – how do we optimize all of them without
hurting our production system?
• Dynamometer to the rescue
• Easily set up experiments sweeping over different values for a
param
• Fire-and-forget, test with with many combinations and analyze
later
• Main parameters needing significant tuning were for the remembered
sets
• (details to follow in appendix)
15. How Much Does G1GC Help?
Startup Normal Operation
METRIC CMS G1GC CMS G1GC
Avg Client Latency
(ms)
19 18
Total Pause Time (s) 200 180 550 160
Median Pause Time (s) 1.1 0.5 0.12 0.06
Max Pause Time (s) 13.4 3.3 1.4 0.6
* Values are approximate and provided primarily to give a sense of s
Excellent
reduction in
pause
times
Not much
impact on
throughput
16. Looking towards the future…
• Question: How does G1GC fare extrapolating to future workloads:
• 600GB+ heap size, 1 billion blocks, 1 billion files
• Answer: Not so well
• RSet entry count has to be increased even further to obtain
reasonable performance
• Off-heap overheads in the hundreds of gigabytes
• Wouldn’t recommend it
17. Looking towards the future…
• Anything we can do besides G1GC?
• Extensive testing with Azul’s C4 GC available in Zing® JVM
• Good performance with no tuning
• Results in a test environment:
• 99th percentile pause time ~1ms, max in tens of ms
• Average client latency dropped ~30%
• Continued to see good performance up to 600GB heap size
Zing JVM: https://www.azul.com/products/zing/
Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec
18. Looking towards the future…
• Anything we can do that isn’t proprietary?
• Wait for OpenJDK next gen GC algorithms to mature:
• Shenandoah
• ZGC
19. Appendix: Detailed G1GC Tuning Tips
• -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of
1536)
• Comes with high off-heap memory overheads
• -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push
more work to concurrent threads (NameNode is not really that concurrent – extra cores are better
by the GC algorithm)
• -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help
shorten pauses during high churn periods (startup, failover)
• -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on
experiments sweeping over values. This is where Dynamometer really shines
• MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either
very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer
default of 8 performs better
Quick show of hands, who has heard of Dynamometer?
How can we know that deploying a new config won’t hurt us? How to know if working on a project is worth it?
First key insight: focus on NN only
Greatly reduces size of problem: only a single “real” node is needed for NameNode
Quick background for those not familiar
Enable it on a small cluster and see what happens? Sure, but without the load (heap and RPC) it won’t really tell us much
Enable it on production? Sure, but how many apologies do we need to make if something goes wrong?
Try it on NNThroughputBenchmark? Sure, but block reports and varying client workloads contribute heavily
Dynamometer!
Macro effect: performance viewe by client/server
Micro effect: GC pauses themselves
GC characteristics very different during startup / failover, measure these separately
Can you see what’s wrong here?
Object copying time? No….
Rset Scanning! Essentially the set of references. Tradeoff between overhead and how expensive it is to scan them
NN performance ground to a halt. This is why we can’t test it out on production!
Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween
Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween