SlideShare une entreprise Scribd logo
1  sur  20
Dynamometer
and
A Case Study in NameNode GC
Erik Krogen
Senior Software Engineer, Hadoop & HDFS
Dynamometer
• Realistic performance benchmark
& stress test for HDFS
• Open sourced on LinkedIn
GitHub, contributing to Apache
• Evaluate scalability limits
• Provide confidence before new
feature/config deployment
What’s a Dynamometer?
A dynamometer or "dyno" for
short, is a device for
measuring force, torque,
or power. For example, the
power produced by an engine
…
- Wikipedia
Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license
Main Goals
• Accurate namespace: Namespace
characteristics have a big impact
• Accurate client workload: Request
types and the timing of requests both
have a big impact
• Accurate system workload: Load
imposed by system management
(block reports, etc.) has a big impact
High Fidelity Efficiency
• Low cost: Offline infra has high
utilization; can’t afford to keep
around unused machines for testing
• Low developer effort: Deploying to
large number of machines can be
cumbersome; make it easy
• Fast iteration cycle: Should be able
to iterate quickly
Simplify the Problem
NameNode is the central component, most
frequent bottleneck: focus here
Dynamometer
SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS
• How to schedule
and coordinate?
Use YARN!
• Real NameNode,
fake DataNodes to
run on ~1% the
hardware
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN NodeYARN Node
NameNode
Host YARN Cluster
YARN Node
DynoAM
Host HDFS Cluster
FsImage Block
Listings
Dynamometer
SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS
• Clients can run on
YARN too!
• Replay real traces
from production
cluster audit logs
Dynamomete
r Driver
DataNode
DataNode
DataNode
DataNode
• • •
YARN Node
• • •
YARN Node
NameNode
Dynamometer
Infrastructure
Application
Workload MapReduce Job
Host YARN Cluster
Simulated Client Simulated Client
Simulated ClientSimulated Client
Host HDFS Cluster
Audit
Logs
Contributing to Apache
• Working to put into hadoop-tools
• Easier place for community to access and contribute
• Increased chance of others helping to maintain it
• Follow HDFS-12345 (actual ticket, not a placeholder)
NameNode GC:
A Dynamometer Case
Study
NameNode GC Primer
• Why do we care?
• NameNode heaps are huge (multi-hundred GB)
• GC is a big factor in performance
• What’s special about NameNode GC?
• Huge working set: can have over 100GB of long-lived objects
• Massive young gen churn (from RPC requests)
Can we use a new GC algorithm to squeeze
more performance out of the NameNode?
Q U E S T I O N
Experimental Setup
• 16 hour production trace: Long enough to experience 2 rounds of
mixed GC
• Measure performance via standard metrics (client latency, RPC
queue time)
• Measure GC pauses during startup and normal workloads
• Let’s try G1GC even though we know we’re pushing the limits:
• The region sizes can vary from 1 MB to 32 MB depending on the
heap size. The goal is to have no more than 2048 regions. –
Oracle*
• This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453
Can You Spot the Issue?
[Parallel Time: 17676.0 ms, GC Workers: 16]
[GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3]
[Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8]
[Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7]
[Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484]
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4]
[GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3]
[GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0]
[Code Root Fixup: 0.7 ms]
[Code Root Migration: 2.3 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 6.7 ms]
[Other: 1194.8 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 2.8 ms]
[Ref Enq: 0.4 ms]
[Redirty Cards: 468.4 ms]
[Free CSet: 4.0 ms]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
[Times: user=223.20 sys=0.22, real=18.88 secs]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
[Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4]
[Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3]
[Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)]
902.102: Total time for which application threads were stopped: 18.8815330 seconds
Huge pause!
Few GB of Eden
cleared, big but not
huge
~500ms
pause due to
object copy
17.5s pause due
to “Scan RS”!
Tuning G1GC with Dynamometer
• G1GC has lots of tunables – how do we optimize all of them without
hurting our production system?
• Dynamometer to the rescue
• Easily set up experiments sweeping over different values for a
param
• Fire-and-forget, test with with many combinations and analyze
later
• Main parameters needing significant tuning were for the remembered
sets
• (details to follow in appendix)
How Much Does G1GC Help?
Startup Normal Operation
METRIC CMS G1GC CMS G1GC
Avg Client Latency
(ms)
19 18
Total Pause Time (s) 200 180 550 160
Median Pause Time (s) 1.1 0.5 0.12 0.06
Max Pause Time (s) 13.4 3.3 1.4 0.6
* Values are approximate and provided primarily to give a sense of s
Excellent
reduction in
pause
times
Not much
impact on
throughput
Looking towards the future…
• Question: How does G1GC fare extrapolating to future workloads:
• 600GB+ heap size, 1 billion blocks, 1 billion files
• Answer: Not so well
• RSet entry count has to be increased even further to obtain
reasonable performance
• Off-heap overheads in the hundreds of gigabytes
• Wouldn’t recommend it
Looking towards the future…
• Anything we can do besides G1GC?
• Extensive testing with Azul’s C4 GC available in Zing® JVM
• Good performance with no tuning
• Results in a test environment:
• 99th percentile pause time ~1ms, max in tens of ms
• Average client latency dropped ~30%
• Continued to see good performance up to 600GB heap size
Zing JVM: https://www.azul.com/products/zing/
Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec
Looking towards the future…
• Anything we can do that isn’t proprietary?
• Wait for OpenJDK next gen GC algorithms to mature:
• Shenandoah
• ZGC
Appendix: Detailed G1GC Tuning Tips
• -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of
1536)
• Comes with high off-heap memory overheads
• -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push
more work to concurrent threads (NameNode is not really that concurrent – extra cores are better
by the GC algorithm)
• -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help
shorten pauses during high churn periods (startup, failover)
• -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on
experiments sweeping over values. This is where Dynamometer really shines
• MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either
very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer
default of 8 performs better
Thank
you!

Contenu connexe

Tendances

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)NTT DATA Technology & Innovation
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Masahiko Sawada
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 

Tendances (20)

Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
Hadoop Compatible File Systems 2019 (db tech showcase 2019 Tokyo講演資料、2019/09/25)
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...Transparent Data Encryption in PostgreSQL and Integration with Key Management...
Transparent Data Encryption in PostgreSQL and Integration with Key Management...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 

Similaire à Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Monica Beckwith
 
淺談 Java GC 原理、調教和 新發展
淺談 Java GC 原理、調教和新發展淺談 Java GC 原理、調教和新發展
淺談 Java GC 原理、調教和 新發展Leon Chen
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Spark Summit
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Tier1 App
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarGlobalLogic Ukraine
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) CollectorJava Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) CollectorGurpreet Sachdeva
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC HeroTier1app
 
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIXCassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIXVinay Kumar Chella
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016Pierre Mavro
 
Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu GoSensu Inc.
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGPablo Garbossa
 

Similaire à Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC (20)

Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
 
淺談 Java GC 原理、調教和 新發展
淺談 Java GC 原理、調教和新發展淺談 Java GC 原理、調教和新發展
淺談 Java GC 原理、調教和 新發展
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
ZGC-SnowOne.pdf
ZGC-SnowOne.pdfZGC-SnowOne.pdf
ZGC-SnowOne.pdf
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
BAXTER phase 1b
BAXTER phase 1bBAXTER phase 1b
BAXTER phase 1b
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr Bodnar
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 
Google file system
Google file systemGoogle file system
Google file system
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) CollectorJava Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
 
Become a GC Hero
Become a GC HeroBecome a GC Hero
Become a GC Hero
 
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIXCassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
CassandraSummit2015_Cassandra upgrades at scale @ NETFLIX
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Couchbase live 2016
Couchbase live 2016Couchbase live 2016
Couchbase live 2016
 
Keynote: Scaling Sensu Go
Keynote: Scaling Sensu GoKeynote: Scaling Sensu Go
Keynote: Scaling Sensu Go
 
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
 

Plus de Erik Krogen

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSErik Krogen
 
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureHadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureErik Krogen
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeErik Krogen
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondErik Krogen
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Hadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop EncryptionHadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop EncryptionErik Krogen
 

Plus de Erik Krogen (6)

Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
 
Hadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On AzureHadoop Meetup Jan 2019 - Hadoop On Azure
Hadoop Meetup Jan 2019 - Hadoop On Azure
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
 
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Hadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop EncryptionHadoop Meetup Jan 2019 - Hadoop Encryption
Hadoop Meetup Jan 2019 - Hadoop Encryption
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

  • 1. Dynamometer and A Case Study in NameNode GC Erik Krogen Senior Software Engineer, Hadoop & HDFS
  • 2. Dynamometer • Realistic performance benchmark & stress test for HDFS • Open sourced on LinkedIn GitHub, contributing to Apache • Evaluate scalability limits • Provide confidence before new feature/config deployment
  • 3. What’s a Dynamometer? A dynamometer or "dyno" for short, is a device for measuring force, torque, or power. For example, the power produced by an engine … - Wikipedia Image taken from https://flic.kr/p/dtkCRU and redistributed under the CC BY-SA 2.0 license
  • 4. Main Goals • Accurate namespace: Namespace characteristics have a big impact • Accurate client workload: Request types and the timing of requests both have a big impact • Accurate system workload: Load imposed by system management (block reports, etc.) has a big impact High Fidelity Efficiency • Low cost: Offline infra has high utilization; can’t afford to keep around unused machines for testing • Low developer effort: Deploying to large number of machines can be cumbersome; make it easy • Fast iteration cycle: Should be able to iterate quickly
  • 5. Simplify the Problem NameNode is the central component, most frequent bottleneck: focus here
  • 6. Dynamometer SIMULATED HDFS CLUSTER RUNS IN YARN CONTAINERS • How to schedule and coordinate? Use YARN! • Real NameNode, fake DataNodes to run on ~1% the hardware Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN NodeYARN Node NameNode Host YARN Cluster YARN Node DynoAM Host HDFS Cluster FsImage Block Listings
  • 7. Dynamometer SIMULATED HDFS CLIENTS RUN IN YARN CONTAINERS • Clients can run on YARN too! • Replay real traces from production cluster audit logs Dynamomete r Driver DataNode DataNode DataNode DataNode • • • YARN Node • • • YARN Node NameNode Dynamometer Infrastructure Application Workload MapReduce Job Host YARN Cluster Simulated Client Simulated Client Simulated ClientSimulated Client Host HDFS Cluster Audit Logs
  • 8. Contributing to Apache • Working to put into hadoop-tools • Easier place for community to access and contribute • Increased chance of others helping to maintain it • Follow HDFS-12345 (actual ticket, not a placeholder)
  • 10. NameNode GC Primer • Why do we care? • NameNode heaps are huge (multi-hundred GB) • GC is a big factor in performance • What’s special about NameNode GC? • Huge working set: can have over 100GB of long-lived objects • Massive young gen churn (from RPC requests)
  • 11. Can we use a new GC algorithm to squeeze more performance out of the NameNode? Q U E S T I O N
  • 12. Experimental Setup • 16 hour production trace: Long enough to experience 2 rounds of mixed GC • Measure performance via standard metrics (client latency, RPC queue time) • Measure GC pauses during startup and normal workloads • Let’s try G1GC even though we know we’re pushing the limits: • The region sizes can vary from 1 MB to 32 MB depending on the heap size. The goal is to have no more than 2048 regions. – Oracle* • This implies that the heap should be 64 GB and under, but at this*Garbage First Garbage Collector Tuning - https://www.oracle.com/technetwork/articles/java/g1gc-198453
  • 13. Can You Spot the Issue? [Parallel Time: 17676.0 ms, GC Workers: 16] [GC Worker Start (ms): Min: 883574.6, Avg: 883574.8, Max: 883575.0, Diff: 0.3] [Ext Root Scanning (ms): Min: 1.0, Avg: 1.2, Max: 2.1, Diff: 1.1, Sum: 18.8] [Update RS (ms): Min: 31.7, Avg: 32.2, Max: 32.8, Diff: 1.1, Sum: 514.7] [Processed Buffers: Min: 25, Avg: 30.2, Max: 38, Diff: 13, Sum: 484] [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.6, Diff: 0.6, Sum: 1.0] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Termination (ms): Min: 0.0, Avg: 88.8, Max: 96.5, Diff: 96.5, Sum: 1421.5] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4] [GC Worker Total (ms): Min: 17675.5, Avg: 17675.6, Max: 17675.8, Diff: 0.3, Sum: 282810.3] [GC Worker End (ms): Min: 901250.4, Avg: 901250.4, Max: 901250.5, Diff: 0.0] [Code Root Fixup: 0.7 ms] [Code Root Migration: 2.3 ms] [Code Root Purge: 0.0 ms] [Clear CT: 6.7 ms] [Other: 1194.8 ms] [Choose CSet: 0.0 ms] [Ref Proc: 2.8 ms] [Ref Enq: 0.4 ms] [Redirty Cards: 468.4 ms] [Free CSet: 4.0 ms] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] [Times: user=223.20 sys=0.22, real=18.88 secs] 902.102: Total time for which application threads were stopped: 18.8815330 seconds [Scan RS (ms): Min: 17011.1, Avg: 17052.9, Max: 17400.5, Diff: 389.4, Sum: 272846.4] [Object Copy (ms): Min: 169.8, Avg: 500.5, Max: 534.3, Diff: 364.5, Sum: 8007.3] [Eden: 7360.0M(7360.0M)->0.0B(6720.0M) Survivors: 320.0M->960.0M Heap: 92.6G(150.0G)->87.2G(150.0G)] 902.102: Total time for which application threads were stopped: 18.8815330 seconds Huge pause! Few GB of Eden cleared, big but not huge ~500ms pause due to object copy 17.5s pause due to “Scan RS”!
  • 14. Tuning G1GC with Dynamometer • G1GC has lots of tunables – how do we optimize all of them without hurting our production system? • Dynamometer to the rescue • Easily set up experiments sweeping over different values for a param • Fire-and-forget, test with with many combinations and analyze later • Main parameters needing significant tuning were for the remembered sets • (details to follow in appendix)
  • 15. How Much Does G1GC Help? Startup Normal Operation METRIC CMS G1GC CMS G1GC Avg Client Latency (ms) 19 18 Total Pause Time (s) 200 180 550 160 Median Pause Time (s) 1.1 0.5 0.12 0.06 Max Pause Time (s) 13.4 3.3 1.4 0.6 * Values are approximate and provided primarily to give a sense of s Excellent reduction in pause times Not much impact on throughput
  • 16. Looking towards the future… • Question: How does G1GC fare extrapolating to future workloads: • 600GB+ heap size, 1 billion blocks, 1 billion files • Answer: Not so well • RSet entry count has to be increased even further to obtain reasonable performance • Off-heap overheads in the hundreds of gigabytes • Wouldn’t recommend it
  • 17. Looking towards the future… • Anything we can do besides G1GC? • Extensive testing with Azul’s C4 GC available in Zing® JVM • Good performance with no tuning • Results in a test environment: • 99th percentile pause time ~1ms, max in tens of ms • Average client latency dropped ~30% • Continued to see good performance up to 600GB heap size Zing JVM: https://www.azul.com/products/zing/ Azul C4 GC: https://www.azul.com/resources/azul-technology/azul-c4-garbage-collec
  • 18. Looking towards the future… • Anything we can do that isn’t proprietary? • Wait for OpenJDK next gen GC algorithms to mature: • Shenandoah • ZGC
  • 19. Appendix: Detailed G1GC Tuning Tips • -XX:G1RSetRegionEntries: Solving the problem from the previous slide. 4096 worked well (default of 1536) • Comes with high off-heap memory overheads • -XX:G1RSetUpdatingPauseTimePercent: Reduce this to reduce the “Update RS” pause time, push more work to concurrent threads (NameNode is not really that concurrent – extra cores are better by the GC algorithm) • -XX:G1NewSizePercent: Default of 5% is unreasonably large for heaps > 100GB, reducing will help shorten pauses during high churn periods (startup, failover) • -XX:MaxTenuringThreshold, -XX:ParallelGCThreads, -XX:ConcGCThreads: Set empirically based on experiments sweeping over values. This is where Dynamometer really shines • MaxTenuringThreshold is particularly interesting: Based on NN usage pattern (objects are either very long lived or very short), would expect low values (1 or 2) to be best, but in practice closer default of 8 performs better

Notes de l'éditeur

  1. Quick show of hands, who has heard of Dynamometer? How can we know that deploying a new config won’t hurt us? How to know if working on a project is worth it?
  2. First key insight: focus on NN only Greatly reduces size of problem: only a single “real” node is needed for NameNode
  3. Quick background for those not familiar
  4. Enable it on a small cluster and see what happens? Sure, but without the load (heap and RPC) it won’t really tell us much Enable it on production? Sure, but how many apologies do we need to make if something goes wrong? Try it on NNThroughputBenchmark? Sure, but block reports and varying client workloads contribute heavily Dynamometer!
  5. Macro effect: performance viewe by client/server Micro effect: GC pauses themselves GC characteristics very different during startup / failover, measure these separately
  6. Can you see what’s wrong here? Object copying time? No…. Rset Scanning! Essentially the set of references. Tradeoff between overhead and how expensive it is to scan them NN performance ground to a halt. This is why we can’t test it out on production!
  7. Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween
  8. Getting these values is where Dynamometer really comes through strong. We tried probably around 50 different combinations of parameters. Dynamometer allowed us to set up an experiment where we could sweep over a number of parameters, let it run over a long weekend, and come back at the end to a bunch of data – no human necessary inbetween
  9. - Don’t forget to show the appendix!