This document provides an overview of measuring and tuning performance for the Puppet Enterprise (PE) platform. It discusses gathering data from PE services like Puppet Server and PuppetDB through JVM logging, metrics, and configurations. Important metrics for Puppet Server include JRuby usage and catalog compilation times. Tuning options involve adjusting JRuby capacity and rebalancing agent checkins. The document also covers monitoring PuppetDB for storage usage and command processing, as well as optimizing PostgreSQL query performance.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Charlie Sharpsteen, Puppet
1. Keeping an Eye on the PE Stack
An Introduction to Measuring and Tuning PE Performance
Charlie Sharpsteen, Puppet Inc.
2. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Overview
• How do I measure PE performance? What sources of
data are available?
• What numbers are actually important?
• What settings can I adjust when important metrics
start showing unhealthy trends?
2
4. PE Server Components
TrapperKeeper JVM
Puppet Server
PuppetDB
Console Services
Orchestration Services
JVM
ActiveMQ
Other
PostgreSQL
NGINX
Mostly Java based with shared logging and metrics interfaces.
4
5. TrapperKeeper Logging
• Configuration for main logs can be found in:
/etc/puppetlabs/<service name>/logback.xml
• Controls output destinations, log levels and message formatting.
• Ship to a log aggregator to provide context for investigations.
• Default log pattern is:
Date Level [Java Namespace] message
• Puppet Server also includes thread ID:
Date Level [thread] [Java Namespace] message
• Thread ID is useful for grouping activity related to a single request.
5
6. TrapperKeeper Logging
• Configuration for main logs can be found in:
/etc/puppetlabs/<service name>/request-logging.xml
• Default format is Apache Combined Log + request duration
• Easily parsed by most log processors.
• Can add additional bits of information such as request headers.
6
7. TrapperKeeper Metrics
• Metrics are recorded using JMX MBeans.
• Metrics that measure activity over time are weighted to represent the last 5 minutes.
• Metrics can be retrieved via the JMX protocol.
• Full access to all available metrics and all available measurements.
• Can attach tools such as JConsole and JVisualVM.
• Requires additional ports to be opened, configuration can be complex. Java tools only.
• Metrics can be retrieved as JSON over HTTP:
• For a curated set of common metrics: status/v1?level=debug
• For access to all available metrics: metrics/v1/mbeans
7
8. TrapperKeeper Configuration
• Configuration files are stored under:
/etc/puppetlabs/<service name>/conf.d
• Most important settings are managed by puppet_enterprise::profile classes and are
tunable via the Console and Hiera.
• JVM settings are specified in /etc/sysconfig or /etc/default
• JVM memory limit, -Xmx is the primary tunable setting. Enable the G1 garbage
collector when using limits higher than 10 GB: -XX:+UseG1GC
• These flags are configurable via the java_args parameter on profile classes.
8
10. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Puppet Server Metrics Overview
● JVM resource usage: status-service
● JMX namespace: java.lang:*
● HTTP request times per endpoint: pe-master
● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.http.*
● Catalog Compilation metrics: pe-puppet-profiler
● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.compiler.*
puppetserver:name=puppetlabs.<fqdn>.functions.*
puppetserver:name=puppetlabs.<fqdn>.puppetdb.*
● JRuby Metrics: pe-jruby-metrics
● JMX namespace: puppetserver:name=puppetlabs.<fqdn>.jruby.*
10
11. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
New PE 2016.4.0 Features
● The metrics/v1/mbeans endpoint has been added to Puppet Server. Must be enabled via Hiera:
puppet_enterprise::master::puppetserver::metrics_webservice_enabled: true
● The Graphite metrics reporter has been optimized and extended:
● Only a subset of available metrics are reported by default.
● Reported metrics can be customized using the metrics_puppetserver_metrics_allowed
parameter of the puppet_enterprise::profile::master class.
11
12. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
JRuby Metrics
● Almost all Puppet Server requests must be handled by a JRuby instance — this makes JRuby
availability the primary performance bottleneck.
● num-free-jrubies
● Measures spare capacity for incoming requests.
● average-wait-time
● Should never grow to a significant fraction of HTTP request times.
● Impacted by agent checkin distribution, resource availability, Puppet plugins and code.
12
13. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Agent Checkin Activity
● Agents will check in runinterval after starting their last run — this can lead to pile-ups or
“thundering herds”. Be careful of:
● Starting or re-starting a group of agents without the splay setting enabled.
● Triggering a group of agent runs via: mco puppet runonce
● Monitor average-requested-jrubies and Puppet Server access logs for spikes in agent activity.
● Use PostgreSQL to pull a histogram of Agent start times from report data:
sudo su - pe-postgres -s /bin/bash -c "psql -d pe-puppetdb"
SELECT date_part('minute', start_time), count(*)
FROM reports
WHERE start_time BETWEEN '2016-10-20 13:30:00' AND '2015-10-20 14:30:00'
GROUP BY date_part('minute', start_time)
ORDER BY date_part('minute', start_time) ASC;
13
14. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Re-balancing Agent Checkins
● Use MCollective to orchestrate a batched re-start:
su - peadmin -c "mco rpc service stop service=puppet"
su - peadmin -c "mco rpc service start service=puppet --batch 1
--batch-sleep <runinterval in seconds / #nodes>”
● Batching is not necessary if the agents have splay enabled.
● For a stable distribution that isn’t affected by re-starts, puppet agent -t can be run on a schedule
determined by the fqdn_rand() function instead of using the service.
● Load due to agent activity can be cut dramatically by shifting to the Direct Puppet workflow where
Orchestrator or MCollective are used to push catalog updates.
14
15. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Adding More JRuby Capacity
● JRuby count is set via jruby_max_active_instances, constrained by available CPU and RAM:
● Compile masters tend to top out around NCPU - 1. Monolithic masters need to share with
PuppetDB and tend more towards (NCPU / 2 - 1).
● RAM requirements are 512 MB per JRuby, but may need to be increased if catalog compilation
uses large datasets or dozens of environments are in use.
● The environment_timeout setting can be used to reduce the CPU requirements of catalog
compilation. Set to 0 globally and unlimited for long-lived environments with lots of agents.
● Each environment using an unlimited timeout will add to the per-JRuby RAM requirements.
Monitor memory usage of pre-2016.4.0 installations closely when using unlimited timeouts.
● Code Manager should be enabled when an unlimited timeout is used so that caches are flushed
when new code is deployed.
15
16. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Investigating Compile Times
● PE Puppet Server tracks compilation time on several different levels: per-node, per-environment, per-
resource, per-function, and more.
● Top 10 resources and functions are available via the status API and Puppet Server performance
dashboard:
https://<puppetmaster>:8140/puppet/experimental/dashboard.html
● Full access available through JMX and the metrics API.
● Detailed timing on catalog compilation can be obtained by setting the Puppet Server log level to
DEBUG and running puppet agent -t --profile on nodes of interest.
16
17. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
Investigating Agent Run Times
● Agent run summaries are stored at:
/opt/puppetlabs/puppet/cache/state/last_run_summary.yaml
● Summaries are also stored by PuppetDB and can be viewed from the PE Console, or queried:
reports[metrics] {
latest_report? = true and certname = '<node name>'
}
● The time section shows amount of time taken per resource type along with config_retrieval
measuring the amount of time it took to receive a catalog.
● Per-resource timing can be logged by running: puppet agent -t --evaltrace
17
19. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
PuppetDB Storage Usage
● Monitor disk space!
/opt/puppetlabs/server/data/postgresql/
/opt/puppetlabs/server/data/puppetdb/
● If disk space runs out, there are two options for returning space to the operating system:
● The existing volume can be enlarged so that a VACUUM FULL can be run.
● Alternately, a new volume can be attached for a database backup and restore.
● The primary source of disk usage is report storage, this can be tuned by setting: report-ttl
● For infrastructure with high node turnover, consider setting node-purge-ttl to remove data related
to decommissioned nodes.
19
20. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
PuppetDB Command Processing
● Every PuppetDB operation, aside from queries, is executed by an asynchronous command
processing queue. This queue is managed by an internal ActiveMQ server:
org.apache.activemq:type=Broker,brokerName=localhost,
destinationType=Queue,destinationName=puppetlabs.puppetdb.commands
● Important metrics:
● Backlog of commands waiting for processing: QueueSize
● Largest command seen: MaxMessageSize
● Available memory for in-flight commands: MemoryPercentUsage
● Increase PuppetDB heap size along with the command-processing.memory-usage setting if the
percentage spikes close to 100%. This will prevent ActiveMQ from paging commands to disk.
20
21. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
PuppetDB Command Processing
● Command processing rates:
puppetlabs.puppetdb.mq:name=global.processing-time
puppetlabs.puppetdb.storage:name=replace-facts-time
puppetlabs.puppetdb.storage:name=replace-catalog-time
puppetlabs.puppetdb.storage:name=store-report-time
● Additional processing threads can be added using the command-processing.threads setting.
● On a monolithic install, PuppetDB processing threads must be balanced against Puppet Server
JRubies and the number of CPU cores available.
21
22. Preso title goes here. To update, go to File > Page Setup > Header/Footer, paste title, Apply All
PostgreSQL Query Performance
● PostgreSQL configuration can be found in:
/opt/puppetlabs/server/data/postgresql/9.4/data/postgresql.conf
● Add settings to improve logging around slow queries:
log_min_duration_statement = 3000ms
log_temp_files = 0
● If a temp file shows up in the logs, that means Postgres had to perform an operation outside of
RAM; which is slow. Consider increasing the work_mem setting to be greater than the size of the
temp files used.
● If query performance has been dropping over time, a database VACCUM may be needed:
su - pe-postgres -s /bin/bash -c "vacuumdb --analyze --verbose --all"
22