This document summarizes one organization's experience scaling Puppet Enterprise (PE) to manage 5,100 nodes over 9 months. Some key points:
- PE was deployed across 14 supported operating systems spanning 7 families on 4,157 production nodes and 72 development nodes.
- The production PE deployment consisted of 11 servers fulfilling roles like the certificate authority (CA), PuppetDB, and Puppet Console.
- Scaling challenges included load balancing puppet masters, sizing PuppetDB and the Console for performance, and tuning JVM settings as nodes grew to 4,000.
- Lessons included using Role and Profiles patterns, learning staging controls, and setting top-level defaults in site.pp to
2. WHO AM I?
• DevOps and Cloud Admin* at Te
Connectivity
• ~9 years of assorted technical
operations experience
• ~1 year of PE usage/administration
• Puppet Featured Community
Member (for most verbose
complaints by a Test Pilot 2014)
• Puppet Certified Professional 2015
(sample scores: Puppet Language
94%, Console 40%)
• Can’t be bothered to take internal
“Making compelling presentations
training”
<= LIAR =>
3. PE DEPLOYMENT STATS
• 5100 PE licenses
• Prod => 4157 Agents
• Dev => 72 Agents
• 871 Licenses purchased for systems of stubborn
people.
• 14 supported OS spanning 7 OS families
• Prod PE deployment consists of 11 servers.
• 1 CA / Filebucket Server
• 1 PuppetDB server (using embedded
PostgreSQL)
• 1 Puppet Console
• 4 Puppet Compile Masters
• 1 Active MQ Hub
• 3 Active MQ Brokers
4. THE CRUELEST LIES ARE OFTENTOLD
WHENTRYINGTO GET MANAGERSTO
BUYTHE RIGHTTOOLS
• Compliance reporting (without
remediation)
• Application code deployment
• Service discovery
• DNS?!
• Any phrase that includes “I’m
sure there is a way puppet
can…”
5. NO-OP (AKA MY ARCH
NEMESIS)
• No-Op is a tool, not a solution.
• No-Op != Operational Intelligence
• Pandora’s Box full of excuses not to embrace change
(see also: “brownfield”, “legacy”,“near-EoL”)
• Make sure you enforce enough code to control your
agent configuration…
6. THE FASTEST WAYTO CAUSE
4000 AGENT RUNSTO FAIL
• Custom Facter facts are
your friend, until they aren’t.
• #1 culprit for massive agent
failures is bad confines in
custom facts not tested
against enough canary
nodes.
• “It worked when I tested it,
the fact even returns the
right value”.
Important
8. #puppet.conf.stub
[main]
server = puppet.example.net
archive_file = true
archive_file_server = puppet.example.net
ca_server = puppet.example.net
#puppetdb.conf.stub
[main]
server = puppet.example.net
#console.conf.stub
[main]
server = puppet.example.net
Evolution of puppet.conf
9. #puppet.conf.stub
[main]
server = puppet.example.net
archive_file = true
archive_file_server = puppet.example.net
ca_server = puppet.example.net
#puppetdb.conf.stub
[main]
server = puppetdb.example.net
#console.conf.stub
[main]
server = puppetconsole.example.net
Evolution of puppet.conf
10. #puppet.conf.stub
[main]
server = puppet.example.net (Now an LB)
archive_file = true
archive_file_server = puppetfb.example.net*
ca_server = puppetca.example.net*
#puppetdb.conf.stub
[main]
server = puppetdb.example.net
#console.conf.stub
[main]
server = puppetconsole.example.net
Evolution of puppet.conf
11. LOAD BALANCING PITFALLS
• Do Load Balance
• Port 8140 between compile masters
• If you use connection stickiness > 30 minutes agents will never
change masters.
• Port 61613 between ActiveMQ Brokers
• Don’t Load Balance
• Puppet CA, or any cert signing requests.
• File Bucket (archive_file_server)
• ActiveMQ hub, more split brain SSL
13. • Sizing Recommendations Revised
• PuppetDB needs way more RAM than is recommended when
you scale. (Req 30GB, Our present 50GB, and it should be
higher)
• PostgreSQL best practices claim 3xDB size of memory for
best performance. @4000 nodes, puppetdb ~ 50GB,
consoledb ~40GB @ 3days retention.
• ConsoleDB needs pruned aggressively.
(reports = nodes * 48 * days retention). That much
information is not useful in the console.
• Console uses less RAM than expected. (Req 30GB, Our present
10GB)
15. • @4000 nodes we use 8 dashboard workers.
• When # of nodes grows, the default page of
the console can become very sluggish.
edit /opt/puppet/share/puppet-dashboard/config/routes.rb to adjust
the route:
PuppetDashboard::Application.routes do
# root :to => 'pages#home'
root :to => 'reports#index'
CONSOLE CONFIGURATIONS
16. JVMTUNING
• Problem: Service stops, log show Out of Memory Exceptions.
• Heap Sizes:
• puppetserver - 4GB
• puppetdb - 1GB
• PE console - 2GB
• ActiveMQ Hub - 1.5GB
• ActiveMQ Broker - 1GB
• PuppetDB (server component) has been a JVM for a while, so
most GC actions can be tuned as Puppet Params
18. • Use R10K. Use Puppetfile. Use Roles and Profiles.
• Learn what nanlui/staging does. Then use it.
• exec { ‘horrible_idea’:
cmd => ‘dostuff.sh && touch /tmp/didstuff.proof’,
creates => ‘/tmp/didstuff.proof’,
}
• PuppetLabs, myself, and most of our profession are absolutely terrible at naming things.
• Problem:
(‘Environment’ && ‘Deployment’ && ‘Tier’ && ‘Branches’ && ‘Forks’) => [‘Production’,
‘Dev’, ‘QA’]
• Result:
cats.all? { cats.content[:name] == ‘Selso’ } => true
• Proxy Servers are evil. Spaceship Operators have a cool name.
• Problem: universally_respected_proxy_variables.exists? => false
• Solution: Use site.pp + Resource Collection to set top level resource defaults.
The “read this later” slide
19. “IF I HAVE SEEN FURTHER IT IS BY STANDING ON
YE SHOULDERS OF GIANTS” ~ ISAAC NEWTON
Resources that have gotten me by:
• https://docs.puppetlabs.com/
references/latest/type.html
• Puppet Types and Providers by
Dan Bode and Nan Liu
• Puppet Practitioner’s Training
• Gary Larizza’s Blog (aka nsfw
missing puppet documentation)
• PuppetLabs Support
• Puppet Professional Services
And Most importantly
• A healthy mixture of ambition,
stubbornness and stupidity.