As a leading developer of highly scalable, large-scale Web services, Google was forced early on to develop systems to support the deployment and management of diverse workloads at an immense scale. As the broader developer community embraces cloud technologies we see significant parallels between the internal management infrastructure which Google has built over the last decade, and open source management technologies of today. This talk will describe Google's experience in managing large-scale compute services, draw parallels to open source efforts underway today, and sketch out how our past experience shapes our future development of the Google Cloud Platform.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
SaltConf14 - Brendan Burns, Google - Management at Google Scale
1. Google confidential │ Do not distribute
Management at Google Scale
Converging managed infrastructure between Google and the Cloud community
Brendan Burns
Staff Software Engineer
3. Google confidential │ Do not distribute
For the past 15 years, Google
has been building the world’s
fastest, most powerful, highest
quality cloud infrastructure on the
planet.
Images by Connie
Zhou
5. Declarative Management for sanity
Containers for idempotency and reproducibility
So, what have we learned?
Task Introspection (or how I learned to forget about SSH)
6. A view into my life
• Google engineer for 6 years
• Search Infrastructure (Realtime Search, Google+ Search …)
• Cloud Infrastructure
• Build software to expect failure
• Never had root@gooogle.com, despite web search oncall for 4+ years
7. Declarative Management for sanity
Containers for idempotency and reproducibility
So, what have we learned?
Task Introspection (or how I learned to forget about SSH)
12. Declarative Management for sanity
Containers for idempotency and reproducibility
So, what have we learned?
Task Introspection (or how I learned to forget about SSH)
14. Google has a long history with containers (Process CGroups, LMCTFY [https://github.com/google/lmctfy])
What containers are good for?
Containers
15. Declarative Management for sanity
Containers for idempotency and reproducibility
So, what have we learned?
Task Introspection (or how I learned to forget about SSH)
16. (or how I learned to forget about SSH)
Containers don’t really have SSH (well, they can, but…)
Still want containers to be self-contained
Introspection
20. The Top Six Things
You Didn’t Know About SaltStack
21. 1. Fast, flexible comms protocol
• SaltStack provides options
• Different solutions for different problems
• Flexibility and plug-ability
• ØMQ
– Super fast
• SSH
– For certain use cases
– 50x faster than other other SSH-based tools
• RAET
– UDP or TCP
– Even faster
– More control over job queuing and prioritization
– More infrastructure visibility
22. 2. Salt Virt
• Doesn’t get much attention
• Salt originally designed as a
cloud controller (Butter)
• A completely different approach
to cloud management
– Database free
– Evolving but being used in production
23. 3. Declarative or imperative? Yes.
• Stick a fork in this debate
• Most flexible configuration management
• Finite order execution is a core Salt
design principle
• 0.17 introduced more state ordering
choice
• Compiler and run time
– Salt modularity
– No sacrifice or compromise of speed
24. 4. Generic device automation
• Minion proxy for network devices (Juniper,
Arista, Broadcom, F5, etc.)
• Not just executing CM routines
• Finite device control w/ remote execution
• Easy to communicate with and control these
typically dumb devices
• Stateful configuration and one-off queries
• Integrated with standard Salt workflows and
methodologies
25. 5. The Salt test suite
• More stable Salt releases
• Pedro Algarvio!
• Running lives tests constantly on real infra
– Jenkins
– Spinning up VMs on Rackspace to run tests
– Hooked into Docker containers
• PyLint coverage (thx Hulu & LogiLab)
• Test coverage doubled in three months
26. 6. The SaltStack name
• Not SLC
• FLOSS Weekly
realization
• Gimli, son of Gloin
• Ubiquitous nature of Salt
By-product of N different commands from M different users
This is no good for countless reasons.
e.g. Class vs. Object
Reasoning in a declaration unlocks tremendous potential
Audit Trail
Code Review
Roll forward / Roll back
Reproducibility
Portable environments
Self describing systems
Separation of concerns
chroot, package management, process cgroups
Containers are
Introspective
Declarative (or can be)
Contain everything you want, nothing you don’t
Focused
Limited
Self-Contained
Extensible
Isolated (still work to do here)
chroot, package management, process cgroups
Containers are
Introspective
Declarative (or can be)
Contain everything you want, nothing you don’t
Focused
Limited
Self-Contained
Extensible
Isolated (still work to do here)
Containers should carry with them, their debug access
Logging
Monitoring
status pages, threadz, heapz, etc.
Should get some “for free”
This is where the community comes in (and container extensibility is useful)
Basic take away, we have some services now.
We’re going to have more services
We are excited about open source services and will partner with what comes next.
Containers should carry with them, their debug access
Logging
Monitoring
status pages, threadz, heapz, etc.
Should get some “for free”
This is where the community comes in (and container extensibility is useful)