The document discusses Kubernetes adoption at Squarespace as their engineering organization grew. It describes the challenges of a monolithic architecture and how microservices addressed these challenges. It then discusses how Kubernetes helped solve operational challenges of provisioning and scaling microservices. Key Kubernetes concepts like pods, deployments, services and namespaces are explained. Monitoring, networking and security with Kubernetes are also covered.
2. Agenda
01 The problem with static infrastructure
02 Kubernetes Fundamentals
03 Adapting Microservices to Kubernetes
04 Kubernetes in a datacenter?
3. Microservices Journey: A Story of Growth
2013: small (< 50 engineers)
build product & grow customer base
whatever works
2014: medium (< 100 engineers)
we have a lot of customers now!
whatever works doesn't work anymore
2016: large (100+ engineers)
architect for scalability and reliability
organizational structures
?: XL (200+ engineers)
4. Challenges with a Monolith
● Reliability
● Performance
● Engineering agility/speed, cross-team coupling
● Engineering time spent fire fighting rather than building new
functionality
What were the increasingly difficult challenges with a
monolith?
5. Challenges with a Monolith
● Minimize failure domains
● Developers are more confident in their changes
● Squarespace can move faster
Solution: Microservices!
6. Operational Challenges
● Engineering org grows…
● More features...
● More services…
● More infrastructure to spin up…
● Ops becomes a blocker...
Stuck in a loop
7. Traditional Provisioning Process
● Pick ESX with available resources
● Pick IP
● Register host to Cobbler
● Register DNS entry
● Create new VM on ESX
● PXE boot VM and install OS and base configuration
● Install system dependencies (LDAP, NTP, CollectD, Sensu…)
● Install app dependencies (Java, FluentD/Filebeat, Consul, Mongo-
S…)
● Install the app
● App registers with discovery system and begins receiving traffic
8. Containerization & Kubernetes Orchestration
● Difficult to find resources
● Slow to provision and scale
● Discovery is a must
● Metrics system must support short lived metrics
● Alerts are usually per instance
Static infrastructure and microservices do not mix!
13. Common Objects: Deployments
● Declarative
● Defines a type of pod to run
● Defines desired #
● Supports basic operations
○ Can be rolled back quickly!
○ Can be scaled up/down
● Meant to be stateless apps!
kind: Deployment
spec:
replicas: 3
selector:
matchLabels:
service: location
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
... pod info here ...
14. Common Objects: Services
● Make pods addressable
● Assigned an IP
● Addressable DNS entries!
apiVersion: v1
kind: Service
metadata:
name: location
namespace: core-services
spec:
type: ClusterIP
clusterIP: 10.123.79.211
selector:
service: location
ports:
- name: traffic
port: 8080
- name: admin
port: 8081
15. Common Objects: Namespaces
● Namespaces
○ Isolates groups of objects
■ Developer
■ Team
■ System or Service
○ Good for permission boundaries
○ Good for network boundaries
● Most objects are namespaced
apiVersion: v1
kind: Namespace
metadata:
name: core-services
annotations:
squarespace.net/contact: |
team@squarespace.com
creationTimestamp: 2017-06-14T..
spec:
finalizers:
- kubernetes
status:
phase: Active
17. Future Work: Updating Common Dependencies
● Custom Initializers
○ Inject container dependencies into deployments (consul, fluentd)
○ Configure Prometheus instances for each namespace
● Trigger rescheduling of pods when dependencies need updating
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: location
namespace: core-services
annotations:
initializer.squarespace.net/consul: "true"
18. Future Work: Enforce Squarespace Standards
● Custom Admission Controller requires all services, deployments, etc.
meet certain standards
○ Resource requests/limits
○ Owner annotations
○ Service labels
19. Quality of Service Classes
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● BestEffort
○ No resource constraints
○ First to be killed under pressure
● Guaranteed
○ Requests == Limits
○ Last to kill under pressure
○ Easier to reason about resources
● Burstable
○ Take advantage of unused resources!
○ Can be tricky with some languages
20. Microservice Pod Definition
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Kubernetes assumes no other processes are
consuming significant resources
● Completely Fair Scheduler (CFS)
○ Schedules a task based on CPU Shares
○ Throttles a task once it hits CPU Quota
● OOM Killed when memory limit exceeded
21. Microservice Pod Definition
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
● Shares = CPU Request * 1024
● Total Kubernetes Shares = # Cores * 1024
● Quota = CPU Limit * 100ms
● Period = 100ms
22. Java in a Container
● JVM is able to detect # of cores via sysconf(_SC_NPROCESSORS_ONLN)
● Many libraries rely on Runtime.getRuntime.availableProcessors()
○ Jetty
○ ForkJoinPool
○ GC Threads
○ That mystery dependency...
23. Java in a Container
● Provide a base container that calculates the container’s resources!
● Detect # of “cores” assigned
○ /sys/fs/cgroup/cpu/cpu.cfs_quota_us divided by
/sys/fs/cgroup/cpu/cpu.cfs_period_us
● Automatically tune the JVM:
○ -XX:ParallelGCThreads=${core_limit}
○ -XX:ConcGCThreads=${core_limit}
○ -Djava.util.concurrent.ForkJoinPool.common.parallelism=${core_limit}
24. Java in a Container
● Use Linux preloading to override availableProcessors()
#include <stdlib.h>
#include <unistd.h>
int JVM_ActiveProcessorCount(void) {
char* val = getenv("CONTAINER_CORE_LIMIT");
return val != NULL ? atoi(val) : sysconf(_SC_NPROCESSORS_ONLN);
}
https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling
27. ● Graphite does not scale well with ephemeral instances
● Easy to have combinatoric explosion of metrics
Traditional Monitoring & Alerting
● Application and system alerts are tightly coupled
● Difficult to create alerts on SLAs
● Difficult to route alerts
29. Falco: Centralized Service Management
● Kubernetes Dashboard is too complex and powerful
● Centralized deployment status and history
● Manual rollbacks of deploys
● Quick access to scaling controls
32. ● Efficient for ephemeral instances
● Stores tagged data
● Easy to have many smaller instances (per team or complex system)
● Prometheus Operator runs everything in Kubernetes!
Kubernetes Monitoring & Alerting
● Alerts are defined with the application code!
● Easy to define SLA alerts
● Routing is still difficult
40. Spine and Leaf Layer 3 Clos Topology
● All work is performed at the leaf/ToR switch
● Each leaf switch is separate Layer 3 domain
● Each leaf is a separate BGP domain (ASN)
● No Spanning Tree Protocol issues seen in L2 networks (convergence
time, loops)
Leaf Leaf Leaf Leaf
Spine Spine
41. Spine and Leaf Layer 3 Clos Topology
● Simple to understand
● Easy to scale
● Predictable and consistent latency (hops = 2)
● Allows for Anycast IPs
Leaf Leaf Leaf Leaf
Spine Spine
42. Calico Networking
● No network overlay required!
○ No nasty MTU issues
○ No performance impact
● Communicates directly with existing L3 network
● BGP Peering with Top of Rack switch
43. Calico Networking
● Engineers can think of Pod IPs as normal hosts
(they’re not)
○ Ping works
○ Consul works normally
○ Browser communication works
○ Shell sorta works (kubectl exec -it pod sh)
44. Calico Networking
● Each worker announces it’s pod IP ranges
○ Aggregated to /26
● Each master announces an External Anycast IP
○ Used for component communication
● Each ingress tier announces the Service IP range
ip addr add 10.123.0.0/17 dev lo
etcdctl set
/calico/bgp/v1/global/custom_filters/v4/services
'if ( net = 10.123.0.0/17 ) then { accept; }'
45. Calico Networking: Firewalls
● Calico supports NetworkPolicy firewall rules
○ We aren’t using this yet!
● Add DefaultDeny to block traffic into namespace
● Add Ingress rules for whitelisted communication
○ Works across namespaces
○ Works with raw IP ranges
47. Future Work: Security!
● PodSecurityPolicy
● Mutual TLS in our environment:
○ Kubernetes WG Components WG
○ SIG Auth
○ SPIFFE
○ ISTIO
48. How do I connect to the cluster?
● Look at Getting Started guide on the wiki
● Generate a kubeconfig file
○ curl --user $(whoami) https://kubeconfig-generator.squarespace.net
● Uses KeyCloak OIDC to authenticate users
● Automatically refreshes credentials!
● Audit Logs are sent to logs.squarespace.net
51. Communication With External Services
● Environment specific services should not be encoded in application
● Single deployment for all environments and datacenters
● Federation API expects same deployment
● Not all applications are using consul
last year less than a dozen services existed, today more than 50 are in production or actively developed
Typical workflow for provisioning a VM at Squarespace
Currently takes about 15 minutes to provision a VM
There are definitily some optimizations to be made here:
Use VM templates (hard to generalize space constraints in general, but not so much of a problem for microservices)
Use VMware vMotion and other tools for auto migrating and finding free resources
The big takeaway
Requires a robust discovery mechanism for services; can’t easily get by with static names
This can be as simple DNS or load balancers
or something more complex (zookeeper, etcd, Consul)
Each has tradeoffs
Metrics:
Graphite metrics are not meant to be ephemeral
long lived metrics that are expensive to create, and are not efficiently aggregated (no tagging support!)
Difficult to control where data is coming from and how much data is coming in
Easy to blow out disk, or send faulty metrics
Centralized metrics can lead to
Alerts
Sensu alerts are per instance; system
A bit simplified, as there are a lot of moving parts
Declarative Infrastructure
All objects are represented by YAML descriptions
Kubernetes resource constraints aren’t enough
Need an understand of CGroups
Kubernetes resource constraints aren’t enough
Need an understand of CGroups
Kubernetes resource constraints aren’t enough
Push vs Pull metrics
Same Grafana
Same ELK
Sensu: app and system alerts are tightly coupled
Overwhelming & confusing to everyone except the guy who designed the system
does not present a sense of ownership
Hard to get a single view: graphite checks vs instance checks
Alerts are defined with code
Encourages developer ownership
only relevent alerts are defined: active requests, error rates, response times, # of instances up
Deployment logic is not colocated with code
Depends on Networking
Very SIMPLE
Each leaf is a Top of Rack switch
All devices are exactly the same number of segments away
Calico is backed by Etcd…
It’s super easy to leverage this
TODO: add graphic of KeyCloak interaction
We’re not moving all infrastructure to Kubernetes anytime soon
stateful systems and hardware dependent services like Databases, Kafka, ELK will remain statically provisioned
We need a way to automatically update these endpoints