4. Components
● Control plane
○ Storage (etcd)
○ API
○ Scheduler
○ Controller-manager
● Nodes
○ Container runtime
○ Node agent (kubelet)
○ Service proxy
○ Network agent
5. ● Key-value store
● Raft based distributed storage
● Client to Server & Server to Server TLS support
Project page : https://etcd.io/
Incubating at
Components : storage
6. Components : API server
● Store data in etcd
● Stateless REST API
● HTTP/2 + TLS
● gRPC support:
○ WATCH events over HTTP
○ Reactive event based triggers on Kubernetes components
7. Components : Scheduler
● Connected to API server only
● Watch for pod objects
● Select node to run on based on criterias:
○ Hardware (CPU available, CPU architecture, memory available, disk space)
○ (Anti-)Affinity patterns
○ Policy constraints (labels)
● 1 master per quorum (token in etcd)
8. Components : Controller manager
● Core controller:
○ Node status responses
○ Replication: ensure pod number on replication controllers
○ Endpoints: maintains Endpoints object for Services
○ Namespace: create default Service Account & Tokens
● 1 master per quorum (token in etcd)
9. Node components
● Container runtime: Run containers (Docker, containerd.io…)
● Node agent : connects to API server to handle containers & volumes
● Service proxy : load balances service IPs to pod endpoints
● Network agent : Connects nodes together (flannel, calico, kube-router…)
11. ● 3 Kubernetes clusters per datacenter:
○ Benchmark
○ Staging
○ Production
● No cross DC cluster: No DC split brain situation to manage
Datacenter deployment
12. ● 3 etcd per datacenter
○ TLSv1.2 enabled
○ Authentication through TLSv1.2 enabled
○ Hardware : 4 CPU 32GB RAM
○ OS : Debian 10.1
○ Version 3.4 enabled :
■ reduced latency
■ high write performance improvements
■ read not affected by commits
■ Will be the default version to K8S 1.17
■ See : https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/
Etcd deployment
13. ● API version: 1.15.x (old clusters) and 1.16.x (new clusters)
● 2 API server load balanced by haproxy (TCP mode)
○ Horizontally scalable
○ Vertically scalable
○ Current setup : 4 CPU 32GB RAM
○ OS : Debian 10.1
● Load balance etcd themselves
○ We discovered a bug in k8s < 1.16.3 when using TLS, ensure you have at least this
version
○ Issue: https://github.com/kubernetes/kubernetes/issues/83028
API server deployment
15. ● Enabled/Enforced features (Admission controllers):
○ LimitRanger: Resource limitation validator
○ NodeRestriction: limit kubelet permissions on node/pod objects
○ PodSecurityPolicy: security policies to run pods
○ PodNodeSelector: limit node selection for pods
● See full list of admission controllers here:
○ https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers
● Enabled extra feature: Secret encryption on etcd in AES256
API server deployment
16. ● 3 nodes per DC
○ Each has scheduler
○ Each has controller manager
○ Hardware: 2 CPU 8GB RAM
○ OS: Debian 10.1
Controller-Manager & scheduler deployment
17. ● Enabled features on controller-manager: all defaults plus
○ BootstrapSigner: authenticate kubelets on cluster join
○ TokenCleaner: clean expired tokens
● Supplementary features on scheduler:
○ NodeRestrictions: restrict pods on some nodes
Controller-Manager & scheduler deployment
20. Node architecture: container runtime
● Valid choice: Docker (https://www.docker.com/)
○ The default one
○ Known by “everyone” in the container world
○ Owned by a company
○ Simple to use
21. Node architecture: container runtime
● Valid choices: Containerd (https://containerd.io/)
○ Younger than Docker
○ Extracted from Docker
○ CNCF enabled project
○ Some limitations:
■ No docker API v1!
■ K8S integration poorly documented
22. Node architecture: container runtime
● Veepee choice: Containerd
○ Supported by CNCF and community
○ Used by Docker as underlying container runtime
○ We use artifactory, Docker API v2 is fully supported
○ Less footprint, less code, lower latency for kubelet
23. Node architecture: system configuration
● Pod DNS configuration
○ clusterDomain: root DNS name for the pods/services
○ clusterDNS: DNS servers configured on pods
■ except if hostNetwork: true and pod DNS policy is default
● Protect system from pods: Ensure node system daemons can run
■ 128Mio memory reserved
■ 0.2 CPU reserved
■ Disk soft & hard limits
● Soft: don’t allow new pods to run if limit reached
● Hard: evict pods if limit reached
24. Node architecture: service proxy
● Exposes K8S service IP on nodes to access pods
● Multiple ways
○ IPTables
○ IPVS
○ External Load Balancer (example AWS ELB in layer 4 or layer 7)
● Multiple possibilities
○ Kube-proxy (iptables, ipvs)
○ Kube-router (ipvs)
○ Calico
○ ...
25. Node architecture: service proxy
● Veepee solution choice: kube-proxy
○ Stay close to Kubernetes distribution: don’t add more complexity
○ No default need for layer 7 load balancing (service type: LoadBalancer), can be
added as extra proxy in the future
○ Next challenge: IPTables vs IPVS
26. Node architecture: kube-proxy mode
● Kube-proxy: iptables mode
○ Default recommended mode (faster)
○ Works quite well… but:
■ Doesn’t integrate with Debian 10 and upper (thanks for Debian
iptables-nftables tool) => restore legacy iptables mode
■ Has locking problems when multiple programs need it
● https://github.com/weaveworks/weave/issues/3351
● https://github.com/kubernetes/kubernetes/issues/82587
● https://github.com/kubernetes/kubernetes/issues/46103
■ We need kube-proxy and Kubernetes Network Policies
■ We should take care of conntrack :(
27. Node architecture: kube-proxy mode
● Kube-proxy: ipvs mode
○ Works well technically (no locking issue/hacks!)
○ ipvsadm is a very better friend than iptables -t nat
○ ipvs also chosen by some other tools like kube-router
○ calico performance comparison convinced us
(https://www.projectcalico.org/comparing-kube-proxy-modes-iptables-or-ipvs/)
29. Node architecture: network layer
● Interconnects nodes
○ Ensure pod to pod and pod to service communication
○ Can be fully private (our choice) or shared with regular network
● Various ways to achieve it
○ Static routing
○ Dynamic routing (generally BGP)
○ VXLan VPN
○ IPIP VPN
● Multiple ways to allocate node CIDRs
○ Statically (enjoy)
○ Dynamically
30. Node architecture: network layer
Warning, reading this slide can make your network engineers crazy
● Allocate two CIDRs for your cluster
○ 1 for nodes and pods
○ 1 for service IPs
● Don’t be conservative, give a thousands of IPs to K8S, each node
requires a /24
○ CIDR /14 for nodes (up to 1024 nodes)
○ CIDR /16 for services (service IP randomness party)
31. Node architecture: network layer
● Needs:
○ Each solution must learn the CIDR of current node through API
○ Network mesh setup should be automagic
● Select the right solution
○ Flannel (default recommended one): VXLan, host-gw
○ Kube-router: IPIP or BGP
○ Calico: IPIP
○ WeaveNet: VXLan
32. Node architecture: network layer
First test: flannel in VXLan
● Works quite well
● Very easy setup
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
● Yes it’s like curl blah | bash
● No we didn’t installed it like this :)
33. Node architecture: network layer
First test: flannel in VXLan (https://github.com/coreos/flannel)
● Before a big sale, we load tested an app and… very bad network
performance on nodes
○ Iperf shows that the outside network was good, around 9.8Gbps over 10Gbps
○ Node to pod perf was at maximum too
○ Node to node using regular net is around 9.7Gbps
○ Node to node using VXLan is around 3.2Gbps and kernel load is very high
○ Investigation on the recommended way to run VXLan: offload VXLan to network
cards.
○ It’s not possible in our case we are using Libvirt/KVM VMs, discard VXLan
34. Node architecture: network layer
Second test: kube-router in BGP mode (https://www.kube-router.io/)
● Drops the need of offloading to network card
● Easy setup too
kubectl apply -f https://raw.githubusercontent.com/cloudnativelabs/kube-router/master/daemonset/kube-router-all-service-daemonset.yaml
● Don’t forget to read the yaml and ensure you publish on right cluster :)
● As suspected, using BGP restore the full capacity of the bandwidth
● Other interesting features:
○ Service proxy (IPVS)
○ Network Policy support
○ Network LB using BGP
35. ● Our choice:
○ BGP choice is very nice
○ We can extend the BGP to fabric if needed in the future
○ We need network policy isolation for some sensible apps
○ One binary for both network mesh and policies: less maintenance
Node architecture: network layer
37. Kubernetes is not magic: tooling
With previous setup we have:
● API
● Container scheduling
● Network communication
We have some limits:
● No access from outside
● No DNS resolution
● No metrology/alerting
● Volatile logging on nodes
38. Tooling: DNS resolution
Two methods:
● External, using host resolv.conf: no DNS for inside cluster
communication, we can use DNS for external resources only
● Internal: inside cluster DNS records, enables service discovery
○ We need it, go ahead
39. Tooling: DNS resolution
Two main solutions:
● Kube-dns: legacy one, should not be used for new cluster
○ dnsmasq C layer, single thread
○ 3 containers for a single daemon ?
● Coredns: modern one
○ Golang multithreaded implementation (goroutine)
○ 1 container only
● Some benchmarks (from coredns team, be careful)
○ https://coredns.io/2018/11/27/cluster-dns-coredns-vs-kube-dns/
40. Tooling: DNS resolution
● CoreDNS is the more reasonable choice.
● Our deployment
○ Deployed as Kubernetes deployment
○ Runs on master nodes (3 pods)
○ Configured as default DNS service on all Kubelet
41. Tooling: Access from outside
Ingress: access from outside of the cluster
Various choices on the market:
● Nginx (the default one)
● Traefik
● Envoy
● Kong
● Ambassador
● Haproxy
● And more...
42. Tooling: Access from outside
We studied five:
● ambassador: promising but very young
(https://www.getambassador.io/)
● nginx: the OSS model on Nginx is unclear since F5 bought Nginx Inc.
(http://nginx.org/)
● haproxy: mature product but ingress is very young and HTTP/2 and
gRPC too (http://www.haproxy.org/)
● kong: built on the top of Nginx it's not for general purposes but can be a
very nice API gateway (https://konghq.com/kong/)
● Traefik: good licensing, mature and updated regularly
(https://traefik.io/)
43. Tooling: Access from outside
Because of risks on some products, we benched traefik:
● Kubernetes API ready
● HTTP/2 ready
● TLS/1.3 ready (Veepee minimum: TLS/1.2)
● Scalable & reactive configuration deployments
● TLS certificate reconfiguration in less than 10sec
● TCP/UDP raw balancing (traefik v2)
44. Tooling: Access from outside
Traefik bench:
● Very good performance in lab:
○ Tested using k6 and ab tools
○ Test backend was a raw golang HTTP service
○ HTTP: Up to 10krps with 2 pods on VM with 1CPU and 2GB RAM
○ HTTPS: Up to 6.3krps with 2 pods on VM with 1CPU and 2GB RAM
○ Scaling pods doesn’t increase performance, anyway it’s sufficient
45. Tooling: Access from outside
Traefik bench:
● Load Testing with a real product:
○ More than 1krps
○ not so recent dotnet.core app
○ Dotnet.core app doesn’t take care about containers and suffers from some
contention
○ Anyway the rate is sufficient for the sale: go ahead to prod
○ On a big event sale we sold ~32k concert tickets in 1h40 without problems
46. Tooling: Access from outside
Traefik bench:
● Before production sale:
○ We increase nodes from 2 to 3
○ We increase application size from 2 to 10 instances
● Production sale day (starting at 7am):
○ No incident
○ We sold 32k concert places in 1h40
48. Tooling: metrology/alerting
Implementation:
● Pods exposes a /metrics endpoint through their HTTP listener
● Prometheus will scrape it
● Writing prometheus scrapping configuration by hand is painful
● Hopefully comes: https://github.com/coreos/kube-prometheus
+ =
49. Tooling: metrology/alerting
● Kube-prometheus implementation:
○ HA prometheus instances
○ HA alertmanager instances
○ Grafana for local metrics view (not reusable for something else)
○ Gather node metrics
○ ServiceMonitor Kubernetes API extension object
54. Tooling: logging
How to retrieve logs properly ?
● Logging is volatile on containers
● On docker hosts: just mount a volume from host and write on it
● On K8S: i don’t know where my container runs, i don’t know the host, the
host doesn’t want me to write on it, help me doctor!
55. Tooling: logging
● You can prevent open heart surgery in production by knowing the rules
56. Tooling: logging
● Never write logs on disk
○ if you need it, use a sidecar to read it and don’t forget rotation!
● Write on stdout/stderr in a parsable way
○ Json comes to the rescue: known by every devel language, easy to serialize &
implement
● Choose a software to gather container logs and push them:
○ filebeat
○ fluentd
○ fluentbit
○ logstash
57. Tooling: logging
● Our choice: fluentd
○ CNCF sponsored
(https://www.cncf.io/announcement/2019/04/11/cncf-announces-fluentd-graduati
on/)
○ Some needed features on fluentd are not in fluentbit
○ Already used by many SRE at Veepee
● Our deployment model: K8S Daemonset
○ Rolling upgrade flexibility
○ Ensure logs are gathered on each running node
○ Ensure configuration is same everywhere
60. Tooling: client/product isolation
Need:
● Ensure a client or product will not steal CPU/Memory/Disk resources of
another
Two work axis:
● Node level isolation
● Pod level isolation
61. Tooling: client/product isolation
Work axis: node level
● Ensure a client (tribe) or a product own the underlying node
● Billing per customer
● Resources per customer, then SRE team
Solution:
● Use enforced NodeSelector on namespaces
scheduler.alpha.kubernetes.io/node-selector: k8s.veepee.tech/tribe=foundation,k8s.veepee.tech=platform
○ Pod can be at only be scheduled on a node with at minimum those labels
62. Tooling: client/product isolation
Work axis: pod level
● Ensure pods are not stealing other pod resources
● Ensure scheduling do the right node choice according to available
resources
● Forbid pod allocation if no resource available (no overcommit)
Solution:
● LimitRanges
64. <ADD YOUR TITLE HERE/>
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
65. <ADD YOUR TITLE HERE/>
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation