Kubernetes from scratch at veepee sysadmins days 2019

Kubernetes from scratch @Veepee

SUMMARY
1 Study
Kubernetes components
Tools & exploitation
Network, security, runtime, proxy, ...3
Control plane deployment
4Node architecture
observability, isolation, discovery
2

Components
● Control plane
○ Storage (etcd)
○ API
○ Scheduler
○ Controller-manager
● Nodes
○ Container runtime
○ Node agent (kubelet)
○ Service proxy
○ Network agent

● Key-value store
● Raft based distributed storage
● Client to Server & Server to Server TLS support
Project page : https://etcd.io/
Incubating at
Components : storage

Components : API server
● Store data in etcd
● Stateless REST API
● HTTP/2 + TLS
● gRPC support:
○ WATCH events over HTTP
○ Reactive event based triggers on Kubernetes components

Components : Scheduler
● Connected to API server only
● Watch for pod objects
● Select node to run on based on criterias:
○ Hardware (CPU available, CPU architecture, memory available, disk space)
○ (Anti-)Aﬃnity patterns
○ Policy constraints (labels)
● 1 master per quorum (token in etcd)

Components : Controller manager
● Core controller:
○ Node status responses
○ Replication: ensure pod number on replication controllers
○ Endpoints: maintains Endpoints object for Services
○ Namespace: create default Service Account & Tokens
● 1 master per quorum (token in etcd)

Node components
● Container runtime: Run containers (Docker, containerd.io…)
● Node agent : connects to API server to handle containers & volumes
● Service proxy : load balances service IPs to pod endpoints
● Network agent : Connects nodes together (ﬂannel, calico, kube-router…)

● 3 Kubernetes clusters per datacenter:
○ Benchmark
○ Staging
○ Production
● No cross DC cluster: No DC split brain situation to manage
Datacenter deployment

● 3 etcd per datacenter
○ TLSv1.2 enabled
○ Authentication through TLSv1.2 enabled
○ Hardware : 4 CPU 32GB RAM
○ OS : Debian 10.1
○ Version 3.4 enabled :
■ reduced latency
■ high write performance improvements
■ read not affected by commits
■ Will be the default version to K8S 1.17
■ See : https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/
Etcd deployment

● API version: 1.15.x (old clusters) and 1.16.x (new clusters)
● 2 API server load balanced by haproxy (TCP mode)
○ Horizontally scalable
○ Vertically scalable
○ Current setup : 4 CPU 32GB RAM
○ OS : Debian 10.1
● Load balance etcd themselves
○ We discovered a bug in k8s < 1.16.3 when using TLS, ensure you have at least this
version
○ Issue: https://github.com/kubernetes/kubernetes/issues/83028
API server deployment

● Enabled/Enforced features (Admission controllers):
○ LimitRanger: Resource limitation validator
○ NodeRestriction: limit kubelet permissions on node/pod objects
○ PodSecurityPolicy: security policies to run pods
○ PodNodeSelector: limit node selection for pods
● See full list of admission controllers here:
○ https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers
● Enabled extra feature: Secret encryption on etcd in AES256
API server deployment

● 3 nodes per DC
○ Each has scheduler
○ Each has controller manager
○ Hardware: 2 CPU 8GB RAM
○ OS: Debian 10.1
Controller-Manager & scheduler deployment

● Enabled features on controller-manager: all defaults plus
○ BootstrapSigner: authenticate kubelets on cluster join
○ TokenCleaner: clean expired tokens
● Supplementary features on scheduler:
○ NodeRestrictions: restrict pods on some nodes
Controller-Manager & scheduler deployment

Node architecture
Network, security, runtime, proxy, ...

Node architecture: container runtime
● Valid choice: Docker (https://www.docker.com/)
○ The default one
○ Known by “everyone” in the container world
○ Owned by a company
○ Simple to use

● Valid choices: Containerd (https://containerd.io/)
○ Younger than Docker
○ Extracted from Docker
○ CNCF enabled project
○ Some limitations:
■ No docker API v1!
■ K8S integration poorly documented

● Veepee choice: Containerd
○ Supported by CNCF and community
○ Used by Docker as underlying container runtime
○ We use artifactory, Docker API v2 is fully supported
○ Less footprint, less code, lower latency for kubelet

Node architecture: system configuration
● Pod DNS configuration
○ clusterDomain: root DNS name for the pods/services
○ clusterDNS: DNS servers configured on pods
■ except if hostNetwork: true and pod DNS policy is default
● Protect system from pods: Ensure node system daemons can run
■ 128Mio memory reserved
■ 0.2 CPU reserved
■ Disk soft & hard limits
● Soft: don’t allow new pods to run if limit reached
● Hard: evict pods if limit reached

Node architecture: service proxy
● Exposes K8S service IP on nodes to access pods
● Multiple ways
○ IPTables
○ IPVS
○ External Load Balancer (example AWS ELB in layer 4 or layer 7)
● Multiple possibilities
○ Kube-proxy (iptables, ipvs)
○ Kube-router (ipvs)
○ Calico
○ ...

Node architecture: service proxy
● Veepee solution choice: kube-proxy
○ Stay close to Kubernetes distribution: don’t add more complexity
○ No default need for layer 7 load balancing (service type: LoadBalancer), can be
added as extra proxy in the future
○ Next challenge: IPTables vs IPVS

Node architecture: kube-proxy mode
● Kube-proxy: iptables mode
○ Default recommended mode (faster)
○ Works quite well… but:
■ Doesn’t integrate with Debian 10 and upper (thanks for Debian
iptables-nftables tool) => restore legacy iptables mode
■ Has locking problems when multiple programs need it
● https://github.com/weaveworks/weave/issues/3351
● https://github.com/kubernetes/kubernetes/issues/82587
● https://github.com/kubernetes/kubernetes/issues/46103
■ We need kube-proxy and Kubernetes Network Policies
■ We should take care of conntrack :(

● Kube-proxy: ipvs mode
○ Works well technically (no locking issue/hacks!)
○ ipvsadm is a very better friend than iptables -t nat
○ ipvs also chosen by some other tools like kube-router
○ calico performance comparison convinced us
(https://www.projectcalico.org/comparing-kube-proxy-modes-iptables-or-ipvs/)

● Veepee ﬁnal choice: kube-proxy + IPVS

Node architecture: network layer
● Interconnects nodes
○ Ensure pod to pod and pod to service communication
○ Can be fully private (our choice) or shared with regular network
● Various ways to achieve it
○ Static routing
○ Dynamic routing (generally BGP)
○ VXLan VPN
○ IPIP VPN
● Multiple ways to allocate node CIDRs
○ Statically (enjoy)
○ Dynamically

Warning, reading this slide can make your network engineers crazy
● Allocate two CIDRs for your cluster
○ 1 for nodes and pods
○ 1 for service IPs
● Don’t be conservative, give a thousands of IPs to K8S, each node
requires a /24
○ CIDR /14 for nodes (up to 1024 nodes)
○ CIDR /16 for services (service IP randomness party)

● Needs:
○ Each solution must learn the CIDR of current node through API
○ Network mesh setup should be automagic
● Select the right solution
○ Flannel (default recommended one): VXLan, host-gw
○ Kube-router: IPIP or BGP
○ Calico: IPIP
○ WeaveNet: VXLan

First test: flannel in VXLan
● Works quite well
● Very easy setup
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
● Yes it’s like curl blah | bash
● No we didn’t installed it like this :)

First test: flannel in VXLan (https://github.com/coreos/flannel)
● Before a big sale, we load tested an app and… very bad network
performance on nodes
○ Iperf shows that the outside network was good, around 9.8Gbps over 10Gbps
○ Node to pod perf was at maximum too
○ Node to node using regular net is around 9.7Gbps
○ Node to node using VXLan is around 3.2Gbps and kernel load is very high
○ Investigation on the recommended way to run VXLan: offload VXLan to network
cards.
○ It’s not possible in our case we are using Libvirt/KVM VMs, discard VXLan

Second test: kube-router in BGP mode (https://www.kube-router.io/)
● Drops the need of oﬄoading to network card
● Easy setup too
kubectl apply -f https://raw.githubusercontent.com/cloudnativelabs/kube-router/master/daemonset/kube-router-all-service-daemonset.yaml
● Don’t forget to read the yaml and ensure you publish on right cluster :)
● As suspected, using BGP restore the full capacity of the bandwidth
● Other interesting features:
○ Service proxy (IPVS)
○ Network Policy support
○ Network LB using BGP

● Our choice:
○ BGP choice is very nice
○ We can extend the BGP to fabric if needed in the future
○ We need network policy isolation for some sensible apps
○ One binary for both network mesh and policies: less maintenance

Tools & exploitation
DNS, metrology, logging, ...

Kubernetes is not magic: tooling
With previous setup we have:
● API
● Container scheduling
● Network communication
We have some limits:
● No access from outside
● No DNS resolution
● No metrology/alerting
● Volatile logging on nodes

Tooling: DNS resolution
Two methods:
● External, using host resolv.conf: no DNS for inside cluster
communication, we can use DNS for external resources only
● Internal: inside cluster DNS records, enables service discovery
○ We need it, go ahead

Two main solutions:
● Kube-dns: legacy one, should not be used for new cluster
○ dnsmasq C layer, single thread
○ 3 containers for a single daemon ?
● Coredns: modern one
○ Golang multithreaded implementation (goroutine)
○ 1 container only
● Some benchmarks (from coredns team, be careful)
○ https://coredns.io/2018/11/27/cluster-dns-coredns-vs-kube-dns/

● CoreDNS is the more reasonable choice.
● Our deployment
○ Deployed as Kubernetes deployment
○ Runs on master nodes (3 pods)
○ Conﬁgured as default DNS service on all Kubelet

Tooling: Access from outside
Ingress: access from outside of the cluster
Various choices on the market:
● Nginx (the default one)
● Traeﬁk
● Envoy
● Kong
● Ambassador
● Haproxy
● And more...

We studied five:
● ambassador: promising but very young
(https://www.getambassador.io/)
● nginx: the OSS model on Nginx is unclear since F5 bought Nginx Inc.
(http://nginx.org/)
● haproxy: mature product but ingress is very young and HTTP/2 and
gRPC too (http://www.haproxy.org/)
● kong: built on the top of Nginx it's not for general purposes but can be a
very nice API gateway (https://konghq.com/kong/)
● Traefik: good licensing, mature and updated regularly
(https://traefik.io/)

Because of risks on some products, we benched traefik:
● Kubernetes API ready
● HTTP/2 ready
● TLS/1.3 ready (Veepee minimum: TLS/1.2)
● Scalable & reactive configuration deployments
● TLS certificate reconfiguration in less than 10sec
● TCP/UDP raw balancing (traefik v2)

Traeﬁk bench:
● Very good performance in lab:
○ Tested using k6 and ab tools
○ Test backend was a raw golang HTTP service
○ HTTP: Up to 10krps with 2 pods on VM with 1CPU and 2GB RAM
○ HTTPS: Up to 6.3krps with 2 pods on VM with 1CPU and 2GB RAM
○ Scaling pods doesn’t increase performance, anyway it’s suﬃcient

Traeﬁk bench:
● Load Testing with a real product:
○ More than 1krps
○ not so recent dotnet.core app
○ Dotnet.core app doesn’t take care about containers and suffers from some
contention
○ Anyway the rate is suﬃcient for the sale: go ahead to prod
○ On a big event sale we sold ~32k concert tickets in 1h40 without problems

Traeﬁk bench:
● Before production sale:
○ We increase nodes from 2 to 3
○ We increase application size from 2 to 10 instances
● Production sale day (starting at 7am):
○ No incident
○ We sold 32k concert places in 1h40

Tooling: metrology/alerting
Need:
● collect metrics on pods to do nice graphs
Solution:
● A solution to rule them all

Implementation:
● Pods exposes a /metrics endpoint through their HTTP listener
● Prometheus will scrape it
● Writing prometheus scrapping conﬁguration by hand is painful
● Hopefully comes: https://github.com/coreos/kube-prometheus
+ =

● Kube-prometheus implementation:
○ HA prometheus instances
○ HA alertmanager instances
○ Grafana for local metrics view (not reusable for something else)
○ Gather node metrics
○ ServiceMonitor Kubernetes API extension object

Pod discovery

Veepee ecosystem
integration

Pod resource
overview

Kube-prometheus
graphes (+ some
custom)

Tooling: logging
How to retrieve logs properly ?
● Logging is volatile on containers
● On docker hosts: just mount a volume from host and write on it
● On K8S: i don’t know where my container runs, i don’t know the host, the
host doesn’t want me to write on it, help me doctor!

Tooling: logging
● You can prevent open heart surgery in production by knowing the rules

Tooling: logging
● Never write logs on disk
○ if you need it, use a sidecar to read it and don’t forget rotation!
● Write on stdout/stderr in a parsable way
○ Json comes to the rescue: known by every devel language, easy to serialize &
implement
● Choose a software to gather container logs and push them:
○ filebeat
○ fluentd
○ fluentbit
○ logstash

Tooling: logging
● Our choice: fluentd
○ CNCF sponsored
(https://www.cncf.io/announcement/2019/04/11/cncf-announces-fluentd-graduati
on/)
○ Some needed features on fluentd are not in fluentbit
○ Already used by many SRE at Veepee
● Our deployment model: K8S Daemonset
○ Rolling upgrade flexibility
○ Ensure logs are gathered on each running node
○ Ensure configuration is same everywhere

Tooling: logging
Fluentd object deployment

Tooling: logging
Fluentd log ingestion
pipeline

Tooling: client/product isolation
Need:
● Ensure a client or product will not steal CPU/Memory/Disk resources of
another
Two work axis:
● Node level isolation
● Pod level isolation

Work axis: node level
● Ensure a client (tribe) or a product own the underlying node
● Billing per customer
● Resources per customer, then SRE team
Solution:
● Use enforced NodeSelector on namespaces
scheduler.alpha.kubernetes.io/node-selector: k8s.veepee.tech/tribe=foundation,k8s.veepee.tech=platform
○ Pod can be at only be scheduled on a node with at minimum those labels

Work axis: pod level
● Ensure pods are not stealing other pod resources
● Ensure scheduling do the right node choice according to available
resources
● Forbid pod allocation if no resource available (no overcommit)
Solution:
● LimitRanges

Applied LimitRanges

<ADD YOUR TITLE HERE/>
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit, sed do
eiusmod tempor
incididunt ut labore et
dolore magna aliqua.
Ut enim ad minim
veniam, quis nostrud
exercitation
amet, consectetur
eiusmod tempor
Ut enim ad minim
exercitation
amet, consectetur
eiusmod tempor
Ut enim ad minim
exercitation
amet, consectetur
eiusmod tempor
Ut enim ad minim
exercitation

Kubernetes from scratch at veepee sysadmins days 2019

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Kubernetes from scratch at veepee sysadmins days 2019

Similaire à Kubernetes from scratch at veepee sysadmins days 2019 (20)

Dernier

Dernier (20)

Kubernetes from scratch at veepee sysadmins days 2019