Kubernetes and OpenStack at Scale at OpenStack Summit Boston 2017
Imagine being able to stand up thousands of tenants with thousands of apps, running thousands of Docker-formatted container images and routes, all on a self-healing cluster and elastic infrastructure. Now, take that one step further - all of those images being updatable through a single upload to the registry, and with zero downtime. In this session, you will see just that.
In this presentation, we will walk through a recent benchmarking deployment using Kubernetes and OpenStack on the Cloud Native Computing Foundation’s (CNCF's) 1,000 node cluster with OpenStack and Red Hat’s OpenShift Container Platform, the enterprise-ready Kubernetes for developers.
You'll also what's been happening in subsequent rounds of testing in Red Hat's own SCALE lab and the CNCF cluster and how we are working with the relevant open source communities including OpenStack, Kubernetes, and Ansible to continue to raise the bar for horizontal scaling of these platforms via community powered innovation.
Elevate Developer Efficiency & build GenAI Application with Amazon Q
Kubernetes and OpenStack at Scale
1. KUBERNETES AND OPENSTACK AT SCALE
Will it blend?
Stephen Gordon (@xsgordon)
Principal Product Manager, Red Hat
May 8th, 2017
2. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT2
ONCE UPON A TIME...
Part 1
● 1000 OpenShift Container Platform 3.3 / Kubernetes 1.3
nodes on OpenStack infrastructure
● Presented methodology and results in Barcelona:
○ https://www.cncf.io/blog/2016/08/23/deploying-1000-
nodes-of-openshift-on-the-cncf-cluster-part-1/
● Goals were:
○ Push limits
○ Identify best practices
○ Document best practices
○ Fix issues
3. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT3
FOR OUR NEXT TRICK!
Part 2
● Goals:
○ 2048 OpenShift Container Platform 3.5 / Kubernetes 1.5
nodes on OpenStack infrastructure
○ Network ingress tier saturation test
○ Overlay2 graph driver w/ SELinux test
○ Persistent volume scalability and performance test of
Container Native Storage (glusterfs)
4. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT4
KUBERNETES SCALABILITY SIG
Scalability SIG SLAs:
● API responsiveness
○ 99% of calls return in < 1 s
● Pod startup time
○ 99% of pods start within 5s*
Also define a number of other primary and
derived metrics.
* With pre-pulled images
5. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT5
A CONTAINER STACK FOR OPENSTACK
OPENSTACK KUBERNETES
+
A wild solution appears...
Consumption of resources
Able to easily access new environments to
quickly build new apps and move on
Exposition of resources
Provide necessary environments to developers
in minutes, not weeks or months
6. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT6
A CONTAINER STACK FOR OPENSTACK
A wild solution appears...
OPENSTACK OPENSHIFT
+
Consumption of resources
Integrated platform to run, orchestrate,
monitor, and scale containers. Built around
Kubernetes and Docker.
Exposition of resources
Provide necessary environments to developers
in minutes, not weeks or months
10. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
0
HOW TO TEST?
System Verification Test suite (SVT)
● Red Hat OpenShift Performance and Scalability team’s
upstream test suites:
○ Application Performance
○ Application Scalability
○ OpenShift Performance
○ OpenShift Scalability (incl. cluster-loader)
○ Networking Performance
○ Reliability/Longevity
● Also includes some additional tools e.g. image provisioner
● https://github.com/openshift/svt
11. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
1
ARCHITECTURE
Baremetal Cluster (100 nodes)
OpenShift-on-OpenStack Cluster (2048 nodes)
12. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
2
ARCHITECTURE (cont.)
● Software:
○ Red Hat OpenStack Platform 10, based on “Newton”
○ OpenShift Container Platform 3.5 (built around K8S 1.5)
○ Red Hat Enterprise Linux 7.3 (mostly…)
● Deployment:
○ Deployed OpenStack + Ceph using TripleO
○ Deployed OpenShift Container Platform using openshift-ansible.
● Applying previous learnings
○ Storage architecture
○ Image formatting
○ Pre-baked images (see image_provisioner tool)
14. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
4
NETWORK INGRESS/ROUTING TIER
Testing HAProxy Performance
● Load generator itself runs
in a pod.
● Added SNI and TLS variants
to the test suite.
● Configuration by passing in
configmaps.
● Focused in on HTTP with
keepalive and TLS
terminated at the edge.
projects:
- num: 1
basename: centos-stress
ifexists: delete
tuning: default
templates:
- num: 1
file: ./content/quickstarts/stress/stress-pod.json
parameters:
- RUN: "wrk" # which app to execute inside WLG pod
- RUN_TIME: "120" # benchmark run-time in seconds
- PLACEMENT: "test" # Placement of the WLG pods based on node label
- WRK_DELAY: "100" # maximum delay between client requests in ms
- WRK_TARGETS: "^cakephp-" # extended RE (egrep) to filter target routes
- WRK_CONNS_PER_THREAD: "1" # how many connections per worker thread/route
- WRK_KEEPALIVE: "y" # use HTTP keepalive [yn]
- WRK_TLS_SESSION_REUSE: "y" # use TLS session reuse [yn]
- URL_PATH: "/" # target path for HTTP(S) requests
15. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
5
NETWORK INGRESS/ROUTING TIER
Testing HAProxy Performance (cont.)
● 1p-mix-cpu*: nbproc=1, run on any CPU
● 1p-mix-cpu0: nbproc=1, run on core 0
● 1p-mix-cpu1: nbproc=1, run on core 1
● 1p-mix-cpu2: nbproc=1, run on core 2
● 1p-mix-cpu3: nbproc=1, run on core 3
● 1p-mix-mc10x: nbproc=1, run on any core,
sched_migration_cost=5000000
● 2p-mix-cpu*: nbproc=2, run on any core
● 4p-mix-cpu02: nbproc=4, run on core 2
17. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
7
NETWORK PERFORMANCE
Testing OpenShift-sdn (OVS+VXLAN) Performance
● OpenShift includes and uses OpenShift-sdn (OpenvSwitch + VXLAN) by
default:
○ Provides full multi-tenancy
○ Is fully pluggable (as is ingress/routing tier)
○ Supports all four footprints (physical/virtual/private/public)
● Web-based workloads are mostly transactional
● Focused microbenchmark on a ping-pong test of varying payload sizes
18. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
1
8
NETWORK PERFORMANCE
Testing OpenShift-sdn (OVS+VXLAN) Performance (cont.)
● Tested mix of payload sizes
and stream counts.
● tcp_rr-XXB-Yi
○ XX = # of bytes
○ Y = # of instances
(streams)
● Slimmed down version of
RFC2544
20. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
0
OVERLAY2 w/ SELINUX
Next on storage wars...
● Until recently RHEL used Device Mapper for docker’s storage graph driver
○ Overlay support added in RHEL 7.2
○ Overlay2 supported added in RHEL 7.3
○ Overlay2 support w/ SELinux added upstream and expected in RHEL 7.4
■ https://lkml.org/lkml/2016/7/5/409
○ Device Mapper remains default in RHEL for now, Overlay2 default in Fedora
26
■ https://fedoraproject.org/wiki/Changes/DockerOverlay2
● Let’s try it out!
21. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
1
OVERLAY2 w/ SELINUX
Results
● Single base
image for all
pods
● 240 pods on
the node
(rate limited
creation)
● Reasonable
memory
savings
23. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
3
CONTAINER NATIVE STORAGE
Approach
● OpenShift Container Platform supports a wide variety of volume providers
via the standard Kubernetes volume interface
● Red Hat Container Native Storage is a Gluster-based persistent volume
provider deployed on OpenShift
● Used the NVMe disks as “bricks” for Gluster, exposed 1G persistent
volumes
● Container Native Storage nodes marked unschedulable for other OpenShift
pods
● Ran throughput numbers for create/delete operations, as well as API
parallelism
24. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
4
CONTAINER NATIVE STORAGE
Results
● CNS allocated
volumes in constant
time
● Consistent with
results for other
persistent volume
providers
26. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
6
NEXT STEPS
To infinity, and beyond!
● Filed 40+ bugs across a variety of projects and components
● Scaling and Performance Guide, new with OpenShift Container Platform
3.5
● Getting Involved
○ “Kubernetes Ops on OpenStack” forum session
■ Wednesday, May 10, 1:50pm-2:30pm
■ Hynes Convention Center MR102
○ K8S SIG Scalability
○ K8S SIG OpenStack
27. KUBERNETES AND OPENSTACK AT SCALE #OPENSTACKSUMMIT #REDHAT
2
7
REFERENCES
● Part 1: https://www.cncf.io/blog/2016/08/23/deploying-1000-nodes-of-
openshift-on-the-cncf-cluster-part-1/
● Part 2: https://www.cncf.io/blog/2017/03/28/deploying-2048-openshift-
nodes-cncf-cluster-part-2/
● Overlay2 and Device Mapper
https://developers.redhat.com/blog/2016/10/25/docker-project-can-
you-have-overlay2-speed-and-density-with-devicemapper-yep/
● Red Hat Performance and Scale Trello:
https://trello.com/b/M1bpo55E/scalability
Goals were:
Push system to it’s limit, incl. Ensuring we can reproduce work done in the community with kubernetes upstream incl. SIG Scalability (will come to this in a minute)
Identify config changes and best practices to increase capacity and performance
Document and file issues upstream and send patches where applicable
Saturation test for OpenShift’s HAProxy-based network ingress tier
Overlay2 graph driver and SELinux support from kernel v4.9
Persistent volume scalability and performance using Red Hat’ Container-Native Storage (CNS) product (gluster based)
Saturation test for OpenShift’s integrated container registry and CI/CD pipeline
Primary metrics include:
Max cores per cluster
Max pods per core
Management overhead per node
Management overhead per cluster
Derived metrics include:
Max cores per node
Max pods per machine
Max machines per cluster
Max pods per cluster
End-to-end pod startup time
Scheduler throughput
Max cluster saturation time
Pre-pulled images because high degree of variability introduced (network throughput, size of image, etc.) between images that are unrelated to k8s performance.
Why IaaS and PaaS
Exposition versus Consumption
Current state (VMs) versus future state (BM)
Culture/people challenges (developer versus operations, who is driving)
Isolation concerns
Scaling concerns
OpenStack
Open source cloud computing platform for building massively scalable clouds.
Kubernetes
Open source system for automating deployment, scaling and management of containerized applications. Provides framework for building distributed platforms.
Kubernetes container management/orchestration
Red Hat is the biggest contributor outside of Google
How did Red Hat end up on the Kubernetes horse?
We bet on a simple idea: that an open source community is the best place to build the future of application orchestration, and that only an open source community could successfully integrate the diverse range of capabilities necessary to succeed.
OpenShift
An integrated infrastructure platform to run, orchestrate, monitor and scale containers. Built around Kubernetes and Docker.
OpenShift application platform
Acquired Makara in Nov 2010
OpenShift Origin launched in Apr 2012
Docker Open Source Mar 2013
First Kubernetes commit on github Jun 2014
OpenShift v3 re-architected around Docker and Kubernetes Jun 2015 building on operational experiences obtained by OpenShift Online team with v2.
LDK!
Sandwhich:
Your applications
OpenShift masters, nodes, registry
Infrastructure services (LBaaS, Neutron, Nova, Cinder, etc.)
Architectural tenets:
Technical independence: Ensure that containers are defined such that they remain independent of the underlying infrastructure. Containers must continue to be portable across host environments.
Contextual awareness: Allow containers to easily take advantage of OpenStack shared services beyond compute (i.e. networking and storage). To do this, Red Hat Atomic Enterprise (and other Red Hat container offerings) must be context aware.
Avoid redundancy: Limit redundancies where possible to minimize performance and other resource hits. This includes limiting the number of layers between the container and the hardware.
Simplified management: Simplify management by delivering a holistic, integrated view across platforms.
Currently contextual awareness comes via the cloud provider implementation (all or nothing)
Expect to see increased experimentation with using services piecemeal/a la carte (e.g. Cinder)
Storage:
Container hosts consume OpenStack storage
Tenant isolation
Application storage managed by Kubernetes
Stateful applications
Containerized distributed storage services
Networking:
Use OpenShift-SDN to have full application isolation but get double encapsulation when using Neutron with GRE or VXLAN tunnels.
Tenant isolation via OpenStack SDN using Kuryr eventually
Use Flannel with host-gw backend to avoid double encapsulation.
Load Balancing provided by LBaaS V1 by default. Other options:
External load balancer (recommended for production)
Dedicated load balancer node - create a dedicated node for HAProxy. Good for demo/test but no HA.
None - if using single master node.
Authenticate OpenShift users using LDAP.
Re-validate Kubernetes SIG Scalability findings on equivalent OpenShift Container Platform release.
The CNCF cluster is made up of 1000 nodes deployed at Switch, Las Vegas by Intel for the use of the CNCF community.
We were using ~300
NVMe storage will come in handy later
Not product supported
application_performance: JMeter-based performance testing of applications hosted on OpenShift.
applications_scalability: Performance and scalability testing of the OpenShift web UI.
conformance: Wrappers to run a subset of e2e/conformance tests in an SVT environment (work in progress)
image_provisioner: Ansible playbooks for building AMI and qcow2 images with OpenShift rpms and Docker images baked in.
networking: Performance tests for the OpenShift SDN and kube-proxy.
openshift_performance: Performance tests for container build parallelism, projects and persistent storage (EBS, Ceph, Gluster and NFS)
openshift_scalability: Home of the infamous "cluster-loader", details in openshift_scalability/README.md
reliability: Run tests over long periods of time (weeks), cycle object quantity up and down.
Why both?
For the foreseeable future we envisage there will be baremetal, virtualized, containerized workloads
Current state is most people we see are running containers in VMs.
Cultural/people issues:
Easiest way to get going without rocking the organization wide IT boat in some cases
Concerns about potential for breakout (contrast to QEMU and use of similar constructs there)
Scale issues: # of pods per node (currently 250 and rising), workload dependent.
Availability: Ability to live migrate VMs, not impossible to live migrate a container but also not really the way things should work long term.
The Overcloud usually consists of nodes in predefined roles such as Controller nodes, Compute nodes, and different storage node types. Each of these default roles contains a set of services defined in the core Heat template collection on the director node. However, the architecture of the core Heat templates provides a method to:
Create custom roles
Add and remove services from each role
Storage Layout
Each storage node includes 2 SSDs and 10 SAS disks.
Passed NVMe to VMs for Container Native Storage (Gluster)
Ceph performs significantly better when deployed with write-journals on SSDs.
Created two write-journals on the SSDs and allocated 5 of the spinning disks to each SSD.
In all, we had 90 Ceph OSDs, equating to 158 TB of available disk space.
Image Upload
Converted to RAW for upload to glance
Use snapshot/boot-from-volume flow
Consumed ~ 700MB per VM
VM pool in Ceph this time around ~1.5 TB for 2048 VMs versus 22 TB last time for 1,000 VMs.
Reduced I/O and time to boot VMs, < 15 mins for the 2048 VMs.
Ceph’s role in this environment is to provide boot-from-volume service for our VMs (via Cinder).
Routing tier consists of nodes running HAProxy for ingress into the cluster.
Identified that there are (on average), a large number of low-throughput cluster ingress connections from clients (i.e. web browsers) to HAProxy versus a small number of high-throughput connections.
Already some changes in this space based on previous iterations:
Default connection limit of 2000 leaves plenty of room on commonly available CPU cores for additional connections.
Thus, bumped the default connection limit to 20,000 in OpenShift 3.5 out of the box.
If you have other needs to customize the configuration for HAProxy, our networking folks have made it significantly easier — as of OpenShift 3.4, the router pod now uses a configmap, making tweaks to the config that much simpler.
Load generator configured via passing in ConfigMaps
Queries Kubernetes API for list of routes.
Builds list of test targets dynamically
Zoomed in on a particularly representative workload mix
Combination of HTTP with keepalive and TLS terminated at the edge.
Chose this because it represents how most OpenShift production deployments are used - serving large numbers of web applications for internal and external use, with a range of security postures.
Graph shows throughput test with a Y-axis of Requests Per Second, higher is better.
nbproc refers to number of HAProxy processes spawned.
Sched_migration_cost is a kernel tunable that weights processes when deciding if/how the kernel should load balance them amongst available cores.
What we learned:
CPU affinity matters. But why are certain cores nearly 2x faster? This is because HAProxy is now hitting the CPU cache more often due to NUMA/PCI locality with the network adapter.
Increasing nbproc helps throughput. nbproc=2 is ~2x faster than nbproc=1, BUT we get no more boost from going to 4 cores, and in fact nbproc=4 is slower than nbproc=2. This is because there were 4 cores in this guest, and 4 busy HAProxy threads left no room for the OS to do its thing (like process interrupts).
Can improve performance over 20% from baseline with no changes other than sched_migration_cost.
By increasing it by a factor of 10, we keep HAProxy on the CPU longer, and increase our likelihood of CPU cache hits by doing so.
This is a common technique amongst the low-latency networking crowd, and is in fact recommended tuning in our Low Latency Performance Tuning Guide for RHEL7.
Provides full multi-tenancy
Encapsulation comes with tradeoffs in CPU cycles to wrap/unwrap packets
Can be mitigated via VXLAN offloading with commonly available NICs incl. Those in CNCF cluster.
Pluggable, so like OpenStack you can use other SDN solutions where integration has been done
Also expect to use Kuryr in future
Allows it to be used on any public/private footprint incl. OpenStack
RFC2544 - Benchmarking Methodology for Network Interconnect Devices
Discusses and defines a number of tests that may be used to describe the performance characteristics of a network interconnecting device.
Also describes specific formats for reporting the results of the tests.
As you would expect, adding more streams for same payload provides a notable increase.
Difference between baremetal/baremetal+pod and vm/vm+pod only becomes pronounced at largest payload size.
Bonus tuning: Large clusters with over 1000 routes or nodes require increasing the default kernel arp cache size.
We’ve increased it by a factor of 8x, and are including that tuning out of the box in OpenShift 3.5.
Reasons:
Maturity
Supportability
Security
POSIX compliance
Overlay/Overlay2
Density improvements gained by page cache sharing are very important for certain environments where there is significant overlap in base image content.
Overlay2 w/ SELinux in Linux kernel 4.9
Rate limited pod creation using “tuningset” w/ cluster-loader
Each of the 6 bumps is a batch of 40 pods.
Before it moves to the next batch, cluster-loader makes sure the previous batch is in running state.
In this way avoid crushing the API server with requests, and can examine the system’s profiles at each plateau.
The savings in terms of memory is reasonable (again, this is a “perfect world” scenario and your mileage may vary).
The reduction in disk operations below is due to subsequent container starts leveraging the kernel’s page cache rather than having to repeatedly fetch base image content from storage:
Overall found overlay2 to be very stable, and it becomes even more interesting with the addition of SELinux support.
Deployed in pods, scheduled like any application
Used Kubernetes dynamic provisioning to expose volumes to applications.
Marked unschedulable to control variability.
Roughly 6 seconds from submit to the PVC going into “Bound” state.
This number does not vary when CNS is deployed on bare metal or virtualized.
Not pictured here are our tests verifying that several other persistent volume providers respond in a very similar timeframe.