SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Kubernetes
Lessons Learned
Blake Barnett, Staff Software Engineer
What even is a Kubernete?
● In Cyrillic: Κυβερνήτη
● Translated to English: Commander
● Some say Helmsman…?
Why Kubernetes?
● Previous tool (homegrown)
○ Built images (AMIs) based on composable “layers”.
○ Orchestrated AWS primitives (ASGs, LCs, ELBs).
○ Slow to build, the base image was hard to maintain.
○ Little documentation.
○ Maintained by 1 person primarily.
○ Became a large collection of Jenkins wrapper scripts over time.
● Kubernetes
○ Leverage a growing, active community for support.
○ Leverage the shared knowledge and expertise of many.
○ Good documentation!?
○ Enable hiring people who already know it.
○ Build upon and use well-known CI and deployment patterns.
A few downsides to Kubernetes
● The rate of change is a bit challenging.
● Many related projects come and go, keeping current is hard.
● Cloud provider specifics are often trickier than they first look.
● Only the Kubernetes core has been load-tested for scale. Finding out which
other pieces don’t scale is super “fun”.
Initial requirements
● Must be HA.
● Shared cluster across multiple teams.
○ Authorization based on groups managed elsewhere.
○ Network policy support for defense in depth.
● Low-latency networking.
● AWS IAM integration for applications.
● High level of instrumentation & introspection.
So simple in theory...in reality it was way more complicated than this
Initial design choices
● Management & Disaster recovery
● Networking
● Disaster recovery
● PaaS
● User tools
Management & disaster recovery
● Chronological order
○ Kube-up.sh
○ kube-aws (CloudFormation)
○ Terraform
○ Troposphere
○ CoreOS tectonic (terraform+)
○ Kops <- what we’re still using today
● Kops + Terraform (network level)
○ Infrastructure as code
○ Kops has cluster introspection, rolling-updates are possible
○ Lesson learned, managing upgrades at on a per-instance group level is safest.
Networking
So many options! A wide variety of use cases.
● CNI (Container Network Interface) was still a new standard.
● We didn’t need Layer 2 features.
● Performance (especially low latency) was important.
● NetworkPolicy support.
● We chose Calico.
○ A bit daunting but closest to standard networking.
○ Met our requirements.
○ Fast moving target.
● How do we make cluster debugging and connectivity easier?
○ VPN vs. Bastions
○ DNS for cluster internals while on VPN?
PaaS? & user tools
● Shall it come to PaaS?
○ Deis Workflow? Nope.
○ OpenShift? Nope.
○ Cloud Foundry? Nope.
○ Knative? Maybe…
● User tools
○ Helm
○ Kustomize
○ Jsonnet
Problems we encountered
● DNS
● DNS
● DNS
● Resource requests & limits
● Workload isolation
● Etcd v2
● “Bad” nodes
DNS issues as you scale
● Autoscaling
○ Early kube-dns didn’t autoscale.
● Too many queries!
○ 1 DNS query turns into 10, every time.
● Node DNS cache
○ Lightweight build of coredns that runs on all nodes, forwards cluster queries to central
CoreDNS.
○ Co-presented at KubeCon EU about it.
Workload isolation & resources
Resource requests & limits
● Requires a lot of training.
● Kubernetes admins start feeling like resource cops.
● Good metrics and alerting are crucial.
Workload Isolation
● Helps with the above, but is a heavy-handed solution.
● Limits the efficiency gains from bin-packing.
● Required for safety and reliability in some cases.
Etcd v2 & “Bad” nodes
Etcd version 2
● We got stuck on v2 because of Kops & Calico.
● Kubernetes has removed support for etcd v2 as of v1.13.
“Bad” nodes
● A very unspecific term for a large amount of problems.
● We use node-problem-detector with custom monitors to catch a handful of
these.
● Regularly adding new use cases
Links & Questions?
Kubernetes Failure Stories:
https://github.com/hjacobs/kubernetes-failure-stories
Node-local-dns cache talk:
https://static.sched.com/hosted_files/kccnceu19/4b/KubeCon-Europe-2019-nodelocaldns.pdf
An early comparison of CNI providers:
https://docs.google.com/spreadsheets/d/1polIS2pvjOxCZ7hpXbra68CluwOZybsP1IYfr-HrAXc/edit#gid=0

Contenu connexe

Tendances

How to deal second interface service discovery and load balancer in kubernetes
How to deal second interface  service discovery and load balancer  in kubernetesHow to deal second interface  service discovery and load balancer  in kubernetes
How to deal second interface service discovery and load balancer in kubernetes
Meng-Ze Lee
 
CEPH technical analysis 2014
CEPH technical analysis 2014CEPH technical analysis 2014
CEPH technical analysis 2014
Erwan Quigna
 

Tendances (20)

[Container world 2017] The Questions You're Afraid to Ask about Containers
[Container world 2017] The Questions You're Afraid to Ask about Containers[Container world 2017] The Questions You're Afraid to Ask about Containers
[Container world 2017] The Questions You're Afraid to Ask about Containers
 
Integrating Docker EE into Société Générale's Existing Enterprise IT Systems
Integrating Docker EE into Société Générale's Existing Enterprise IT SystemsIntegrating Docker EE into Société Générale's Existing Enterprise IT Systems
Integrating Docker EE into Société Générale's Existing Enterprise IT Systems
 
K8s storage-glusterfs-20180210
K8s storage-glusterfs-20180210K8s storage-glusterfs-20180210
K8s storage-glusterfs-20180210
 
Service mesh from linkerd to conduit (cloud native taiwan meetup)
Service mesh from linkerd to conduit (cloud native taiwan meetup)Service mesh from linkerd to conduit (cloud native taiwan meetup)
Service mesh from linkerd to conduit (cloud native taiwan meetup)
 
Android Meets Docker
Android Meets DockerAndroid Meets Docker
Android Meets Docker
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...
Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...
Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Open Source vs. Open Standards by Sage Weil
Open Source vs. Open Standards by Sage WeilOpen Source vs. Open Standards by Sage Weil
Open Source vs. Open Standards by Sage Weil
 
Persistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShiftPersistent Storage with Containers with Kubernetes & OpenShift
Persistent Storage with Containers with Kubernetes & OpenShift
 
Kubernetes scheduling and QoS
Kubernetes scheduling and QoSKubernetes scheduling and QoS
Kubernetes scheduling and QoS
 
Cncf storage-final-filip
Cncf storage-final-filipCncf storage-final-filip
Cncf storage-final-filip
 
How to deal second interface service discovery and load balancer in kubernetes
How to deal second interface  service discovery and load balancer  in kubernetesHow to deal second interface  service discovery and load balancer  in kubernetes
How to deal second interface service discovery and load balancer in kubernetes
 
CEPH technical analysis 2014
CEPH technical analysis 2014CEPH technical analysis 2014
CEPH technical analysis 2014
 
DockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker Brings
DockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker BringsDockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker Brings
DockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker Brings
 
Intro to creating kubernetes operators
Intro to creating kubernetes operators Intro to creating kubernetes operators
Intro to creating kubernetes operators
 
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
 
DockerCon EU 2015: What's New with Docker Trusted Registry
DockerCon EU 2015: What's New with Docker Trusted RegistryDockerCon EU 2015: What's New with Docker Trusted Registry
DockerCon EU 2015: What's New with Docker Trusted Registry
 

Similaire à Kubernetes lessons learned

Similaire à Kubernetes lessons learned (20)

KubeCon US 2021 - Recap - DCMeetup
KubeCon US 2021 - Recap - DCMeetupKubeCon US 2021 - Recap - DCMeetup
KubeCon US 2021 - Recap - DCMeetup
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
19. Cloud Native Computing - Kubernetes - Bratislava - Databases in K8s world
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Taking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and DecideTaking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and Decide
 
Taking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and DecideTaking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and Decide
 
Network services on Kubernetes on premise
Network services on Kubernetes on premiseNetwork services on Kubernetes on premise
Network services on Kubernetes on premise
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
[WSO2Con EU 2018] Architecting for a Container Native Environment
[WSO2Con EU 2018] Architecting for a Container Native Environment[WSO2Con EU 2018] Architecting for a Container Native Environment
[WSO2Con EU 2018] Architecting for a Container Native Environment
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
ContainerDays NYC 2015: "Easing Your Way Into Docker: Lessons From a Journey ...
 
Container Days
Container DaysContainer Days
Container Days
 
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
To Russia with Love: Deploying Kubernetes in Exotic Locations On PremTo Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
 
Micro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and AnsibleMicro services infrastructure with AWS and Ansible
Micro services infrastructure with AWS and Ansible
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1Cairo Kubernetes Meetup - October event Talk #1
Cairo Kubernetes Meetup - October event Talk #1
 
3 - Delen Private Bank: FOSS adventures in a Cloud Native world
3 - Delen Private Bank: FOSS adventures in a Cloud Native world3 - Delen Private Bank: FOSS adventures in a Cloud Native world
3 - Delen Private Bank: FOSS adventures in a Cloud Native world
 
Cicd pixelfederation
Cicd pixelfederationCicd pixelfederation
Cicd pixelfederation
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Kubernetes lessons learned

  • 2. What even is a Kubernete? ● In Cyrillic: Κυβερνήτη ● Translated to English: Commander ● Some say Helmsman…?
  • 3. Why Kubernetes? ● Previous tool (homegrown) ○ Built images (AMIs) based on composable “layers”. ○ Orchestrated AWS primitives (ASGs, LCs, ELBs). ○ Slow to build, the base image was hard to maintain. ○ Little documentation. ○ Maintained by 1 person primarily. ○ Became a large collection of Jenkins wrapper scripts over time. ● Kubernetes ○ Leverage a growing, active community for support. ○ Leverage the shared knowledge and expertise of many. ○ Good documentation!? ○ Enable hiring people who already know it. ○ Build upon and use well-known CI and deployment patterns.
  • 4. A few downsides to Kubernetes ● The rate of change is a bit challenging. ● Many related projects come and go, keeping current is hard. ● Cloud provider specifics are often trickier than they first look. ● Only the Kubernetes core has been load-tested for scale. Finding out which other pieces don’t scale is super “fun”.
  • 5. Initial requirements ● Must be HA. ● Shared cluster across multiple teams. ○ Authorization based on groups managed elsewhere. ○ Network policy support for defense in depth. ● Low-latency networking. ● AWS IAM integration for applications. ● High level of instrumentation & introspection.
  • 6. So simple in theory...in reality it was way more complicated than this
  • 7. Initial design choices ● Management & Disaster recovery ● Networking ● Disaster recovery ● PaaS ● User tools
  • 8. Management & disaster recovery ● Chronological order ○ Kube-up.sh ○ kube-aws (CloudFormation) ○ Terraform ○ Troposphere ○ CoreOS tectonic (terraform+) ○ Kops <- what we’re still using today ● Kops + Terraform (network level) ○ Infrastructure as code ○ Kops has cluster introspection, rolling-updates are possible ○ Lesson learned, managing upgrades at on a per-instance group level is safest.
  • 9. Networking So many options! A wide variety of use cases. ● CNI (Container Network Interface) was still a new standard. ● We didn’t need Layer 2 features. ● Performance (especially low latency) was important. ● NetworkPolicy support. ● We chose Calico. ○ A bit daunting but closest to standard networking. ○ Met our requirements. ○ Fast moving target. ● How do we make cluster debugging and connectivity easier? ○ VPN vs. Bastions ○ DNS for cluster internals while on VPN?
  • 10. PaaS? & user tools ● Shall it come to PaaS? ○ Deis Workflow? Nope. ○ OpenShift? Nope. ○ Cloud Foundry? Nope. ○ Knative? Maybe… ● User tools ○ Helm ○ Kustomize ○ Jsonnet
  • 11. Problems we encountered ● DNS ● DNS ● DNS ● Resource requests & limits ● Workload isolation ● Etcd v2 ● “Bad” nodes
  • 12. DNS issues as you scale ● Autoscaling ○ Early kube-dns didn’t autoscale. ● Too many queries! ○ 1 DNS query turns into 10, every time. ● Node DNS cache ○ Lightweight build of coredns that runs on all nodes, forwards cluster queries to central CoreDNS. ○ Co-presented at KubeCon EU about it.
  • 13. Workload isolation & resources Resource requests & limits ● Requires a lot of training. ● Kubernetes admins start feeling like resource cops. ● Good metrics and alerting are crucial. Workload Isolation ● Helps with the above, but is a heavy-handed solution. ● Limits the efficiency gains from bin-packing. ● Required for safety and reliability in some cases.
  • 14. Etcd v2 & “Bad” nodes Etcd version 2 ● We got stuck on v2 because of Kops & Calico. ● Kubernetes has removed support for etcd v2 as of v1.13. “Bad” nodes ● A very unspecific term for a large amount of problems. ● We use node-problem-detector with custom monitors to catch a handful of these. ● Regularly adding new use cases
  • 15. Links & Questions? Kubernetes Failure Stories: https://github.com/hjacobs/kubernetes-failure-stories Node-local-dns cache talk: https://static.sched.com/hosted_files/kccnceu19/4b/KubeCon-Europe-2019-nodelocaldns.pdf An early comparison of CNI providers: https://docs.google.com/spreadsheets/d/1polIS2pvjOxCZ7hpXbra68CluwOZybsP1IYfr-HrAXc/edit#gid=0