Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm

•

3 j'aime•350 vues

Video to the talk: https://www.youtube.com/watch?v=MTHj0_NdeeM When you run Kubernetes in production and at scale, you encounter many issues both on the infrastructure side as well as in user-space. Some of these issues come with time and increased usage and size of clusters as well as amount of workloads, some might only come once you go global and into regions that have vastly different technology landscapes like China. This talk goes into detail on learnings from concurrently operating 100+ clusters for big enterprises in production on different clouds and data centers around the globe. Over the years we have fixed 100s of post mortems and want to share both operations and development best-practices that can help avoid the issues we ran into. A big focus of this talk is getting towards a hardened and reliable cluster setup and the handling of multi-tenancy in clusters that are used by a multitude of teams.

Technologie

11.12.2018
Hardening Kubernetes
Setups: War Stories from
the Trenches of Production
Puja Abbassi - @puja108

- Developer Advocate /
Product Owner @ Giant
Swarm
- #CKA #Security
#Operators
- Data & Network
Science “Almost-PhD”
Puja
@puja108
@puja108 2
customer
product community

1. On running 100+
clusters
2. Postmortems - Lots of
them!
3. Hardening and
Best-Practices
Agenda
@puja108 3

- Different Clouds
- Different Regions
- On-Premise
- China
100+ clusters
@puja108 5

- Companies
- Industries
- Users
- Use Cases
Diversity
@puja108 6

- Opt for Freedom
- Educate Users
- Harden up
Freedom vs.
Control
@puja108 7

“The primary goals of writing a postmortem are to
ensure that the incident is documented, that all
contributing root cause(s) are well understood, and,
especially, that effective preventive actions are put
in place to reduce the likelihood and/or impact of
recurrence.”
- Google SRE book
Postmortem Philosophy
@puja108 9

1. Gather Issues
2. Fix in Code
3. Roll out continuously
4. Profit 😉
Single Product
@puja108 10

- Issue Template
- High Priority
- Assigned to
x-functional team
Postmortem
Practice
@puja108 11

Load Balancing
Postmortems
@puja108
Team 1
PM Team 2
Team 3
12

@puja108
Kubernetes
upstream issue:
#57992
(fixed in 1.11.4
and 1.12.0)
15

- Faulty ingress
objects can break
controller
- Lots of teams + lots
of freedom
= lots of issues
Ingress
Controller
Misconfiguration
@puja108 16

@puja108
Ever built a
full-mesh IPIP
tunnels ICMP
pinger?
17

@puja108
Customer Load
Test goes bad?
You take the
blame!
- “Must be Calico,
kube-proxy, IC!”
- Turns out EC2 network
saturation was the
bottleneck
- Solution: More
workers!
18

Hardening and Best-Practices
@puja108 19

- Old versions
- Ingress (~15%)
- Networking & DNS
- Resource Pressure
- Multi-tenancy
Postmortem
Hotspots
@puja108 20

- Issues might have
been solved already
- CVEs
- Test Upgrades
extensively
- Automate Upgrades (or
have a process)
Old versions
@puja108 21

- NGINX IC: Newer
versions are less
prone to
misconfiguration
- Separate controllers
- Load- and
failover-testing
- Last resort:
SVC of type LB
Ingress
@puja108 22

- Monitor network
health
- Monitor DNS latency
- Check for known
issues
- Apply best practices
Networking & DNS
@puja108 23

- Resource Management!
- Include Buffers (lots
of them)
- Protect K8s and
critical addons
(priority)
Resource
Pressure
@puja108 24

Multi-tenancy
@puja108 25
- Separate and isolate
namespaces with RBAC
- No cluster-admins!
- Separate clusters if
possible
- Automate with CI/CD
- Minimize manual ops

Best Practices
@puja108 26
- Preemptive Monitoring &
Alerting are key!
- Logging (and Tracing)
help debugging
- Fix issues fast
- Educate users
- Have a postmortem process
- Train Recovery

Stand on the Shoulders of
Giants!
@puja108 27
- Kubernetes the very hard way - Datadog
- Scaling Kubernetes to 2,500 Nodes - OpenAI
- 5 - 15s DNS lookups on Kubernetes? - BitMEX
- Scaling CoreDNS in Kubernetes Clusters
- Inside Kubernetes Resource Management (QoS) -
Michael Gasch
- List of Kubernetes Best Practice talks/blogs
- Kubernetes Office Hours

Questions?
Stay in touch
- Twitter: @puja108
- Github: puja108
- Slack/Discuss: puja
@puja108 28
Thank you!

Contenu connexe

Similaire à Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm

Kubernetes Monitoring & Best Practices

Ajeet Singh Raina

Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way. Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...

Sumeet Singh

Topic of presentation: Dataservices based on mesos and kafka The main points of the presentation: У своїй доповіді Костянтин поділіться досвідом побудови датасервісів на основі таких технологій як: Kafka, Docker, Mesos, Aerospike та Spark. Будуть розглянуті наступні питання: оркестрація, ізоляція, управляння ресурсами, service discovery and load balancing, взаємодія датасервісів. Будуть обговорені проблеми управління ресурсами Java-based та Spark-based сервісів під керування mesos кластера, а також реалізація CI та CD датасервісів. *CI - continuous integration, CD - continuous delivery http://dataconf.com.ua/speaker-page/kostiantyn-bokhan.php https://www.youtube.com/watch?v=4d41DDyKuwU&list=PL5_LBM8-5sLjbRFUtXaUpg84gtJtyc4Pu&t=0s&index=3

Dataservices based on mesos and kafka kostiantyn bokhan dataconf 21 04 18

Olga Zinkevych

Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day

Plain Concepts

Developing, experimenting, and deploying ML models at scale requires substantial tooling, scripting, tracking, versioning, and monitoring. Watch full video here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/ Data scientists want to do data science – and are slowed down by MLOps and DevOps tasks. They lack user friendly tools needed to track experiments, attach resources, manage datasets and launch multiple ML pipelines. In this presentation cnvrg.io CEO, Yochay Ettun will host a special guest from NVIDIA, Sr. Product Manager for NVIDIA DGX systems, Michael Balint, and discuss how to optimize the use of any NVIDIA DGX and NVIDIA GPU asset both on-prem or in the cloud with the cnvrg.io machine learning platform. We will show best practices to reach high utilization of NVIDIA DGX systems, while conducting meta-scheduling across multiple heterogeneous Kubernetes/OpenShift/Linux server clusters. In addition, we will introduce the concept of production flows, which automate hundreds of models from the data hub to deployment. We will wrap up with a real-life demo of flows, exercising many experiments across DGX platforms. What you will learn: - Creating a data science flow: from data to deployment, while attaching different NVIDIA DGX Kubernetes clusters to each step of the flow - The concept of meta-scheduler: scheduling experiments disperse resources or other schedulers, accomplishing high utilization at scale - How the NVIDIA DGX ecosystem with cnvrg.io makes GPU assets consumed easily, with one-click, bypassing complexity of MLOps - How to leverage NGC containers in ML pipelines You can watch the full presentation along with audio and video in the link here: https://cnvrg.io/webinars-and-workshops/scaling-mlops-on-nvidia-dgx-systems/

Scaling MLOps on NVIDIA DGX Systems

cnvrg.io AI OS - Hands-on ML Workshops

Practical Petabyte Pushing

Chris Dagdigian

Systemd evolution revolution_regression

Susant Sahani

Kubernetes Cairo Meetup_dec_2019

Ahmed Atef

Journey to Containerized Application / Google Container Engine

Google Cloud Platform - Japan

This talk will review the advanced security features in DataStax Enterprise and discuss best practices for secure deployments. In particular, topics reviewed will cover: Authentication with Kerberos & LDAP/Active Directory, Role-based Authorization and LDAP role assignment, Auditing, Securing network communication, Encrypting data files and using the Key-Management Interoperability Protocol (KMIP) for secure off-host key management. The talk will also suggest strategies for addressing security needs not met directly by the built-in features of the database such as how to address applications that require Attribute Based Access Control (ABAC). About the Speaker Matt Kennedy Sr. Product Manager, DataStax Matt Kennedy works at DataStax as the product manager for DataStax Enterprise Core. Matt has been a Cassandra user and occasional contributor since version 0.7 and was named a Cassandra MVP in 2013 shortly before joining DataStax. Unlike Cassandra, Matt is not partition tolerant.

DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...

DataStax

Multi-core architecture is the present and future way in which the market is addressing Moore’s law limitations. Multi-core workstations, high performance computers, GPUs and the focus on hybrid/ public cloud technologies for offloading and scaling applications is the direction development is heading. Leveraging multiple cores in order to increase application performance and responsiveness is expected especially from classic high-throughput executions such as rendering, simulations, and heavy calculations. Choosing the correct multi-core strategy for your software requirements is essential, making the wrong decision can have serious implications on software performance, scalability, memory usage and other factors. In this overview, we will inspect various considerations for choosing the correct multi-core strategy for your application’s requirement and investigate the pros and cons of multi-threaded development vs multi-process development. For example, Boost’s GIL (Generic Image Library) provides you with the ability to efficiently code image processing algorithms. However, deciding whether your algorithms should be executed as multi-threaded or multi-process has a high impact on your design, coding, future maintenance, scalability, performance, and other factors. A partial list of considerations to take into account before taking this architectural decision includes: - How big are the images I need to process - What risks can I have in terms of race-conditions, timing issues, sharing violations – does it justify multi-threading programming? - Do I have any special communication and synchronization requirements? - How much time would it take my customers to execute a large scenario? - Would I like to scale processing performance by using the cloud or cluster? We will then examine these issues in real-world environments. In order to learn how this issue is being addressed in a real-world scenario, we will examine common development and testing environments we are using in our daily work and compare the multi-core strategies they have implemented in order to promote higher development productivity.

Choosing the right parallel compute architecture

corehard_by

Messaging architectures in any environment, from local standalone deployments through to public clouds, must provide the highest reliability yet maximize their performance. This session gives you an insight into IBM MQ and how applications can be made to perform to their absolute best while maintaining the data integrity that IBM MQ is renowned for. We'll see how this can be achieved through a combination of good application design, system tuning and architectural patterns.

3450 - Writing and optimising applications for performance in a hybrid messag...

Timothy McCormick

Unleashing k8 s to reduce complexities of an entire middleware platform

Lakmal Warusawithana

The SaltStack Pub Crawl - Fosscomm 2016

effie mouzeli

This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax. One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations. This session will cover: 1) The process that led to our decision to use Cassandra 2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment 3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs. 4) Our lessons learned and next steps

Macy's: Changing Engines in Mid-Flight

DataStax Academy

Evolution of Supermicro GPU Server Solution

NVIDIA Taiwan

Dataweave Libraries and ObjectStore

Vikalp Bhalia

The basics - Introduction to Containers and Orchestrators (May 18th, 2020) by Rauno De Pasquale (Newesis), supported by Cristiano Degiorgis (Deltatre) A new version of the introduction to containers and orchestrator, done for the series of events "Kubernetes - The Deltatre way". Knowing the context and concepts behind container use is essential to be able to proceed on the path that will lead to master Kubernetes and Cloud Native applications. This initial session is about basic skills to answer questions such as: what is a container image? Why did anyone feel the need for an orchestrator? Are there any alternatives to Docker and Kubernetes? How does working with containers and Kubernetes connect to traditional virtualization? The session aims to provide the basic skills to be able to guide yourself in the next sessions where the ways of creating and execution of applications in Kubernetes environment will be tackled. Recorded session: YouTube | Facebook Repository: https://github.com/deltatrelabs/community-events-kubernetes-the-deltatre-way

Kubernetes the deltatre way the basics - introduction to containers and orc...

Rauno De Pasquale

sun solaris

Subur Haryawan

Introduction To Apache Mesos

Timothy St. Clair

Similaire à Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm (20)

Kubernetes Monitoring & Best Practices

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...

Dataservices based on mesos and kafka kostiantyn bokhan dataconf 21 04 18

Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day

Scaling MLOps on NVIDIA DGX Systems

Practical Petabyte Pushing

Systemd evolution revolution_regression

Kubernetes Cairo Meetup_dec_2019

Journey to Containerized Application / Google Container Engine

DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...

Choosing the right parallel compute architecture

3450 - Writing and optimising applications for performance in a hybrid messag...

Unleashing k8 s to reduce complexities of an entire middleware platform

The SaltStack Pub Crawl - Fosscomm 2016

Macy's: Changing Engines in Mid-Flight

Evolution of Supermicro GPU Server Solution

Dataweave Libraries and ObjectStore

Kubernetes the deltatre way the basics - introduction to containers and orc...

sun solaris

Introduction To Apache Mesos

Dernier

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Strategies for Landing an Oracle DBA Job as a Fresher

GenAI Risks & Security Meetup 01052024.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

AWS Community Day CPH - Three problems of Terraform

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Manulife - Insurer Transformation Award 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Ransomware_Q4_2023. The report. [EN].pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - The value of a flexible API Management solution for O...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm

1. 11.12.2018 Hardening Kubernetes Setups: War Stories from the Trenches of Production Puja Abbassi - @puja108

2. - Developer Advocate / Product Owner @ Giant Swarm - #CKA #Security #Operators - Data & Network Science “Almost-PhD” Puja @puja108 @puja108 2 customer product community

3. 1. On running 100+ clusters 2. Postmortems - Lots of them! 3. Hardening and Best-Practices Agenda @puja108 3

4. On running 100+ clusters @puja108 4

5. - Different Clouds - Different Regions - On-Premise - China 100+ clusters @puja108 5

6. - Companies - Industries - Users - Use Cases Diversity @puja108 6

7. - Opt for Freedom - Educate Users - Harden up Freedom vs. Control @puja108 7

8. Postmortems - Lots of them! @puja108 8

9. “The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s) are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence.” - Google SRE book Postmortem Philosophy @puja108 9

10. 1. Gather Issues 2. Fix in Code 3. Roll out continuously 4. Profit 😉 Single Product @puja108 10

11. - Issue Template - High Priority - Assigned to x-functional team Postmortem Practice @puja108 11

12. Load Balancing Postmortems @puja108 Team 1 PM Team 2 Team 3 12

13. 500+ Postmortems @puja108 13

14. War Stories @puja108 14

15. @puja108 Kubernetes upstream issue: #57992 (fixed in 1.11.4 and 1.12.0) 15

16. - Faulty ingress objects can break controller - Lots of teams + lots of freedom = lots of issues Ingress Controller Misconfiguration @puja108 16

17. @puja108 Ever built a full-mesh IPIP tunnels ICMP pinger? 17

18. @puja108 Customer Load Test goes bad? You take the blame! - “Must be Calico, kube-proxy, IC!” - Turns out EC2 network saturation was the bottleneck - Solution: More workers! 18

19. Hardening and Best-Practices @puja108 19

20. - Old versions - Ingress (~15%) - Networking & DNS - Resource Pressure - Multi-tenancy Postmortem Hotspots @puja108 20

21. - Issues might have been solved already - CVEs - Test Upgrades extensively - Automate Upgrades (or have a process) Old versions @puja108 21

22. - NGINX IC: Newer versions are less prone to misconfiguration - Separate controllers - Load- and failover-testing - Last resort: SVC of type LB Ingress @puja108 22

23. - Monitor network health - Monitor DNS latency - Check for known issues - Apply best practices Networking & DNS @puja108 23

24. - Resource Management! - Include Buffers (lots of them) - Protect K8s and critical addons (priority) Resource Pressure @puja108 24

25. Multi-tenancy @puja108 25 - Separate and isolate namespaces with RBAC - No cluster-admins! - Separate clusters if possible - Automate with CI/CD - Minimize manual ops

26. Best Practices @puja108 26 - Preemptive Monitoring & Alerting are key! - Logging (and Tracing) help debugging - Fix issues fast - Educate users - Have a postmortem process - Train Recovery

27. Stand on the Shoulders of Giants! @puja108 27 - Kubernetes the very hard way - Datadog - Scaling Kubernetes to 2,500 Nodes - OpenAI - 5 - 15s DNS lookups on Kubernetes? - BitMEX - Scaling CoreDNS in Kubernetes Clusters - Inside Kubernetes Resource Management (QoS) - Michael Gasch - List of Kubernetes Best Practice talks/blogs - Kubernetes Office Hours

28. Questions? Stay in touch - Twitter: @puja108 - Github: puja108 - Slack/Discuss: puja @puja108 28 Thank you!

Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm

Recommandé

Recommandé

Contenu connexe

Similaire à Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm

Similaire à Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm (20)

Dernier

Dernier (20)

Hardening Kubernetes Setups: War Stories from the Trenches of Production - Puja Abbassi, Giant Swarm