SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Calico のデプロイを
ミスって本番クラスタを
壊しそうになった話
2021/03/12
Cloud Native Days Online
WHO I AM?
- Name: Kawabe Katsuya
- Team: CyberAgent Group Infrastructure Unit
- Position: Software, Infra Engineer, 2020 New Graduate
- Hobby: Music, Comic
ABOUT US
@KKawabe108
1. What Happen
2. How To Resolve
TABLE
CONTENTS
問題発覚編: calico のデプロイをミスったことに
よって発生した事象について
解決編: プロダクト側への説明、監視の見直し、
再発防止への取り組み
図参照: https://docs.projectcalico.org/reference/architecture/overview
CNI: calico
アラートはいつも突然に
AKE での監視
Victoria Metrics で複数のクラスタを監視しています
What Happened
Ingress とノードの
BGP ピアがダウンしたというアラートが大量発生
Kubectl get po -A をすると Master ノードに乗っている
Pod がほとんど Evicted されていた 😇
Master ノードに ssh すると、どうやらディスク領域が
圧迫され、Eviction の閾値に到達していた
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
BGP Link is
Down
なぜ、Diskが圧迫されたのか
What Happened : ipamhandles リソースの爆発
calico-ipam が使用する Pod と IPを紐づけるリソース
通常、calico が Pod の作成と削除に合わせ
制御するリソースのはずだったが・・・
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
APIサーバが操作できない =
クラスタが操作不能 =
ヤバイ
What Happened : ipamhandles リソースの爆発
etcd backup
2GB
/var/backup/etcd
etcd backup
2GB
etcd backup
2GB
systemd
🧨
🧨 🧨
ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、
20GB しかないディスクの圧迫へと繋がった
What Happened : まとめ
Step 1 Step 2 Step 3 Step 4
calico の ipamhandles が
爆発する
etcd のバックアップデータが
増加する (2GB)
Master のディスクが圧迫されて
calico-node とその他が Evicted
される
calico-node がダウンしたことに
より、BGPピアが切断され、
アラート発砲
原因と対応
How To Resolve
calico-kube-controllers というコンポーネント
をデプロイしていなかった
calico-node の ClusterRole の権限が
間違っていた
図参照: https://docs.projectcalico.org/reference/architecture/overview
How To Resolve: 反省点
calico-node に delete の権限を渡していないせいで
GCが発生していなかった
元々、3.8 のマニフェストをベースに弄っていたので
発生したミス
新しいマニフェストを公式から落としてそれをベースにすれば
今回のようなミスは発生しなかった
How To Resolve : プロダクトへの対応
今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは
Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった
すぐに事情を説明して、マニフェストの修正を行なった
発生するリスクは抑えたということを確認して、対応終了
How To Resolve: 監視基盤の対応
監視基盤で全クラスタで etcd の db size
と、オブジェクト数を監視するようにした
ディスクサイズのアラートが
Eviction Policy と同等だったので、
それより低く設定し直す
How To Resolve: calico のアップデートについて
極力アップデートしなくていいならしない
マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい)
マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
CNI のアップデートには細心の
注意を払いましょう!
Thank you for listening !

Contenu connexe

Tendances

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するKohei Tokunaga
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterKohei Tokunaga
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionKohei Tokunaga
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesKohei Tokunaga
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Kohei Tokunaga
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposiuminside-BigData.com
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetesinwin stack
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdKohei Tokunaga
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeAcademy
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developersRobert Barr
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeAcademy
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefMatt Ray
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibilityDocker, Inc.
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkRed Hat Developers
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKitAkihiro Suda
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radioEueung Mulyana
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10Gleb Smirnoff
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetesinwin stack
 

Tendances (20)

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of Images
 
Learning kubernetes
Learning kubernetesLearning kubernetes
Learning kubernetes
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developers
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetes
 

Similaire à 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?OWASP
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Mauricio (Salaboy) Salatino
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019rouanw
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumDavid Nuescheler
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Managementguest88136a
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionRackN
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetesrhirschfeld
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019RackN
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudTobias Schmidt
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudKubeAcademy
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】Hacks in Taiwan (HITCON)
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018Giulio Vian
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableAlexander Tarlinder
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingKP Kaiser
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCMatt McNeeney
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in KubernetesAlexei Ledenev
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudAjeet Singh Raina
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesKP Kaiser
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」Satoshi Goto
 

Similaire à 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話 (20)

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum Belgium
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Management
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch Provision
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetes
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in Kubernetes
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」
 

Dernier

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 

Dernier (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 

【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

  • 2. WHO I AM? - Name: Kawabe Katsuya - Team: CyberAgent Group Infrastructure Unit - Position: Software, Infra Engineer, 2020 New Graduate - Hobby: Music, Comic ABOUT US @KKawabe108
  • 3. 1. What Happen 2. How To Resolve TABLE CONTENTS 問題発覚編: calico のデプロイをミスったことに よって発生した事象について 解決編: プロダクト側への説明、監視の見直し、 再発防止への取り組み
  • 4.
  • 7. AKE での監視 Victoria Metrics で複数のクラスタを監視しています
  • 8. What Happened Ingress とノードの BGP ピアがダウンしたというアラートが大量発生 Kubectl get po -A をすると Master ノードに乗っている Pod がほとんど Evicted されていた 😇 Master ノードに ssh すると、どうやらディスク領域が 圧迫され、Eviction の閾値に到達していた
  • 9. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing
  • 10. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing BGP Link is Down
  • 12. What Happened : ipamhandles リソースの爆発 calico-ipam が使用する Pod と IPを紐づけるリソース 通常、calico が Pod の作成と削除に合わせ 制御するリソースのはずだったが・・・
  • 13. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい
  • 14. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい APIサーバが操作できない = クラスタが操作不能 = ヤバイ
  • 15. What Happened : ipamhandles リソースの爆発 etcd backup 2GB /var/backup/etcd etcd backup 2GB etcd backup 2GB systemd 🧨 🧨 🧨 ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、 20GB しかないディスクの圧迫へと繋がった
  • 16. What Happened : まとめ Step 1 Step 2 Step 3 Step 4 calico の ipamhandles が 爆発する etcd のバックアップデータが 増加する (2GB) Master のディスクが圧迫されて calico-node とその他が Evicted される calico-node がダウンしたことに より、BGPピアが切断され、 アラート発砲
  • 18. How To Resolve calico-kube-controllers というコンポーネント をデプロイしていなかった calico-node の ClusterRole の権限が 間違っていた 図参照: https://docs.projectcalico.org/reference/architecture/overview
  • 19. How To Resolve: 反省点 calico-node に delete の権限を渡していないせいで GCが発生していなかった 元々、3.8 のマニフェストをベースに弄っていたので 発生したミス 新しいマニフェストを公式から落としてそれをベースにすれば 今回のようなミスは発生しなかった
  • 20. How To Resolve : プロダクトへの対応 今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった すぐに事情を説明して、マニフェストの修正を行なった 発生するリスクは抑えたということを確認して、対応終了
  • 21. How To Resolve: 監視基盤の対応 監視基盤で全クラスタで etcd の db size と、オブジェクト数を監視するようにした ディスクサイズのアラートが Eviction Policy と同等だったので、 それより低く設定し直す
  • 22. How To Resolve: calico のアップデートについて 極力アップデートしなくていいならしない マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい) マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
  • 24. Thank you for listening !