SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
you should watch
Jorge Salamero - @bencerillo
15 Kubernetes
failure points
Jorge Salamero
Tech Marketing aka container gamer @ Sysdig
github.com/bencer
@bencerillo
OSS fan
Monitoring, containers, IoT/home-automation, cars
About me
Monitoring & Security Platform for Containers
Monitoring 15 Kubernetes failure points
- Apps
- Hosts
- Orchestration
- Containers
- Yourself
https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/
https://sysdig.com/blog/alerting-kubernetes/
The holy service metrics
- KPI / biz metrics / synthetic
monitoring / user metrics
- Google SRE book:
“The Four Golden Signals”
Latency+Traffic+Errors+Saturation
USE method
- Utilization
(how busy we are, close to 100% bottleneck)
- Saturation
(amount of work waiting on the queue)
- Errors
RED method
- Request Rate
- Request Errors
- Request Duration
The holy service metrics
- Code instrumentation (statsd, JMX
or Prometheus metrics):
var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_durations_histogram_seconds",
Help: "Seconds spent serving HTTP requests.",
Buckets: prometheus.DefBuckets,
}, []string{"method", "route", "status_code"})
prometheus.MustRegister(httpDurationsHistogram)
- or Sysdig autodiscovery ;-)
1. connections per second
net.request.count
2. response time
net.response.time
3. errors
net.request.error.count
Prometheus + Grafana UI
Kubernetes orchestration
Kubernetes hierarchy
Services vs hosts+containers
Kubernetes metadata: labels
Pod
app: shopping
tier: api
Pod
app: shopping
tier: db
Pod
app: social
tier: api
role: search
Pod
app: social
tier: api
role: search
Leverage metadata (by service)
Leverage metadata (by pod)
Health vs state monitoring
- Health:
- CPU, memory, disk
- connections, response time,
errors
Health vs state monitoring
- State (orchestration):
- Are containers up and
running properly?
Health vs state monitoring
- kube-state-metrics
https://github.com/kubernetes/kube-state-metrics
https://sysdig.com/blog/introducing-kube-state-metrics/
calculate new metrics based on
the state of Kubernetes
resources
Container scheduling
- Need to deploy a container:
- given the requirements,
where can we run it?
and let’s ignore affinity, taints and tolerations:
https://sysdig.com/blog/kubernetes-scheduler/
- capacity planning
4. node availability
Based on the host or the kubelet component status:
kube_node_status_condition{condition="Ready",status="true"} == 0
count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and
(count(kube_node_status_condition{condition="Ready",status="true"} == 0) /
count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
kube_node_status_condition: kube_node_status_ready,
kube_node_status_out_of_disk, kube_node_status_memory_pressure,
kube_node_status_disk_pressure, and kube_node_status_network_unavailable
Sysdig alert UI
Container resource requirements
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
https://github.com/kubernetes-incubator/cluster-capacity
5. CPU resources
6. memory resources
kube_node_status_capacity_pods
kube_node_status_allocatable_pods
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
capacity - used (by OS and kube services) = allocatable
Container disk requirements
here things get more complicated...
- ephemeral disk usage
- persistent volumes claims
7. disk resources
predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
kube_node_status_condition: kube_node_status_out_of_disk
but within containers this is still WIP, at least Kubernetes 1.8:
container_fs_* doesn’t work with PV
https://github.com/kubernetes/kubernetes/pull/59170
https://github.com/kubernetes/kubernetes/pull/51553
https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
Container orchestration
- ReplicationController
- ReplicaSet
- Deployment
- DaemonSet
- StatefulSet
Kubernetes deployments
Is Kubernetes doing what is
supposed to to?
Orchestration needs monitoring too.
8. running instances
9. desired instances
((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or
(kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
10. deployment updates glitches
kube_deployment_status_observed_generation !=
kube_deployment_metadata_generation
kube_deployment_spec_paused
kube_deployment_spec_strategy_rollingupdate_max_unavailable
Container livecycle state
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
Liveness probes
To know when to restart a container:
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: X-Custom-Header
value: Awesome
initialDelaySeconds: 3
periodSeconds: 3
Ready-ness probes
To know when a container is ready to start accepting traffic:
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
11. pod status
kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown
kube_pod_status_ready
kube_pod_status_scheduled
kube_pod_container_status_waiting
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_ready
12. pod restarts
You can look at this as a metric or as an event:
ALERT PodRestartingTooMuch
IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60)
FOR 1h
LABELS { severity="warning" }
ANNOTATIONS {
summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too
much.",
}
CrashLoopBackOff event
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/
Sysdig Inspect
https://github.com/draios/sysdig-inspect
Kubernetes internals
- APIserver
- KubeDNS / Istio
- container registry
- any other piece of Kubernetes
https://sysdig.com/blog/monitor-etcd/
13. APIserver
rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) /
rate(apiserver_request_count[5m])* 100 > 5
apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!
~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4
Or just do Golden signals on APIserver endpoint too :-)
14. KubeDNS / Istio
histogram_quantile(0.95,
sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le,
kubernetes_pod_name)) > 1000
All export native metrics in Prometheus format, just scrape them!
https://sysdig.com/blog/monitor-istio/
What are we deploying?
- CI/CD and commits
- Manual deploys
You need to validate what you
tell Kubernetes too!
15. monitor your commands
kubeval: validates YAML and JSON config files
https://github.com/garethr/kubeval
kube-diff: show differences between running state and version controlled configuration
https://github.com/weaveworks/kubediff
Configuration reconciliation discussion:
https://github.com/kubernetes/kubernetes/issues/1702
Although this is getting automated too:
https://sysdig.com/blog/kubernetes-scaler/
Recap
1. connections per second
2. response time
3. errors
4. node availability
5. CPU resources
6. memory resources
7. disk and external resources
Recap (2)
8. running instances
9. desired instances
10. deployment updates glitches
Recap (3)
11. pod status
12. pod restarts
13. APIserver health
14. KubeDNS / Istio health
15. monitor your commands
Grazie!
Jorge Salamero - @bencerillo
https://sysdig.com/blog/

Contenu connexe

Tendances

Virtualization 101: Everything You Need To Know To Get Started With VMware
Virtualization 101: Everything You Need To Know To Get Started With VMwareVirtualization 101: Everything You Need To Know To Get Started With VMware
Virtualization 101: Everything You Need To Know To Get Started With VMwareDatapath Consulting
 
파이썬 TDD 101
파이썬 TDD 101파이썬 TDD 101
파이썬 TDD 101정주 김
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Idan Atias
 
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨Amazon Web Services Korea
 
Automated Deployments with Ansible
Automated Deployments with AnsibleAutomated Deployments with Ansible
Automated Deployments with AnsibleMartin Etmajer
 
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...VMware Tanzu
 
K8s network policy bypass
K8s network policy bypassK8s network policy bypass
K8s network policy bypassKaizhe Huang
 
Introduction to Istio on Kubernetes
Introduction to Istio on KubernetesIntroduction to Istio on Kubernetes
Introduction to Istio on KubernetesJonh Wendell
 
Openshift serverless Solution
Openshift serverless SolutionOpenshift serverless Solution
Openshift serverless SolutionRyan ZhangCheng
 
Swiftで、Webサーバにデータを送信・登録しよう!
Swiftで、Webサーバにデータを送信・登録しよう!Swiftで、Webサーバにデータを送信・登録しよう!
Swiftで、Webサーバにデータを送信・登録しよう!Kanako Kobayashi
 
Fault Tolerance 소프트웨어 패턴
Fault Tolerance 소프트웨어 패턴Fault Tolerance 소프트웨어 패턴
Fault Tolerance 소프트웨어 패턴IMQA
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディ
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディCisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディ
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディシスコシステムズ合同会社
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developerRichárd Kovács
 
Postgres on OpenStack
Postgres on OpenStackPostgres on OpenStack
Postgres on OpenStackEDB
 

Tendances (20)

Virtualization 101: Everything You Need To Know To Get Started With VMware
Virtualization 101: Everything You Need To Know To Get Started With VMwareVirtualization 101: Everything You Need To Know To Get Started With VMware
Virtualization 101: Everything You Need To Know To Get Started With VMware
 
Jitsi
JitsiJitsi
Jitsi
 
파이썬 TDD 101
파이썬 TDD 101파이썬 TDD 101
파이썬 TDD 101
 
Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
 
Sikuli Slides
Sikuli SlidesSikuli Slides
Sikuli Slides
 
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨
[Gaming on AWS] AWS에서 실시간 멀티플레이 게임 구현하기 - 넥슨
 
Ansible
AnsibleAnsible
Ansible
 
Automated Deployments with Ansible
Automated Deployments with AnsibleAutomated Deployments with Ansible
Automated Deployments with Ansible
 
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...
Winning the Lottery with Spring: A Microservices Case Study for the Dutch Lot...
 
K8s network policy bypass
K8s network policy bypassK8s network policy bypass
K8s network policy bypass
 
Introduction to Istio on Kubernetes
Introduction to Istio on KubernetesIntroduction to Istio on Kubernetes
Introduction to Istio on Kubernetes
 
Openshift serverless Solution
Openshift serverless SolutionOpenshift serverless Solution
Openshift serverless Solution
 
Behavior Driven Development Testing (BDD)
Behavior Driven Development Testing (BDD)Behavior Driven Development Testing (BDD)
Behavior Driven Development Testing (BDD)
 
Swiftで、Webサーバにデータを送信・登録しよう!
Swiftで、Webサーバにデータを送信・登録しよう!Swiftで、Webサーバにデータを送信・登録しよう!
Swiftで、Webサーバにデータを送信・登録しよう!
 
Fault Tolerance 소프트웨어 패턴
Fault Tolerance 소프트웨어 패턴Fault Tolerance 소프트웨어 패턴
Fault Tolerance 소프트웨어 패턴
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディ
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディCisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディ
Cisco Connect Japan 2014:高密度環境におけるシスコ無線デザイン ケース スタディ
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developer
 
Automated deployment
Automated deploymentAutomated deployment
Automated deployment
 
Postgres on OpenStack
Postgres on OpenStackPostgres on OpenStack
Postgres on OpenStack
 

Similaire à 15 kubernetes failure points you should watch

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDays Riga
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesSeva Dolgopolov
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKel Cecil
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Michael Man
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Ryan Jarvinen
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)HungWei Chiu
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Paris Open Source Summit
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Giacomo Tirabassi
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsSIGHUP
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersjosfuecas
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3Nilesh Gule
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusTobias Schmidt
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSWeaveworks
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'sAntônio Roberto Silva
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetDevOpsDaysJKT
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with KubernetesSatnam Singh
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Yoichi Kawasaki
 

Similaire à 15 kubernetes failure points you should watch (20)

DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
DevOpsDaysRiga 2018: Andrew Martin - Continuous Kubernetes Security
 
Monitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetesMonitoring akka cluster on kubernetes
Monitoring akka cluster on kubernetes
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
Control Plane: Continuous Kubernetes Security (DevSecOps - London Gathering, ...
 
Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
 
Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)Build Your Own CaaS (Container as a Service)
Build Your Own CaaS (Container as a Service)
 
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
Keynote #Tech - Google : aperçu de la gestion des services distribués chez Go...
 
Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators Kubernetes extensibility: crd & operators
Kubernetes extensibility: crd & operators
 
Kubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & OperatorsKubernetes extensibility: CRDs & Operators
Kubernetes extensibility: CRDs & Operators
 
Using kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containersUsing kubernetes to lose your fear of using containers
Using kubernetes to lose your fear of using containers
 
Azure kubernetes service (aks) part 3
Azure kubernetes service (aks)   part 3Azure kubernetes service (aks)   part 3
Azure kubernetes service (aks) part 3
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
A GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKSA GitOps model for High Availability and Disaster Recovery on EKS
A GitOps model for High Availability and Disaster Recovery on EKS
 
using Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API'susing Mithril.js + postgREST to build and consume API's
using Mithril.js + postgREST to build and consume API's
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmetHow Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
How Honestbee Does CI/CD on Kubernetes - Vincent DeSmet
 
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...
 
Cluster management with Kubernetes
Cluster management with KubernetesCluster management with Kubernetes
Cluster management with Kubernetes
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
Kubernetes x PaaS – コンテナアプリケーションのNoOpsへの挑戦
 

Plus de Sysdig

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionSysdig
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsSysdig
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime SecuritySysdig
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesSysdig
 
Continuous Security
Continuous SecurityContinuous Security
Continuous SecuritySysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorSysdig
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoSysdig
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor MicroservicesSysdig
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!Sysdig
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsSysdig
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongSysdig
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishSysdig
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Sysdig
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy ContainersSysdig
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system callsSysdig
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with ChiselSysdig
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutesSysdig
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting KubernetesSysdig
 

Plus de Sysdig (20)

Wordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccionWordpress y Docker, de desarrollo a produccion
Wordpress y Docker, de desarrollo a produccion
 
What Prometheus means for monitoring vendors
What Prometheus means for monitoring vendorsWhat Prometheus means for monitoring vendors
What Prometheus means for monitoring vendors
 
Docker Runtime Security
Docker Runtime SecurityDocker Runtime Security
Docker Runtime Security
 
CI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in KubernetesCI / CD / CS - Continuous Security in Kubernetes
CI / CD / CS - Continuous Security in Kubernetes
 
Continuous Security
Continuous SecurityContinuous Security
Continuous Security
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
The top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitorThe top 5 Kubernetes metrics to monitor
The top 5 Kubernetes metrics to monitor
 
Behavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig FalcoBehavioural activity monitoring on CoreOS with Sysdig Falco
Behavioural activity monitoring on CoreOS with Sysdig Falco
 
How to Monitor Microservices
How to Monitor MicroservicesHow to Monitor Microservices
How to Monitor Microservices
 
WTF my container just spawned a shell!
WTF my container just spawned a shell!WTF my container just spawned a shell!
WTF my container just spawned a shell!
 
Trace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdminsTrace everything, when APM meets SysAdmins
Trace everything, when APM meets SysAdmins
 
You're monitoring Kubernetes Wrong
You're monitoring Kubernetes WrongYou're monitoring Kubernetes Wrong
You're monitoring Kubernetes Wrong
 
The Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - SpanishThe Dark Art of Container Monitoring - Spanish
The Dark Art of Container Monitoring - Spanish
 
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...Lions, Tigers and Deers: What building zoos can teach us about securing micro...
Lions, Tigers and Deers: What building zoos can teach us about securing micro...
 
Building Trustworthy Containers
Building Trustworthy ContainersBuilding Trustworthy Containers
Building Trustworthy Containers
 
A brief history of system calls
A brief history of system callsA brief history of system calls
A brief history of system calls
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 
Intro to sysdig in 15 minutes
Intro to sysdig in 15 minutesIntro to sysdig in 15 minutes
Intro to sysdig in 15 minutes
 
Troubleshooting Kubernetes
Troubleshooting KubernetesTroubleshooting Kubernetes
Troubleshooting Kubernetes
 

Dernier

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Dernier (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 

15 kubernetes failure points you should watch

  • 1. you should watch Jorge Salamero - @bencerillo 15 Kubernetes failure points
  • 2. Jorge Salamero Tech Marketing aka container gamer @ Sysdig github.com/bencer @bencerillo OSS fan Monitoring, containers, IoT/home-automation, cars About me
  • 3. Monitoring & Security Platform for Containers
  • 4. Monitoring 15 Kubernetes failure points - Apps - Hosts - Orchestration - Containers - Yourself https://sysdig.com/blog/monitoring-kubernetes-with-sysdig-cloud/ https://sysdig.com/blog/alerting-kubernetes/
  • 5. The holy service metrics - KPI / biz metrics / synthetic monitoring / user metrics - Google SRE book: “The Four Golden Signals” Latency+Traffic+Errors+Saturation
  • 6. USE method - Utilization (how busy we are, close to 100% bottleneck) - Saturation (amount of work waiting on the queue) - Errors
  • 7. RED method - Request Rate - Request Errors - Request Duration
  • 8. The holy service metrics - Code instrumentation (statsd, JMX or Prometheus metrics): var httpDurationsHistogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_durations_histogram_seconds", Help: "Seconds spent serving HTTP requests.", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status_code"}) prometheus.MustRegister(httpDurationsHistogram) - or Sysdig autodiscovery ;-)
  • 9. 1. connections per second net.request.count 2. response time net.response.time 3. errors net.request.error.count
  • 14. Kubernetes metadata: labels Pod app: shopping tier: api Pod app: shopping tier: db Pod app: social tier: api role: search Pod app: social tier: api role: search
  • 17. Health vs state monitoring - Health: - CPU, memory, disk - connections, response time, errors
  • 18. Health vs state monitoring - State (orchestration): - Are containers up and running properly?
  • 19. Health vs state monitoring - kube-state-metrics https://github.com/kubernetes/kube-state-metrics https://sysdig.com/blog/introducing-kube-state-metrics/ calculate new metrics based on the state of Kubernetes resources
  • 20. Container scheduling - Need to deploy a container: - given the requirements, where can we run it? and let’s ignore affinity, taints and tolerations: https://sysdig.com/blog/kubernetes-scheduler/ - capacity planning
  • 21. 4. node availability Based on the host or the kubelet component status: kube_node_status_condition{condition="Ready",status="true"} == 0 count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3 kube_node_status_condition: kube_node_status_ready, kube_node_status_out_of_disk, kube_node_status_memory_pressure, kube_node_status_disk_pressure, and kube_node_status_network_unavailable
  • 23. Container resource requirements resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" https://github.com/kubernetes-incubator/cluster-capacity
  • 24. 5. CPU resources 6. memory resources kube_node_status_capacity_pods kube_node_status_allocatable_pods kube_node_status_capacity_cpu_cores kube_node_status_capacity_memory_bytes kube_node_status_allocatable_cpu_cores kube_node_status_allocatable_memory_bytes capacity - used (by OS and kube services) = allocatable
  • 25. Container disk requirements here things get more complicated... - ephemeral disk usage - persistent volumes claims
  • 26. 7. disk resources predict_linear(node_filesystem_free[30m], 3600 * 2) < 0 kube_node_status_condition: kube_node_status_out_of_disk but within containers this is still WIP, at least Kubernetes 1.8: container_fs_* doesn’t work with PV https://github.com/kubernetes/kubernetes/pull/59170 https://github.com/kubernetes/kubernetes/pull/51553 https://kubernetes.io/docs/concepts/cluster-administration/controller-metrics/
  • 27. Container orchestration - ReplicationController - ReplicaSet - Deployment - DaemonSet - StatefulSet
  • 28. Kubernetes deployments Is Kubernetes doing what is supposed to to? Orchestration needs monitoring too.
  • 30. 9. desired instances ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
  • 31. 10. deployment updates glitches kube_deployment_status_observed_generation != kube_deployment_metadata_generation kube_deployment_spec_paused kube_deployment_spec_strategy_rollingupdate_max_unavailable
  • 33. Liveness probes To know when to restart a container: livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 3 periodSeconds: 3
  • 34. Ready-ness probes To know when a container is ready to start accepting traffic: readinessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5
  • 35. 11. pod status kube_pod_status_phase: Pending|Running|Succeeded|Failed|Unknown kube_pod_status_ready kube_pod_status_scheduled kube_pod_container_status_waiting kube_pod_container_status_running kube_pod_container_status_terminated kube_pod_container_status_ready
  • 36. 12. pod restarts You can look at this as a metric or as an event: ALERT PodRestartingTooMuch IF rate(k8s_pod_status_restartCount[1m]) > 1/(5*60) FOR 1h LABELS { severity="warning" } ANNOTATIONS { summary = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", description = "Pod {{$labels.namespace}}/{{$label.name}} restarting too much.", }
  • 39. Kubernetes internals - APIserver - KubeDNS / Istio - container registry - any other piece of Kubernetes https://sysdig.com/blog/monitor-etcd/
  • 40. 13. APIserver rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])* 100 > 5 apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb! ~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}> 4 Or just do Golden signals on APIserver endpoint too :-)
  • 41. 14. KubeDNS / Istio histogram_quantile(0.95, sum(rate(kubedns_probe_kubedns_latency_ms_bucket[1m])) BY (le, kubernetes_pod_name)) > 1000 All export native metrics in Prometheus format, just scrape them! https://sysdig.com/blog/monitor-istio/
  • 42. What are we deploying? - CI/CD and commits - Manual deploys You need to validate what you tell Kubernetes too!
  • 43. 15. monitor your commands kubeval: validates YAML and JSON config files https://github.com/garethr/kubeval kube-diff: show differences between running state and version controlled configuration https://github.com/weaveworks/kubediff Configuration reconciliation discussion: https://github.com/kubernetes/kubernetes/issues/1702 Although this is getting automated too: https://sysdig.com/blog/kubernetes-scaler/
  • 44. Recap 1. connections per second 2. response time 3. errors 4. node availability 5. CPU resources 6. memory resources 7. disk and external resources
  • 45. Recap (2) 8. running instances 9. desired instances 10. deployment updates glitches
  • 46. Recap (3) 11. pod status 12. pod restarts 13. APIserver health 14. KubeDNS / Istio health 15. monitor your commands
  • 47. Grazie! Jorge Salamero - @bencerillo https://sysdig.com/blog/