SlideShare une entreprise Scribd logo
1  sur  51
2019.07
KOCOON
KAKAO Automatic k8s Monitoring
@@cloud.telemetry / issac.lim(임성국)
Before We Running
Who am I?
Telemetry ?
Telemetry is an automated communications process by which meas
urements and other data are collected at remote or inaccessible points
and transmitted to receiving equipment for monitoring.
ref: https://en.wikipedia.org/wiki/Telemetry
cloud.telemetry ?
• Remote & Inaccessible points
• Baremetal, IaaS, CaaS …
• We Develop & Provide automated communications process
• MaaS (Monitoring as a Service)
• KEMI Stats, KEMI Logs, KOCOON
ref: https://en.wikipedia.org/wiki/Telemetry
KEMI Stats
KEMI Logs
Is it enough?
Head: KEMI-*
Longtail : ?
ref: https://mgcabral.wordpress.com/2012/03/04/thelongtaileconomics/
Longtail
• Users want
 Get own resource
 Deal with resources in their own way
• We want
 Divide resources by users
 Provide self-monitoring service
DKOSv3
• k8s based container orchestrator @KAKAO
• Kubernetes v1.11.5
• 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a
Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
그렇다면 …
서비스 클러스터 별로 모니터링 리소스를 분리하자!
KOCOON
img ref: https://www.treehugger.com/green-architecture/cocoon-tree-prefab-spherical-treehouse-pod.html
KOCOON
• KakaO COntainer based service mONitoring
 서비스 리소스 안에서 수집/조회/알람 등 모든 것을 해결
KOCOON-* Overview
Metric based Self Monitoring
Log Routing
Rule based Log Event
How to Use?
(KOCOON-Prometheus)
KOCOON-Prometheus Install
 Prometheus-operator, Prometheus, Alertmanager
 PrometheusRule(for alarm & recreate metric)
 Cupido (Prometheus webhook manager for kakao)
 kube state metrics, node exporter
 Grafana & Dashboard
 kakao etcd & kakao ingress controller service monitor
helm install kakao-stable/kocoon-prometheus --name $(kubectl config current-context) --namespace
monitoring
KOCOON-Prometheus Set LB
## grafana ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]=example-grafana.dev.
9rum.cc
## prometheus ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set prometheus.ingress.enabled=true --set prometheus.ingress.hosts[0]=example-pro
metheus.dev.9rum.cc
## alertmanager ingress
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val
ues --set alertmanager.ingress.enabled=true --set alertmanager.ingress.hosts[0]=example-al
ertmanager.dev.9rum.cc
KOCOON-Prometheus Set Kakaotalk
## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;4443;”
## 받아야 할 watchcenter group id가 1개인(6663) 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;"
## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에
함께 알람이 갑니다.
k8s Status View
k8s DrillDown View
k8s Ingress Controller View
k8s ETCD View
KOCOON-Prometheus Alarm
Any Problems?
Goals
• Workers: 500
• Metric scrapes interval: 30s ~ 60s
• Check resource status
• service namespace (default)
• kube-system
• ingress
Step 1
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
• it’s ok ~ 100 workers but…
• Internal node IP issue (max 128 nodes…)
Step 2
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
12 30s 0.1 780MB 4605
271 30s OOM
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
Out Of Memory…
• Prometheus killed due to OOM in large scale
• https://groups.google.com/forum/#!topic/prometheus-users/DE
LLNNSVCSw
• https://github.com/prometheus/prometheus/issues/4553
• https://github.com/prometheus/prometheus/issues/1358
We considered …
• # of targets
• # of metrics
• # of rules
• frequency of scrapes & rule evaluations
• …
Must item
• # of targets
• # of rules
• frequency of scrapes & rule evaluations (every minutes)
# of metrics & topk
• # of metrics per sec
rate(prometheus_tsdb_head_samples_appended_total[5m])
• Top 10 metrics
topk(10, count({job=~".+"}) by(__name__))
Step 3
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
271 30s OOM
271 60s 1.15 OOM 27500
container_network_tcp_usage_total:
246676
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 4GB, limit 8GB
• Scrape interval: 30s -> 60s
Useless Metrics
• cadvisor
• container_(network_tcp_usage_total|network_udp_usage_total
|tasks_state|cpu_load_average_10s|memory_failures_total)
• container_(cpu_schedstat_run_seconds_total|cpu_schedstat_ru
nqueue_seconds_total|cpu_schedstat_run_periods_total|cpu_s
ystem_seconds_total|cpu_user_seconds_total)
• container_(last_seen|memory_working_set_bytes|memory_cac
he|memory_failcnt|memory_max_usage_bytes|memory_swap|
start_time_seconds)
• container_(network_receive_packets_total|network_transmit_p
ackets_dropped_total|network_receive_errors_total|network_r
eceive_packets_dropped_total|network_transmit_packets_total
|network_transmit_errors_total)
Useless Metrics
• cadvisor
• container_(spec_([a-z_]+)|fs_([a-z_]+))
• kubelet_(runtime_operations_latency_microseconds|docker_op
erations_latency_microseconds)
Useless Metrics
• kube api
• apiserver_(admission_controller_admission_latencies_seconds_
bucket|admission_step_admission_latencies_seconds_bucket|a
dmission_controller_admission_latencies_seconds_sum|admissi
on_controller_admission_latencies_seconds_count|admission_s
tep_admission_latencies_seconds_summary)
• apiserver_(request_latencies_bucket|response_sizes_bucket|re
quest_latencies_summary)
Useless Metrics
• kube state metrics
• kube_pod_container_status_waiting_reason
Step 4
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
271 60s 1.15 OOM 27500
container_network_tcp_usage_total:
246676
310 60s 0.4 2.9GB 9748
storage_operation_duration_seconds_bucket:
31834
container_memory_usage_bytes: 25737
container_memory_rss: 25737
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 4GB, limit 8GB
• Drop useless metrics: kube api, cadvisor, kube state metrics
• 1629 pods
Step 5
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
topk
500 60s 1.1
12.88GB
8.35GB(RSS)
18324
storage_operation_duration_seconds_bucket:
50402
container_memory_rss: 45149
container_memory_usage_bytes: 45149
• Upgrade Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 9GB, limit 14GB
• 4107 pods
KOCOON-Prometheus Default SLA
• Metric 보관 주기 : 3 days
• Metric 수집 주기 : Every 60s
• Alarm 주기:
• 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림
• Target Cluster
• ~ 200 nodes, ~ 1500 pods
• 5분 평균 append되는 metric 수 : ~ 10,000/sec
• 현재 설정 그대로 사용 가능
• cpu 1~4 Core, memory 4~6GB
• 더 많은 node / pod를 돌리고 싶다면?
KOCOON-Prometheus SLA for 500
• Target Cluster
• ~ 500 nodes(worker+ingress), ~ 4000 pods
• 5분 평균 append되는 metric 수 : 18,000/sec
• upgrade memory!
## Upgrade prometheus memory request & limit
helm upgrade $(kubectl config current-context) kocoon-stg/kocoon-prometheus --reuse-values --set prometheus.prom
KOCOON-Prometheus Requirements
• ~ 200 nodes / ~1500 pods
• krane VM / PM: 8 Core, 8GB * 2
• ~ 500 nodes / ~4000 pods
• krane VM / PM : 8 Core, 16GB * 2
Consideration for Service..
• Resource Request & Limit
• cpu, memory request는 최소로 주고 limit 제한을 없앰
• 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod)
view를 보고 필요한 cpu, memory를 확인
• 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을
설정
• 500 node cluster는 가능하지만
• 100 node 단위로 서비스 / 서비스 그룹들을 넣고,
• 5개의 master node, 2개의 Prometheus & Alertmanager,
• LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적
으로 운영
Sum Up
• KOCOON-* is provided with helm charts
• KOCOON-Prometheus -> today’s main topic
• KOCOON-Cupido : included in KOCOON-Prometheus
• KOCOON-Hermes -> will be present at ifKakao 2019
• KOCOON-DIKE -> developing…
helm? : package manager for k8s https://helm.sh/
Thanks
@@cloud.telemetry with andi, beemo, issac, jenny, joanne & cloud part
KOCOON – KAKAO Automatic K8S Monitoring

Contenu connexe

Tendances

OpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - MasakariOpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - Masakarimasahito12
 
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsNeutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsThomas Morin
 
OpenStack Networking
OpenStack NetworkingOpenStack Networking
OpenStack NetworkingIlya Shakhat
 
Docker Networking - Common Issues and Troubleshooting Techniques
Docker Networking - Common Issues and Troubleshooting TechniquesDocker Networking - Common Issues and Troubleshooting Techniques
Docker Networking - Common Issues and Troubleshooting TechniquesSreenivas Makam
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep diveTrinath Somanchi
 
Kubernetes API - deep dive into the kube-apiserver
Kubernetes API - deep dive into the kube-apiserverKubernetes API - deep dive into the kube-apiserver
Kubernetes API - deep dive into the kube-apiserverStefan Schimanski
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and TuningNGINX, Inc.
 
Nx: Angular CLI Power-ups for your modern Development
Nx: Angular CLI Power-ups for your modern DevelopmentNx: Angular CLI Power-ups for your modern Development
Nx: Angular CLI Power-ups for your modern DevelopmentSeven Peaks Speaks
 
Rancher 2.0 - Complete Container Management Platform
Rancher 2.0 - Complete Container Management PlatformRancher 2.0 - Complete Container Management Platform
Rancher 2.0 - Complete Container Management PlatformSebastiaan van Steenis
 
Learning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNILearning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNIHungWei Chiu
 
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceCloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceDennis Hong
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming PatternsHao Chen
 
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트OpenStack Korea Community
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법Open Source Consulting
 
Building a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioBuilding a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioSAMIR BEHARA
 
Docker Advanced registry usage
Docker Advanced registry usageDocker Advanced registry usage
Docker Advanced registry usageDocker, Inc.
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy ProxyMark McBride
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka confluent
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack ArchitectureMirantis
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Hao H. Zhang
 

Tendances (20)

OpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - MasakariOpenStackユーザ会資料 - Masakari
OpenStackユーザ会資料 - Masakari
 
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsNeutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
 
OpenStack Networking
OpenStack NetworkingOpenStack Networking
OpenStack Networking
 
Docker Networking - Common Issues and Troubleshooting Techniques
Docker Networking - Common Issues and Troubleshooting TechniquesDocker Networking - Common Issues and Troubleshooting Techniques
Docker Networking - Common Issues and Troubleshooting Techniques
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep dive
 
Kubernetes API - deep dive into the kube-apiserver
Kubernetes API - deep dive into the kube-apiserverKubernetes API - deep dive into the kube-apiserver
Kubernetes API - deep dive into the kube-apiserver
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
Nx: Angular CLI Power-ups for your modern Development
Nx: Angular CLI Power-ups for your modern DevelopmentNx: Angular CLI Power-ups for your modern Development
Nx: Angular CLI Power-ups for your modern Development
 
Rancher 2.0 - Complete Container Management Platform
Rancher 2.0 - Complete Container Management PlatformRancher 2.0 - Complete Container Management Platform
Rancher 2.0 - Complete Container Management Platform
 
Learning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNILearning how AWS implement AWS VPC CNI
Learning how AWS implement AWS VPC CNI
 
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_serviceCloud Native Days Korea 2019 - kakao's k8s_as_a_service
Cloud Native Days Korea 2019 - kakao's k8s_as_a_service
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트
[OpenStack Days Korea 2016] Track3 - 오픈스택 환경에서 공유 파일 시스템 구현하기: 마닐라(Manila) 프로젝트
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
Building a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istioBuilding a scalable microservice architecture with envoy, kubernetes and istio
Building a scalable microservice architecture with envoy, kubernetes and istio
 
Docker Advanced registry usage
Docker Advanced registry usageDocker Advanced registry usage
Docker Advanced registry usage
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 

Similaire à KOCOON – KAKAO Automatic K8S Monitoring

GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackTon Ngo
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheusBob Cotton
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceAnil Nair
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Xavier Lucas
 
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesDevoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesMichaël Morello
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWSAmazon Web Services Korea
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)Mathew Beane
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...PROIDEA
 

Similaire à KOCOON – KAKAO Automatic K8S Monitoring (20)

GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Toward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStackToward 10,000 Containers on OpenStack
Toward 10,000 Containers on OpenStack
 
K8s monitoring with elk
K8s monitoring with elkK8s monitoring with elk
K8s monitoring with elk
 
Time series denver an introduction to prometheus
Time series denver   an introduction to prometheusTime series denver   an introduction to prometheus
Time series denver an introduction to prometheus
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur KubernetesDevoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS20160503 Amazed by AWS | Tips about Performance on AWS
20160503 Amazed by AWS | Tips about Performance on AWS
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 

Dernier

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

KOCOON – KAKAO Automatic K8S Monitoring

  • 1. 2019.07 KOCOON KAKAO Automatic k8s Monitoring @@cloud.telemetry / issac.lim(임성국)
  • 4. Telemetry ? Telemetry is an automated communications process by which meas urements and other data are collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. ref: https://en.wikipedia.org/wiki/Telemetry
  • 5. cloud.telemetry ? • Remote & Inaccessible points • Baremetal, IaaS, CaaS … • We Develop & Provide automated communications process • MaaS (Monitoring as a Service) • KEMI Stats, KEMI Logs, KOCOON ref: https://en.wikipedia.org/wiki/Telemetry
  • 9. Head: KEMI-* Longtail : ? ref: https://mgcabral.wordpress.com/2012/03/04/thelongtaileconomics/
  • 10. Longtail • Users want  Get own resource  Deal with resources in their own way • We want  Divide resources by users  Provide self-monitoring service
  • 11. DKOSv3 • k8s based container orchestrator @KAKAO • Kubernetes v1.11.5 • 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
  • 12. 그렇다면 … 서비스 클러스터 별로 모니터링 리소스를 분리하자!
  • 14. KOCOON • KakaO COntainer based service mONitoring  서비스 리소스 안에서 수집/조회/알람 등 모든 것을 해결
  • 16. Metric based Self Monitoring
  • 18. Rule based Log Event
  • 19.
  • 21. KOCOON-Prometheus Install  Prometheus-operator, Prometheus, Alertmanager  PrometheusRule(for alarm & recreate metric)  Cupido (Prometheus webhook manager for kakao)  kube state metrics, node exporter  Grafana & Dashboard  kakao etcd & kakao ingress controller service monitor helm install kakao-stable/kocoon-prometheus --name $(kubectl config current-context) --namespace monitoring
  • 22.
  • 23. KOCOON-Prometheus Set LB ## grafana ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]=example-grafana.dev. 9rum.cc ## prometheus ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set prometheus.ingress.enabled=true --set prometheus.ingress.hosts[0]=example-pro metheus.dev.9rum.cc ## alertmanager ingress helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-val ues --set alertmanager.ingress.enabled=true --set alertmanager.ingress.hosts[0]=example-al ertmanager.dev.9rum.cc
  • 24. KOCOON-Prometheus Set Kakaotalk ## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;4443;” ## 받아야 할 watchcenter group id가 1개인(6663) 경우 helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va lues --set cupido.watchcenterGroups="6663;" ## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에 함께 알람이 갑니다.
  • 31. Goals • Workers: 500 • Metric scrapes interval: 30s ~ 60s • Check resource status • service namespace (default) • kube-system • ingress
  • 32. Step 1 • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB • it’s ok ~ 100 workers but… • Internal node IP issue (max 128 nodes…)
  • 33. Step 2 Workers Scrape Interval CPU Memory # of metrics/sec 12 30s 0.1 780MB 4605 271 30s OOM • Basic Resources for Prometheus • CPU request 1, limit 4 • Memory : request 2GB, limit 6GB
  • 34. Out Of Memory… • Prometheus killed due to OOM in large scale • https://groups.google.com/forum/#!topic/prometheus-users/DE LLNNSVCSw • https://github.com/prometheus/prometheus/issues/4553 • https://github.com/prometheus/prometheus/issues/1358
  • 35. We considered … • # of targets • # of metrics • # of rules • frequency of scrapes & rule evaluations • …
  • 36. Must item • # of targets • # of rules • frequency of scrapes & rule evaluations (every minutes)
  • 37. # of metrics & topk • # of metrics per sec rate(prometheus_tsdb_head_samples_appended_total[5m]) • Top 10 metrics topk(10, count({job=~".+"}) by(__name__))
  • 38. Step 3 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 30s OOM 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Scrape interval: 30s -> 60s
  • 39. Useless Metrics • cadvisor • container_(network_tcp_usage_total|network_udp_usage_total |tasks_state|cpu_load_average_10s|memory_failures_total) • container_(cpu_schedstat_run_seconds_total|cpu_schedstat_ru nqueue_seconds_total|cpu_schedstat_run_periods_total|cpu_s ystem_seconds_total|cpu_user_seconds_total) • container_(last_seen|memory_working_set_bytes|memory_cac he|memory_failcnt|memory_max_usage_bytes|memory_swap| start_time_seconds) • container_(network_receive_packets_total|network_transmit_p ackets_dropped_total|network_receive_errors_total|network_r eceive_packets_dropped_total|network_transmit_packets_total |network_transmit_errors_total)
  • 40. Useless Metrics • cadvisor • container_(spec_([a-z_]+)|fs_([a-z_]+)) • kubelet_(runtime_operations_latency_microseconds|docker_op erations_latency_microseconds)
  • 41. Useless Metrics • kube api • apiserver_(admission_controller_admission_latencies_seconds_ bucket|admission_step_admission_latencies_seconds_bucket|a dmission_controller_admission_latencies_seconds_sum|admissi on_controller_admission_latencies_seconds_count|admission_s tep_admission_latencies_seconds_summary) • apiserver_(request_latencies_bucket|response_sizes_bucket|re quest_latencies_summary)
  • 42. Useless Metrics • kube state metrics • kube_pod_container_status_waiting_reason
  • 43. Step 4 Workers Scrape Interval CPU Memory # of metrics/sec topk 271 60s 1.15 OOM 27500 container_network_tcp_usage_total: 246676 310 60s 0.4 2.9GB 9748 storage_operation_duration_seconds_bucket: 31834 container_memory_usage_bytes: 25737 container_memory_rss: 25737 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 4GB, limit 8GB • Drop useless metrics: kube api, cadvisor, kube state metrics • 1629 pods
  • 44. Step 5 Workers Scrape Interval CPU Memory # of metrics/sec topk 500 60s 1.1 12.88GB 8.35GB(RSS) 18324 storage_operation_duration_seconds_bucket: 50402 container_memory_rss: 45149 container_memory_usage_bytes: 45149 • Upgrade Resources for Prometheus • CPU request 1, limit 4 • Memory : request 9GB, limit 14GB • 4107 pods
  • 45. KOCOON-Prometheus Default SLA • Metric 보관 주기 : 3 days • Metric 수집 주기 : Every 60s • Alarm 주기: • 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림 • Target Cluster • ~ 200 nodes, ~ 1500 pods • 5분 평균 append되는 metric 수 : ~ 10,000/sec • 현재 설정 그대로 사용 가능 • cpu 1~4 Core, memory 4~6GB • 더 많은 node / pod를 돌리고 싶다면?
  • 46. KOCOON-Prometheus SLA for 500 • Target Cluster • ~ 500 nodes(worker+ingress), ~ 4000 pods • 5분 평균 append되는 metric 수 : 18,000/sec • upgrade memory! ## Upgrade prometheus memory request & limit helm upgrade $(kubectl config current-context) kocoon-stg/kocoon-prometheus --reuse-values --set prometheus.prom
  • 47. KOCOON-Prometheus Requirements • ~ 200 nodes / ~1500 pods • krane VM / PM: 8 Core, 8GB * 2 • ~ 500 nodes / ~4000 pods • krane VM / PM : 8 Core, 16GB * 2
  • 48. Consideration for Service.. • Resource Request & Limit • cpu, memory request는 최소로 주고 limit 제한을 없앰 • 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod) view를 보고 필요한 cpu, memory를 확인 • 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을 설정 • 500 node cluster는 가능하지만 • 100 node 단위로 서비스 / 서비스 그룹들을 넣고, • 5개의 master node, 2개의 Prometheus & Alertmanager, • LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적 으로 운영
  • 49. Sum Up • KOCOON-* is provided with helm charts • KOCOON-Prometheus -> today’s main topic • KOCOON-Cupido : included in KOCOON-Prometheus • KOCOON-Hermes -> will be present at ifKakao 2019 • KOCOON-DIKE -> developing… helm? : package manager for k8s https://helm.sh/
  • 50. Thanks @@cloud.telemetry with andi, beemo, issac, jenny, joanne & cloud part

Notes de l'éditeur

  1. 롱테일 파레토