4. Telemetry ?
Telemetry is an automated communications process by which meas
urements and other data are collected at remote or inaccessible points
and transmitted to receiving equipment for monitoring.
ref: https://en.wikipedia.org/wiki/Telemetry
5. cloud.telemetry ?
• Remote & Inaccessible points
• Baremetal, IaaS, CaaS …
• We Develop & Provide automated communications process
• MaaS (Monitoring as a Service)
• KEMI Stats, KEMI Logs, KOCOON
ref: https://en.wikipedia.org/wiki/Telemetry
10. Longtail
• Users want
Get own resource
Deal with resources in their own way
• We want
Divide resources by users
Provide self-monitoring service
11. DKOSv3
• k8s based container orchestrator @KAKAO
• Kubernetes v1.11.5
• 카카오 T 택시 사례를 통해 살펴보는 카카오 클라우드의 Kubernetes as a
Service (openinfradays days 2 Track 2 12:00 ~ 12:40)
24. KOCOON-Prometheus Set Kakaotalk
## 받아야 할 watchcenter group id가 6663, 4443 등 다수인 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;4443;”
## 받아야 할 watchcenter group id가 1개인(6663) 경우
helm upgrade $(kubectl config current-context) kakao-stable/kocoon-prometheus --reuse-va
lues --set cupido.watchcenterGroups="6663;"
## watchcenter group id를 설정하지 않아도 문제가 생길 시에 dkosv3 운영하는 cloud.deploy 셀에
함께 알람이 갑니다.
32. Step 1
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
• it’s ok ~ 100 workers but…
• Internal node IP issue (max 128 nodes…)
33. Step 2
Workers
Scrape
Interval
CPU Memory
# of
metrics/sec
12 30s 0.1 780MB 4605
271 30s OOM
• Basic Resources for Prometheus
• CPU request 1, limit 4
• Memory : request 2GB, limit 6GB
34. Out Of Memory…
• Prometheus killed due to OOM in large scale
• https://groups.google.com/forum/#!topic/prometheus-users/DE
LLNNSVCSw
• https://github.com/prometheus/prometheus/issues/4553
• https://github.com/prometheus/prometheus/issues/1358
35. We considered …
• # of targets
• # of metrics
• # of rules
• frequency of scrapes & rule evaluations
• …
36. Must item
• # of targets
• # of rules
• frequency of scrapes & rule evaluations (every minutes)
37. # of metrics & topk
• # of metrics per sec
rate(prometheus_tsdb_head_samples_appended_total[5m])
• Top 10 metrics
topk(10, count({job=~".+"}) by(__name__))
45. KOCOON-Prometheus Default SLA
• Metric 보관 주기 : 3 days
• Metric 수집 주기 : Every 60s
• Alarm 주기:
• 특정 시간(5M)동안 이벤트가 반복되면 1시간에 한 번 알림
• Target Cluster
• ~ 200 nodes, ~ 1500 pods
• 5분 평균 append되는 metric 수 : ~ 10,000/sec
• 현재 설정 그대로 사용 가능
• cpu 1~4 Core, memory 4~6GB
• 더 많은 node / pod를 돌리고 싶다면?
48. Consideration for Service..
• Resource Request & Limit
• cpu, memory request는 최소로 주고 limit 제한을 없앰
• 성능 시험을 진행하면서 앞에 공유했던 drill down (namespace to pod)
view를 보고 필요한 cpu, memory를 확인
• 실제 Production 투입 때는 앞에 실험한 결과를 가지고 request, limit을
설정
• 500 node cluster는 가능하지만
• 100 node 단위로 서비스 / 서비스 그룹들을 넣고,
• 5개의 master node, 2개의 Prometheus & Alertmanager,
• LB를 통해서 각 cluster에 RoundRobin로 접근하게 하는게 좀 더 안정적
으로 운영
49. Sum Up
• KOCOON-* is provided with helm charts
• KOCOON-Prometheus -> today’s main topic
• KOCOON-Cupido : included in KOCOON-Prometheus
• KOCOON-Hermes -> will be present at ifKakao 2019
• KOCOON-DIKE -> developing…
helm? : package manager for k8s https://helm.sh/