SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
WWW.HAPROXY.COM
Observability
with HAProxy
WWW.HAPROXY.COM
● Introduction to “observability”
● Deep dive in HAProxy
● Tips (to get more from HAProxy)
● RoX (Return On eXperience)
● Conclusion (because we need one)
Agenda
WWW.HAPROXY.COMWWW.HAPROXY.COM
Introduction
WWW.HAPROXY.COM
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs"
(wikipedia)
In short for us: WTF is going on ?
Definition
WWW.HAPROXY.COM
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
Observability vs monitoring
WWW.HAPROXY.COM
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
Observability is crucial
WWW.HAPROXY.COM
● central place
● in distributed systems, one LB per layer, even better when
sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related to
performance and observability
● again, logs provide long-term references
The LB as an observation tower
WWW.HAPROXY.COM
● Maintain two types of TCP connections
○ From client to HAPRoxy (1)
○ From HAProxy to server (2)
○ Only access to data payload (not the “packet”)
LB in Reverse-proxy mode
HAProxyClient Server
(1) (2)
WWW.HAPROXY.COM
● you should! Especially in distributed systems (microservices,
etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: check OpenTracing and Prometheus
Why not looking from other points?
WWW.HAPROXY.COM
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning (eg:
conntrack)
● connection slowdowns caused by inefficient firewall policies
(#rules)
● Non exhaustive list ...
What does the LB see?
WWW.HAPROXY.COM
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific page,
can be addressed before going full-scale
What does the LB see?
WWW.HAPROXY.COMWWW.HAPROXY.COM
Deep dive in
HAProxy
WWW.HAPROXY.COM
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {}
"GET /index.html HTTP/1.1"
I need help !!!
WWW.HAPROXY.COM
● Logs :
○ Halog, ELK, Splunk, datadog…
○ Provides unique-id for tracing/event correlation
● Statistics :
○ Stats page, CLI, hatop
○ Prometheus, statsd, ...
● Stick-tables (per arbitrary key like IP, URL, cookie, ...) :
○ Byte count, cumulated/concurrent conns, errors, rates…
Accessing metrics in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
● HAProxy now supports heavier per-request workloads (Lua,
device identification, …)
● Processing times over 200 µs can become noticeable
● Actions:
○ log per-request total CPU time spent in analysers
○ log per-request total CPU time spent in TLS handshake
○ log per-request total latency added by other tasks
○ Ability to kill offending tasks
○ Ability to alert on high latencies
=> make HAProxy as observable as other components
And more to come in HAProxy 1.9 (Nov. 2018)
WWW.HAPROXY.COM
● Timers are averaged in the stats
● Each timer appears in the logs
● halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Event timing reports
Timers Term code Cookie code
WWW.HAPROXY.COM
● Who, when, why and how the session has been terminated
● Distinguish between timeout and abort
● Indicate who (client, server, haproxy, kill, ...)
● Indicate when (req, queue, connect, response, data...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
halog -tcn|-TCN ... # for filtering
halog -tc # for sorting
=> Graphing the number of errors reported by second helps
detecting an issue is in progress somewhere
Termination codes
WWW.HAPROXY.COM
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations between
application deployments
HTTP status code distribution
WWW.HAPROXY.COM
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
Other relevant metrics: queues
WWW.HAPROXY.COM
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates abnormally
slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
Other relevant metrics: LB fairness
WWW.HAPROXY.COM
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
Other relevant metrics: Error rate
WWW.HAPROXY.COM
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Useful entries in log-format
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
WWW.HAPROXY.COMWWW.HAPROXY.COM
Tips !!!!
WWW.HAPROXY.COM
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s without
sweating
● that's 1.7B events/day or 350GB of uncompressed haproxy
logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
Sampling: why / When
Tips !!!!
WWW.HAPROXY.COM
● you only want to catch suspicious events
● enable logs per url, source IP, etc...
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
WARNING: you'll lose any valid reference
Selective logging: why / When
Tips !!!!
WWW.HAPROXY.COM
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB of log file per second)
=> Use it in production to figure the relevant metrics
Other halog goodies
Tips !!!!
WWW.HAPROXY.COMWWW.HAPROXY.COM
ROX
(Return On eXperience)
WWW.HAPROXY.COM
● Many ops and dev do this every day all around the world
● Users are complaining that the application is slow
● Devops team check HAProxy logs (using halog or ELK) in order to isolate
potential sources of issue:
○ Check server Tc
○ Sort servers by response time
○ Sort URLs by response time
● HAProxy won’t fix the problem (in most cases), but will drastically reduce the
troubleshooting time by isolating the potential issues
Web application is slow!!!!!!!!
Return On eXperience
WWW.HAPROXY.COM
● Retail customer, with shops everywhere in France
● Cash register reports the following error message “Unable to reach central
system” when searching people in the loyalty card database, after waiting up to
10s.
● Argument between customer and “central system” software provider:
○ Customer: your software is slow
○ Provider: your network triggers this error
● After installing HAProxy, its logs reported the following:
○ Time to get the query from the client: 300ms (from North of France to Paris)
○ Server response time: 50s
● Conclusion: provider turned off debug mode in the “central system” software
Unavailable central system ?!?
Return On eXperience
WWW.HAPROXY.COM
● Customer with big TLS traffic, mostly API based
● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if
there seem to be a lot of unused resources on the box
● At first, this was qualified as an “HAProxy” issue (including the HW and OS
itself)
● Hard tuning the box did not improve anything
● halog Tc percentile reported higher values when the load increased (up to
30ms, on the LAN!!!)
● After asking customer to double check the Spanning Tree, it seemed a small 24
ports ToR switch became root bridge….
● Fixing spanning tree also fixed HAProxy’s TLS performance
HAProxy TLS performance issue
Return On eXperience
WWW.HAPROXY.COM
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
Spotting a broken fiber channel between 2
core switches
Return On eXperience
WWW.HAPROXY.COM
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● Tc timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
TIP: use /dev/urandom ;)
Wrong web server configuration using
/dev/random
Return On eXperience
WWW.HAPROXY.COMWWW.HAPROXY.COM
Conclusion
WWW.HAPROXY.COM
● exploit your stats
● Choose the right LB
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● Enjoy :-)
Conclusion
Conclusion
WWW.HAPROXY.COM
QUESTION &
ANSWER

Contenu connexe

Tendances

Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)NTT DATA Technology & Innovation
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer GuideDeon Huang
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
 
DPDK Acceleration with Arkville
DPDK Acceleration with ArkvilleDPDK Acceleration with Arkville
DPDK Acceleration with ArkvilleShepard Siegel
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用cyberagent
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For OperatorsKevin Brockhoff
 
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...Vietnam Open Infrastructure User Group
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 
Prometheus and Thanos
Prometheus and ThanosPrometheus and Thanos
Prometheus and ThanosCloudOps2005
 
Alluxio: Data Orchestration on Multi-Cloud
Alluxio: Data Orchestration on Multi-CloudAlluxio: Data Orchestration on Multi-Cloud
Alluxio: Data Orchestration on Multi-CloudJinwook Chung
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Thomas Riley
 
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)NTT DATA Technology & Innovation
 
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouseAltinity Ltd
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersSyah Dwi Prihatmoko
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021InfluxData
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 

Tendances (20)

Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
DPDK Acceleration with Arkville
DPDK Acceleration with ArkvilleDPDK Acceleration with Arkville
DPDK Acceleration with Arkville
 
Presto on YARNの導入・運用
Presto on YARNの導入・運用Presto on YARNの導入・運用
Presto on YARNの導入・運用
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...
Room 3 - 2 - Trần Tuấn Anh - Defending Software Supply Chain Security in Bank...
 
HBase at LINE 2017
HBase at LINE 2017HBase at LINE 2017
HBase at LINE 2017
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 
Prometheus and Thanos
Prometheus and ThanosPrometheus and Thanos
Prometheus and Thanos
 
MapReduce/YARNの仕組みを知る
MapReduce/YARNの仕組みを知るMapReduce/YARNの仕組みを知る
MapReduce/YARNの仕組みを知る
 
Alluxio: Data Orchestration on Multi-Cloud
Alluxio: Data Orchestration on Multi-CloudAlluxio: Data Orchestration on Multi-Cloud
Alluxio: Data Orchestration on Multi-Cloud
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
 
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
Kubernetes環境に対する性能試験(Kubernetes Novice Tokyo #2 発表資料)
 
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 

Similaire à Observability with HAProxy: Logs, Metrics and Troubleshooting

Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxyWilly Tarreau
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testingRoman Ananev
 
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Ontico
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0Mike Belshe
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaborationJulien Pivotto
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020OdessaJS Conf
 
University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)Mike Belshe
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Brian Brazil
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...NETWAYS
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and MonitoringToby Jackson
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandMaarten Balliauw
 

Similaire à Observability with HAProxy: Logs, Metrics and Troubleshooting (20)

Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxy
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Website & Internet + Performance testing
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
 
Log Files
Log FilesLog Files
Log Files
 
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
 
University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
 
WTF is Sensu and Monitoring
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
 
Approaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Observability with HAProxy: Logs, Metrics and Troubleshooting

  • 2. WWW.HAPROXY.COM ● Introduction to “observability” ● Deep dive in HAProxy ● Tips (to get more from HAProxy) ● RoX (Return On eXperience) ● Conclusion (because we need one) Agenda
  • 4. WWW.HAPROXY.COM Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ? Definition
  • 5. WWW.HAPROXY.COM ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system. Observability vs monitoring
  • 6. WWW.HAPROXY.COM ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters Observability is crucial
  • 7. WWW.HAPROXY.COM ● central place ● in distributed systems, one LB per layer, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● again, logs provide long-term references The LB as an observation tower
  • 8. WWW.HAPROXY.COM ● Maintain two types of TCP connections ○ From client to HAPRoxy (1) ○ From HAProxy to server (2) ○ Only access to data payload (not the “packet”) LB in Reverse-proxy mode HAProxyClient Server (1) (2)
  • 9. WWW.HAPROXY.COM ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: check OpenTracing and Prometheus Why not looking from other points?
  • 10. WWW.HAPROXY.COM ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ● Non exhaustive list ... What does the LB see?
  • 11. WWW.HAPROXY.COM ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale What does the LB see?
  • 13. WWW.HAPROXY.COM haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" I need help !!!
  • 14. WWW.HAPROXY.COM ● Logs : ○ Halog, ELK, Splunk, datadog… ○ Provides unique-id for tracing/event correlation ● Statistics : ○ Stats page, CLI, hatop ○ Prometheus, statsd, ... ● Stick-tables (per arbitrary key like IP, URL, cookie, ...) : ○ Byte count, cumulated/concurrent conns, errors, rates… Accessing metrics in HAProxy
  • 30. WWW.HAPROXY.COM ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable ● Actions: ○ log per-request total CPU time spent in analysers ○ log per-request total CPU time spent in TLS handshake ○ log per-request total latency added by other tasks ○ Ability to kill offending tasks ○ Ability to alert on high latencies => make HAProxy as observable as other components And more to come in HAProxy 1.9 (Nov. 2018)
  • 31. WWW.HAPROXY.COM ● Timers are averaged in the stats ● Each timer appears in the logs ● halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Event timing reports Timers Term code Cookie code
  • 32. WWW.HAPROXY.COM ● Who, when, why and how the session has been terminated ● Distinguish between timeout and abort ● Indicate who (client, server, haproxy, kill, ...) ● Indicate when (req, queue, connect, response, data...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : halog -tcn|-TCN ... # for filtering halog -tc # for sorting => Graphing the number of errors reported by second helps detecting an issue is in progress somewhere Termination codes
  • 33. WWW.HAPROXY.COM ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments HTTP status code distribution
  • 34. WWW.HAPROXY.COM ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one! Other relevant metrics: queues
  • 35. WWW.HAPROXY.COM LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm Other relevant metrics: LB fairness
  • 36. WWW.HAPROXY.COM ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations Other relevant metrics: Error rate
  • 37. WWW.HAPROXY.COM ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Useful entries in log-format Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 39. WWW.HAPROXY.COM "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 } Sampling: why / When Tips !!!!
  • 40. WWW.HAPROXY.COM ● you only want to catch suspicious events ● enable logs per url, source IP, etc... ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" WARNING: you'll lose any valid reference Selective logging: why / When Tips !!!!
  • 41. WWW.HAPROXY.COM ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB of log file per second) => Use it in production to figure the relevant metrics Other halog goodies Tips !!!!
  • 43. WWW.HAPROXY.COM ● Many ops and dev do this every day all around the world ● Users are complaining that the application is slow ● Devops team check HAProxy logs (using halog or ELK) in order to isolate potential sources of issue: ○ Check server Tc ○ Sort servers by response time ○ Sort URLs by response time ● HAProxy won’t fix the problem (in most cases), but will drastically reduce the troubleshooting time by isolating the potential issues Web application is slow!!!!!!!! Return On eXperience
  • 44. WWW.HAPROXY.COM ● Retail customer, with shops everywhere in France ● Cash register reports the following error message “Unable to reach central system” when searching people in the loyalty card database, after waiting up to 10s. ● Argument between customer and “central system” software provider: ○ Customer: your software is slow ○ Provider: your network triggers this error ● After installing HAProxy, its logs reported the following: ○ Time to get the query from the client: 300ms (from North of France to Paris) ○ Server response time: 50s ● Conclusion: provider turned off debug mode in the “central system” software Unavailable central system ?!? Return On eXperience
  • 45. WWW.HAPROXY.COM ● Customer with big TLS traffic, mostly API based ● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if there seem to be a lot of unused resources on the box ● At first, this was qualified as an “HAProxy” issue (including the HW and OS itself) ● Hard tuning the box did not improve anything ● halog Tc percentile reported higher values when the load increased (up to 30ms, on the LAN!!!) ● After asking customer to double check the Spanning Tree, it seemed a small 24 ports ToR switch became root bridge…. ● Fixing spanning tree also fixed HAProxy’s TLS performance HAProxy TLS performance issue Return On eXperience
  • 46. WWW.HAPROXY.COM ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified Spotting a broken fiber channel between 2 core switches Return On eXperience
  • 47. WWW.HAPROXY.COM ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● Tc timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL TIP: use /dev/urandom ;) Wrong web server configuration using /dev/random Return On eXperience
  • 49. WWW.HAPROXY.COM ● exploit your stats ● Choose the right LB ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● Enjoy :-) Conclusion Conclusion