Talk by Tom Wilkie at KubeCon 2016.
Abstract:
The rise of “cloud native” applications (many targeting Kubernetes) has had knock-on effects on the ways development teams operate, and the complexity of monitoring your application. Order-of-magnitude increase in the number of moving parts and rate of change of the applications require us to reassess traditional techniques.
In this talk we will discuss some of the challenges raised and different approaches to solving them. We’ll discuss:
- the necessity for a common set of monitoring paradigms across all you microservices, and how this can reduce cognitive load and improve recovery-from-failure time.
- the increased importance of automated, interactive visualization tools now that responsibilities for system architecture have been devolved to individuals and teams.
- new exciting ideas in distributed tracing and how these can help deploy tracing across an ever-differentiated production environment.
This talk will be hands on at command line, using open source tools and include anecdotes from real world experience.
Outline:
- "monitoring" - ie timeseries metrics, considerations etc
- "visualisation" - interactive visualisation of the architecture of running systems
- "tracing" - distributed tracing, its importance, and how to etc.
10. USE Method* - for every
resource, check:
• utilization,
• saturation, and
• errors
RED Method - for every
service, check request:
• rate,
• error (rate), and
• duration (distributions)
* http://www.brendangregg.com/usemethod.html
An alternative view
11. Okay, but how?
var rpcDurations = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "rpc_durations_histogram_microseconds",
Help: "RPC latency distributions.",
Buckets: prometheus.LinearBuckets(0, 100, 20),
})
func init() {
prometheus.MustRegister(rpcDurations)
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
begin := time.Now()
...
rpcDurations.WithLabelValues(r.Method).Observe(
float64(time.Since(begin).Nanoseconds()))
}
12.
13. There must be a better way…
Kubeproxy
Replicas
incoming traffic from
other services
28. Not a new topic
• Lots of literature
• Existing open source
projects
• e.g. Zipkin, originally from
Twitter
29. • Challenge: detecting
causality between
incoming and outgoing
requests
• Existing solutions require
propagation of some
unique ID (dapper, zipkin)
• This requires application-
specific modifications
some service
incoming
request
outgoing
requests
?
30. Can this be done without
application modifications?
31. By intercepting systems calls,
build up a data structure of:
• which threads are reading
to / writing from which FDs
• which FDs are talking to
which IPs & ports
Use this to infer causality
between incoming and
outgoing connections.
some service
kernel
?
System calls