Teach Application Telemetry

Teach your application
eloquence.
Logs, metrics, traces.
Dmytro Shapovalov
Infrastructure Engineer @ Cossack Labs

Who we are?
• UK-based data security products and services
company 
• Building security tools to prevent sensitive data
leakage and to comply with data security
regulations 
• Cryptographic tools, security consulting, training 
• We are cryptographers, system engineers,
applied engineers, infrastructure engineers 
• We support community, speak, teach, open
source a lot

What we are going to talk
• Why do we need telemetry?
• What are the different kinds of telemetry?
• Borders of applicability of various types of
telemetry
• Approaches and mistakes
• Implementation

What is telemetry?
«Gathering data on the use of applications and
application components, measurements of start-up
time and processing time, hardware, application
crashes, and general usage statistics.»

Why do we need telemetry at all?
Who are the consumers? 
− developers 
− devops/sysadmins 
− analysts 
− security staff
What purposes? 
− debug 
− monitor state and health 
− measure and tune performance 
− business analysis 
− intrusion detection

It is worthwhile, indeed
• speed up developing process
• increase overall stability
• reduce the reaction time on crashes and intrusions
• adequate business planning

It is worthwhile, indeed
• speed up developing process
• increase overall stability
• reduce the reaction time on crashes and intrusions
• adequate business planning
• COST of development
• COST of use

What data do we have to export?
… we can ask any specialist.

What data do we have to export?
… we can ask any specialist.
— ALL!… will be their answer.

Classification of information
technical: 
− state 
− health 
− errors 
− performance 
− debug 
− events

technical: 
− state 
− health 
− errors 
− performance 
− debug 
− events
business: 
− SLI 
− user actions

technical: 
− state 
− health 
− errors 
− performance 
− debug 
− events
business: 
− SLI 
− user actions
developers
devops/sysadmins

technical: 
− state 
− health 
− errors 
− performance 
− debug 
− events
business: 
− SLI 
− user actions
developers
devops/sysadmins
analysts

technical: 
− state 
− health 
− errors 
− performance 
− debug 
− events
business: 
− SLI 
− user actions
developers
devops/sysadmins
analysts
security staff

SIEM — security staff’s main instrument
Complex analyze: 
− correlation 
− threats 
− patterns 
− compliance
Applications
Network devices
Servers
Environment

Telemetry evolution
Logs
• each application has
an individual log file
• syslog: 
− message standard
(RFC 3164, 2001) 
− aggregation
• ELK (agents,
collectors)
• HTTP, JSON,
protobuf

Telemetry evolution
Logs
• syslog: 
(RFC 3164, 2001) 
− aggregation
• ELK (agents,
collectors)
• HTTP, JSON,
protobuf
Metrics
• reports into logs
• agents, collectors,
stores with
proprietary protocols
• SNMP
• HTTP, protobuf
• custom
implementations

Telemetry evolution
Logs
• syslog: 
(RFC 3164, 2001) 
− aggregation
• ELK (agents,
collectors)
• HTTP, JSON,
protobuf
Metrics
stores with
• SNMP
• HTTP, protobuf
• custom
implementations
Traces
stores with
• custom
implementations

Telemetry applicability
Logs
• simplest
• no external tools
required
• human readable
• UNIX-style
• compatible with a
tons of tools
• queries
• alerts

Logs
• simplest
required
• human readable
• UNIX-style
tons of tools
• queries
• alerts
Metrics
• minimal store size
• low performance
impact
• performance
measuring
• health and state
observing
• special structures
• queries
• alerts

Logs
• simplest
required
• human readable
• UNIX-style
tons of tools
• queries
• alerts
Metrics
• low performance
impact
• performance
measuring
observing
• queries
• alerts
Traces
• low performance
impact
• per-query metrics
• low-level
information
• precise debugging
and performance
tuning

Logs
• simplest
required
• human readable
• UNIX-style
tons of tools
• queries
• alerts
Metrics
• low performance
impact
• performance
measuring
observing
• queries
• alerts
Traces
• low performance
impact
• per-query metrics
• low-level
information
• precise debugging
and performance
tuning
+ SIEM systems

Telemetry flow
creation
transport
aggregation
normalization
store
analyze + alerting
visualize
archive

Logs : kinds of data
• initial information about the application
• state changes (start/ready/…/stop)
• health changes
• audit trail (security-relevant list of activities: financial
operations, health care data transactions, changing keys,
changing configuration)
• user sessions (sign-in attempts, sign-out, actions)
• not expected actions (wrong URLs, sign-in fails, etc.)
• various information in string format

Logs : on start
• new state: starting
• application name
• component name
• commit hash / build number
• configuration in use
• deprecation warnings
• running mode

Logs : on ready
• new state: ready
• listen interfaces, ports and sockets
• health

Logs : on state or health change
• new state
• reason
• URL to documentation

Logs : on state or health change
• new state
• reason
• URL to documentation
Use traffic-light highlight system for health states: 
● — completely unhealthy 
● — partially healthy, reduced functionality 
● — completely healthy

Logs : on shutdown
• reason
• status of preparing to shutdown
• new state: stopped (final goodbye)

Logs : each line
• timestamps (ISO8601, TZ, reasonable precission)
• PID
• application/component short name
• application version (JSON, CEF, protobuf)
• severity (CEF: 0→10, rfc5427: 7→0)
• event code (HTTP style)
• human-readable message

Logs : do not export!
• passwords, tokens, any sensitive data — security risks
• private data — legal risks
Use:
− masking
− anonymisation / pseudonymisation

Logs : consumers
• Console
• Files
• General purpose collector/store/alert/search
system.
• SIEM

Logs : consumers and formats
console,
STDERR
ﬁle syslog ELK SIEM
socket,
HTTP,
custom
plain ✓
syslog
(RFC3164) ✓ ✓ ✓ ✓ ✓ ✓
JSON ✓ ✓ ✓ ✓ ✓ ✓
CEF ✓ ✓ ✓ ✓ ✓ ✓
protobuf ✓ ✓

CEF naming, data formats
+
JSON/protobuf/… transport
=
painless logging

Logs : bear in mind [1/3]
• Logs will be read by humans. Often, when failure
happens. With limited time to reaction. Be brief and
eloquent. Give information that may help to solve a
problem.
• Logs will be searched. Don’t be a poet, be a technical
specialist. Use expected words.
• Logs will be parsed automatically; indeed, they will.
There are too many different systems that want telemetry
from your application.
• Carefully classify the severity of events. Many error
messages instead of warnings in non-critical situations
will lead to ignoring information from the logs.

• Whenever it possible, base on existing standards.
Grouping event codes according to the HTTP
error code table is not bad idea.
• Logs are the first resource to analyze security
incidents.
• Logs will be archived and stored for a long period
of time. It will be almost impossible to cut off
some pieces of data.
• Should be configurable: formats, transport
protocols, paths, severity.

• Your application may run in many different
environments with different standards of logging (VM,
docker). Application should be able to direct all logs
into one channel. Splitting may be an option.
• Do not implement log files rotation. Give possibility to
inform your application when it needs to gracefully
recreate the log file after being rotated by an external
service.
• When big trouble occurs and nothing works, your
application should be able to print readable logs in the
simplest manner — to stderr/stdout.

Logs : implementation
• native Ruby methods
• semantic_logger 
https://github.com/rocketjob/semantic_logger 
(a lot of destinations: DBs, HTTP, UDP, syslog)
• ougai 
https://github.com/tilfin/ougai 
(JSON)
• httplog 
https://github.com/trusche/httplog 
(HTTP logging, JSON support)

Metrics : approaches
• USE method 
Utilization, Saturation, Errors
• Google SRE book 
Latency, Traffic, Errors, Saturation
• RED method 
Rate, Errors, Duration

Metrics : utilization
• Hardware resources: CPU, disk system, network
intefaces
• File system: capacity, usage
• Memory: capacity, cache, heap, queue
• Resources: file descriptors, threads, sockets, connections
The average time that the resource was busy
servicing work.
Usage of resource.

Metrics : traffic, rate
• normal operations:  
− requests 
− queries 
− transactions 
− sending network packets 
− processing flow bytes
A measure of how much demand is being placed
on your system. (Google SRE book)
The number of requests, per second, you services
are serving. (RED Method)

Metrics : latency, duration
The time it takes to service a request. (Google SRE
book)
• latency of operations:  
− requests 
− queries 
− transactions 
− sending network packets 
− processing flow bytes

Metrics : errors
• error events: 
− hardware errors 
− software exceptions 
− invalid requests / input 
− authentication fails 
− invalid URLs
The count of error events. (USE Method)
The rate of requests that fail, either explicitly,
implicitly, or by policy. (Google SRE book)

Metrics : saturation
• calculated value, measure of current load
The degree to which the resource has extra work
which it can't service, often queued. (USE Method)
How "full" your service is. A measure of your
system fraction, emphasizing the resources that are
most constrained. (Google SRE book)

Metrics : saturation
• can be calculated internally or measured
externally
• high utilization is a problem
• high saturation is a problem
• low utilization level does not guarantee that
everything is OK
• low saturation (in the case of a correct calculation)
most likely indicates that everything is OK

OpenMetrics : based on Prometheus metric types
• Gauge 
single numerical value 
− memory used 
− fan speed 
− connections count
• Counter 
single monotonically increasing counter 
− operations done 
− errors occured 
− requests processed
• Histogram 
increment counter per buckets 
− requests count per latency buckets 
− CPU load values count per range buckets
• Summary 
similar to the Histogram, but φ-quantiles are calculated on client-side;
calculating of other quantiles is not possible
https://openmetrics.io/
https://prometheus.io/docs/concepts/metric_types/

OpenMetrics : Average vs Percentile
Average

OpenMetrics : Average vs Percentile
Average
99 percentile

Metrics : buckets
<10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100

Metrics : buckets
<10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100
1
1
1
1
1
1 1
1
1
1

Metrics : buckets
<10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100
1
1
1
1
1
1 1
1
1
1
90 percentile50 percentile

Metrics : export data
• current state
• current health
• event counters: 
− AAA events 
− not expected actions (wrong URLs, sign-in fails) 
− errors during normal operations
• performance metrics 
− normal operations 
− queues 
− utilization, saturation 
− query latency
• application info: 
− version 
− warnings/notifications gauge

Metrics : formats
• suggest using Prometheus format 
− native for Prometheus 
− OpenMetrics — open source specification 
− simple and clear 
− HTTP-based 
− can be easily converted 
− libraries exist
• Influx or similar format if you really need to implement
push model
• protobuf / gRPC 
− custom 
− high load

Metrics : implementation
• Prometheus Ruby client 
https://github.com/prometheus/client_ruby
• native Ruby methods

Metrics : bear in mind [1/2]
• Split statistic by types. For example, the aggregation
of successful (relatively long) and failed (relatively
short) durations may lead to the illusion of
performance increase when multiple failures occur.
• Whenever it possible use Saturation to determine
load of system. Utilization is not complete
information.
• Be sure to export the metrics of the component
closest to the user. This will allow to evaluate the SLI.
• Implement configurable buckets sizes.

Metrics : bear in mind [2/2]
• Export appropriate metrics as buckets. It lower
polling rate and makes possible to get statistics
in percentiles.
• Add units to metric names.
• Whenever it possible, use SI units.
• Follow the naming standard. Prometheus
“Metric and label naming” document is a good
base.

Traces : definition
In software engineering, tracing involves a
specialized use of logging to record information
about a program's execution.
…
There is not always a clear distinction between
tracing and other forms of logging, except that the
term tracing is almost never applied to logging that is
a functional requirement of a program.
— Wikipedia

Traces : use cases
• Debugging during development
• Measuring and tuning performance
• Analyze failures and security incidents
https://www.cossacklabs.com/blog/how-
to-implement-distributed-tracing.html
• Approaches
• Library comparison
• Implementation example
• Use cases

Traces : principles
• Low overhead
• Application-level transparency
• Scalability

Traces : spans in trace tree
https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf

Traces : kinds of data
• trace id
• span id
• parent span id
• application info (product, component)
• module name
• method name
• context data (session/request id, user id, …)
• operation name and code
• start time
• end time
Per request/query tracking:

Traces : consumers
• General purpose collectors: 
− Jaeger 
− Zipkin
• Cloud collectors: 
− Google StackDriver 
− AWS X-Ray 
− Azure Application Insights
• SIEM

Traces : formats
• Proprietary protocols: 
− Jaeger 
− Zipkin 
− Google StackDriver 
− AWS X-Ray 
− Azure Application Insights
• JSON: 
− SIEM
• protobuf/gRPC: 
− custom

Traces : implementation
• OpenCensus 
https://www.rubydoc.info/gems/opencensus 
(Zipkin, GC Stackdriver, JSON)
• OpenTracing 
https://opentracing.io/guides/ruby/
• Jaeger client 
https://github.com/salemove/jaeger-client-ruby

Checklist : Logs
□ Each line: 
□ timestamps (ISO8601, TZ, reasonable precission) 
□ PID 
□ component name 
□ severity 
□ event code 
□ human-readable message
□ Events to log: 
□ state changes (start/ready/pause/stop) 
□ health changes (new state, reason, doc URL) 
□ user sign-in attempts (including failed with reasons), actions, sign-out 
□ audit trail 
□ errors
□ On start: 
□ product name, component name 
□ version (+build, +commit hash) 
□ running mode (debug/normal, daemon/) 
□ deprecation warnings 
□ which configuration in use (ENV, file, configuration service)
□ On ready: communication sockets and ports
□ On exit: reason
□ Do not log: 
□ passwords, tokens 
□ personal data

Checklist : Metrics
□ Data to export: 
□ application (version, warning/notification) 
□ utilization (resources, capacities, usage) 
□ saturation (internally calculated or appropriate metrics) 
□ rate (operations) 
□ errors 
□ latencies
□ Split metrics by types
□ Export as buckets when reasonable
□ Configure size of buckets
□ Export metrics for SLI
□ Determine required resolution
□ Normalize, use SI units, add units to names
□ Prefer poll model if it possible
□ Clear counters on restart

Links [1/2]
• Dapper, a Large-Scale Distributed Systems Tracing Infrastructure 
https://static.googleusercontent.com/media/
research.google.com/uk//pubs/archive/36356.pdf
• How to Implement Tracing in a Modern Distributed Application 
https://www.cossacklabs.com/blog/how-to-implement-
distributed-tracing.html
• OpenTracing 
https://opentracing.io/
• OpenMetrics 
https://github.com/RichiH/OpenMetrics
• OpenCensus 
https://opencensus.io

Links [2/2]
• CEF 
https://kc.mcafee.com/resources/sites/MCAFEE/content/live/
CORP_KNOWLEDGEBASE/78000/KB78712/en_US/
CEF_White_Paper_20100722.pdf
• Metrics : USE method 
http://www.brendangregg.com/usemethod.html
• Google SRE book 
https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-
systems/
• Metrics : RED method 
https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-
architecture/
• MS Azure : monitoring and diagnostic 
https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring
• Prometheus : Metrics and label names 
https://prometheus.io/docs/practices/naming/

Dmytro Shapovalov
Infrastructure Engineer @ Cossack Labs
Thank you!
shadinua
shad.in.ua
shad.in.ua

Teach Application Telemetry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Teach Application Telemetry

Similar to Teach Application Telemetry (20)

More from Ruby Meditation

More from Ruby Meditation (20)

Recently uploaded

Recently uploaded (20)

Teach Application Telemetry