Migrating the elastic stack to the cloud, or application logging @ travix

Migrating the Elastic Stack to the Cloud
Application Logging @ Travix
March 13, 2019

Content:
1. Whois Travix
2. Elastic Stack for logging - Travix use cases
3. Brief history - DC to GCP
4. Issues and Solutions
5. Detailed overview of current setup
6. Cost
7. Formats and conventions
8. Tricks and tools
9. Next Steps
10. Questions

Travix in numbers:
Bookings/year: 4 million
Countries: 39
50+ planes filled every day
Searches/second: 300-500
Services in production: 300-400

Development and System Engineering:
Product teams: 12-14
System Engineers (SWAT): 10
Development Locations: Amsterdam, London, Minsk
Languages: .Net, Java, Golang, NodeJS, Python

ES log delivery load pattern: 48h

Production environment in numbers:
Ingestion rate: 1.5 TB/day or 1000 MB/minute (average)

“Netflix microservices architecture is a
metrics generator that occasionally streams
movies”
https://medium.com/netflix-techblog/stream-processing-with-mantis-78af913f51a6

“Netflix microservices architecture is a
metrics generator that occasionally streams”
Travix
logs sells
tickets

“SLA” Travix logging:
- Does not guarantee 100% delivery (BEST EFFORT)
In some cases it is better to drop 4 hours of queued log events to recover system
and let recent logs to reach the cluster.
- ElasticSearch may drop entire log message when type of field in a log message
clashes with type defined in index.
(Example: “response” was String and become Object)
- Normally deliver and store event only once (no duplications).
Had number of incidents when bugs in pipelines resulted same log message
stored number of times.

Logging use cases:
a) Access to detailed context of real time events (<2h) - incident
handling/distributed debugging
b) Platform issues GCP/GKE (<24h) - investigation
c) Regression/bugs introduced by new releases (<24h) - investigation
d) Analysis of historical data ( > 48h) - statistics/trends comparison

Use cases:
Main role of logging system: real time incident handling and distributed
debugging - see detailed context of events as they happen across systems.
Could be opinionated but when Logging Cluster is down Travix IT is 99,9% blind -
no deployments, no application level incident handling

Examples of usage:
- AIR team (Bangalore) : investigate on failed bookings from website, mainly
depending on xml payloads which used for escalation to GDS
- RulesAPI Development (Amsterdam): monitoring of rules execution - requests,
timings, response codes, triggered versions of InRule
- Pricing (San Jose): daily monitoring using landing/search and book/landing ratios
- System Engineers: monitoring, incident handling, investigation

Alerting (prometheus alertmanager + slack)

Logging 3.0 February 2017
Version: ES 3.x
Hardware: 7 x CPU: 16, Memory: 64GB, Disk System: 2x120GB, Data: 8x960GB data
Volume: 400-800 GB/day
Configuration:
path.data: /mnt/sdc, /mnt/sdd, /mnt/sde, /mnt/sdf, /mnt/sdg, /mnt/sdh, /mnt/sdi, /mnt/sdj
JVM Heap (ES data node ): 30GB
Indexing: Single daily index, curator optimise scheduled by linux cronjob at 1:00 AM
Traffic distribution: 95% DC, 5% GCP
Pipelines:
(DC) APP Travix Logger -> (DC) RMQ -> (DC) LS -> (DC) ES
(GCP) Fluentd -> (GCP) RMQ -> (DC) RMQ -> (DC) LS -> (DC) ES

Issues:
● Sudden increases in logging volume slows down ingestion, search.
● No easy way to see what type of data is DDOSing cluster.
Kibana dashboard to report number of documents stored and their size grouped by application name -
heavy search adds even more load during meltdown.
● RabbitMQ configured in HA cluster mode is huge and unstable - memory, showels, slow to recover.
● Curator optimise (index merge) on daily index takes around 12-18 hours affecting cluster functionality.
● People treat Kibana like a Google search - full text search on most of fields and all possible queries with
“*”
● Disk recovery causes large disruption - many disks used by single ES process, malfunctioning disk affects
performance of entire process, to replace disk ES process has to be restarted affecting 1/7 of cluster
capacity.
● ES Index datatype collisions - dynamic mappings and different applications storing data to the same index
may cause events to be silently dropped by ES but also index on rotation changing type of field causing
even more confusion
● Size of log messages up to 10MB (SOAP, XML binary blobs). ES can handle this, but feels very wrong.
Logging 3.0 February 2017

Logging 5.0 May - August 2017
Version: ES 5.x
Volume: 800-1000 GB/day
Hardware: 4 x CPU: 48, RAM: 264GB, Disk System: 2 x 120GB, Disk Data: 4 x 2TB
Configuration:
4 x docker containers (ES data nodes) per physical server
path.data: each data node has own dedicated 2TB SSD
Total data storage: 32 TB
Replica anti-affinity:
'cluster.routing.allocation.awareness.attributes’ : node.attr.rack_id: "${rack_id}"
Indexing: hourly index, curator optimise scheduled by rundeck
Traffic distribution: 60% DC, 40% GCP

Issues:
● Change management: people had hard time saying bye to kibana3, mess with stored
searches/dashboards (hard to find owners).
● “Snowball effect” during major production issues - 2x-3x increase in write/search volume. People use
random heavy queries selecting weeks of data.
● Impossible to see what application system is DDOSing cluster. Slow query log in ES is hard to read - no
easy way to see who executed certain search.
● When DC networking is slow/saturated RMQ unable to keep up with delivery causing large backlog slow
to process and stressing ES cluster even more when link is restored.
● Fluentd under load also need some attention (memory/buffer tuning) - OOM on fluentd result in
duplication of log messages, RMQ memory issues, bugs in plugins may result in content of log files not
being read/shipped.
● People messing timezone settings and logging events “in future”
● ES data node JVM periodically crashes after week of running with kernel core dump in Linux/Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056
● Max logentry size limit: Stackdriver logging - 100KB, Docker introduces - 16KB (thanks Docker!)
Logging 5.0 May - August 2017

Issue: People messing timezone settings and logging events “in future”
Solution: logstash custom ruby filter to compare timestamps and log warning message notifying about
timestamp issues
filter {
ruby {
#threshold can be configured with "threshold_delta_minutes"
code => 'threshold_delta_minutes = 30;
event_message = event.to_hash;
time_now = Time.now.to_f;
time_event = event.get("@timestamp").to_f;
log_message = "Time control - log event in future >30 minutes [ logstash_time #{Time.at(time_now)} < event_time
#{Time.at(time_event)} ] event: #{event_message}";
@logger.warn(log_message) if (time_event - time_now) > (60 * threshold_delta_minutes);
event.cancel if (time_now - time_event) > (60 * 20160);
@logger.warn("Dropped event older than 2 weeks") if (time_now - time_event) > (60 * 20160)'
}
}
Logging 5.0 Issues

Issue: Max logentry size limit 16KB
Solution: store large payloads ( up to 10MB) directly to BQ and log only url of the payload
Custom application logger to log payloads to local log file and write another light log messages to stdout, share
same UUID key to link events. Use fluend to store message in BQ and URL formatted field in kibana to let users
access payload:
Logging 5.0 Issues

Issue: misbehaving applications and test deployment may produce huge volume of logs affecting delivery of
application logs without adding real value
Solution: be pragmatic and filter out all unknown (non “travix” formatted) events at filebeat level
leave “backdoor” - containers with label “elasticsearch_debug”
- drop_event.when.not:
or:
- equals:
type: "v2"
- equals:
logformat: "v3"
- equals:
kubernetes.namespace: "kube-system"
- equals:
kubernetes.labels.elasticsearch_debug: "true"
Logging 5.0 Issues

Issue: Impossible to see what application system is DDOSing cluster.
Solution: Measure and display logging rate/size per application at earliest possible stage (before pipelines or ES).
Fluentd plugins: “flowcounter” + “prometheus”.
Collecting logging rate in prometheus allow us to set alerts and automate detection on misbehaving
applications.
Unfortunately could not find easy way to achieve similar in filebeat.
For the simple DDOS prevention Logstash has Throttle filter plugin (which we use) but does not have easy way
to integrate with prometheus.
Logging 5.0 Issues

Fluentd record has a tag:
"kubernetes.var.log.containers.cfe-14080-vay-us-31935-gksb3_frontend-acceptance_cfe-14080-vay-us-4bnj.log"
<store>
type flowcounter
count_keys *
unit minute
aggregate tag
</store>
<filter flowcount>
@type record_transformer
enable_ruby true
<record>
# Parse tag
container_name ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split("_")[2].rpartition("-")[0]}
namespace ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split('_')[1]}
</record>
</filter>
Logging 5.0 Issues

<filter flowcount>
@type prometheus
<labels>
container_name ${container_name}
namespace ${namespace}
</labels>
<metric>
name fluentd_flowcounter_logging_lines_per_minute
type gauge
desc logged lines per minute
key count
</metric>
</filter>
<match flowcount>
type null
</match>
Logging 5.0 Issues

Issue: log events “silently” dropped by ES as result of ES Index data type collisions
Solution: move away from dynamic mappings to static mappings, rigid logging format,
for the unlucky ones - DLQ with alerting
Logging 5.0 Issues

Version: ES 6.x
Volume: 1-1.5 TB/day
VM’s: 42 x n1-highmem-8, CPU: 8 , Memory: 52 GB, SSD persistent disk: 1TB
Configuration:
24x data, 6x client, 3x ingest, 3x master, 8x logstash nodes
Total data storage: 24 TB
Indexing: multiple indexes (per log message type), rollover index (number of documents), curator by k8s
cronjob to rotate/delete indexes, optimise disabled.
Traffic distribution: 100% GCP
Logging GCP February 2018 - 2019

Issues:
● GCP specific issues with node availability and health checks in a large clusters.
Solution: keep ES in isolated GKE cluster and (if possible) isolated project.
Previous experience in running ES in docker helps in proving your point to Google Support.
● GCP preemptible nodes are cheap (80% cheaper than regular instances) but not reliable to run large
production ES cluster.
● Hard to perform graceful rolling update on a live cluster. k8s container readiness check based on cluster
state (yellow, green) is limited.
To update manually - updateStrategy: OnDelete
● Missing canary deployment, staging can not provide load similar to production.
● https://github.com/pires/docker-elasticsearch-kubernetes is not maintained.
● Components of Elastic Stack do not have native support of the Prometheus. Third party plugins.
Logging GCP February 2018 - 2019

Logging Infrastructure :
Apps in Kubernetes
Enossis
Windows VM’s
MOAP
Windows VM’s
ElasticSearch
API: storage, indexing,
searching
Delivery
pipelines
asynchronous
delivery,
queueing,
validation,
transformation
Kibana
Real time discovery,
search, visualisation
Grafana
Visualisation
ElastAlert
Alerting
Google Cloud Platform

Log delivery pipelines:
Local layout: logging cluster directly accessible
filebeat
logstash
forwarder
Google
PubSub
logstash
processor
filebeat
logstash
processor
Distributed layout: logging cluster not accessible directly
Project: “travix-myapp” Project: “travix-logging”
Project: “travix-logging”

Production ES Index numbers:
indexes: 569
active indexes: 14
shards: 11,178
documents: 8,526,909,237

Production ES Index settings:
index.number_of_replicas: 1
index.number_of_shards: 10
index.routing.allocation.total_shards_per_node: 1
index.refresh_interval: 60s
index.translog.durability: async
“commits the translog every 5 seconds if set to async or if set to request (default) at the end of every index, delete, update, or bulk request”
index.unassigned.node_left.delayed_timeout: 30m
"the allocation of replica shards which become unassigned because a node has left can be delayed with the
index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m”
index.search.slowlog.level: debug
index.search.slowlog.threshold.fetch.debug: 5s
index.search.slowlog.threshold.fetch.query.debug: 5s

Production ES - system resources:
QOS Burstable: CPU request: 6, CPU limit: 8.
Pay attention to CPU throttling! container_cpu_cfs_throttled_seconds_total{job="kubernetes-cadvisor"}
Consider QOS Guaranteed or do not set CPU limit to disable kernel scheduling Completely Fair Scheduler (CFS)
https://github.com/kubernetes/kubernetes/issues/67577

Production ES - system resources:
QOS Burstable: CPU request: 0.5, CPU limit: 1. Pay attention to CPU throttling!
Good presentation from Zalando:
https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload

Production Logstash:
replicas: 8 (CPU req: 5, CPU limit: 8, JVM Heap: 2GB)
queue.type: persisted
path.queue: /logstash-queue
queue.max_bytes: 61440mb (approximately 1 hour)
pipeline.workers: 2
input.beats.executor_threads: 4
Logstash vs Message Broker/Queueing system:
Pros: simple to understand, easy to scale up/down, provides good visibility (prometheus exporter on top of
Monitoring API)
Cons: no way to pause delivery or manage backlog

Retention period: 6 days (some indexes have retention around 1 month)
Dataset size: 15TB (data+replica)
Total disk size: 24TB
Cluster size: 42 VM’s (ELK cluster + pipelines + telemetry)
Cost of current solution

Cost of current solution
Monthly cost of complete solution: ELK cluster + pipelines + storage + monitoring
(cost of human resource not included)
EUR 12,000 / month
with 1 year “Commited usage” GoogleTM:
EUR 9,500 / month

Alternatives to Travix controlled logging cluster
Logging as a service:
- Google Stackdriver Logging: EUR 12,800/month ($0.50/GiB and 30 days of logs)
- Elastic Cloud: EUR 9,000/month (6 node cluster) without pipelines and orchestration
Resume: without radical change in a way we use logging we can not migrate to
external “logging as a service” provider and cut the costs

Standartise Error logging. SOS!!!
We see error events as one of the main indicators to monitor health of the Travix
infrastructure.
As a general rule we should aim to keep amount of error events close to zero.
Use same format of the log messages across ALL components.
Common format will allow us to automate error handling - detection of the issues,
classification of the known/unknown incidents, creation of the graphs/dashboards.

It’s all about conventions
Travix logging formats:
HALS - json with loosely defined structure (deprecated)
V2 - loosely defined, introduced by one team to address own needs, later adopted
across Travix stack (to be deprecated)
V3 - new recommended unified format
Elastic Common Schema - open source specification that defines a common set of
document fields for data ingested into Elasticsearch. ECS is designed to support uniform data
modeling, enabling you to centrally analyze data from diverse sources with both interactive and
automated techniques.
https://github.com/elastic/ecs

V3 logging format
Treat log message as application API - defined schema, versioned
messagetype - An enum-style value showing what kind of event was logged, it can be
used for filtering for certain events. It's using the Java enum casing style:
"GET_ORDER_BY_NUMBER", "WORKER_RUN_STARTED", etc.
messagetypeversion - The version of the message type. This is present to handle
possible breaking changes in the contents of the payload.
payload - Object with custom application-specific fields. It's up to the application to
decide what to store here.

Tricks to reduce unnecessary load:
Time picker defaults to round to the last minute.
timepicker:timeDefaults: { "from": "now-15m/m", "to": "now-1m/m", "mode": "quick" }
Search rounded dates
Queries on date fields that use now are typically not cacheable since the range that is being matched changes all the time. However
switching to a rounded date is often acceptable in terms of user experience, and has the benefit of making better use of the query cache.

Tricks to reduce unnecessary load:
Use default dummy index with single document pointing users to “Best practices” page.
Helps us to avoid all users hitting large index when they landing to Discover page.

Stick to naming convention for Kibana searches/visualisations:
${team_name}.${product}.${description}
${user_name}.${description}

Tools we use:
Elastalert - simple framework for alerting on anomalies, spikes, or other patterns of
interest from data in Elasticsearch. https://github.com/Yelp/elastalert
Several rule types with common monitoring paradigms are included with ElastAlert:
● Match where there are at least X events in Y time" (frequency type)
● Match when the rate of events increases or decreases" (spike type)
● Match when there are less than X events in Y time" (flatline type)
● Match when a certain field matches a blacklist/whitelist" (blacklist and whitelist type)
● Match on any event matching a given filter" (any type)
● Match when a field has two different values within some time" (change type)
● Match when a never before seen term appears in a field" (new_term type)
● Match when the number of unique values for a field is above or below a threshold (cardinality type)

InboundTrafficFlugladenAT.yaml:
name: Inbound traffic Increased by 50% compared to the last 15 minutes for Flugladen.at
index: v3
type: spike
timeframe:
minutes: 15
spike_height: 50
spike_type: up
threshold_cur: 500
filter:
- query_string:
query: ‘app: iis_adv AND host:api.flugladen.at’
alert:
- "opsgenie”
generate_kibana_link: True
kibana_url: ‘https://kibana.mysite.org’
include:
- “kibana_link”

Tools we use:
Cerebro - open source elasticsearch web admin tool
Quick way to discover/filter indexes, check shard allocation, active mappings:

Tools we use:
Quick way to check/modify index aliases:

Next steps:
- Migrate to official helm charts supported by Elastic, canary deployment:
https://github.com/elastic/helm-charts/tree/master/elasticsearch
Helm charts tested on different Kubernetes versions with “Goss” - YAML based serverspec alternative tool for validating a
server’s configuration.
https://github.com/aelsabbahy/goss
- Static mappings and custom retention.
Provide automation and templating around Curator config.
Use ILM (index lifecycle management) API introduced in 6.6.
- Tune ES clusters to optimise resource usage and increase retention without
increasing in cost - reduce number of data nodes, increase size of data volumes

Next steps:
- With unified error format automate issue detection, alerting and dashboarding.
ElasticStack ML POC
- Treat local users/applications as potential DDOS’ers - provide quota per service
and automated way to disable abusing applications from logging pipeline.

Migrating the elastic stack to the cloud, or application logging @ travix

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Migrating the elastic stack to the cloud, or application logging @ travix

Similar to Migrating the elastic stack to the cloud, or application logging @ travix (20)

Recently uploaded

Recently uploaded (20)

Migrating the elastic stack to the cloud, or application logging @ travix