SlideShare a Scribd company logo
1 of 57
Download to read offline
Migrating the Elastic Stack to the Cloud
Application Logging @ Travix
March 13, 2019
Content:
1. Whois Travix
2. Elastic Stack for logging - Travix use cases
3. Brief history - DC to GCP
4. Issues and Solutions
5. Detailed overview of current setup
6. Cost
7. Formats and conventions
8. Tricks and tools
9. Next Steps
10. Questions
Travix in numbers:
Bookings/year: 4 million
Countries: 39
50+ planes filled every day
Searches/second: 300-500
Services in production: 300-400
Development and System Engineering:
Product teams: 12-14
System Engineers (SWAT): 10
Development Locations: Amsterdam, London, Minsk
Languages: .Net, Java, Golang, NodeJS, Python
ES log delivery load pattern: 48h
Production environment in numbers:
Ingestion rate: 1.5 TB/day or 1000 MB/minute (average)
“Netflix microservices architecture is a
metrics generator that occasionally streams
movies”
https://medium.com/netflix-techblog/stream-processing-with-mantis-78af913f51a6
“Netflix microservices architecture is a
metrics generator that occasionally streams”
Travix
logs sells
tickets
“SLA” Travix logging:
- Does not guarantee 100% delivery (BEST EFFORT)
In some cases it is better to drop 4 hours of queued log events to recover system
and let recent logs to reach the cluster.
- ElasticSearch may drop entire log message when type of field in a log message
clashes with type defined in index.
(Example: “response” was String and become Object)
- Normally deliver and store event only once (no duplications).
Had number of incidents when bugs in pipelines resulted same log message
stored number of times.
Logging use cases:
a) Access to detailed context of real time events (<2h) - incident
handling/distributed debugging
b) Platform issues GCP/GKE (<24h) - investigation
c) Regression/bugs introduced by new releases (<24h) - investigation
d) Analysis of historical data ( > 48h) - statistics/trends comparison
Use cases:
Main role of logging system: real time incident handling and distributed
debugging - see detailed context of events as they happen across systems.
Could be opinionated but when Logging Cluster is down Travix IT is 99,9% blind -
no deployments, no application level incident handling
Examples of usage:
- AIR team (Bangalore) : investigate on failed bookings from website, mainly
depending on xml payloads which used for escalation to GDS
- RulesAPI Development (Amsterdam): monitoring of rules execution - requests,
timings, response codes, triggered versions of InRule
- Pricing (San Jose): daily monitoring using landing/search and book/landing ratios
- System Engineers: monitoring, incident handling, investigation
Alerting (prometheus alertmanager + slack)
Pricing
System Engineers
Brief history
Logging 3.0 February 2017
Version: ES 3.x
Hardware: 7 x CPU: 16, Memory: 64GB, Disk System: 2x120GB, Data: 8x960GB data
Volume: 400-800 GB/day
Configuration:
path.data: /mnt/sdc, /mnt/sdd, /mnt/sde, /mnt/sdf, /mnt/sdg, /mnt/sdh, /mnt/sdi, /mnt/sdj
JVM Heap (ES data node ): 30GB
Indexing: Single daily index, curator optimise scheduled by linux cronjob at 1:00 AM
Traffic distribution: 95% DC, 5% GCP
Pipelines:
(DC) APP Travix Logger -> (DC) RMQ -> (DC) LS -> (DC) ES
(GCP) Fluentd -> (GCP) RMQ -> (DC) RMQ -> (DC) LS -> (DC) ES
Issues:
● Sudden increases in logging volume slows down ingestion, search.
● No easy way to see what type of data is DDOSing cluster.
Kibana dashboard to report number of documents stored and their size grouped by application name -
heavy search adds even more load during meltdown.
● RabbitMQ configured in HA cluster mode is huge and unstable - memory, showels, slow to recover.
● Curator optimise (index merge) on daily index takes around 12-18 hours affecting cluster functionality.
● People treat Kibana like a Google search - full text search on most of fields and all possible queries with
“*”
● Disk recovery causes large disruption - many disks used by single ES process, malfunctioning disk affects
performance of entire process, to replace disk ES process has to be restarted affecting 1/7 of cluster
capacity.
● ES Index datatype collisions - dynamic mappings and different applications storing data to the same index
may cause events to be silently dropped by ES but also index on rotation changing type of field causing
even more confusion
● Size of log messages up to 10MB (SOAP, XML binary blobs). ES can handle this, but feels very wrong.
Logging 3.0 February 2017
Logging 5.0 May - August 2017
Version: ES 5.x
Volume: 800-1000 GB/day
Hardware: 4 x CPU: 48, RAM: 264GB, Disk System: 2 x 120GB, Disk Data: 4 x 2TB
Configuration:
4 x docker containers (ES data nodes) per physical server
JVM Heap (ES data node ): 28GB
path.data: each data node has own dedicated 2TB SSD
Total data storage: 32 TB
Replica anti-affinity:
'cluster.routing.allocation.awareness.attributes’ : node.attr.rack_id: "${rack_id}"
Indexing: hourly index, curator optimise scheduled by rundeck
Traffic distribution: 60% DC, 40% GCP
Issues:
● Change management: people had hard time saying bye to kibana3, mess with stored
searches/dashboards (hard to find owners).
● “Snowball effect” during major production issues - 2x-3x increase in write/search volume. People use
random heavy queries selecting weeks of data.
● Impossible to see what application system is DDOSing cluster. Slow query log in ES is hard to read - no
easy way to see who executed certain search.
● When DC networking is slow/saturated RMQ unable to keep up with delivery causing large backlog slow
to process and stressing ES cluster even more when link is restored.
● Fluentd under load also need some attention (memory/buffer tuning) - OOM on fluentd result in
duplication of log messages, RMQ memory issues, bugs in plugins may result in content of log files not
being read/shipped.
● People messing timezone settings and logging events “in future”
● ES data node JVM periodically crashes after week of running with kernel core dump in Linux/Ubuntu:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056
● Max logentry size limit: Stackdriver logging - 100KB, Docker introduces - 16KB (thanks Docker!)
Logging 5.0 May - August 2017
Issue: People messing timezone settings and logging events “in future”
Solution: logstash custom ruby filter to compare timestamps and log warning message notifying about
timestamp issues
filter {
ruby {
#threshold can be configured with "threshold_delta_minutes"
code => 'threshold_delta_minutes = 30;
event_message = event.to_hash;
time_now = Time.now.to_f;
time_event = event.get("@timestamp").to_f;
log_message = "Time control - log event in future >30 minutes [ logstash_time #{Time.at(time_now)} < event_time
#{Time.at(time_event)} ] event: #{event_message}";
@logger.warn(log_message) if (time_event - time_now) > (60 * threshold_delta_minutes);
event.cancel if (time_now - time_event) > (60 * 20160);
@logger.warn("Dropped event older than 2 weeks") if (time_now - time_event) > (60 * 20160)'
}
}
Logging 5.0 Issues
Issue: Max logentry size limit 16KB
Solution: store large payloads ( up to 10MB) directly to BQ and log only url of the payload
Custom application logger to log payloads to local log file and write another light log messages to stdout, share
same UUID key to link events. Use fluend to store message in BQ and URL formatted field in kibana to let users
access payload:
Logging 5.0 Issues
Issue: misbehaving applications and test deployment may produce huge volume of logs affecting delivery of
application logs without adding real value
Solution: be pragmatic and filter out all unknown (non “travix” formatted) events at filebeat level
leave “backdoor” - containers with label “elasticsearch_debug”
- drop_event.when.not:
or:
- equals:
type: "v2"
- equals:
logformat: "v3"
- equals:
kubernetes.namespace: "kube-system"
- equals:
kubernetes.labels.elasticsearch_debug: "true"
Logging 5.0 Issues
Issue: Impossible to see what application system is DDOSing cluster.
Solution: Measure and display logging rate/size per application at earliest possible stage (before pipelines or ES).
Fluentd plugins: “flowcounter” + “prometheus”.
Collecting logging rate in prometheus allow us to set alerts and automate detection on misbehaving
applications.
Unfortunately could not find easy way to achieve similar in filebeat.
For the simple DDOS prevention Logstash has Throttle filter plugin (which we use) but does not have easy way
to integrate with prometheus.
Logging 5.0 Issues
Logging 5.0 Issues
Fluentd record has a tag:
"kubernetes.var.log.containers.cfe-14080-vay-us-31935-gksb3_frontend-acceptance_cfe-14080-vay-us-4bnj.log"
<store>
type flowcounter
count_keys *
unit minute
aggregate tag
</store>
<filter flowcount>
@type record_transformer
enable_ruby true
<record>
# Parse tag
container_name ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split("_")[2].rpartition("-")[0]}
namespace ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split('_')[1]}
</record>
</filter>
Logging 5.0 Issues
<filter flowcount>
@type prometheus
<labels>
container_name ${container_name}
namespace ${namespace}
</labels>
<metric>
name fluentd_flowcounter_logging_lines_per_minute
type gauge
desc logged lines per minute
key count
</metric>
</filter>
<match flowcount>
type null
</match>
Logging 5.0 Issues
Issue: log events “silently” dropped by ES as result of ES Index data type collisions
Solution: move away from dynamic mappings to static mappings, rigid logging format,
for the unlucky ones - DLQ with alerting
Logging 5.0 Issues
Version: ES 6.x
Volume: 1-1.5 TB/day
VM’s: 42 x n1-highmem-8, CPU: 8 , Memory: 52 GB, SSD persistent disk: 1TB
Configuration:
24x data, 6x client, 3x ingest, 3x master, 8x logstash nodes
Total data storage: 24 TB
JVM Heap (ES data node ): 28GB
Indexing: multiple indexes (per log message type), rollover index (number of documents), curator by k8s
cronjob to rotate/delete indexes, optimise disabled.
Traffic distribution: 100% GCP
Logging GCP February 2018 - 2019
Issues:
● GCP specific issues with node availability and health checks in a large clusters.
Solution: keep ES in isolated GKE cluster and (if possible) isolated project.
Previous experience in running ES in docker helps in proving your point to Google Support.
● GCP preemptible nodes are cheap (80% cheaper than regular instances) but not reliable to run large
production ES cluster.
● Hard to perform graceful rolling update on a live cluster. k8s container readiness check based on cluster
state (yellow, green) is limited.
To update manually - updateStrategy: OnDelete
● Missing canary deployment, staging can not provide load similar to production.
● https://github.com/pires/docker-elasticsearch-kubernetes is not maintained.
● Components of Elastic Stack do not have native support of the Prometheus. Third party plugins.
Logging GCP February 2018 - 2019
Logging Infrastructure :
Apps in Kubernetes
Enossis
Windows VM’s
MOAP
Windows VM’s
ElasticSearch
API: storage, indexing,
searching
Delivery
pipelines
asynchronous
delivery,
queueing,
validation,
transformation
Kibana
Real time discovery,
search, visualisation
Grafana
Visualisation
ElastAlert
Alerting
Google Cloud Platform
Log delivery pipelines:
Local layout: logging cluster directly accessible
filebeat
logstash
forwarder
Google
PubSub
logstash
processor
filebeat
logstash
processor
Distributed layout: logging cluster not accessible directly
Project: “travix-myapp” Project: “travix-logging”
Project: “travix-logging”
Production ES Index numbers:
indexes: 569
active indexes: 14
shards: 11,178
documents: 8,526,909,237
Production ES Index settings:
index.number_of_replicas: 1
index.number_of_shards: 10
index.routing.allocation.total_shards_per_node: 1
index.refresh_interval: 60s
index.translog.durability: async
“commits the translog every 5 seconds if set to async or if set to request (default) at the end of every index, delete, update, or bulk request”
index.unassigned.node_left.delayed_timeout: 30m
"the allocation of replica shards which become unassigned because a node has left can be delayed with the
index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m”
index.search.slowlog.level: debug
index.search.slowlog.threshold.fetch.debug: 5s
index.search.slowlog.threshold.fetch.query.debug: 5s
Production ES - system resources:
QOS Burstable: CPU request: 6, CPU limit: 8.
Pay attention to CPU throttling! container_cpu_cfs_throttled_seconds_total{job="kubernetes-cadvisor"}
Consider QOS Guaranteed or do not set CPU limit to disable kernel scheduling Completely Fair Scheduler (CFS)
https://github.com/kubernetes/kubernetes/issues/67577
Production ES - system resources:
QOS Burstable: CPU request: 0.5, CPU limit: 1. Pay attention to CPU throttling!
Good presentation from Zalando:
https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload
Production Logstash:
replicas: 8 (CPU req: 5, CPU limit: 8, JVM Heap: 2GB)
queue.type: persisted
path.queue: /logstash-queue
queue.max_bytes: 61440mb (approximately 1 hour)
pipeline.workers: 2
input.beats.executor_threads: 4
Logstash vs Message Broker/Queueing system:
Pros: simple to understand, easy to scale up/down, provides good visibility (prometheus exporter on top of
Monitoring API)
Cons: no way to pause delivery or manage backlog
Retention period: 6 days (some indexes have retention around 1 month)
Dataset size: 15TB (data+replica)
Total disk size: 24TB
Cluster size: 42 VM’s (ELK cluster + pipelines + telemetry)
Cost of current solution
Cost of current solution
Monthly cost of complete solution: ELK cluster + pipelines + storage + monitoring
(cost of human resource not included)
EUR 12,000 / month
with 1 year “Commited usage” GoogleTM:
EUR 9,500 / month
Alternatives to Travix controlled logging cluster
Logging as a service:
- Google Stackdriver Logging: EUR 12,800/month ($0.50/GiB and 30 days of logs)
- Elastic Cloud: EUR 9,000/month (6 node cluster) without pipelines and orchestration
Resume: without radical change in a way we use logging we can not migrate to
external “logging as a service” provider and cut the costs
Error logs
Standartise Error logging. SOS!!!
We see error events as one of the main indicators to monitor health of the Travix
infrastructure.
As a general rule we should aim to keep amount of error events close to zero.
Use same format of the log messages across ALL components.
Common format will allow us to automate error handling - detection of the issues,
classification of the known/unknown incidents, creation of the graphs/dashboards.
It’s all about conventions
Travix logging formats:
HALS - json with loosely defined structure (deprecated)
V2 - loosely defined, introduced by one team to address own needs, later adopted
across Travix stack (to be deprecated)
V3 - new recommended unified format
Elastic Common Schema - open source specification that defines a common set of
document fields for data ingested into Elasticsearch. ECS is designed to support uniform data
modeling, enabling you to centrally analyze data from diverse sources with both interactive and
automated techniques.
https://github.com/elastic/ecs
V3 logging format
Treat log message as application API - defined schema, versioned
messagetype - An enum-style value showing what kind of event was logged, it can be
used for filtering for certain events. It's using the Java enum casing style:
"GET_ORDER_BY_NUMBER", "WORKER_RUN_STARTED", etc.
messagetypeversion - The version of the message type. This is present to handle
possible breaking changes in the contents of the payload.
payload - Object with custom application-specific fields. It's up to the application to
decide what to store here.
Tricks to reduce unnecessary load:
Time picker defaults to round to the last minute.
timepicker:timeDefaults: { "from": "now-15m/m", "to": "now-1m/m", "mode": "quick" }
Search rounded dates
Queries on date fields that use now are typically not cacheable since the range that is being matched changes all the time. However
switching to a rounded date is often acceptable in terms of user experience, and has the benefit of making better use of the query cache.
Tricks to reduce unnecessary load:
Use default dummy index with single document pointing users to “Best practices” page.
Helps us to avoid all users hitting large index when they landing to Discover page.
Stick to naming convention for Kibana searches/visualisations:
${team_name}.${product}.${description}
${user_name}.${description}
Tools we use:
Elastalert - simple framework for alerting on anomalies, spikes, or other patterns of
interest from data in Elasticsearch. https://github.com/Yelp/elastalert
Several rule types with common monitoring paradigms are included with ElastAlert:
● Match where there are at least X events in Y time" (frequency type)
● Match when the rate of events increases or decreases" (spike type)
● Match when there are less than X events in Y time" (flatline type)
● Match when a certain field matches a blacklist/whitelist" (blacklist and whitelist type)
● Match on any event matching a given filter" (any type)
● Match when a field has two different values within some time" (change type)
● Match when a never before seen term appears in a field" (new_term type)
● Match when the number of unique values for a field is above or below a threshold (cardinality type)
InboundTrafficFlugladenAT.yaml:
name: Inbound traffic Increased by 50% compared to the last 15 minutes for Flugladen.at
index: v3
type: spike
timeframe:
minutes: 15
spike_height: 50
spike_type: up
threshold_cur: 500
filter:
- query_string:
query: ‘app: iis_adv AND host:api.flugladen.at’
alert:
- "opsgenie”
generate_kibana_link: True
kibana_url: ‘https://kibana.mysite.org’
include:
- “kibana_link”
Tools we use:
Cerebro - open source elasticsearch web admin tool
Quick way to discover/filter indexes, check shard allocation, active mappings:
Tools we use:
Quick way to check/modify index aliases:
Next steps:
- Migrate to official helm charts supported by Elastic, canary deployment:
https://github.com/elastic/helm-charts/tree/master/elasticsearch
Helm charts tested on different Kubernetes versions with “Goss” - YAML based serverspec alternative tool for validating a
server’s configuration.
https://github.com/aelsabbahy/goss
- Static mappings and custom retention.
Provide automation and templating around Curator config.
Use ILM (index lifecycle management) API introduced in 6.6.
- Tune ES clusters to optimise resource usage and increase retention without
increasing in cost - reduce number of data nodes, increase size of data volumes
Next steps:
- With unified error format automate issue detection, alerting and dashboarding.
ElasticStack ML POC
- Treat local users/applications as potential DDOS’ers - provide quota per service
and automated way to disable abusing applications from logging pipeline.

More Related Content

What's hot

Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten Group, Inc.
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkXiaoxi Chen
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with PrometheusShiao-An Yuan
 
Hadoop at Bloomberg:Medium data for the financial industry
Hadoop at Bloomberg:Medium data for the financial industryHadoop at Bloomberg:Medium data for the financial industry
Hadoop at Bloomberg:Medium data for the financial industryMatthew Hunt
 
Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaAll Things Open
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataRob Gardner
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopPatrick McGarry
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesSeveralnines
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringBartłomiej Płotka
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...C4Media
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at ScaleFabian Reinartz
 
Distributed scheduler hell (MicroXChg 2017 Berlin)
Distributed scheduler hell (MicroXChg 2017 Berlin)Distributed scheduler hell (MicroXChg 2017 Berlin)
Distributed scheduler hell (MicroXChg 2017 Berlin)Matthew Campbell
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by StepKim Stefan Lindholm
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksCommand Prompt., Inc
 

What's hot (20)

Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file system
 
PostgreSQL Terminology
PostgreSQL TerminologyPostgreSQL Terminology
PostgreSQL Terminology
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Hadoop at Bloomberg:Medium data for the financial industry
Hadoop at Bloomberg:Medium data for the financial industryHadoop at Bloomberg:Medium data for the financial industry
Hadoop at Bloomberg:Medium data for the financial industry
 
Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
 
How To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - SlidesHow To Set Up SQL Load Balancing with HAProxy - Slides
How To Set Up SQL Load Balancing with HAProxy - Slides
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoring
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Distributed scheduler hell (MicroXChg 2017 Berlin)
Distributed scheduler hell (MicroXChg 2017 Berlin)Distributed scheduler hell (MicroXChg 2017 Berlin)
Distributed scheduler hell (MicroXChg 2017 Berlin)
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by Step
 
Elephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forksElephant Roads: a tour of Postgres forks
Elephant Roads: a tour of Postgres forks
 

Similar to Migrating the elastic stack to the cloud, or application logging @ travix

User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyDaniel Hochman
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin Kuberton
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Présentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhPrésentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhAurélie Henriot
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisationgrooverdan
 
Présentation et démo ELK/SIEM/Wazuh
Présentation et démo ELK/SIEM/Wazuh Présentation et démo ELK/SIEM/Wazuh
Présentation et démo ELK/SIEM/Wazuh clevernetsystemsgeneva
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...Lviv Startup Club
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...Equnix Business Solutions
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
WMS Performance Shootout 2010
WMS Performance Shootout 2010WMS Performance Shootout 2010
WMS Performance Shootout 2010Jeff McKenna
 

Similar to Migrating the elastic stack to the cloud, or application logging @ travix (20)

User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with EnvoyInstrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Présentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo WazuhPrésentation ELK/SIEM et démo Wazuh
Présentation ELK/SIEM et démo Wazuh
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
Open ebs 101
Open ebs 101Open ebs 101
Open ebs 101
 
Présentation et démo ELK/SIEM/Wazuh
Présentation et démo ELK/SIEM/Wazuh Présentation et démo ELK/SIEM/Wazuh
Présentation et démo ELK/SIEM/Wazuh
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
WMS Performance Shootout 2010
WMS Performance Shootout 2010WMS Performance Shootout 2010
WMS Performance Shootout 2010
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Migrating the elastic stack to the cloud, or application logging @ travix

  • 1. Migrating the Elastic Stack to the Cloud Application Logging @ Travix March 13, 2019
  • 2. Content: 1. Whois Travix 2. Elastic Stack for logging - Travix use cases 3. Brief history - DC to GCP 4. Issues and Solutions 5. Detailed overview of current setup 6. Cost 7. Formats and conventions 8. Tricks and tools 9. Next Steps 10. Questions
  • 3.
  • 4.
  • 5. Travix in numbers: Bookings/year: 4 million Countries: 39 50+ planes filled every day Searches/second: 300-500 Services in production: 300-400
  • 6. Development and System Engineering: Product teams: 12-14 System Engineers (SWAT): 10 Development Locations: Amsterdam, London, Minsk Languages: .Net, Java, Golang, NodeJS, Python
  • 7.
  • 8. ES log delivery load pattern: 48h
  • 9. Production environment in numbers: Ingestion rate: 1.5 TB/day or 1000 MB/minute (average)
  • 10. “Netflix microservices architecture is a metrics generator that occasionally streams movies” https://medium.com/netflix-techblog/stream-processing-with-mantis-78af913f51a6
  • 11. “Netflix microservices architecture is a metrics generator that occasionally streams” Travix logs sells tickets
  • 12. “SLA” Travix logging: - Does not guarantee 100% delivery (BEST EFFORT) In some cases it is better to drop 4 hours of queued log events to recover system and let recent logs to reach the cluster. - ElasticSearch may drop entire log message when type of field in a log message clashes with type defined in index. (Example: “response” was String and become Object) - Normally deliver and store event only once (no duplications). Had number of incidents when bugs in pipelines resulted same log message stored number of times.
  • 13. Logging use cases: a) Access to detailed context of real time events (<2h) - incident handling/distributed debugging b) Platform issues GCP/GKE (<24h) - investigation c) Regression/bugs introduced by new releases (<24h) - investigation d) Analysis of historical data ( > 48h) - statistics/trends comparison
  • 14. Use cases: Main role of logging system: real time incident handling and distributed debugging - see detailed context of events as they happen across systems. Could be opinionated but when Logging Cluster is down Travix IT is 99,9% blind - no deployments, no application level incident handling
  • 15. Examples of usage: - AIR team (Bangalore) : investigate on failed bookings from website, mainly depending on xml payloads which used for escalation to GDS - RulesAPI Development (Amsterdam): monitoring of rules execution - requests, timings, response codes, triggered versions of InRule - Pricing (San Jose): daily monitoring using landing/search and book/landing ratios - System Engineers: monitoring, incident handling, investigation
  • 20. Logging 3.0 February 2017 Version: ES 3.x Hardware: 7 x CPU: 16, Memory: 64GB, Disk System: 2x120GB, Data: 8x960GB data Volume: 400-800 GB/day Configuration: path.data: /mnt/sdc, /mnt/sdd, /mnt/sde, /mnt/sdf, /mnt/sdg, /mnt/sdh, /mnt/sdi, /mnt/sdj JVM Heap (ES data node ): 30GB Indexing: Single daily index, curator optimise scheduled by linux cronjob at 1:00 AM Traffic distribution: 95% DC, 5% GCP Pipelines: (DC) APP Travix Logger -> (DC) RMQ -> (DC) LS -> (DC) ES (GCP) Fluentd -> (GCP) RMQ -> (DC) RMQ -> (DC) LS -> (DC) ES
  • 21. Issues: ● Sudden increases in logging volume slows down ingestion, search. ● No easy way to see what type of data is DDOSing cluster. Kibana dashboard to report number of documents stored and their size grouped by application name - heavy search adds even more load during meltdown. ● RabbitMQ configured in HA cluster mode is huge and unstable - memory, showels, slow to recover. ● Curator optimise (index merge) on daily index takes around 12-18 hours affecting cluster functionality. ● People treat Kibana like a Google search - full text search on most of fields and all possible queries with “*” ● Disk recovery causes large disruption - many disks used by single ES process, malfunctioning disk affects performance of entire process, to replace disk ES process has to be restarted affecting 1/7 of cluster capacity. ● ES Index datatype collisions - dynamic mappings and different applications storing data to the same index may cause events to be silently dropped by ES but also index on rotation changing type of field causing even more confusion ● Size of log messages up to 10MB (SOAP, XML binary blobs). ES can handle this, but feels very wrong. Logging 3.0 February 2017
  • 22. Logging 5.0 May - August 2017 Version: ES 5.x Volume: 800-1000 GB/day Hardware: 4 x CPU: 48, RAM: 264GB, Disk System: 2 x 120GB, Disk Data: 4 x 2TB Configuration: 4 x docker containers (ES data nodes) per physical server JVM Heap (ES data node ): 28GB path.data: each data node has own dedicated 2TB SSD Total data storage: 32 TB Replica anti-affinity: 'cluster.routing.allocation.awareness.attributes’ : node.attr.rack_id: "${rack_id}" Indexing: hourly index, curator optimise scheduled by rundeck Traffic distribution: 60% DC, 40% GCP
  • 23. Issues: ● Change management: people had hard time saying bye to kibana3, mess with stored searches/dashboards (hard to find owners). ● “Snowball effect” during major production issues - 2x-3x increase in write/search volume. People use random heavy queries selecting weeks of data. ● Impossible to see what application system is DDOSing cluster. Slow query log in ES is hard to read - no easy way to see who executed certain search. ● When DC networking is slow/saturated RMQ unable to keep up with delivery causing large backlog slow to process and stressing ES cluster even more when link is restored. ● Fluentd under load also need some attention (memory/buffer tuning) - OOM on fluentd result in duplication of log messages, RMQ memory issues, bugs in plugins may result in content of log files not being read/shipped. ● People messing timezone settings and logging events “in future” ● ES data node JVM periodically crashes after week of running with kernel core dump in Linux/Ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056 ● Max logentry size limit: Stackdriver logging - 100KB, Docker introduces - 16KB (thanks Docker!) Logging 5.0 May - August 2017
  • 24. Issue: People messing timezone settings and logging events “in future” Solution: logstash custom ruby filter to compare timestamps and log warning message notifying about timestamp issues filter { ruby { #threshold can be configured with "threshold_delta_minutes" code => 'threshold_delta_minutes = 30; event_message = event.to_hash; time_now = Time.now.to_f; time_event = event.get("@timestamp").to_f; log_message = "Time control - log event in future >30 minutes [ logstash_time #{Time.at(time_now)} < event_time #{Time.at(time_event)} ] event: #{event_message}"; @logger.warn(log_message) if (time_event - time_now) > (60 * threshold_delta_minutes); event.cancel if (time_now - time_event) > (60 * 20160); @logger.warn("Dropped event older than 2 weeks") if (time_now - time_event) > (60 * 20160)' } } Logging 5.0 Issues
  • 25. Issue: Max logentry size limit 16KB Solution: store large payloads ( up to 10MB) directly to BQ and log only url of the payload Custom application logger to log payloads to local log file and write another light log messages to stdout, share same UUID key to link events. Use fluend to store message in BQ and URL formatted field in kibana to let users access payload: Logging 5.0 Issues
  • 26. Issue: misbehaving applications and test deployment may produce huge volume of logs affecting delivery of application logs without adding real value Solution: be pragmatic and filter out all unknown (non “travix” formatted) events at filebeat level leave “backdoor” - containers with label “elasticsearch_debug” - drop_event.when.not: or: - equals: type: "v2" - equals: logformat: "v3" - equals: kubernetes.namespace: "kube-system" - equals: kubernetes.labels.elasticsearch_debug: "true" Logging 5.0 Issues
  • 27. Issue: Impossible to see what application system is DDOSing cluster. Solution: Measure and display logging rate/size per application at earliest possible stage (before pipelines or ES). Fluentd plugins: “flowcounter” + “prometheus”. Collecting logging rate in prometheus allow us to set alerts and automate detection on misbehaving applications. Unfortunately could not find easy way to achieve similar in filebeat. For the simple DDOS prevention Logstash has Throttle filter plugin (which we use) but does not have easy way to integrate with prometheus. Logging 5.0 Issues
  • 29. Fluentd record has a tag: "kubernetes.var.log.containers.cfe-14080-vay-us-31935-gksb3_frontend-acceptance_cfe-14080-vay-us-4bnj.log" <store> type flowcounter count_keys * unit minute aggregate tag </store> <filter flowcount> @type record_transformer enable_ruby true <record> # Parse tag container_name ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split("_")[2].rpartition("-")[0]} namespace ${record['tag'].sub("kubernetes.var.log.containers.","").sub(".log","").split('_')[1]} </record> </filter> Logging 5.0 Issues
  • 30. <filter flowcount> @type prometheus <labels> container_name ${container_name} namespace ${namespace} </labels> <metric> name fluentd_flowcounter_logging_lines_per_minute type gauge desc logged lines per minute key count </metric> </filter> <match flowcount> type null </match> Logging 5.0 Issues
  • 31. Issue: log events “silently” dropped by ES as result of ES Index data type collisions Solution: move away from dynamic mappings to static mappings, rigid logging format, for the unlucky ones - DLQ with alerting Logging 5.0 Issues
  • 32. Version: ES 6.x Volume: 1-1.5 TB/day VM’s: 42 x n1-highmem-8, CPU: 8 , Memory: 52 GB, SSD persistent disk: 1TB Configuration: 24x data, 6x client, 3x ingest, 3x master, 8x logstash nodes Total data storage: 24 TB JVM Heap (ES data node ): 28GB Indexing: multiple indexes (per log message type), rollover index (number of documents), curator by k8s cronjob to rotate/delete indexes, optimise disabled. Traffic distribution: 100% GCP Logging GCP February 2018 - 2019
  • 33. Issues: ● GCP specific issues with node availability and health checks in a large clusters. Solution: keep ES in isolated GKE cluster and (if possible) isolated project. Previous experience in running ES in docker helps in proving your point to Google Support. ● GCP preemptible nodes are cheap (80% cheaper than regular instances) but not reliable to run large production ES cluster. ● Hard to perform graceful rolling update on a live cluster. k8s container readiness check based on cluster state (yellow, green) is limited. To update manually - updateStrategy: OnDelete ● Missing canary deployment, staging can not provide load similar to production. ● https://github.com/pires/docker-elasticsearch-kubernetes is not maintained. ● Components of Elastic Stack do not have native support of the Prometheus. Third party plugins. Logging GCP February 2018 - 2019
  • 34. Logging Infrastructure : Apps in Kubernetes Enossis Windows VM’s MOAP Windows VM’s ElasticSearch API: storage, indexing, searching Delivery pipelines asynchronous delivery, queueing, validation, transformation Kibana Real time discovery, search, visualisation Grafana Visualisation ElastAlert Alerting Google Cloud Platform
  • 35. Log delivery pipelines: Local layout: logging cluster directly accessible filebeat logstash forwarder Google PubSub logstash processor filebeat logstash processor Distributed layout: logging cluster not accessible directly Project: “travix-myapp” Project: “travix-logging” Project: “travix-logging”
  • 36. Production ES Index numbers: indexes: 569 active indexes: 14 shards: 11,178 documents: 8,526,909,237
  • 37. Production ES Index settings: index.number_of_replicas: 1 index.number_of_shards: 10 index.routing.allocation.total_shards_per_node: 1 index.refresh_interval: 60s index.translog.durability: async “commits the translog every 5 seconds if set to async or if set to request (default) at the end of every index, delete, update, or bulk request” index.unassigned.node_left.delayed_timeout: 30m "the allocation of replica shards which become unassigned because a node has left can be delayed with the index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m” index.search.slowlog.level: debug index.search.slowlog.threshold.fetch.debug: 5s index.search.slowlog.threshold.fetch.query.debug: 5s
  • 38. Production ES - system resources: QOS Burstable: CPU request: 6, CPU limit: 8. Pay attention to CPU throttling! container_cpu_cfs_throttled_seconds_total{job="kubernetes-cadvisor"} Consider QOS Guaranteed or do not set CPU limit to disable kernel scheduling Completely Fair Scheduler (CFS) https://github.com/kubernetes/kubernetes/issues/67577
  • 39. Production ES - system resources: QOS Burstable: CPU request: 0.5, CPU limit: 1. Pay attention to CPU throttling! Good presentation from Zalando: https://www.slideshare.net/try_except_/optimizing-kubernetes-resource-requestslimits-for-costefficiency-and-latency-highload
  • 40. Production Logstash: replicas: 8 (CPU req: 5, CPU limit: 8, JVM Heap: 2GB) queue.type: persisted path.queue: /logstash-queue queue.max_bytes: 61440mb (approximately 1 hour) pipeline.workers: 2 input.beats.executor_threads: 4 Logstash vs Message Broker/Queueing system: Pros: simple to understand, easy to scale up/down, provides good visibility (prometheus exporter on top of Monitoring API) Cons: no way to pause delivery or manage backlog
  • 41. Retention period: 6 days (some indexes have retention around 1 month) Dataset size: 15TB (data+replica) Total disk size: 24TB Cluster size: 42 VM’s (ELK cluster + pipelines + telemetry) Cost of current solution
  • 42. Cost of current solution Monthly cost of complete solution: ELK cluster + pipelines + storage + monitoring (cost of human resource not included) EUR 12,000 / month with 1 year “Commited usage” GoogleTM: EUR 9,500 / month
  • 43. Alternatives to Travix controlled logging cluster Logging as a service: - Google Stackdriver Logging: EUR 12,800/month ($0.50/GiB and 30 days of logs) - Elastic Cloud: EUR 9,000/month (6 node cluster) without pipelines and orchestration Resume: without radical change in a way we use logging we can not migrate to external “logging as a service” provider and cut the costs
  • 45. Standartise Error logging. SOS!!! We see error events as one of the main indicators to monitor health of the Travix infrastructure. As a general rule we should aim to keep amount of error events close to zero. Use same format of the log messages across ALL components. Common format will allow us to automate error handling - detection of the issues, classification of the known/unknown incidents, creation of the graphs/dashboards.
  • 46. It’s all about conventions Travix logging formats: HALS - json with loosely defined structure (deprecated) V2 - loosely defined, introduced by one team to address own needs, later adopted across Travix stack (to be deprecated) V3 - new recommended unified format Elastic Common Schema - open source specification that defines a common set of document fields for data ingested into Elasticsearch. ECS is designed to support uniform data modeling, enabling you to centrally analyze data from diverse sources with both interactive and automated techniques. https://github.com/elastic/ecs
  • 47. V3 logging format Treat log message as application API - defined schema, versioned messagetype - An enum-style value showing what kind of event was logged, it can be used for filtering for certain events. It's using the Java enum casing style: "GET_ORDER_BY_NUMBER", "WORKER_RUN_STARTED", etc. messagetypeversion - The version of the message type. This is present to handle possible breaking changes in the contents of the payload. payload - Object with custom application-specific fields. It's up to the application to decide what to store here.
  • 48.
  • 49. Tricks to reduce unnecessary load: Time picker defaults to round to the last minute. timepicker:timeDefaults: { "from": "now-15m/m", "to": "now-1m/m", "mode": "quick" } Search rounded dates Queries on date fields that use now are typically not cacheable since the range that is being matched changes all the time. However switching to a rounded date is often acceptable in terms of user experience, and has the benefit of making better use of the query cache.
  • 50. Tricks to reduce unnecessary load: Use default dummy index with single document pointing users to “Best practices” page. Helps us to avoid all users hitting large index when they landing to Discover page.
  • 51. Stick to naming convention for Kibana searches/visualisations: ${team_name}.${product}.${description} ${user_name}.${description}
  • 52. Tools we use: Elastalert - simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch. https://github.com/Yelp/elastalert Several rule types with common monitoring paradigms are included with ElastAlert: ● Match where there are at least X events in Y time" (frequency type) ● Match when the rate of events increases or decreases" (spike type) ● Match when there are less than X events in Y time" (flatline type) ● Match when a certain field matches a blacklist/whitelist" (blacklist and whitelist type) ● Match on any event matching a given filter" (any type) ● Match when a field has two different values within some time" (change type) ● Match when a never before seen term appears in a field" (new_term type) ● Match when the number of unique values for a field is above or below a threshold (cardinality type)
  • 53. InboundTrafficFlugladenAT.yaml: name: Inbound traffic Increased by 50% compared to the last 15 minutes for Flugladen.at index: v3 type: spike timeframe: minutes: 15 spike_height: 50 spike_type: up threshold_cur: 500 filter: - query_string: query: ‘app: iis_adv AND host:api.flugladen.at’ alert: - "opsgenie” generate_kibana_link: True kibana_url: ‘https://kibana.mysite.org’ include: - “kibana_link”
  • 54. Tools we use: Cerebro - open source elasticsearch web admin tool Quick way to discover/filter indexes, check shard allocation, active mappings:
  • 55. Tools we use: Quick way to check/modify index aliases:
  • 56. Next steps: - Migrate to official helm charts supported by Elastic, canary deployment: https://github.com/elastic/helm-charts/tree/master/elasticsearch Helm charts tested on different Kubernetes versions with “Goss” - YAML based serverspec alternative tool for validating a server’s configuration. https://github.com/aelsabbahy/goss - Static mappings and custom retention. Provide automation and templating around Curator config. Use ILM (index lifecycle management) API introduced in 6.6. - Tune ES clusters to optimise resource usage and increase retention without increasing in cost - reduce number of data nodes, increase size of data volumes
  • 57. Next steps: - With unified error format automate issue detection, alerting and dashboarding. ElasticStack ML POC - Treat local users/applications as potential DDOS’ers - provide quota per service and automated way to disable abusing applications from logging pipeline.