SlideShare une entreprise Scribd logo
1  sur  14
Monitoring at section.io
Operational visibility for both the platform and our users
• Runs on your local machine and pre-production
• Configuration and deployment via git
• Fast global cache management
• HTTPS and HTTP/2 by default
A modern CDN
• Integrates with popular open-source
• API driven
• Near real-time log access
• Consistent operational interface
Open platform
• Delivery Proxies
• Varnish Cache
• ModSecurity
• Kibana
• Graphite
• Umpire
Containers
• Web access logs, syslog, performance data
• Docker Volumes
• Elastic Beats
• Log rotation
Gathering data
• 600 million web access logs per week
• 60,000 log entries processed per minute
• 7 days of logs are searchable
Log volume
Log flow
Delivery
networks
Logstash
receivers redis
Logstash
processors
Logstash
senders
redis
Ops Elasticsearch
cluster
Apps
Elasticsearch
cluster
StatsD,
Carbon
Between about 5 seconds and 2 minutes
• Kibana
• Elasticsearch API
• Traces
Log visibility
• Metrics can optimise common log queries
• Metrics retention:
• 1 minute granularity for 1 month
• 1 hour granularity for 13 months
• Graphite, Tessera, and Grafana
• Heroku Umpire
Beyond logs
• CPU utilisation, memory usage, disk space
• Traffic: connections, requests, packets, bytes
• By partition, node, geo-region, and domain
• By HTTP response status code
• Log latency, queue depth, processing rate
• Message counts, errors, processing time
Platform monitoring
• Cache hit, miss, pass
• By content-type
• Response time (median, mean, upper 95%)
• WAF intercepts
• By rule
• By country
Website monitoring
• Every staff member does on-call
• Every alert is actionable
• Every incident feeds the product backlog
Internal processes
• Yelp Elastalert
• Custom log fields
• A `tail -f` UI
• Automated anomaly detection
Beyond today
Jason Stangroome
Twitter: @jstangroome
https://blog.stangroome.com
https://www.section.io/blog
Thank you

Contenu connexe

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

En vedette

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

En vedette (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Monitoring at section.io - Operational Intelligence Meetup May 2016

Notes de l'éditeur

  1. Hi. I am Jason Stangroome. I work for section.io and I’d like to talk to you about how we handle operational visibility. I am definitely not here to claim that we way we do things is the best way. Some days I wonder if its even one of the good ways. But hopefully through sharing, we can all improve. Now, to provide some context for this presentation, I’d like to give you some background on what section.io does: Section.io is a new class of Content Delivery Network. Prior to section.io, the CDN market focused on solutions that operate in-front of, or beside, your production website. Furthermore, the market dominators were very closed and slow to respond to both industry change and customer requests. In contrast, at section.io, we have three driving tenets: be open be easy and give users control These tenets have manifested in our CDN today in several ways:
  2. Firstly, you can run our CDN on in local development and other environments prior to production. This is important because even a basic CDN is having an impact on which browser requests actually reach your origin web server and how those requests and responses are modified in transit. An advanced CDN configuration is going to be manipulating the traffic significantly if is delivering the best value possible for your particular site. Being able to see how this affects your sites behaviour before production is critical to reproducing and understanding reported issues and even catching new issues before a feature deployment becomes a production incident. See Menulog’s customer data leakage earlier this month for an example of what can happen when your CDN configuration is insufficiently tested. -- The configuration for every website on section.io is tracked in a per-site git version control repository providing natural change tracking. We provide a friendly web GUI over most of it but can also just clone the git repository locally and use your preferred tools. Deployments happen simply by pushing to the remote git branch that corresponds to your environment and moments later the new configuration is active at all the delivery nodes. -- Similarly, a request to flush your entire site cache, or just specific URLs, or even some combination in between happens in seconds. -- We provide a strong Qualys Grade A HTTPS configuration by default at no extra charge if you bring your own certificate. Our team are also currently working on Let’s Encrypt integration so you’ll even get a free HTTPS certificate as soon as you point your DNS records at our edge nodes. Once HTTPS is activated, you’ll find HTTP/2 is also enabled out-of-the-box, even if your origin doesn’t support it.
  3. section.io consists of quite a few popular open-source systems that we have integrated together and we believe this gives at least two important benefits to our users: If you have existing skills with these products, these skills are immediately useful when working with the section.io platform. If you don’t have these skills, there is already a wealth of existing content, and an establishing community for these products on top of what our section.io documentation and support team already provide. If you later find section.io is not the right fit for your website, you haven’t coupled to some section.io-specific implementation. All our architectural decisions err away from building a section.io custom build that could result in vendor lock-in. -- Everything in our web management portal is built API-first. If you can perform an operation through our web UI, you can also do it via our REST API. This makes it easy to automate tasks and include section.io deployments into your internal deployment pipelines. -- We focus on providing access to logs as close as possible to the time the event occurred. You shouldn’t have to raise a support ticket to request access to detailed data of the traffic for your website. -- Lastly, we don’t manage our customer’s origin web servers – they have their own operations staff. But to keep the origin web servers running smoothly, and to understand anomalies, those staff need access to much of the same data that we need at section.io to ensure the CDN platform itself is operating as expected. As such, we’ve built our system so that the data and techniques that we use to run section.io are also available to our users. The primary difference is permissions – we don’t give our users visibility into the data of websites they don’t own.
  4. section.io in its current form, began in the second half of 2014, just after Docker reached version 1.0. Prior to that we only operated a fully-managed CDN service and wanted to find a way to put the control back in the hands of our users. Docker’s approach to containerisation proved to be the catalyst and after watching it slowly mature, we seized the opportunity to re-architect. Containers enable us to more easily build a multi-tenant system giving each user their own isolated environment for handling their website’s traffic and the CDN configuration and operational data. As just a few examples: we’re using Varnish Cache as our CDN’s caching solution. If you don’t know Varnish, it is the caching solution used by Wikipedia, The New York Times, Pinterest, some competing CDNs, and *many* others. On section.io each site gets its own Varnish instance in its own Docker container with dedicated configuration. Similarly we provide ModSecurity as our Web Application Firewall offering and we have a content-rewriting proxy in development right now. We use Kibana for querying logs, Graphite for metrics, and Umpire for alerting, and these are all containerised per website too. There’s very little left in our platform that isn’t in a container and that list diminishes with each iteration.
  5. The first step to monitoring is gathering the data and containers brought some challenges. Most of the data is web access logs and syslog. We also run various processes and jobs to capture additional data to a useful log format. We are then running multiple docker containers per customer website. There’s a whole debate raging on about just how much a single container should do. On one end of the spectrum you have the one-process-per-container crowd, and on the other end you have an init daemon, various system services, a handful other processes and the kitchen sink. There are merits to each perspective but us for the sheer process count is a driver toward the minimalist end. We have over 300 containers actively running on some nodes and if each container is running its own log shipping process, that’s another 300 processes fighting for a slice of CPU time and another 300 connections to our log ingestion system, *per node*. Instead we leverage Docker Volumes to map the log directories of each container out to where a single per-node log shipping process can harvest them all. Today we’re using Elastic Filebeat for shipping log files. We like that it is using TLS, it batches logs together and gets some good compression from repeated values, and it requires acknowledgement from the receiver before proceeding. Filebeat is a fairly new product though and we’ve been hitting a number of edge cases, luckily the Elastic team has been responsive to our bug reports (once we started including repro scripts). We’re also interested to adopt Elastic’s Topbeat and Packetbeat solutions where we have previously used collectd. Log rotation is also a little more involved in this world. Again we don’t want 300 cron daemons running to handle each container but at the same time we do need to signal all the container processes that have open file descriptors to close and re-open the log files after rotation. For now we’ve integrated `docker exec` calls into shared logrotateD configurations. Our container hosts are short-lived – days, sometimes weeks in the quieter partitions. This happens for two reasons: Platform deployments are implemented by provisioning new hosts, bringing them into service, and retiring the old hosts. We scale horizontally in response to increased load but whenever we scale-in as the load declines we retire the oldest hosts – just one more nail in the coffin for configuration drift.
  6. Just how much data are we dealing with at the moment? Focused only on our self-managed customers and a portion of a fully-managed customers, and only on the web traffic logs, we’re handling about 600 million new logs a week. Many of our fully-managed customers are still logging through our previous generation system and are being migrated incrementally. Our non-web logs are not included in this number. This is only expected to grow with our user base. Our users can extract their logs via the ES API for their own archives before the 7 days passes. We are investigating other options for shipping logs directly to our user’s own systems.
  7. From the moment a web request is handled in a delivery node and written to the local log file it is typically as little as 5 seconds until that log is searchable in the Kibana UI for ES. Under peak load the latency can reach 2 minutes for some log sources and this is our acceptable upper bound. We found that we needed to split the Logstash pipeline into separate processes for resiliency, especially due to the design of the Elasticsearch output plugins blocking the pipeline and signalling back pressure all the way to the delivery nodes. Redis helps to decouple the performance and availability of the components in the pipeline. We autoscale our Logstash machines based on log flow rate. Essentially it’s a combination of CPU usage and queue depth.
  8. Kibana containers use nginx auth subrequests to ensure containers are running so that we can limit the number of Kibana containers running concurrently to the same number of users actively using the Kibana UI – a much smaller number that the total number of users on the platform. We run a single Elasticsearch cluster but we then use an nginx proxy with LUA parse the requests and whitelist which indexes are permitted per user. On-demand messages allow varnishlog to be run in cache containers to grab a snapshot of recent requests in much greater detail that we can currently ship. This is very useful for diagnosing issues.
  9. Everything in metrics could be queried from Elasticsearch but metrics make many queries more efficient and we can get better retention Umpire allows smart alerting by leveraging existing synthetics platforms instead of building our own
  10. Boring stuff like CPU, etc
  11. All the same traffic data as mentioned on the last slide plus…
  12. Front-line with a buddy. Including the CEO. Both customer support and platform support. Gives a great range of perspectives. Alerts are all actionable and documented. If it can’t be actioned, the alert is removed. The documentation lists impacted systems and user experiences, possible causes and sometimes how the failure may cascade. When incidents occur the immediate focus is on rectifying the issue, ideally without destroying any diagnostic data. Then post-mortems are performed to document the series of conditions that allowed the situation to exist – careful not target individual actions. The post-mortems then become an basis for identifying room for improvement in the platform and workflows that would circumvent similar incidents in the future, and those ideas go into our product backlog for consideration in the next iteration of development.
  13. Like Umpire but for Elasticsearch data -- But we don’t want string_1, integer_7, custom_field_9 -- Sometimes you just want to tail a log and get all Matrix-y -- It is common for a site to have high traffic during the business day and drop to low overnight, typically 3am Sydney time for an Australian website. Weekends have similar shapes. But the absolute numbers of these peaks and troughs change over time, by season, and as business grows so its difficult to establish a baseline from which to trigger alerts. We’d like to investigate options to be notified, quickly, when traffic is not following the trend.
  14. Thank you all for listening. I hope you found at least some of what I’ve shared today to be useful and I’d love to hear back from you about anything I’ve mentioned tonight.