SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Tim E. Hall @thallinflux
VP, Products InfluxData
Lessons Learned: Running
InfluxDB Cloud at Scale
Discussion Topics
Brief History of InfluxDB Cloud
Gathering Metrics...and Logs
Visualization, Monitoring, and Alerting
Troubleshooting Scenarios
What did we miss? So many things…
A Brief History of InfluxDB Cloud 1.0…
April
2016
August
2017
May
2014
• Enterprise Edition DBaaS
• Kapacitor Add-On
• Hosted on AWS
• Enterprise Edition DBaaS
• Chronograf and limited
Kapacitor included
• Co-monitoring
• Pay-as-you-go storage• Open Source DBaaS
• Hosted on Digital Ocean
From development
to production
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues, before
they become outages
From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise
InfluxDB Cloud 1: Deployment Diagram
AWS Account (Separate Accounts for Development/Acceptance and Production)
Monitoring Cluster
Kubernetes cluster
ssh
Bastion
Subscriptions (Single Tenant)
Running	procs:	
ssh
Running	procs:	
Docker
ssh
etcd
Designates:	
Service
Running	procs:	
Docker
ssh
etcd
Cluster	Manager	API	
Access
:443	
TLS	Listeners
Chronograf UI	
Access
:443	
TLS	Listeners
Cluster	
Manager
Cluster	
Backup	
Servicessh Access
:22
Quay.io
software image
repository
InfluxDB
Enterprise
Data Nodes
InfluxDB
Enterprise
Meta Nodes
Chronograf
Kapacitor
InfluxDB
Enterprise
Meta Nodes
InfluxDB
Enterprise
Data Nodes
Chronograf
+ Kapacitor
Add-Ons:
Kapacitor
Grafana
Papertrail
(log archival)
Data Nodes
InfluxDB Cloud 1: Deployment Diagram
Meta Node Quorum
Data Nodes
Kapacitor Node (optional add-on)
Kach Node
Meta Nodes
Papertrail
(log archival)
Running	procs:	
Docker
ssh
etcd
Running	procs:	
Docker
ssh
etcd
Running	procs:	
Docker
ssh
etcd
Designates:	
Docker	
Container
Kapacitor
(Chronograf
access	only)
Automatron
LogSpout
SkyDNS
Telegraf
InfluxData
Monitoring
InfluxData
Provisioning
Chronograf
Automatron
LogSpout
Telegraf
SkyDNS
Running	procs:	
Docker
ssh
etcd
Browser-
based	
access
CLI	and/or	
Programmatic	
Access
:8086	(Data	Node)
:9092		(Kapacitor
Node)
:443	
TLS	Listeners
:8088	(Chronograf)
:443	
TLS	Listeners
InfluxEnterprise
Meta	
InfluxEnterprise
Data
Automatron
LogSpout
Telegraf
SkyDNS
Kapacitor
SkyDNS
Automatron
LogSpout
Telegraf
ALB
(Shared
across n
clusters)
Shared	Security	Group	
(Open	ports	between	nodes)
:3000
:4001
:7001
:8083,	:8086,	:8088,	:8089,	:8091
:9092
Other	Port	Access
:46939	– Provisioning	System
:22	– open	to	bastion	host	only	
(for	ssh)
Description of common processes and services
within InfluxCloud
Running processes
– Each node has the following processes running
• Docker -- container infrastructure within which ALL InfluxEnterprise components execute
• ssh – secure shell to allow for secure, remote login
• etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of
changes in the underlying infrastructure
– Docker containers common across nodes
• LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage,
archival and search.
• Telegraf gathers and metrics and events from the systems services and InfluxEnterprise
components to facilitate remote monitoring
• Automatron is a custom built provisioning infrastructure which allows for delivery of software
updates to any of the containers deployed across the nodes.
Papertrail
(log archival)
Automatron
LogSpout
Telegraf
InfluxData Monitoring
InfluxData
Provisioning
SkyDNS
Running	procs:	
Docker
ssh
etcd
Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can
be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers
Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data
Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.
Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)
Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)
Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data. Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into
a cluster.
Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
[[outputs.influxdb]]
urls = [ "http://localhost:8086" ]
database = "telegraf"
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point
to the load balancer, if you
are storing the metrics into
a cluster.
Input Configuration add
the syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
Visualization, Monitoring,
Alerting
We’ve gathered a wide variety of metrics...so now
what?
Dashboards!
Alerting: Common Metrics to Watch
Disk Usage
Hinted Handoff Queue
No metrics…. aka Deadman
Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')
Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')
Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
WHERE ("host" =~ /prod-.*/)
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')
Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()
https://docs.influxdata.com
Troubleshooting
Common Troubleshooting Scenarios
• OOM Loop
• Runaway Series Cardinality
Common Troubleshooting Scenarios
Workload Type
• Which type are we
looking at?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow us to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities the
hardware by plan size
– Develop and review
sizing guidelines
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!
What did we miss? So many things…
Head for the balcony!
– Shift from instance-based dashboards to “fleet management”
What’s the experience of the “customer”?
– Real user monitoring from the front-door
– Integration with subscription management system
SSL Cert expiration
E-commerce system monitoring
– Health and availability of supporting components
Recap
Gather Metrics...and Logs (for context)
Visualize, Monitor, and Alert… tune based on your environment
Iterate and address “new” scenarios to eliminate alert fatigue
https://community.influxdata.com https://docs.influxdata.com
https://www.influxdata.com/products/influxdb-cloud/
Thank You

Contenu connexe

Tendances

Tendances (20)

Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
Virtual training Intro to the Tick stack and InfluxEnterprise
Virtual training  Intro to the Tick stack and InfluxEnterpriseVirtual training  Intro to the Tick stack and InfluxEnterprise
Virtual training Intro to the Tick stack and InfluxEnterprise
 
Virtual training Intro to Kapacitor
Virtual training  Intro to Kapacitor Virtual training  Intro to Kapacitor
Virtual training Intro to Kapacitor
 
Advanced kapacitor
Advanced kapacitorAdvanced kapacitor
Advanced kapacitor
 
Catalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data SetCatalogs - Turning a Set of Parquet Files into a Data Set
Catalogs - Turning a Set of Parquet Files into a Data Set
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)WRITING QUERIES (INFLUXQL AND TICK)
WRITING QUERIES (INFLUXQL AND TICK)
 
A TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATIONA TRUE STORY ABOUT DATABASE ORCHESTRATION
A TRUE STORY ABOUT DATABASE ORCHESTRATION
 
Meet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product UpdateMeet the Experts: InfluxDB Product Update
Meet the Experts: InfluxDB Product Update
 
How to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin EcosystemHow to Use Telegraf and Its Plugin Ecosystem
How to Use Telegraf and Its Plugin Ecosystem
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Kapacitor Manager
Kapacitor ManagerKapacitor Manager
Kapacitor Manager
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database Orchestration
 
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
Alan Pope, Sebastian Spaink [InfluxData] | Data Collection 101 | InfluxDays N...
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
 
Kapacitor Stream Processing
Kapacitor Stream ProcessingKapacitor Stream Processing
Kapacitor Stream Processing
 

Similaire à Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData

Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
Piyush Kumar
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
SamHoney6
 

Similaire à Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData (20)

Monitoring InfluxEnterprise
Monitoring InfluxEnterpriseMonitoring InfluxEnterprise
Monitoring InfluxEnterprise
 
Virtual training Intro to InfluxDB & Telegraf
Virtual training  Intro to InfluxDB & TelegrafVirtual training  Intro to InfluxDB & Telegraf
Virtual training Intro to InfluxDB & Telegraf
 
Influx data basic
Influx data basicInflux data basic
Influx data basic
 
Informix Data Streaming Overview
Informix Data Streaming OverviewInformix Data Streaming Overview
Informix Data Streaming Overview
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
Create useful data center health visualizations with Dell iDRAC Telemetry Ref...
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0Oracle Trace File Analyzer - What's New in 12.2.1.1.0
Oracle Trace File Analyzer - What's New in 12.2.1.1.0
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 

Plus de InfluxData

How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 

Plus de InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | Tim Hall | InfluxData

  • 1. Tim E. Hall @thallinflux VP, Products InfluxData Lessons Learned: Running InfluxDB Cloud at Scale
  • 2. Discussion Topics Brief History of InfluxDB Cloud Gathering Metrics...and Logs Visualization, Monitoring, and Alerting Troubleshooting Scenarios What did we miss? So many things…
  • 3. A Brief History of InfluxDB Cloud 1.0… April 2016 August 2017 May 2014 • Enterprise Edition DBaaS • Kapacitor Add-On • Hosted on AWS • Enterprise Edition DBaaS • Chronograf and limited Kapacitor included • Co-monitoring • Pay-as-you-go storage• Open Source DBaaS • Hosted on Digital Ocean
  • 4. From development to production • Establish monitoring baselines • Ensure visibility into health of the system • Notifications for most common issues, before they become outages
  • 5. From OSS to Enterprise InfluxDB OSS Meta 1 Meta 3Meta 2 Data Node 2 Data Node 1 InfluxDB Enterprise
  • 6. InfluxDB Cloud 1: Deployment Diagram AWS Account (Separate Accounts for Development/Acceptance and Production) Monitoring Cluster Kubernetes cluster ssh Bastion Subscriptions (Single Tenant) Running procs: ssh Running procs: Docker ssh etcd Designates: Service Running procs: Docker ssh etcd Cluster Manager API Access :443 TLS Listeners Chronograf UI Access :443 TLS Listeners Cluster Manager Cluster Backup Servicessh Access :22 Quay.io software image repository InfluxDB Enterprise Data Nodes InfluxDB Enterprise Meta Nodes Chronograf Kapacitor InfluxDB Enterprise Meta Nodes InfluxDB Enterprise Data Nodes Chronograf + Kapacitor Add-Ons: Kapacitor Grafana Papertrail (log archival)
  • 7. Data Nodes InfluxDB Cloud 1: Deployment Diagram Meta Node Quorum Data Nodes Kapacitor Node (optional add-on) Kach Node Meta Nodes Papertrail (log archival) Running procs: Docker ssh etcd Running procs: Docker ssh etcd Running procs: Docker ssh etcd Designates: Docker Container Kapacitor (Chronograf access only) Automatron LogSpout SkyDNS Telegraf InfluxData Monitoring InfluxData Provisioning Chronograf Automatron LogSpout Telegraf SkyDNS Running procs: Docker ssh etcd Browser- based access CLI and/or Programmatic Access :8086 (Data Node) :9092 (Kapacitor Node) :443 TLS Listeners :8088 (Chronograf) :443 TLS Listeners InfluxEnterprise Meta InfluxEnterprise Data Automatron LogSpout Telegraf SkyDNS Kapacitor SkyDNS Automatron LogSpout Telegraf ALB (Shared across n clusters) Shared Security Group (Open ports between nodes) :3000 :4001 :7001 :8083, :8086, :8088, :8089, :8091 :9092 Other Port Access :46939 – Provisioning System :22 – open to bastion host only (for ssh)
  • 8. Description of common processes and services within InfluxCloud Running processes – Each node has the following processes running • Docker -- container infrastructure within which ALL InfluxEnterprise components execute • ssh – secure shell to allow for secure, remote login • etcd – provides common rendezvous point for InfluxDB Enterprise components in the event of changes in the underlying infrastructure – Docker containers common across nodes • LogSpout gathers InfluxEnterprise related log outputs and delivers them to PaperTrail for storage, archival and search. • Telegraf gathers and metrics and events from the systems services and InfluxEnterprise components to facilitate remote monitoring • Automatron is a custom built provisioning infrastructure which allows for delivery of software updates to any of the containers deployed across the nodes. Papertrail (log archival) Automatron LogSpout Telegraf InfluxData Monitoring InfluxData Provisioning SkyDNS Running procs: Docker ssh etcd
  • 9. Deploy Telegraf on all nodes (meta and data) By enabling these plugins, KPI’s routinely associated with infrastructure and database performance can be measured and serve as a good starting point for monitoring. Minimum Recommendation: 1. CPU: collects standard CPU metrics 2. System: gathers general stats on system load 3. Processes: uptime, and number of users logged in 4. DiskIO: gathers metrics about disk traffic and timing 5. Disk: gathers metrics about disk usage 6. Mem: collects system memory metrics 7. NetStat: Network related metrics 8. http_response: Setup local ping 9. filestat: Files to gather stats about (meta node only) 10. InfluxDB: gather stats from the InfluxDB Instance. (data node only) Optional: 1. Logs: requires syslog 2. Swap: collects system swap metrics 3. Internal: gather Telegraf related stats 4. Docker: if deployed in containers
  • 10. Telegraf Configuration: Global [global_tags] cluster_id = $CLUSTER_ID environment = $ENVIRONMENT [agent] interval = "10s" round_interval = true metric_buffer_limit = 10000 metric_batch_size = 1000 collection_jitter = "0s" flush_interval = "30s" flush_jitter = "30s" debug = false hostname = "" All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by activating them. Global tags can be specified in the [global_tags] section of the config file in key="value" format. Use a GUID which uniquely identifies each “cluster” and ensure that environment variable exists consistently on all hosts (meta and data). Optionally, add other tags if desired. Example: dev, prod for environment. Agent Configuration recommended config settings for InfluxDB data collection. Adjust the interval and flush_interval based on: ● desire around “speed of observability” ● retention policy for the data
  • 11. Telegraf Configuration: Inputs (common) # INPUTS [[inputs.cpu]] percpu = false totalcpu = true fieldpass = ["usage_idle", "usage_user", "usage_system", "usage_steal"] [[inputs.mem]] [[inputs.netstat]] [[inputs.system]] [[inputs.diskio]] Input Configuration items include grabbing metrics from the various infrastructure, database, and system components in play. For the other plug-ins, default config is sufficient.
  • 12. Telegraf Configuration: Inputs Data Nodes # INPUTS [[inputs.influxdb]] interval = "15s" urls = ["http://<localhost>:8086/debug/vars"] timeout = "15s” [[inputs.http_response]] #DATA address = "http://<localhost>:8086/ping” [[inputs.disk]] mount_points = ["/var/lib/influxdb/data","/var/lib/influxdb/wal", "/var/lib/influxdb/hh”,"/"] InfluxDB grabs all metrics from the exposed endpoint. http_response allows you to ping individual data nodes and track response output. You can also setup a separate Telegraf agent elsewhere within your infrastructure to ping the available cluster(s) through the load balancer. disk allows you to configure the various volumes/mount points on disk -- locations of data, wal, hinted handoff -- and root. (default config options shown)
  • 13. Telegraf Configuration: Inputs Meta Nodes # INPUTS [[inputs.http_response]] #META address = "http://<localhost>:8091/ping" [[inputs.filestat]] files = ["/ivar/lib/influxdb/meta/snapshots/*/state.bin"] md5 = false [[inputs.disk]] mount_points = ["/var/lib/influxdb/meta", "/"] http_response allows you to ping individual meta nodes and track response output. filestat allows you to monitor metadata snapshots. disk allows you to configure the various volumes/mount points on disk -- locations of meta store -- and root. (default config options shown)
  • 14. Telegraf Configuration: Outputs # OUTPUTS [[outputs.influxdb]] urls = [ "<target URL of DB>" ] database = "telegraf" retention_policy = "autogen" timeout = "10s" username = <uname> password = <pword> content_encoding = "gzip" Output Configuration tells telegraf which output sink to send the data. Multiple output sinks can be specified in the configuration file. ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster.
  • 15. Telegraf Configuration: Gathering Logs # INPUT [[inputs.syslog]] # OUTPUTS [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" # Drop all measurements that start with "syslog" namedrop = [ "syslog*" ] [[outputs.influxdb]] urls = [ "http://localhost:8086" ] database = "telegraf" retention_policy = "14days" # Only accept syslog data: namepass = [ "syslog*" ] Output Configuration use namepass/namedrop to direct metrics/logs to different db.rp targets ** NOTE: This should point to the load balancer, if you are storing the metrics into a cluster. Input Configuration add the syslog input plug-in. Review the settings for your environment. InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.
  • 17. We’ve gathered a wide variety of metrics...so now what? Dashboards!
  • 18. Alerting: Common Metrics to Watch Disk Usage Hinted Handoff Queue No metrics…. aka Deadman
  • 19. Disk Usage Batch Task: TICKscript // Monitor disk usage for all hosts var data = batch |query(''' SELECT last(used_percent) FROM "telegraf"."autogen"."disk" WHERE ("host" =~ /prod-.*/) AND ("path" = '/var/lib/influxdb/data' OR "path" = '/var/lib/influxdb/wal' OR "path" = '/var/lib/influxdb/hh' OR "path" = '/') ''') .period(5m) .every(10m) .groupBy('host', 'role', 'environment', 'device')
  • 20. Disk Usage Alert: TICKscript var warn_threshold = 85 var critical_threshold = 95 data |alert() .id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags "environment" }}') .message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index .Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}') .warn(lambda: "used_percent" > warn_threshold) .crit(lambda: "used_percent" > critical_threshold) .slack() .channel('#monitoring')
  • 21. Hinted Handoff Queue Batch Task: TICKscript // This generates alerts for high hinted-handoff queues for InfluxEnterprise var queue_size = batch |query(''' SELECT max(queueBytes) as "max" FROM "telegraf"."autogen"."influxdb_hh_processor" WHERE ("host" =~ /prod-.*/) ''') .groupBy('host', 'cluster_id') .period(5m) .every(10m) |eval(lambda: "max" / 1048576.0) .as('queue_size_mb')
  • 22. Hinted Handoff Queue Alert: TICKscript var warn_threshold = 3500 var crit_threshold = 5000 queue_size |alert() .id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id" }}/{{ index .Tags "host" }}') .message('Host {{ index .Tags "host" }} (cluster {{ index .Tags "cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields "queue_size_mb" }}MB') .details('') .warn(lambda: "queue_size_mb" > warn_threshold) .crit(lambda: "queue_size_mb" > crit_threshold) .stateChangesOnly() .slack() .pagerDuty()
  • 25. Common Troubleshooting Scenarios • OOM Loop • Runaway Series Cardinality
  • 26. Common Troubleshooting Scenarios Workload Type • Which type are we looking at? – Read heavy – Write heavy – Mixed? – Establish baselines and understand “normal” using metrics and visualization – Baselines allow us to understand change over time and help determine when is time to scale up Log Analysis • Metrics First! – Highlights where you should look within the log files • Logs allow for pin pointing root-cause of issue observed by metrics – Cache max memory size – Hinted Handoff Queue “Blocked” IOPS & Disk Throughput • Understand the capabilities the hardware by plan size – Develop and review sizing guidelines – Understand max read and write limits based on machine class and drive types – these can change as you scale!
  • 27. What did we miss? So many things… Head for the balcony! – Shift from instance-based dashboards to “fleet management” What’s the experience of the “customer”? – Real user monitoring from the front-door – Integration with subscription management system SSL Cert expiration E-commerce system monitoring – Health and availability of supporting components
  • 28. Recap Gather Metrics...and Logs (for context) Visualize, Monitor, and Alert… tune based on your environment Iterate and address “new” scenarios to eliminate alert fatigue https://community.influxdata.com https://docs.influxdata.com