SlideShare une entreprise Scribd logo
1  sur  21
Argus Production
Monitoring At
Salesforce
Service Health & Observability at Scale
Tom Valine
Director, Infrastructure Engineering
tvaline@salesforce.com
in/tvaline
Bhinav Sura
Software Engineer, Infrastructure Engineering
bhinav.sura@salesforce.com
in/bhinavsura
What is Argus?
● Time Series Data & Events
● Inbuilt Service Protection
● Alerting
● Flexible Dashboarding
● Full REST API
● High Throughput
● Low Latency
● Horizontally Scalable
● In Use By
○ Capacity Planning
○ Search
○ Feature Teams
○ Site Reliability
○ Customer Success
But Why Another Monitoring System?
● Technology changes
frequently!
● Insulate our customers
● Performance
● Trust
● Programmatic access for
everything
● Multi-tenancy
● Correlation with non-
timeseries data
● Highly dimensional
I’ve seen this somewhere before...
Metrics
● Transforms
● Namespace
● Scope
● Name
● Tags
● Aggregator
● Downsampler
Events
● Namespace
● Scope
● Name
● Tags
● Type
● User
SCALE(-2d:-1d:dva:argus:freemem{host=*}:min:1d-min, $1e-6)
TRANSFORM
START
END
NAMESPACE
SCOPE
METRIC
TAGS
AGG
DS
PARAMS
-2d:-1d:dva:argus:release{host=*}:major:admin
START
END
NAMESPACE
SCOPE
NAME
TAGS
TYPE
USER
● First Class Data
● Decoupled from Time
Series
● Multiple Events Per
Timestamp
● Event Categories
● Identifiable per User
● Overlay on Any Time
Series
Events
Alerting
● CRON Format
● Alert on Missing Data
● Single Ended & Range
Comparisons
● Inertia
● Cooldown
● Multiple Triggers
● Multiple Notifications
○ Audit
○ Email
○ GOC++
○ Salesforce Chatter
○ PagerDuty
● Event Backannotation
Warden
● Policy Driven Suspension
Mechanism
● Per User
● Application & Subsystem
● Progressively Punitive
● Indefinite Suspension
Supported
● Customizeable
Dashboarding
● Maintaining dashboards is
a horrible business to be
in
● Empower the users, get
out of their way
● Markup based
● Custom tags for
visualization elements
● HTML for everything else
REST
● API First
● All functionality exposed
via services
● Decoupled UI
● Authenticated
○ Login
○ Do stuff
○ Logout
● Get out of User's Way!
○ Orchestra Client
○ ArgusPoke
○ Dashboard Creation
Tool
How does it work?
METRICS ANNOTATION USER ENTITY
ALERTS MAIL SCHEDULING MONITORING
WEB SERVICES
AUTH ORM MQ TSDB
WEB UI CUSTOM APPS OTHER CLIENTS
DASHBOARD MANAGEMENT WARDEN NAMESPACE
SCHEMA WILDCARDING CACHING INTERLOCK
Okay, but how does it REALLY work?
MESSAGE BUS
HBASE/TSDB/RDBMS/CACHING
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
W
S
Cool, how will it evolve going forward?
HBASE/TSDB/RDBMS/CACHE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
W
S
HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE
ROUTE/FORK/JOIN+M/R
ROUTE/FORK/JOIN+M/R
MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS
ROUTE/FORK/JOIN+M/R
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
Alert Evaluation Data Flows
Message Queue:
1. Scheduling Service updates
alert schedule every 10 minutes.
2. Scheduler submits scheduled
jobs to queue
3. Minimum interval of 1 minute
Alert Client:
1. Dequeues from alert queue.
2. Query ranges adjusted for
scheduling latency
3. Triggers evaluated
4. Notifications sent
5. Cooldowns updated.
ALERT DATA STORE
SCHEUDLING
SERVICE
ALERT CACHE
ARGUS WS
ALERT 8713
...
ALERT 4141
ALERT 9810
Metric & Event Data Flows
Message Queue:
1. Writes are asynchronous with high
degree of parallelism.
2. Queue used as a shock absorber.
Tolerant to lower level
failures/downtime.
3. Kafka for scalability. One topic each
for metrics and annotations.
Number of partitions in the order of
100s.
ArgusMetricsQueue:
1. Consumed by 2 types of clients:
MetricCommit and SchemaCommit
2. MetricCommit client commits the
actual time series data to persistent
storage (using OTSDB or Phoenix).
3. SchemaCommit client only uses the
metric metadata to create metric
schema records and commits them
to HBase (using AsyncHBase).
TIMESERIES STORE
ARGUS WS
METRIC
...
METRIC
METRIC
METRIC SERVICE
SCHEMA STORE
TSDB Service Implementation - OpenTSDB
● Uses HBase underneath
● RowKey: <metric_uid><timestamp><tagk1><tagv1>[...<tagkn><tagvn>].
● Stores actual time series values on hourly boundaries (All values within an hour stored in the
same cell)
● Pros:
○ Extremely fast when you query using complete metric name.
○ 5M datapoints/min write throughput per write daemon.
● Cons:
○ Tag Cardinality - Total number of tags per metric is limited to 8
○ Tag Cardinality - As product of tag values across all tag keys increases, performance decreases
drastically
○ UID Exhaustion - 16M UIDs each for metric, tagk and tagv names by default. Once these are
exhausted, no new metrics, tagk or tagv can be created.
TSDB Service Implementation - Phoenix
● Uses HBase underneath
● RowKey: <metric_uid><timestamp><tagv1>[...<tagvn>].
● Metric modeled as Phoenix VIEW
○ Schema is introspectable and managed outside of data
○ Supports secondary indexes on value and/or tag(s)
● Parallelizes query and pushes computation to server
○ Server-side aggregation conserves network bandwidth
○ Allows SKIP_SCAN filter optimization for minimizing data scanned
○ Leverages ROW_TIMESTAMP optimization for filtering HFiles
● Performance on par or better than OpenTSDB
● Ad hoc SQL query capability
○ Join against other Phoenix tables
● Longer term leverage Drillix (Phoenix + Drill)
○ Cross cluster queries
○ Joins to other non HBase data sources
Schema Service Motivation
● Discover Metrics
○ What all metrics exist within a scope?
○ For a given <scope, metric> combination, what all tags exist?
○ Given a metric, what all scopes contain this metric?
○ What are all the tag values that exist for a given tag key?
● Support Wildcard Queries
○ Non-wildcard query
■ -1h:system.myDatacenter.myPod:Cpu.perc:avg:1m-avg
○ Wildcard query
■ -1h:system.myDatacenter.*:Cpu.perc:avg:1m-avg
■ -1h:system.myDatacenter.myPod:Cpu*:avg:1m-avg
■ -1h:system.myDatacenter.myPod:Cpu.perc{device=*app*}:avg:1m-avg
Schema Service Implementation
● AsyncHBase Schema Service:
○ Uses HBase underneath
○ SchemaRecord: namespace, scope, metricname, tagk, tagv. No data points.
○ Each record indexed in 2 ways in 2 different tables.
○ MetricIndexed schema table:
■ RowKey: <metricname><scope><namespace><tagk><tagv>
○ ScopeIndexed schema table:
■ RowKey: <scope><metricname><namespace><tagk><tagv>
○ Decide what table to use based on the type of query.
○ Pros:
■ Efficient retrieval for schema records for most types of queries
○ Cons:
■ Storage duplication
● DiscoveryService:
○ Uses SchemaService internally
○ Ability to filter records by type
■ For e.g. Filter all unique scopes that match *myScope*
○ Expand Wildcard query and return a collection of non-wildcard queries
Caching
● CachedTSDB Service:
○ Uses RedisCache service and the configured TSDBService implementation (OpenTSDB or
PhoenixTSDB)
○ Query Level Caching (caches synthetic data)
○ Caches data spanning a window of more than last 24 hours.
○ Data is cached by fracturing it on day boundary.
■ For e.g.: Query spanning 5 days is stored using 5 keys on the cache.
○ Support for partial hits
○ Cache expiry time of an hour (can be increased by running a separate Cache update process)
● CachedDiscovery Service:
○ Uses RedisCache service and the configured DiscoveryService implementation
○ Cache queries already expanded
○ Cache expiry time of a day
Developed By
● Anand Subramanian
● Bhinav Sura
● Tom Valine
● Jigna Bhatt
● Ruofan Zhang
● Dilip Devaraj
● Raj Sarkapally
● Kiran Gowdru
More Information
​https://github.com/SalesforceEng/Argus
thank y u

Contenu connexe

Tendances

Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
Vinoth Chandar
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
huguk
 

Tendances (20)

HBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project StatusHBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project Status
 
Tales from Taming the Long Tail
Tales from Taming the Long TailTales from Taming the Long Tail
Tales from Taming the Long Tail
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
Google mesa
Google mesaGoogle mesa
Google mesa
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay SearchHBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 

En vedette

En vedette (16)

Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Date-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series DataDate-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series Data
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
HBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ FlipboardHBaseCon 2015: HBase @ Flipboard
HBaseCon 2015: HBase @ Flipboard
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
Apache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at Xiaomi
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Improvements to Apache HBase and Its Applications in Alibaba Search
Improvements to Apache HBase and Its Applications in Alibaba Search Improvements to Apache HBase and Its Applications in Alibaba Search
Improvements to Apache HBase and Its Applications in Alibaba Search
 
Apache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New Features
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 

Similaire à Argus Production Monitoring at Salesforce

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013
Andrew Dunstan
 

Similaire à Argus Production Monitoring at Salesforce (20)

MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]
 
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQL
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Les fonctionnalites mariadb
Les fonctionnalites mariadbLes fonctionnalites mariadb
Les fonctionnalites mariadb
 
When Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu MaWhen Apache Spark Meets TiDB with Xiaoyu Ma
When Apache Spark Meets TiDB with Xiaoyu Ma
 
PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013PostgreSQL and Redis - talk at pgcon 2013
PostgreSQL and Redis - talk at pgcon 2013
 
What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1What to expect from MariaDB Platform X5, part 1
What to expect from MariaDB Platform X5, part 1
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 

Plus de HBaseCon

Plus de HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 

Dernier

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 

Dernier (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 

Argus Production Monitoring at Salesforce

  • 1. Argus Production Monitoring At Salesforce Service Health & Observability at Scale Tom Valine Director, Infrastructure Engineering tvaline@salesforce.com in/tvaline Bhinav Sura Software Engineer, Infrastructure Engineering bhinav.sura@salesforce.com in/bhinavsura
  • 2. What is Argus? ● Time Series Data & Events ● Inbuilt Service Protection ● Alerting ● Flexible Dashboarding ● Full REST API ● High Throughput ● Low Latency ● Horizontally Scalable ● In Use By ○ Capacity Planning ○ Search ○ Feature Teams ○ Site Reliability ○ Customer Success
  • 3. But Why Another Monitoring System? ● Technology changes frequently! ● Insulate our customers ● Performance ● Trust ● Programmatic access for everything ● Multi-tenancy ● Correlation with non- timeseries data ● Highly dimensional
  • 4. I’ve seen this somewhere before... Metrics ● Transforms ● Namespace ● Scope ● Name ● Tags ● Aggregator ● Downsampler Events ● Namespace ● Scope ● Name ● Tags ● Type ● User SCALE(-2d:-1d:dva:argus:freemem{host=*}:min:1d-min, $1e-6) TRANSFORM START END NAMESPACE SCOPE METRIC TAGS AGG DS PARAMS -2d:-1d:dva:argus:release{host=*}:major:admin START END NAMESPACE SCOPE NAME TAGS TYPE USER
  • 5. ● First Class Data ● Decoupled from Time Series ● Multiple Events Per Timestamp ● Event Categories ● Identifiable per User ● Overlay on Any Time Series Events
  • 6. Alerting ● CRON Format ● Alert on Missing Data ● Single Ended & Range Comparisons ● Inertia ● Cooldown ● Multiple Triggers ● Multiple Notifications ○ Audit ○ Email ○ GOC++ ○ Salesforce Chatter ○ PagerDuty ● Event Backannotation
  • 7. Warden ● Policy Driven Suspension Mechanism ● Per User ● Application & Subsystem ● Progressively Punitive ● Indefinite Suspension Supported ● Customizeable
  • 8. Dashboarding ● Maintaining dashboards is a horrible business to be in ● Empower the users, get out of their way ● Markup based ● Custom tags for visualization elements ● HTML for everything else
  • 9. REST ● API First ● All functionality exposed via services ● Decoupled UI ● Authenticated ○ Login ○ Do stuff ○ Logout ● Get out of User's Way! ○ Orchestra Client ○ ArgusPoke ○ Dashboard Creation Tool
  • 10. How does it work? METRICS ANNOTATION USER ENTITY ALERTS MAIL SCHEDULING MONITORING WEB SERVICES AUTH ORM MQ TSDB WEB UI CUSTOM APPS OTHER CLIENTS DASHBOARD MANAGEMENT WARDEN NAMESPACE SCHEMA WILDCARDING CACHING INTERLOCK
  • 11. Okay, but how does it REALLY work? MESSAGE BUS HBASE/TSDB/RDBMS/CACHING UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE W S
  • 12. Cool, how will it evolve going forward? HBASE/TSDB/RDBMS/CACHE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE W S HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE ROUTE/FORK/JOIN+M/R ROUTE/FORK/JOIN+M/R MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS ROUTE/FORK/JOIN+M/R C L C L C L C L C L C L C L C L C L C L C L C L C L C L
  • 13. Alert Evaluation Data Flows Message Queue: 1. Scheduling Service updates alert schedule every 10 minutes. 2. Scheduler submits scheduled jobs to queue 3. Minimum interval of 1 minute Alert Client: 1. Dequeues from alert queue. 2. Query ranges adjusted for scheduling latency 3. Triggers evaluated 4. Notifications sent 5. Cooldowns updated. ALERT DATA STORE SCHEUDLING SERVICE ALERT CACHE ARGUS WS ALERT 8713 ... ALERT 4141 ALERT 9810
  • 14. Metric & Event Data Flows Message Queue: 1. Writes are asynchronous with high degree of parallelism. 2. Queue used as a shock absorber. Tolerant to lower level failures/downtime. 3. Kafka for scalability. One topic each for metrics and annotations. Number of partitions in the order of 100s. ArgusMetricsQueue: 1. Consumed by 2 types of clients: MetricCommit and SchemaCommit 2. MetricCommit client commits the actual time series data to persistent storage (using OTSDB or Phoenix). 3. SchemaCommit client only uses the metric metadata to create metric schema records and commits them to HBase (using AsyncHBase). TIMESERIES STORE ARGUS WS METRIC ... METRIC METRIC METRIC SERVICE SCHEMA STORE
  • 15. TSDB Service Implementation - OpenTSDB ● Uses HBase underneath ● RowKey: <metric_uid><timestamp><tagk1><tagv1>[...<tagkn><tagvn>]. ● Stores actual time series values on hourly boundaries (All values within an hour stored in the same cell) ● Pros: ○ Extremely fast when you query using complete metric name. ○ 5M datapoints/min write throughput per write daemon. ● Cons: ○ Tag Cardinality - Total number of tags per metric is limited to 8 ○ Tag Cardinality - As product of tag values across all tag keys increases, performance decreases drastically ○ UID Exhaustion - 16M UIDs each for metric, tagk and tagv names by default. Once these are exhausted, no new metrics, tagk or tagv can be created.
  • 16. TSDB Service Implementation - Phoenix ● Uses HBase underneath ● RowKey: <metric_uid><timestamp><tagv1>[...<tagvn>]. ● Metric modeled as Phoenix VIEW ○ Schema is introspectable and managed outside of data ○ Supports secondary indexes on value and/or tag(s) ● Parallelizes query and pushes computation to server ○ Server-side aggregation conserves network bandwidth ○ Allows SKIP_SCAN filter optimization for minimizing data scanned ○ Leverages ROW_TIMESTAMP optimization for filtering HFiles ● Performance on par or better than OpenTSDB ● Ad hoc SQL query capability ○ Join against other Phoenix tables ● Longer term leverage Drillix (Phoenix + Drill) ○ Cross cluster queries ○ Joins to other non HBase data sources
  • 17. Schema Service Motivation ● Discover Metrics ○ What all metrics exist within a scope? ○ For a given <scope, metric> combination, what all tags exist? ○ Given a metric, what all scopes contain this metric? ○ What are all the tag values that exist for a given tag key? ● Support Wildcard Queries ○ Non-wildcard query ■ -1h:system.myDatacenter.myPod:Cpu.perc:avg:1m-avg ○ Wildcard query ■ -1h:system.myDatacenter.*:Cpu.perc:avg:1m-avg ■ -1h:system.myDatacenter.myPod:Cpu*:avg:1m-avg ■ -1h:system.myDatacenter.myPod:Cpu.perc{device=*app*}:avg:1m-avg
  • 18. Schema Service Implementation ● AsyncHBase Schema Service: ○ Uses HBase underneath ○ SchemaRecord: namespace, scope, metricname, tagk, tagv. No data points. ○ Each record indexed in 2 ways in 2 different tables. ○ MetricIndexed schema table: ■ RowKey: <metricname><scope><namespace><tagk><tagv> ○ ScopeIndexed schema table: ■ RowKey: <scope><metricname><namespace><tagk><tagv> ○ Decide what table to use based on the type of query. ○ Pros: ■ Efficient retrieval for schema records for most types of queries ○ Cons: ■ Storage duplication ● DiscoveryService: ○ Uses SchemaService internally ○ Ability to filter records by type ■ For e.g. Filter all unique scopes that match *myScope* ○ Expand Wildcard query and return a collection of non-wildcard queries
  • 19. Caching ● CachedTSDB Service: ○ Uses RedisCache service and the configured TSDBService implementation (OpenTSDB or PhoenixTSDB) ○ Query Level Caching (caches synthetic data) ○ Caches data spanning a window of more than last 24 hours. ○ Data is cached by fracturing it on day boundary. ■ For e.g.: Query spanning 5 days is stored using 5 keys on the cache. ○ Support for partial hits ○ Cache expiry time of an hour (can be increased by running a separate Cache update process) ● CachedDiscovery Service: ○ Uses RedisCache service and the configured DiscoveryService implementation ○ Cache queries already expanded ○ Cache expiry time of a day
  • 20. Developed By ● Anand Subramanian ● Bhinav Sura ● Tom Valine ● Jigna Bhatt ● Ruofan Zhang ● Dilip Devaraj ● Raj Sarkapally ● Kiran Gowdru More Information ​https://github.com/SalesforceEng/Argus