In Flux Limiting for a multi-tenant logging service

•Télécharger en tant que PPTX, PDF•

4 j'aime•1,421 vues

DataWorks Summit/Hadoop Summit

Technologie

Overview
• Who are we?
• Architecture
• Streaming Pipeline
• Influx Issue
• Influx Limiting Design & Solution
• Conclusion
• Q & A
In-Flux Limiting for a Multi-Tenant Logging Service 2

Who are we?
• Symantec’s internal cloud team
• Host over $1B+ revenue applications
• Team
– Logging as a Service (LaaS) – Elasticsearch/Kibana
– Metering as a Service (MaaS) – InfluxDB/Grafana
– Alerting as a Service (AaaS) – Hendrix
We are hiring!
Also checkout Hendrix: https://github.com/Symantec/hendrix
In-Flux Limiting for a Multi-Tenant Logging Service 3

$Our Data Logs • Application and system logs data from VM’s and Containers • Used for troubleshooting Metrics • Application and system telemetries • Used for Application Performance Monitoring { “message”: “User logged in from 1.1.1.1”, “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "value", “path”: “/opt/logstash/sample.log", “tenant_id”: "291167ebed3221a006eb", “apikey”: "06be8a-28ef-4568-8cb8-612", “string_boolean”: "true", “host_ip”: "192.168.99.01" } { “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "host1.symantec.com", “tenant_id”: "291167ebed3221a006ebf6", “apikey”: "06be8a-28ef-4568-8cb8-618", “value”: 0.65, “name”: “cpu” } Log Event Metric Event In-Flux Limiting for a Multi-Tenant Logging Service 4$

LMM Architecture
Redis
Customer
Agents
Elasticsearch
InfluxDB
Log Topology
Metrics Topology
Kafka
Logstash
Users
Open to
customers
In-Flux Limiting for a Multi-Tenant Logging Service 5

Streaming Pipeline
• Validate events to match schema to optimize indexing
• Authenticate events to route data to the correct index
• Have 1 index per day per tenant
Kafka
Validate Auth Index
In-Flux Limiting for a Multi-Tenant Logging Service 6

Influx Issue
• You know your data store performance
limits (find EPS from benchmark/capacity)
• Tenants send a lot of data and ingestion
rate is never linear
• Ingestion spikes are bound to happen in a
real-time streaming application
• Wouldn’t it be great if you could
normalize these spikes?
In-Flux Limiting for a Multi-Tenant Logging Service 7

Influx Limiting
• Normalize the EPS curve using buffers
• Like a Hydro Dam, explicitly allocate EPS resource to tenants
Before
After
In-Flux Limiting for a Multi-Tenant Logging Service 8

Design - Options
Approach 1 Approach 2
• Route to separate Kafka topic
• No back-pressure in primary queue
• Secondary queue is drained
at a slower pace
• Events may appear out of order
• Controlled back-pressure in the
primary queue
• Selectively reduce ingestion rate
for tenants
• Events will always appear in order
In-Flux Limiting for a Multi-Tenant Logging Service 9

Customer Requirements
• Customers want threshold quotas defined for them
• Thresholds defined as policies (duration in seconds)
• Policies saved in a data store
Tenant A Tenant B Tenant C
{
“threshold”: 100,
“window”: 90
}
{
“threshold”: 700,
“window”: 10
}
{
“threshold”: 900,
“window”: 1
}
In-Flux Limiting for a Multi-Tenant Logging Service 10

Bolt Design
Kafka
1. Track “Event Rate” for each Tenant for the policy window
2. If threshold exceeds then throttle else allow the events
3. Reset window when the time interval is complete (tumbling window)
Validate Auth Throttle Index
In-Flux Limiting for a Multi-Tenant Logging Service 11

Scheduled-task design pattern
• Clock is maintained using
Storm Tick Tuple
• Tenant’s counter is
incremented when event is
received from it
• Counters are reset when
modulated value matches
Is Time % Throttle Duration = 0?
= Tenant Throttle Counter
Clock time
Modulo
Reset counters for each tenant in this sliceNothing to Reset
= Tenant Throttle Duration (modulated)
Reset counters for each tenant in this slice
In-Flux Limiting for a Multi-Tenant Logging Service 12

Results
13
• Reduced EPS to
Elasticsearch
• We can normalize
flow rate based on
load
In-Flux Limiting for a Multi-Tenant Logging Service

In-Flux Limiting for a Multi-Tenant Logging Service
Conclusion
• Overview of real-time log and metric indexing
• Approaches to rate limit in real-time streaming application
• Design pattern to efficiently perform counting in Storm
14
That’s all folks!

Questions?
In-Flux Limiting for a Multi-Tenant Logging Service 15

Contenu connexe

Tendances

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax

Built-In Security for the CloudDataWorks Summit

Data Pipelines with Spark & DataStax EnterpriseDataStax

Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain

Big Data Tools in AWSShu-Jeng Hsieh

Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

Securing Data in Hadoop at UberDataWorks Summit

What's new in SQL on Hadoop and BeyondDataWorks Summit/Hadoop Summit

Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit

Data Architectures for Robust Decision MakingGwen (Chen) Shapira

Building Continuously Curated Ingestion PipelinesArvind Prabhakar

Introduction to Apache KafkaJim Plush

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA

Lambda Architecture with SparkKnoldus Inc.

Tendances (20)

Trend Micro Big Data Platform and Apache Bigtop

Real time fraud detection at 1+M scale on hadoop stack

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...

Built-In Security for the Cloud

Data Pipelines with Spark & DataStax Enterprise

Troubleshooting Kerberos in Hadoop: Taming the Beast

Spark Summit EU talk by Kaarthik Sivashanmugam

The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

Big Data Tools in AWS

Unified, Efficient, and Portable Data Processing with Apache Beam

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Securing Data in Hadoop at Uber

What's new in SQL on Hadoop and Beyond

Preventative Maintenance of Robots in Automotive Industry

Data Architectures for Robust Decision Making

Building Continuously Curated Ingestion Pipelines

Introduction to Apache Kafka

Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...

Lambda Architecture with Spark

Similaire à In Flux Limiting for a multi-tenant logging service

AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on ...Amazon Web Services

Vault Digital TransformationStenio Ferreira

Keystone - ApacheCon 2016Peter Bakas

Real Time Insights for Advertising TechApache Apex

NATS: A Cloud Native Messaging SystemShiju Varghese

Service-Level Objective for Serverless Applicationsalekn

Apache Kafka® at Dropboxconfluent

Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015Amazon Web Services Korea

DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerR3

Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services

CloudStack Overviewsedukull

Beyond REST and RPC: Asynchronous Eventing and Messaging PatternsClemens Vasters

Large scale, distributed access management deployment with aruba clear passAruba, a Hewlett Packard Enterprise company

Itera Dev Meetup - Function as a Service - Serverless architecturePavol Rajzák

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteGigaom

Hacking apache cloud stackNitin Mehta

Practice of large Hadoop cluster in China MobileDataWorks Summit

Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Gary Arora

PayPal Risk Platform High Performance PracticeBrian Ling

Similaire à In Flux Limiting for a multi-tenant logging service (20)

AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on ...

Vault Digital Transformation

Keystone - ApacheCon 2016

Real Time Insights for Advertising Tech

NATS: A Cloud Native Messaging System

Service-Level Objective for Serverless Applications

Apache Kafka® at Dropbox

Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022

AWS를 활용한 첫 빅데이터 프로젝트 시작하기(김일호)- AWS 웨비나 시리즈 2015

DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker

Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis

CloudStack Overview

Beyond REST and RPC: Asynchronous Eventing and Messaging Patterns

Large scale, distributed access management deployment with aruba clear pass

Itera Dev Meetup - Function as a Service - Serverless architecture

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

Hacking apache cloud stack

Practice of large Hadoop cluster in China Mobile

Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...

PayPal Risk Platform High Performance Practice

Plus de DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

Data Science Crash CourseDataWorks Summit/Hadoop Summit

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

HBase in Practice DataWorks Summit/Hadoop Summit

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Slack Application Development 101 Slidespraypatel2

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

How to convert PDF to text with Nanonetsnaman860154

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Histor y of HAM Radio presentation slidevu2urc

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

A Domino Admins Adventures (Engage 2024)Gabriella Davis

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

🐬 The future of MySQL is Postgres 🐘

Slack Application Development 101 Slides

How to Troubleshoot Apps for the Modern Connected Worker

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

How to convert PDF to text with Nanonets

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Scaling API-first – The story of a global engineering organization

Histor y of HAM Radio presentation slide

Salesforce Community Group Quito, Salesforce 101

A Domino Admins Adventures (Engage 2024)

My Hashitalk Indonesia April 2024 Presentation

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Finology Group – Insurtech Innovation Award 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Boost PC performance: How more available memory can improve productivity

In Flux Limiting for a multi-tenant logging service

1. In-Flux Limiting for a Multi-Tenant Logging Service Ambud Sharma & Suma Cherukuri Cloud Platform Engineering @ Symantec In-Flux Limiting for a Multi-Tenant Logging Service 1

2. Overview • Who are we? • Architecture • Streaming Pipeline • Influx Issue • Influx Limiting Design & Solution • Conclusion • Q & A In-Flux Limiting for a Multi-Tenant Logging Service 2

3. Who are we? • Symantec’s internal cloud team • Host over $1B+ revenue applications • Team – Logging as a Service (LaaS) – Elasticsearch/Kibana – Metering as a Service (MaaS) – InfluxDB/Grafana – Alerting as a Service (AaaS) – Hendrix We are hiring! Also checkout Hendrix: https://github.com/Symantec/hendrix In-Flux Limiting for a Multi-Tenant Logging Service 3

4. Our Data Logs • Application and system logs data from VM’s and Containers • Used for troubleshooting Metrics • Application and system telemetries • Used for Application Performance Monitoring { “message”: “User logged in from 1.1.1.1”, “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "value", “path”: “/opt/logstash/sample.log", “tenant_id”: "291167ebed3221a006eb", “apikey”: "06be8a-28ef-4568-8cb8-612", “string_boolean”: "true", “host_ip”: "192.168.99.01" } { “@version”: "1", “@timestamp”: "2014-07-16T06:49:39.919Z", “host”: "host1.symantec.com", “tenant_id”: "291167ebed3221a006ebf6", “apikey”: "06be8a-28ef-4568-8cb8-618", “value”: 0.65, “name”: “cpu” } Log Event Metric Event In-Flux Limiting for a Multi-Tenant Logging Service 4

5. LMM Architecture Redis Customer Agents Elasticsearch InfluxDB Log Topology Metrics Topology Kafka Logstash Users Open to customers In-Flux Limiting for a Multi-Tenant Logging Service 5

6. Streaming Pipeline • Validate events to match schema to optimize indexing • Authenticate events to route data to the correct index • Have 1 index per day per tenant Kafka Validate Auth Index In-Flux Limiting for a Multi-Tenant Logging Service 6

7. Influx Issue • You know your data store performance limits (find EPS from benchmark/capacity) • Tenants send a lot of data and ingestion rate is never linear • Ingestion spikes are bound to happen in a real-time streaming application • Wouldn’t it be great if you could normalize these spikes? In-Flux Limiting for a Multi-Tenant Logging Service 7

8. Influx Limiting • Normalize the EPS curve using buffers • Like a Hydro Dam, explicitly allocate EPS resource to tenants Before After In-Flux Limiting for a Multi-Tenant Logging Service 8

9. Design - Options Approach 1 Approach 2 • Route to separate Kafka topic • No back-pressure in primary queue • Secondary queue is drained at a slower pace • Events may appear out of order • Controlled back-pressure in the primary queue • Selectively reduce ingestion rate for tenants • Events will always appear in order In-Flux Limiting for a Multi-Tenant Logging Service 9

10. Customer Requirements • Customers want threshold quotas defined for them • Thresholds defined as policies (duration in seconds) • Policies saved in a data store Tenant A Tenant B Tenant C { “threshold”: 100, “window”: 90 } { “threshold”: 700, “window”: 10 } { “threshold”: 900, “window”: 1 } In-Flux Limiting for a Multi-Tenant Logging Service 10

11. Bolt Design Kafka 1. Track “Event Rate” for each Tenant for the policy window 2. If threshold exceeds then throttle else allow the events 3. Reset window when the time interval is complete (tumbling window) Validate Auth Throttle Index In-Flux Limiting for a Multi-Tenant Logging Service 11

12. Scheduled-task design pattern • Clock is maintained using Storm Tick Tuple • Tenant’s counter is incremented when event is received from it • Counters are reset when modulated value matches Is Time % Throttle Duration = 0? = Tenant Throttle Counter Clock time Modulo Reset counters for each tenant in this sliceNothing to Reset = Tenant Throttle Duration (modulated) Reset counters for each tenant in this slice In-Flux Limiting for a Multi-Tenant Logging Service 12

13. Results 13 • Reduced EPS to Elasticsearch • We can normalize flow rate based on load In-Flux Limiting for a Multi-Tenant Logging Service

14. In-Flux Limiting for a Multi-Tenant Logging Service Conclusion • Overview of real-time log and metric indexing • Approaches to rate limit in real-time streaming application • Design pattern to efficiently perform counting in Storm 14 That’s all folks!

15. Questions? In-Flux Limiting for a Multi-Tenant Logging Service 15

Notes de l'éditeur

welcome to talk In-Flux Limiting for a Multi-Tenant Logging Service introduce yourself we are from CPE at symantec.
today we are here to talk about how we do event throttling, rate limiting for real time streaming, go over architecture internal details of streaming pipeline influx issue different approaches of solving prob show you results also want to cover an efficient pattern of computing if there are any pressing questions pls feel free to stop us. but we would prefer to take questions to the end
we are part of symantec internal cloud team that hosts app’s generating 1B revenue specifically our team builds, owns and runs 3 primary services. we call them Logging as a Service, Metering as a Service and Alerting as a Service Side note we are hiring if anyone is interested in joining the effort to build the biggest security data lake in the world, please stop by after the presentation. Side note we have open sourced project called hendrix , which is our alerting as a service . pls feel free to go and check it at out at github.com slash Symantec slash hendrix
before we jump on into the actual design and architecture of our system, lets talk about the data that we get and what is the problem that we are solving basically we offer logging and APM (app performance and monitoring) as a service. APM stands for application performance monitoring. our customers are Symantec product teams and they send us app and system logs generated on VM’s and containers. The teams use these for troubleshooting their applications. This is basically our own version of splunk. On The metrics side of the story we get app and system telemtetries which the teams use for application performance monitoring. Here is our sample events. we accept data in Json format and this is what it looks like on the left is the log event and the right is the metric event. If you look at log events you notice that there are 2 special fields one is called tenant id and the other is the api key. so whats the tenant and an api key ? Every customer which is a P&S teams at Symantec have something called tenants. the concept of tenant comes from our Openstack cloud. A given P&S teams can have more than one tenant. For example, their production App A can have a tenant and prod app B can have another tenant. Basically every tenant is a unit of isolation for us. An api key is a token used to allow and revoke flow of logs for a given tenant. lets say you wanted to stop a given tenant from sending data, you can revoke the api key. this means we start discarding the events. we call this process event authentication.
Now lets get into our architecture. basically customers run agents like flume, logstash, collectd and statsd which send data to our kafka cluster which is exposed over loadbalancers. we then run a set of storm topologies which write data to destination data stores. incase of logs its ES and incase of metrics its Influx db. We use Kibana as a front end for ES and Grafana as a front end for the influx db so that customers can graph and query the data. Redis is where we store tenant id’s and api keys.
Here’s what happens inside our streaming pipeline that is that storm topology. First like we showed earlier, events arrive in Kafka; we use the Storm Kafka Spout to read them and then we validate these events against the format and schema specifications that we publish to our customers, example if it’s malformed JSON we will drop the event. Next like we check whether the Tenant Id and API Key are valid. And lastly we Index the data to Elasticsearch or insert it into InfluxDB Each of the above stages are in separate Storm Bolts.
So now that you that you have a fair idea of our pipeline let’s understand the Influx issue. Influx means arrival of a large quantity of something in a short time, in this case that is events. When you are writing data to a data store like Hbase, Cassandra or Elasticsearch you provision a capacity in the cluster as in your cluster will have X number of nodes and they can support let’s say 10000 inserts per second. You can gauge this number / capacity running benchmarks. For us these inserts per second can also be referred to as Event per Second or EPS. EPS sent by our tenants is never linear, it fluctuates quite a lot as you can see in the graph on the right, each line in this graph represents the EPS from a tenant. At times we get spikes which is bound to happen in any real-time event processing system, when the load increases on the applications, they generate a lot more logs and we get a spike. So when a spike happens we don’t have provisioned capacity to handle indexing of the additional influx of data that came in because of the spike. So wouldn’t it be great if you could normalize these spikes? What we mean is have an almost flat EPS curve for every tenant.
Let’s understand how we can limit the influx and normalize these spikes. Think about event streams as a river and if there’s a cloud burst (no pun intended) the river will get temporarily flooded so much so that the banks will overflow right? So to fix this problem we can build a Dam and this Dam will buffer the additional influx of water we just got and you can control the rate at which this dam is being drained. Since we are using Kafka we already have a buffer however we have no control over the buffer as it will be controlled by the back pressure our elasticsearch cluster creates because it can take just sot many writes. So the purpose of this work was to have controlled back pressure into Kafka for our streaming pipelines to let us quantitatively determine how many events we would like to let flow through our pipeline into Elasticsearch. But we would like to do this on a tenant by tenant basis as you can see in the diagrams before you have different tenants sending different quantities of data shown in different colors. If we were to have a controlled system we can normalize and evenly divide the capacity among all of them or knowingly make it uneven that is if 1 customer has more need we allocate more capacity to them rather than others.
How can we do this? Well there are 2 approaches we thought of, one is you can write a substream where if a tenant exceeds their allocated throughput capacity we divert the extra event traffic that is over the capacity to a separate queue which we will then drain at a slower pace. In technical that means a separate kafka topic and a separate storm topology with a lower parallelism configuration. The other way of solving this problem is to pause the processing of events in the existing streaming pipeline for the tenant that is sending more data. Both approaches have some pros and cons. If you go with the first one you will see events out of order that means some data you will see right way which is flowing through our main pipeline and the some data will be delayed because it’s flowing though the slower pipeline. For the second approach you will always see events in order but if the queue back pressures too much you may loose data but that is true for either approaches if you share the kafka cluster because for a given cluster you disk space is limited.
What do we do inside the bolt and how do we track the Event rate for tenants? To track counts of events we keep a hashtable of tenant id and an integer counter which every time we see an event from a tenant we increment. But our customers wanted policies that define event rates differently for every tenant that is someone wants to be allowed to send 300 events in 2 minutes while the other one wanted 5000 in 10 minutes which don’t mean the same EPS so we had to come up with a way to track this for every tenant and we came up with an interesting way of solving this problem without using multiple threads. What we built logically was a sort of mary go round where every tenant is allowed to go round once. How it’s setup is every tenant influx limiting policy has 2 parts, 1. the number of events they would like to send 2. the time duration for those events so what we do is we take the time duration we we place tenants on this virtual mary go round based on their policy time duration.

In Flux Limiting for a multi-tenant logging service

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à In Flux Limiting for a multi-tenant logging service

Similaire à In Flux Limiting for a multi-tenant logging service (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

In Flux Limiting for a multi-tenant logging service

Notes de l'éditeur