Boost PC performance: How more available memory can improve productivity
In Flux Limiting for a multi-tenant logging service
1. In-Flux Limiting for a Multi-Tenant Logging Service
Ambud Sharma & Suma Cherukuri
Cloud Platform Engineering @ Symantec
In-Flux Limiting for a Multi-Tenant Logging Service 1
2. Overview
• Who are we?
• Architecture
• Streaming Pipeline
• Influx Issue
• Influx Limiting Design & Solution
• Conclusion
• Q & A
In-Flux Limiting for a Multi-Tenant Logging Service 2
3. Who are we?
• Symantec’s internal cloud team
• Host over $1B+ revenue applications
• Team
– Logging as a Service (LaaS) – Elasticsearch/Kibana
– Metering as a Service (MaaS) – InfluxDB/Grafana
– Alerting as a Service (AaaS) – Hendrix
We are hiring!
Also checkout Hendrix: https://github.com/Symantec/hendrix
In-Flux Limiting for a Multi-Tenant Logging Service 3
4. Our Data
Logs
• Application and system
logs data from VM’s and
Containers
• Used for troubleshooting
Metrics
• Application and system
telemetries
• Used for Application
Performance
Monitoring
{
“message”: “User logged in from 1.1.1.1”,
“@version”: "1",
“@timestamp”: "2014-07-16T06:49:39.919Z",
“host”: "value",
“path”: “/opt/logstash/sample.log",
“tenant_id”: "291167ebed3221a006eb",
“apikey”: "06be8a-28ef-4568-8cb8-612",
“string_boolean”: "true",
“host_ip”: "192.168.99.01"
}
{
“@version”: "1",
“@timestamp”: "2014-07-16T06:49:39.919Z",
“host”: "host1.symantec.com",
“tenant_id”: "291167ebed3221a006ebf6",
“apikey”: "06be8a-28ef-4568-8cb8-618",
“value”: 0.65,
“name”: “cpu”
}
Log Event Metric Event
In-Flux Limiting for a Multi-Tenant Logging Service 4
6. Streaming Pipeline
• Validate events to match schema to optimize indexing
• Authenticate events to route data to the correct index
• Have 1 index per day per tenant
Kafka
Validate Auth Index
In-Flux Limiting for a Multi-Tenant Logging Service 6
7. Influx Issue
• You know your data store performance
limits (find EPS from benchmark/capacity)
• Tenants send a lot of data and ingestion
rate is never linear
• Ingestion spikes are bound to happen in a
real-time streaming application
• Wouldn’t it be great if you could
normalize these spikes?
In-Flux Limiting for a Multi-Tenant Logging Service 7
8. Influx Limiting
• Normalize the EPS curve using buffers
• Like a Hydro Dam, explicitly allocate EPS resource to tenants
Before
After
In-Flux Limiting for a Multi-Tenant Logging Service 8
9. Design - Options
Approach 1 Approach 2
• Route to separate Kafka topic
• No back-pressure in primary queue
• Secondary queue is drained
at a slower pace
• Events may appear out of order
• Controlled back-pressure in the
primary queue
• Selectively reduce ingestion rate
for tenants
• Events will always appear in order
In-Flux Limiting for a Multi-Tenant Logging Service 9
10. Customer Requirements
• Customers want threshold quotas defined for them
• Thresholds defined as policies (duration in seconds)
• Policies saved in a data store
Tenant A Tenant B Tenant C
{
“threshold”: 100,
“window”: 90
}
{
“threshold”: 700,
“window”: 10
}
{
“threshold”: 900,
“window”: 1
}
In-Flux Limiting for a Multi-Tenant Logging Service 10
11. Bolt Design
Kafka
1. Track “Event Rate” for each Tenant for the policy window
2. If threshold exceeds then throttle else allow the events
3. Reset window when the time interval is complete (tumbling window)
Validate Auth Throttle Index
In-Flux Limiting for a Multi-Tenant Logging Service 11
12. Scheduled-task design pattern
• Clock is maintained using
Storm Tick Tuple
• Tenant’s counter is
incremented when event is
received from it
• Counters are reset when
modulated value matches
Is Time % Throttle Duration = 0?
= Tenant Throttle Counter
Clock time
Modulo
Reset counters for each tenant in this sliceNothing to Reset
= Tenant Throttle Duration (modulated)
Reset counters for each tenant in this slice
In-Flux Limiting for a Multi-Tenant Logging Service 12
13. Results
13
• Reduced EPS to
Elasticsearch
• We can normalize
flow rate based on
load
In-Flux Limiting for a Multi-Tenant Logging Service
14. In-Flux Limiting for a Multi-Tenant Logging Service
Conclusion
• Overview of real-time log and metric indexing
• Approaches to rate limit in real-time streaming application
• Design pattern to efficiently perform counting in Storm
14
That’s all folks!
welcome to talk In-Flux Limiting for a Multi-Tenant Logging Service
introduce yourself
we are from CPE at symantec.
today we are here to talk about
how we do event throttling, rate limiting for real time streaming,
go over architecture
internal details of streaming pipeline
influx issue
different approaches of solving prob
show you results
also want to cover an efficient pattern of computing
if there are any pressing questions pls feel free to stop us. but we would prefer to take questions to the end
we are part of symantec internal cloud team that hosts app’s generating 1B revenue
specifically our team builds, owns and runs 3 primary services.
we call them Logging as a Service,
Metering as a Service and
Alerting as a Service
Side note we are hiring
if anyone is interested in joining the effort to build the biggest security data lake in the world, please stop by after the presentation.
Side note we have open sourced project called hendrix , which is our alerting as a service . pls feel free to go and check it at out at github.com slash Symantec slash hendrix
before we jump on into the actual design and architecture of our system, lets talk about the data that we get and what is the problem that we are solving
basically we offer logging and APM (app performance and monitoring) as a service. APM stands for application performance monitoring.
our customers are Symantec product teams and they send us app and system logs generated on VM’s and containers.
The teams use these for troubleshooting their applications.
This is basically our own version of splunk.
On The metrics side of the story we get app and system telemtetries which the teams use for application performance monitoring.
Here is our sample events. we accept data in Json format and this is what it looks like on the left is the log event and the right is the metric event.
If you look at log events you notice that there are 2 special fields one is called tenant id and the other is the api key.
so whats the tenant and an api key ?
Every customer which is a P&S teams at Symantec have something called tenants. the concept of tenant comes from our Openstack cloud.
A given P&S teams can have more than one tenant. For example, their production App A can have a tenant and prod app B can have another tenant.
Basically every tenant is a unit of isolation for us.
An api key is a token used to allow and revoke flow of logs for a given tenant. lets say you wanted to stop a given tenant from sending data, you can revoke the api key.
this means we start discarding the events.
we call this process event authentication.
Now lets get into our architecture.
basically customers run agents like flume, logstash, collectd and statsd which send data to our kafka cluster which is exposed over loadbalancers.
we then run a set of storm topologies which write data to destination data stores. incase of logs its ES and incase of metrics its Influx db.
We use Kibana as a front end for ES and Grafana as a front end for the influx db so that customers can graph and query the data.
Redis is where we store tenant id’s and api keys.
Here’s what happens inside our streaming pipeline that is that storm topology.
First like we showed earlier, events arrive in Kafka; we use the Storm Kafka Spout to read them and then we validate these events
against the format and schema specifications that we publish to our customers, example if it’s malformed JSON we will drop the event.
Next like we check whether the Tenant Id and API Key are valid.
And lastly we Index the data to Elasticsearch or insert it into InfluxDB
Each of the above stages are in separate Storm Bolts.
So now that you that you have a fair idea of our pipeline let’s understand the Influx issue.
Influx means arrival of a large quantity of something in a short time, in this case that is events.
When you are writing data to a data store like Hbase, Cassandra or Elasticsearch you provision a capacity in the cluster as in your cluster will have X number of nodes and they can support let’s say 10000 inserts per second. You can gauge this number / capacity running benchmarks. For us these inserts per second can also be referred to as Event per Second or EPS.
EPS sent by our tenants is never linear, it fluctuates quite a lot as you can see in the graph on the right, each line in this graph represents the EPS from a tenant.
At times we get spikes which is bound to happen in any real-time event processing system, when the load increases on the applications, they generate a lot more logs and we get a spike.
So when a spike happens we don’t have provisioned capacity to handle indexing of the additional influx of data that came in because of the spike.
So wouldn’t it be great if you could normalize these spikes? What we mean is have an almost flat EPS curve for every tenant.
Let’s understand how we can limit the influx and normalize these spikes.
Think about event streams as a river and if there’s a cloud burst (no pun intended) the river will
get temporarily flooded so much so that the banks will overflow right?
So to fix this problem we can build a Dam and this Dam will buffer the additional influx of water we just got and you can control the rate at which this dam is being drained.
Since we are using Kafka we already have a buffer however we have no control over the buffer as it will be controlled by the back pressure our elasticsearch cluster creates because it can take just sot many writes.
So the purpose of this work was to have controlled back pressure into Kafka for our streaming pipelines to let us quantitatively determine how many events we would like to let flow through our pipeline into Elasticsearch.
But we would like to do this on a tenant by tenant basis as you can see in the diagrams before you have different tenants sending different quantities of data shown in different colors.
If we were to have a controlled system we can normalize and evenly divide the capacity among all of them or knowingly make it uneven that is if 1 customer has more need we allocate more capacity to them rather than others.
How can we do this?
Well there are 2 approaches we thought of,
one is you can write a substream where if a tenant exceeds their allocated throughput capacity we divert the extra event traffic that is over the capacity to a separate queue which we will then drain at a slower pace.
In technical that means a separate kafka topic and a separate storm topology with a lower parallelism configuration.
The other way of solving this problem is to pause the processing of events in the existing streaming pipeline for the tenant that is sending more data.
Both approaches have some pros and cons. If you go with the first one you will see events out of order that means some data you will see right way which is flowing through our main pipeline and the some data will be delayed because it’s flowing though the slower pipeline.
For the second approach you will always see events in order but if the queue back pressures too much you may loose data but that is true for either approaches if you share the kafka cluster because for a given cluster you disk space is limited.
What do we do inside the bolt and how do we track the Event rate for tenants?
To track counts of events we keep a hashtable of tenant id and an integer counter which every time we see an event from a tenant we increment.
But our customers wanted policies that define event rates differently for every tenant that is someone wants to be allowed to send 300 events in 2 minutes while the other one wanted 5000 in 10 minutes which don’t mean the same EPS so we had to come up with a way to track this for every tenant and we came up with an interesting way of solving this problem without using multiple threads.
What we built logically was a sort of mary go round where every tenant is allowed to go round once. How it’s setup is every tenant influx limiting policy has 2 parts, 1. the number of events they would like to send 2. the time duration for those events so what we do is we take the time duration we we place tenants on this virtual mary go round based on their policy time duration.