If you didn’t fail with microservices at least once you didn’t really try anything new! Even though microservices are an established architectural style in the industry, they still come with their own challenges.
This session from nginx.conf 2016 focuses on a topic that is usually overlooked in the early stages of building a microservices architecture: traffic management. It comes into the picture after we fail an SLA, whether the cause is a misbehaving client, a legitimate increase of traffic, or a DDoS attack. We then start asking questions like how to ensure a fair usage policy for clients across microservices, how to protect clients from an abusive peer that is generating a spike in traffic, and how to protect microservices themselves from abusive clients.
NGINX comes with options for rate limiting that usually work great for a single node. Extending NGINX's capabilities to distributed environments increases the complexity of the solution. Can rate limiting be applied transparently without visible impact on latency? Is it easy to scale? Is it reliable? In this session, Adobe's Dragos Dascalita Haut introduces an open source solution contributed by Adobe I/O and used with success in real-life scenarios. The solution is based on an asynchronous communication model that supports high-throughput scenarios with minimum impact on latency. If you've had similar problems in the past or if you're concerned about how clients interact with your microservices then this session is for you.
9. #nginx #nginxconf9
Some reasons for failures
1. A client that misbehaves
2. A spike in demand
3. DDoS
4. A failure in one component
generating a cascading effect
11. OPENRESTY
• Nginx Lua Module
• Nginx Redis
• Headers more
• Set misc
• LuaJIT
• ….
API Gateway Modules
• Request Validation
• Throttling & Rate Limiting
• HTTP Logger
NGINX
• Upstream
• HTTP Proxy
• PCRE
• SSL
• ….
API GATEWAY :" …TAKE ONE OF THE MOST POPULAR WEB
SERVER AND ADD API GATEWAY CAPABILITIES
TO IT…"
13. #nginx #nginxconf13
1. limit_req_zone $binary_remote_addr zone=gold:10m rate=300r/m;
2. limit_req_zone $binary_remote_addr zone=silver:10m rate=30r/m;
4. server {
5. ...
6. location /login.html {
7. limit_req zone=silver burst=5;
8. ...
9. }
10.}
Limit the rate of requests
Limit to 30 requests per
minute
14. #nginx #nginxconf14
1. limit_conn_zone $binary_remote_addr zone=conn_zone:10m;
3. server {
4. ...
5. location /store {
6. limit_conn conn_zone 10;
7. ...
8. }
9. }
Limit the number of connections
Limit to maximum 10
connections for each
client IP address
17. #nginx #nginxconf17
NGINX
NGINX
Service A
Service A
Service B
Service B
Service B
How to limit
ServiceA to
10 r/m
cross multiple
NGINX
nodes ?
Problem
What happens
when a new
node comes
up ? NGINX
18. #nginx #nginxconf18
NGINX
NGINX
Service A
Service A
Service B
Service B
Service B
How to limit
ServiceA to
10 r/m
cross multiple
NGINX
nodes ?
Problem
What happens
when a new
node comes
up ? NGINX
… or goes away
19. #nginx #nginxconf19
Pros: Cons:
Easy to configure
Easy to manage
Works well for a single node
Can’t define rules at a cluster level
Can’t apply dynamic rules per
location
i.e. allow one app to send
1000 requests and another
10 requests
ngx_http_limit_req_module
21. #nginx #nginxconf21
Requirements
1. Work in a distributed environment.
2. Async. Don’t add extra latency to the request when checking quotas.
3. High-performance. Sustain hundreds of thousands of requests/
second.
4. Adaptive. NGINX instances may come up or may go away at any time.
5. Fail-safe. In the event the solution doesn’t function then all traffic
should be permitted until it recovers.
22. #nginx #nginxconf22
Assumptions
1. The intent is rather to allow than to block
• the focus is to ensure a fair usage policy
2. Favor performance instead of precision
• rather allow a small % over the limit instead of adding latency to the request
24. #nginx #nginxconf24
Option #1
NGINX NGINX
Maintain consistent counters across the cluster
Nodes inform
each other
about
their counters
NGINX
Challenges:
Chatty : more nodes, more
messages
Maintaining consistent distributed
counters is a complex problem
Increase NGINX’s complexity
25. #nginx #nginxconf25
Option #2
NGINX
NGINX
Maintain consistent counters across the cluster
Brokered
Message
Queue
Tracking
Microservice
Usage data
Usage data
Usage data
What to BLOCK or SLOW DOWN / DELAY
What to BLOCK or SLOW DOWN / DELAY
26. #nginx #nginxconf26
Option #2
Maintain consistent counters across the cluster
Challenges:
Maintain a Brokered Message
Queue. Is it needed ?
Maintain a new Microservice to
track the counters
Improvements:
Less chatty
Moved distributed counters
from NGINX into a Micro
service
27. #nginx #nginxconf27
Option #3
NGINX
NGINX
Maintain consistent counters across the cluster
MQ
Tracking
Microservice
Pull usage data
Pull usage data
What to BLOCK or SLOW DOWN / DELAY
MQ
What to BLOCK or SLOW DOWN / DELAY
Challenges:
Embed a MQ with NGINX
Maintain a new Microservice
to track the counters
Auto discovery of NGINX
Nodes
Improvements:
Non Brokered Message
Queue
Moved distributed counters
from NGINX into a Micro
service
28. #nginx #nginxconf28
Selecting a Message Queue
Maintain consistent counters across the cluster
CANDIDATE /
LANGUAGE
PROS CONS
Apache Kafka /
Java, Scala
• rated as highly performant, sustaining 2M
messages
• durable, messages being written to disk first
• Zookeeper dependent
• Brokered
• Maintenance complexity
•
ActiveMQ /
Java
• popular
• supports STOMP , AMQP, MQTT, XMPP
• Spring integration
• Brokered
• Maintenance complexity
RabbitMQ /
Erlang
• supports STOMP , AMQP, MQTT, XMPP
• community support
• Brokered
• Maintenance complexity
• slower than ZeroMQ
nanomsg
• performant socket lib
• it promises a cleaner API than ZeroMQ
• in beta when we analyzed it
• no XPUB/XSUB Proxy
ZeroMQ
• around since 2007
• brokerless, designed for high throughput/low
latency scenarios
• embeddable in Nginx with C/C++/Lua
bindings
• pure Java implementation through JeroMQ
• no auto-discoverability - need to use a Proxy
( XPUB/XSUB )
29. #nginx #nginxconf29
Moving ahead with Option #3 and ZMQ
Maintain consistent counters across the cluster
NGINX
NGINX
MQ
Tracking
Microservice
Pull usage data
Pull usage data
What to BLOCK or SLOW DOWN / DELAY
MQ
What to BLOCK or SLOW DOWN / DELAY
30. #nginx #nginxconf30
Integrating ZeroMQ with NGINX
Maintain consistent counters across the cluster
NGINX Master Process
ZeroMQ Adaptor Process
NGINX
Worker
NGINX
Worker
NGINX
Worker
NGINX
Worker
XSUB Socket
default - ipc:///tmp/ngx_queue_listen
XPUB Socket
default - tcp://0.0.0.0:6001
Tracking
Microservice
Pull usage data
What to BLOCK or
DELAY
31. #nginx #nginxconf31
Integrating ZeroMQ with NGINX
Maintain consistent counters across the cluster
NGINX Master Process
ZeroMQ Adaptor Process
NGINX
Worker
XSUB Socket
default - ipc:///tmp/ngx_queue_listen
33. #nginx #nginxconf33
NGINX and Tracking Service
Maintain consistent counters across the cluster
Tracking Service
Persists policies
Sends ACTIONS to the Gateway
based on the tracked information
Concerned with the business rules
managing throttling and rate limiting
Allows only private access to its API
NGINX
Enforces policies
Executes ACTIONS such as:
TRACK
BLOCK
DELAY
Unaware of the business rules
Serves public traffic
34. #nginx #nginxconf34
Request flow
API GATEWAY
/ NGINX
Microservice
ZeroMQ Adaptor
Gateway Tracking
Service
( GTS )
CLIENT
1
2
3
4
5
6
Asynchronous and Non-blocking
37. #nginx #nginxconf37
Local Setup
API GATEWAY
/ NGINX
Microservice
ECHO
ZeroMQ Adaptor
Gateway Tracking
Service
Reporting
Graphite
Grafana UI
TEST
RUNNER
1
2
3
4
5
6
38. #nginx #nginxconf38
Adding a throttling policy
POST /api/policies/throttling
[{
"id": 10,
"softLimit": 4,
"maxDelayPeriod": 3,
"hardLimit": 12,
"timeUnit": "SECONDS",
"span": 10,
"lastModified": 1438019079000,
"domain": {
"$service_id": "echo-service"
},
"groupBy": ["$api_key"]
}]
Gateway Tracking Service API
low watermark
specifying when
to start DELAYing
requests
high watermark
specifying when
to start BLOCKing
requests
39. #nginx #nginxconf39
Adding a throttling policy
POST /api/policies/throttling
[{
"id": 10,
"softLimit": 2,
"maxDelayPeriod": 2,
"hardLimit": 5,
"timeUnit": "SECONDS",
"span": 10,
"lastModified": 1438019079000,
"domain": {
"$service_id": "echo-service"
},
"groupBy": ["$api_key"]
}]
Gateway Tracking Service API
at what
time intervals
to enforce
this policy
40. #nginx #nginxconf40
Adding a throttling policy
POST /api/policies/throttling
[{
"id": 10,
"softLimit": 2,
"maxDelayPeriod": 2,
"hardLimit": 5,
"timeUnit": "SECONDS",
"span": 10,
"lastModified": 1438019079000,
"domain": {
"$service_id": "echo-service"
},
"groupBy": ["$api_key"]
}]
Gateway Tracking Service API
enforce the policy
for all requests
having
service_id = “echo-service”enforce the limits
for each
application
41. #nginx #nginxconf41
Deleting a throttling policy
DELETE /api/policies/throttling/<policy_id>
Gateway Tracking Service API
Listing all policies
GET /api/policies/throttling
42. #nginx #nginxconf42
Defining an application plan
POST /api/policies/throttling
[{
"id": 10,
"softLimit": 2,
"maxDelayPeriod": 2,
"hardLimit": 5,
"timeUnit": "SECONDS",
"span": 10,
"lastModified": 1438019079000,
"domain": {
"$service_id": “echo-service”,
“$app_plan": “silver”,
}
}]
Gateway Tracking Service API
add in addition
an identified for the app.
$api_key variable could be
used as well
43. #nginx #nginxconf43
Throttle by HTTP Verb
POST /api/policies/throttling
[{
"id": 15,
"hardLimit": 5,
"timeUnit": "SECONDS",
"span": 10,
"lastModified": 1438019079000,
"domain": {
"$service_id": “echo-service”,
“$request_method”: “POST”,
}
}]
Gateway Tracking Service API
request_method
is a built-in variable
in NGINX
Limits all POST request for “echo-service” to 5 requests/10 seconds
44. #nginx #nginxconf44
With a 3s delay
"softLimit": 4,
"maxDelayPeriod": 3,
"hardLimit": 12
"hardLimit": 12
Without delay
DELAY vs BLOCK
45. #nginx #nginxconf45
Extending Tracking Service
to dynamically rewrite requests
Use the same method but instead of blocking or delaying
requests, rewrite them
Useful for beta testing without affecting the real traffic
46. #nginx #nginxconf46
Adjust limits dynamically
Measure by QoS (i.e. response time )
Measure by the Capacity of the service
if there aren’t so many consumers allow the current ones to use the remaining capacity
Enhancing Tracking Service