4. 4
• Consumption alternatives: Push (deliver) vs Pull (get)
– Acknowledgements (=> impact of prefetch)
• Finding bottlenecks in RabbitMQ
– Publisher side: Flow Control
• Related topic: Alarms
– Consumer side: Consumer Utilization
• Impact of queue length
Internals - Overview
5. 5
• Consumption alternatives: Push (deliver) vs Pull (get)
– Acknowledgements (=> impact of prefetch)
• Finding bottlenecks in RabbitMQ
– Publisher side: Flow Control
• Related topic: Alarms
– Consumer side: Consumer Utilization
• Impact of queue length
Internals - Overview
6. 6
• There are two ways for applications to consume messages from a queue:
– Have messages delivered to them ("push API"): i.e. ‘basic.deliver’
• It does not require a roundtrip to the broker => can achieve higher delivery rates
– Fetch messages as needed ("pull API"): e.g. ‘basic.get’
• In either case, a consumer can acknowledge messages
– Use of acknowledgments allows for stronger delivery guarantees (using ‘basic.ack’)
• It’s possible to ACK more than one message at once => allows stronger guarantees with higher performance
• If client fails to acknowledge, RabbitMQ re-enqueues the messages
– This can be turned off by setting the ‘auto-ack/no-ack’ option
• Will result in higher performance, but weaker guarantees
Message Consumption Alternatives in AMQP 0.9.1
https://www.rabbitmq.com/amqp-0-9-1-reference.html
7. 7
• What’s Prefetch?
– while consuming messages, the client can request that messages be sent in advance so that when
the client finishes processing a message, the following message is already held locally, rather than
needing to be sent down the channel
– So, it is really a "windowed ack" implementation for push mode
• Prefetching gives performance improvement
• Prefetch is only applicable when consumer acknowledgement is enabled
– They are ignored if the no-ack option is set
Consumption Speed-up through Prefetch
https://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
https://www.rabbitmq.com/amqp-0-9-1-reference.html
8. 8
• Limits are set through prefetch-size or prefetch-count
– Of the basic.qos method
– Default value = 0
• meaning "no specific limit"
• The server may send less data in advance than allowed by the client's specified prefetch
windows but it MUST NOT send more.
• The server MUST ignore this setting when the client is not processing any messages
• AMQP defines this per channel; Optionally RabbitMQ can apply this per individual
consumer
Prefetch Parameters
https://www.rabbitmq.com/amqp-0-9-1-reference.html
9. 9
• Consumption alternatives: Push (deliver) vs Pull (get)
– Acknowledgements (=> impact of prefetch)
• Finding bottlenecks in RabbitMQ
– Publisher-side: Flow Control
• Related topic: Alarms
– Consumer side: Consumer Utilization
• Impact of queue length
Internals - Overview
10. 10
• RabbitMQ provides useful information to help users spot bottlenecks
• Two groups of bottlenecks: on the publisher side (more critical) or on the consumer side
• On the publisher side, RabbitMQ has a very effective (and somehow aggressive?)
backpressure mechanism called “Flow Control”, to mitigate bottlenecks and avoid
crashes
• Resource (Memory/Disk) Alarms are also integrated into the backpressure mechanism
(serve as triggers)
• On the consumer side, Consumer Utilization can provide useful hints
Finding Bottlenecks in RabbitMQ
11. 11
The Publisher Side of RabbitMQ: Stages & Their Responsibilities
https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
• Side note: there is no one-to-one mapping between "processes" and "architectural
components” of RabbitMQ: e.g. while queues are actual processes, exchanges are not.
hence the routing is part of the channel process, but since most of the LOGIC is in
exchange/routing, this overview is a bit under-representing it.
13. 13
• In order to prevent any of those processes from overflowing the next one down the
chain, we have a credit flow mechanism in place.
• Each process initially grants certain amount of credits to the process that it’s sending
them messages. Once a process is able to handle N of those messages, it will grant
more credit to the process that sent them
Flow Control in RabbitMQ through “Flow Credit”
https://www.rabbitmq.com/blog/2015/10/06/new-credit-flow-settings-on-rabbitmq-3-5-5/
reader -> channel -> queue process -> message store
reader <--[grant]-- channel <--[grant]-- queue process <--[grant]-- message store.
14. 14
The Flow Control Sign & Examples
That’s the Flow Control
sign (under the ‘state’
column)
Example for
Queue Flow
Control
Examples for
Connection
Flow Control
15. 15
• If a connection is in flow control, but none of its channels are - This means that one or
more of the channels is the bottleneck; the server is CPU-bound on something the
channel does, probably routing logic. This is most likely to be seen when publishing
small transient messages.
• If a connection is in flow control, some of its channels are, but none of the queues it is
publishing to are - This means that one or more of the queues is the bottleneck; the
server is either CPU-bound on accepting messages into the queue or I/O-bound on
writing queue indexes to disc. This is most likely to be seen when publishing small
persistent messages.
• if a connection is in flow control, some of its channels are, and so are some of the
queues it is publishing to - This means that the message store is the bottleneck; the
server is I/O-bound on writing messages to disc. This is most likely to be seen when
publishing larger persistent messages.
VERY IMPORTANT: How to Decode Flow Control Signs
https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
17. 17
• Related config variables
– vm_memory_high_watermark
• It sets the upper limit of how much of the
'installed' memory on the machine
RabbitMQ can use
• Default: 0.4 (I changed it to 0.6)
– vm_memory_high_watermark_paging
_ratio
• It sets a ratio on the above limit, to tell
RabbitMQ when to start moving
messages from memory to the disk
• Default: 0.5 (I changed it to 1.0)
• See also
– https://www.rabbitmq.com/production-
checklist.html
Alarms: Memory/Disk Allocation & Flow Control Triggers
19. 19
• The flow control mechanism doesn't extend as far as
consumers, but we do have a new metric to help you
tell how hard your consumers are working.
• That metric is consumer utilization. The definition of
consumer utilization is the proportion of time that a
queue's consumers could take new messages. It's thus
a number from 0 to 1, or 0% to 100% (or N/A if the
queue has no consumers).
• So if a queue has a consumer utilization of 100% then
it never needs to wait for its consumers; it's always
able to push messages out to them as fast as it can.
Finding Bottlenecks: The Consumer Side
https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
20. 20
• If its utilization is less than 100% then this implies that its consumers are sometimes not
able to take messages. Network congestion can limit the utilization you can achieve, or
low utilization can be due to the use of too low a prefetch limit, leading to the queue
needing to wait while the consumer processes messages until it can send out more.
Consumer Utilization (cont.)
21. 21
• Consumption alternatives: Push (deliver) vs Pull (get)
– Acknowledgements (=> impact of prefetch)
• Finding bottlenecks in RabbitMQ
– Publisher side: Flow Control
• Related topic: Alarms
– Consumer side: Consumer Utilization
• Impact of queue length
Internals - Overview
22. 22
• RabbitMQ's queues are fastest when they're empty.
– When a queue is empty, and it has consumers ready to receive messages, then as soon as a
message is received by the queue, it goes straight out to the consumer.
– The main point is that very little book-keeping needs to be done, very few data structures are
modified, and very little additional memory needs allocating. Consequently, the CPU load of a
message going through an empty queue is very small.
• If the queue is not empty then a bit more work has to be done:
– the messages have to actually be queued up. Initially, this too is fast and cheap as the underlying
functional data structures are very fast.
– Nevertheless, by holding on to messages, the overall memory usage of the queue will be higher,
– and we are doing more work than before per message (each message is being both enqueued and
dequeued now, whereas before each message was just going straight out to a consumer), so the
CPU cost per message is higher.
– data structures are optimized to be fast when queues are near to empty
Queue Length: Benefits of Empty/Near-empty Queues
https://www.rabbitmq.com/blog/2011/09/24/sizing-your-rabbits/
23. 23
• Additionally, if a queue receives a spike of publications, then the queue must spend time
dealing with those publications, which takes CPU time away from sending existing
messages out to consumers:
– a queue of a million messages will be able to be drained out to ready consumers at a much higher
rate if there are no publications arriving at the queue to distract it.
• Eventually, as a queue grows, it'll become so big that we have to start writing messages
out to disk and forgetting about them from RAM in order to free up RAM.
– At this point, the CPU cost per message is much higher than had the message been dealt with by
an empty queue.
– and more importantly, the latency per message grow drastically, regardless of CPU utilization due to
the slow path over the disk
Queue Length: Growing Overhead of Bursty/Long Queues
https://www.rabbitmq.com/blog/2011/09/24/sizing-your-rabbits/
These statements have been experimentally verified
(see Part II of this slide set)
25. 25
Basic Benchmarking - Overview
• Experimental Setup
• Throughput rates
– Publish Only
– Consume Only
– Simultaneous Publish and Consume
• Memory consumption scheme
• Impact of disk access
26. 26
Basic Benchmarking - Overview
• Experimental Setup
• Throughput rates
– Publish Only
– Consume Only
– Simultaneous Publish and Consume
• Memory consumption scheme
• Impact of disk access
27. 27
• Machine:
– Kernel Version: 3.13.0-91-generic
– Operating System: Ubuntu 14.04.4 LTS
– CPUs: 8
– Total Memory: 7.715 GiB
• RabbitMQ version: 3.5.3
• Client: Java client's PerfTest (next slide)
• Server and client on the same machine
• Server running in a Docker container; client running on the host
Experimental Setup
31. 31
Basic Benchmarking - Overview
• Experimental setup
• Throughput rates
– Publish only
– Consume only
– Simultaneous publish and consume
• Memory consumption scheme
• Impact of disk access
32. 32
Throughput Rates - Publish Only
NOTE: back pressure kicks in for all these cases
No bound
queue
1 bound
queue
2 bound
queues
Increaseinroutingoverhead
Increaseinlevelofbackpressure
Increaseinmemory/CPUutilization
33. 33
Throughput Rates - Consume Only (Ack vs No-Ack)
Acknowledgment has a significant
impact on throughput
36. 36
Basic Benchmarking - Overview
• Experimental setup
• Throughput rates
– Publish only
– Consume only
– Simultaneous publish and consume
• Memory consumption scheme
• Impact of disk access
37. 37
Memory Consumption Scheme – A Long, Stable Queue
Notice the considerable memory
consumed by the queue process
itself (for message metadata,
indexes,…)
Injected 1M messages, each 1kb => total ~
1GB
38. 38
Memory Consumption Scheme – An Active-but-almost-empty Queue
The queue process consumes
very little memory
Empty queue
39. 39
• In RabbitMQ, memory used by
message bodies is shared
among processes
– Under a group called “Binaries”
• This sharing also happens
between queues too
– if an exchange routes a message
to many queues, the message
body is only stored in memory
once.
Memory Consumption Scheme – Sharing Across Queues
1 queue
(empty)
1 queue with
1M messages
2 identical
queues, each
with 1M
messages
https://www.rabbitmq.com/blog/2014/10/30/
understanding-memory-use-with-rabbitmq-
3-4/
40. 40
Basic Benchmarking - Overview
• Experimental setup
• Throughput rates
– Publish only
– Consume only
– Simultaneous publish and consume
• Memory consumption scheme
• Impact of disk access
41. 41
Impact of Disk Access – Publish Phase
New load: 10^6 message of 5kb each
Total size ~ 5 GB (beyond limit)
During swap-to-disk periods, back pressure is
the highest (publisher completely stopped)
Indicators of disk access
42. 42
Impact of Disk Access – Consume Phase
At the beginning, messages are served from
memory, at a reasonable rate (15k/s)
Once it starts to hit disk, the rates drop
drastically (to less than 500/s)
46. 46
Distribution Alternatives in RabbitMQ
https://www.rabbitmq.com/distributed.html
meaning it
is hard to
get through
firewalls,
which are
typically
open in one
direction
only
47. 47
• Scale-out
– Focus of this slide deck
• High availability/Fail-over (through mirrored queues)
– not discussed here (see https://www.rabbitmq.com/ha.html)
Clustering in RabbitMQ: Benefits
48. 48
• All data/state required for the operation of a RabbitMQ broker is replicated across all
nodes.
• An exception to this are message queues, which by default reside on one node,
though they are visible and reachable from all nodes.
• To replicate queues across nodes in a cluster, see the documentation on high availability
(note that you will need a working cluster first).
Clustering in RabbitMQ: What is Replicated?
https://www.rabbitmq.com/clustering.html
49. 49
• Queues within a RabbitMQ cluster are located on a single node (by default, the
node on which they were first declared), called home node or queue master
– This is in contrast to exchanges and bindings, which can always be considered to be on all nodes.
• Queues can optionally be made mirrored across multiple nodes
– All queue operations go through the master first and then are replicated to mirrors.
• This is necessary to guarantee FIFO ordering of messages.
– Consumers are connected to the master regardless of which node they connect to
• Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating
nodes each do all the work).
Crucial Details about Queues in a RabbitMQ Cluster
https://www.rabbitmq.com/ha.html
50. 50
• Cluster can formed in a number of ways:
– Manually with rabbitmqctl
– Declaratively by listing cluster nodes in config file
– Declaratively with plugins
• Two options:
– rabbitmq-autocluster
– rabbitmq-clusterer
• The composition of a cluster can be altered dynamically. All RabbitMQ brokers start out
as running on a single node. These nodes can be joined into clusters, and subsequently
turned back into individual brokers again.
Clustering in RabbitMQ: Cluster Formation Alternatives
https://github.com/harbur/docker-rabbitmq-cluster
52. 52
• A cluster of two nodes:
– Node 1 (N1):
• Operating System: Ubuntu 14.04.4 LTS, Kernel Version: 3.16.0-71-generic
• CPUs: 4
• Total Memory: 3.774 GiB
– Node 2 (N2):
• Operating System: Ubuntu 14.04.4 LTS , Kernel Version: 3.13.0-91-generic
• CPUs: 8
• Total Memory: 7.715 GiB
– Network connection: Ethernet cable (1GB/s)
• No high availability (i.e. mirrored queues)
• Default values inherited from Part II
– Addition: clients running on N2
Basic Benchmarking - Setup
53. 53
• Impact of network latency
• Impact of locality
– (Both Producer/Consumer connected directly to the queue node)
– Both Producer/Consumer connected indirectly to the queue node
– Producer directly to the queue node, consumer indirectly
– Producer indirectly to the queue node, consumer directly
Basic Benchmarking - Scenarios
54. 54
Impact of Network Latency
Q/P/C hosted-on/connected-to N2
P & C running-on N1
Q/P/C hosted-on/connected-to N2
P & C running-on N2
Remarks:
1) No backlog in either case
2) Comparable throughput
(indicating that in LAN
setup, network latency is
not a decisive factor)
55. 55
Impact of Locality: Indirect Producer & Indirect Consumer
Q/P/C hosted-on/connected-to N2 Q hosted-on N1
P/C connected-to N2
Remarks:
1) Producer and Consumer are
connected to a the queue
through a proxy node
2) Both have lower
throughputs
3) Backlog is building up
57. 57
Impact of Locality: Indirect Producer & Direct Consumer
Q/C hosted-on /connected-to N2
P connected-to N1
Q/P/C hosted-on/connected-to N2
Remarks:
1) Moderate overall throughput
2) No backlog
58. 58
Inter-Node Data Transfer
Q/C hosted-on /connected-to N2
P connected-to N1
Q /P hosted-on/ connected-to N2
C connected-to N1
Q hosted-on N1
P/C connected-to N2
59. 59
• Queues in RabbitMQ have one “home” node and all related operations go through those
nodes
• This highlights the importance of “locality” for performance (throughput + backlog)
– Q/P/C all co-located => highest throughput, empty queue
• Best case scenario
– Q/C co-located => moderate levels of throughput, empty queue
– Q/P co-located => relatively low throughput, increasing backlog (fastest)
– Neither co-located Q/C or co-located Q/P, lowest throughput, increasing backlog
Conclusions
61. 61
• “Many Queues” scenario
– Large number of (small) queues
– Problem: queues are by default created on the node
the client is connected to, resulting in an imbalance in
long run (see the figure for an example)
• Mitigated in newer versions (see next slide)
– Focus of this slide set
• “Large Queues” scenario
– A few (large) queues
– Problem: how to share the load of a big ‘logical’ queue
among different brokers
– Not discussed here, for a proposal, see this post:
• https://insidethecpu.com/2014/11/17/load-balancing-a-
rabbitmq-cluster/
Load Balancing in a RabbitMQ Cluster: Two Different Scenarios
62. 62
• Good news: newer versions of RabbitMQ (> 3.6.0) provide control over where to create
master queues
– Through “Queue Master Locator” strategies
• Proposed Solution: a service (part of WWS Deployment component) that
– a) creates queues, ensuring a balanced output
• With the help of the locator feature of RabbitMQ
– b) for each created queue, figures out its “home” node
• Using the a REST call to the Management API of RabbitMQ
– c) points the producer(s) and consumer(s) of the queue to the right node
Load Balancing RabbitMQ Cluster : The “Many Queues” Scenario
63. 63
• Queue masters can be distributed between nodes using several
strategies. Which strategy is used is controlled in three ways:
– using the x-queue-master-locator queue declare argument
– setting the queue-master-locator policy key
– by defining the queue_master_locator key in the configuration file.
• Here are the possible strategies:
– min-masters: pick the node hosting the minimum number of masters
– client-local: pick the node the client that declares the queue is connected to
– random: pick a random node
Queue Master Location
https://www.rabbitmq.com/ha.html
https://www.erlang-solutions.com/blog/take-control-of-your-rabbitmq-queues.html
65. 65
• A sample policy, set up by REST call to Management API
– (sample) endpoint:
• http://192.168.0.108:15672/api/policies/%2ftest/min-masters
– Verb:
• PUT
– Body content =>
• Result:
Queue Location Setting: Option 2 – Through Policy
{"pattern":"^min-masters",
"definition":{
"queue-master-locator":"min-
masters",
"apply-to": "queues”
}
}
66. 66
Queue Master Locator Policy In Practice
Step 1: create 5 queues
with names that don’t
match the policy pattern
Result: all on the same
broker (that client is
connected to)
Step 2: create 9 additional
queues, with names that
match the policy pattern
Result: they are
distributed fairly across
the two brokers
67. 67
• The corresponding entry line =>
– Note, default is “client-local”
• In practice
– result after creating 8 queues =>
Queue Location Setting: Option 3 – Through Config File
{rabbit,[ .
.
{queue_master_locator, <<"min-masters">>},
.
. ]},
NOTE: it may make sense to make
“min-masters” our default
68. 68
• A REST call to the Management API
– (sample) endpoint:
• http://localhost:15672/api/queues/%2ftest/min-masters.queue9
– Verb:
• GET
• Sample output =>
Retrieve Home Node of a Queue
{
"name":"min-masters.queue9",
"vhost":"/test",
"durable":false,
"auto_delete":false,
"exclusive":false,
"arguments":{},
"node":"rabbit@broker2",
...}
70. 70
• RabbitMQ has a very consequential back-pressure mechanism (Flow Control)
• Keep your queues empty! (memory and cpu overhead grows quickly with the length)
• Clustering is not fully transparent (loss of locality vs metadata store)
• Management API exposes a wealth of useful information (particularly, look out for the
node stats, “flow” signs, “disk read/write rates”)
A Few Lessons Learned
71. 71
• Use separate connections for producers and consumers
• Use more than one connection for high-load producers
• Use message batching, if possible
– Amortized overhead
– Increase in latency
• Use distinct user credentials!
– Helps with troubleshooting
A Few Lessons Learned (cont.)