SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Queues, Pools and Caches -
Everything a DBA should know of scaling modern OLTP
Gwen (Chen) Shapira, Senior Consultant
ThePythian Group
cshapi@gmail.com

Scalability Problems in Highly Concurrent Systems
When we drive through a particularly painful traffic jam, we tend to assume that the jam has a cause.
That road maintenance or an accident blocked traffic and created the slowdown. However, we often
reach the end of the traffic jam without seeing any visible cause.

Traffic researcher Prof. Sugiyama and his team showed that with sufficient traffic density, traffic jams
will occur with no discernible root cause. Traffic jams will even form when cars drive in constant speed
on a circular one-lane tracki.

 “When a large number of vehicles, beyond the road capacity, are successively injected into the
road, the density exceeds the critical value and the free flow state becomes unstable.”ii

OLTP systems are systems built to handle large number of small transactions. In those systems the main
requirements are servicing large number of concurrent requests, with low and predictable latency. Good
scalability for OLTP system can be defined as “Achieving maximum useful concurrency from a shared
system”iii.

OLTP systems often behave exactly like traffic jams in Prof. Sugiyama’s experiments – more and more
traffic is loaded into the database. Inevitably, a traffic jam will occur, and we may not be able to find any
visible root cause for that. In a wonderful video, Andrew Holdsworth of Oracle’s Real World
Performance group shows how increasing traffic on a database server can dramatically increase latency
without any improvement in response times and how reducing the number of connections to the
database can improve performanceiv.

In this presentation, I’ll discuss several design patterns and frameworks that are used to improve
scalability by controlling concurrency in modern OTLP systems and web based architectures.

All the patterns and frameworks I’ll discuss are considered part of the software architecture. DBAs often
take little interest in the design and architecture of the applications that use the database. But
databases never operate in vacuum, DBAs who understand application design can have better dialog
with the software team when it comes to scalability, and progress beyond finger pointing and “The
database is slow” blaming. Those frameworks require sizing, capacity planning and monitoring – a task
that DBAs are better qualified for than software developers, I’ll go into details on how DBAs can help
size and monitor these systems with the database performance in mind.
Connection Pools
The Problem:
Scaling application servers is a well understood problem. Through use of horizontal scaling and stateless
interactions it is relatively easy to deploy enough application capacity to support even thousands of
simultaneous user requests. This scalability, however, does not extend to the database layer.

Opening and closing a database connection is a high latency operation, due to the network protocol
used between the application server and the database and the significant overhead of database
resources. Web applications and OLTP systems can't afford this latency for every user request.

The Solution:
Instead of opening a new connection for each application request, the application engine prepares a
certain number of open database connections and caches them in a connection pool.

In Java, DataSource class is a factory for creating database connections and the preferred way of getting
a connection. Java defines a generic DataSource interface, and there are many vendors that provide
their own DataSource implementations. Many, but not all the implementations also include connection
pooling.v

Using the generic DataSource interface, developers call getConnection(), and the DataSource class
provides the connection. Since the developers write the same code regardless of whether the
DataSource class they are using implements pooling or not, asking a developer whether he is using
connection pooling is not a reliable method to determine if connection pooling is used.

To make things more complicated, the developer is often unaware of which DataSource class he is using.
The DataSource implementation will be registered with the Java Naming Directory (JNDI) and can be
deployed and managed separatelyfrom the application that is using it. Finding out which DataSource is
used and how the connection pool is configured can take some digging and creativity. Most application
servers contain a configuration file called "server.xml" or "context.xml" that will contain various
resource descriptions. Search for a resource with type "javax.sql.DataSource" can find the configuration
of the DataSource class and the connection pool minimum and maximum sizes.
The Architecture:



                                         Application Business Layer


                                         Application Data Layer
                                                                                  JNDI
                                         DataSource Interface


                                         DataSource


                                                           Connection
                                      JDBC Driver          Pool


New problems:
   1. When connection pools are used all users share the same schema and same sessions, tracing
      can be difficult. We advise developers to use DBMS_APPLICATION_INFO to set extra information
      such as username (typically in client_info field), module and action to assist in future
      troubleshooting.
   2. Deciding on the size of a connection pool is the biggest challenge in using connection pools to
      increase scalability. As always, the thing that gets us into trouble is the thing we don’t know
      that we don’t know.
      Most developers are well aware that if the connection pool is too small, the database will sit idle
      while users are either waiting for connections or are being turned away. Since the scalability
      limitation of small connection pools are known, developers tend to avoid them by creating large
      connection pools, and increasing their size at the first hint of performance problems.
However a too large connection pool is a much greater risk to the application scalability. Here is
what the scalability of an OLTP system typically looks likevi:




Amdahl’s law say that the scalability of the system is constrained by its serial component as the
users are waiting for shared resources such as IO and CPU (This is the contention delay), but
according to the Universal Scalability Law there is a second delay called “coherency delay” –
which is the cost of maintaining data consistency in the system, this models waits on latches and
mutexes. After a certain point, adding more users to the system will decrease throughput.

Even when throughput doesn’t increase, at the point where throughput stops growing linearly,
data starts to queue and response times suffer proportionally:




If you check the wait events for a system that is past the point of saturation, you will see very
high CPU utilization, high “log file sync” event as a result of the CPU contention and high waits
for concurrency events such as “buffer busy waits” and “library cache latch”.
3. Even when the negative effects of too many concurrent users on the system are made clear,
   developers still argue for oversized connection pools with the excuse that most of the
   connections will be idle most of the time. There are two significant problems with this approach:
       a. While we believe that most of the connections will be idle most of the time, we can’t be
             certain that this will be the case. In fact, the worst performance issues I’ve seen were
             caused by the application actually using the entire connection pool allocated.
             This often happens when response times at the database already suffer for some
             reason, and the application does not receive response in a timely manner. At this point
             the application or users rerun the operation, using another connection to run the exact
             same query. Soon there are hundreds of connections to the database, all attempting to
             run the same queries and waiting for the same latches.
       b. The oversized connection pools have to be re-established during failover events or
             database restarts. The larger the connection pool is, the longer the application will take
             to recover from failover event, as a result decreasing the availability of the application.
4. Connection pools typically allow setting minimum and maximum sizes for the pool. When the
   application starts it will open connections until the minimum number of connections is met.
   Whenever it runs out of connections, it will open new connections until it reaches the maximum
   level. If connections are idle for too long, they will be closed, but never below the minimum
   level. This sounds fairly reasonable, until you ask yourself - if we set the minimum to the
   number of connections usually needed, when will the pool run out of connections?

    A connection pool can be seen as a queue. Users arrive and are serviced by the database while
    holding a connection. According to little's law the avg. number of connections used in the queue
    is (avg. DB response time)*(avg. user arrival rate). It is easy to see that you will run out of
    connections if the rate that users use your site increases, or if the database performance
    degrades and response times increase.

   If your connection pool can grow at these times, it means that it will open new connections, a
   resource intensive operation as we previously noted, to a database that is already abnormally
   busy. This will farther slow things down, which can lead to a vicious cycle known as "connection
   storm". It is much safer to configure the connection pool to a specific size – which is the
   maximum number of concurrent users that can run queries on the database with acceptable
   performance. We’ll discuss later how to determine this size. This will ensure that during peak
   times you will have enough connections to maximize throughput at acceptable latency, and no
   more.
5. Unfortunately, even if you decide on a proper number of database connections, there is the
   problem of multiple application servers. In most web architectures there are multiple web
   servers, each with a separate connection pool, all connecting to the same database server. In
   this case, it seems appropriate to divide the number of connections the DB will sustain by the
   number of servers and size the individual pools by that number. The problem with this approach
   is that load balancing is never perfect, so it is expected that some app servers will run out of
   connections while others still have spare connections. In some cases the number of application
servers is so large that dividing the number of connections leaves less than one connection per
        server.

Solutions to new problems:
As we discussed in the previous section, the key to scaling OLTP systems is by limiting the number of
concurrent connections to a number that the database can reasonably support even when they are all
active. The challenge is in determining this number.

Keeping in mind that OLTP workloads are typically CPU-bound, the number of concurrent users the
system can support is limited by the number of cores on the database server. A database with 12 cores
can typically only run 12 concurrent CPU-bound sessions.

The best way to size the connection pool is by simulating the load generated by the application.
Running a load test on the database is a great way of figuring out the maximum number of concurrent
active sessions that can be sustained by the database. This should usually be done with assistance from
the QA department, as they probably already determined the mix of various transactions that simulates
the normal operations load.

It is important to test the number of concurrently active connections the database can support at its
peak, therefore while testing it is critical to make sure that the database is indeed at full capacity and is
the bottleneck at the point when we decide the number of connections is maximal. This can be
reasonably validated by checking the CPU and IO queues at the database server and correlating with the
response times of the virtual users.

In usual performance tests, you try to decide on the maximum numbers of users the application can
support. Therefore you run the test with increasing number of virtual users, until the response times
become unacceptable. However, when attempting to determine the maximum number of connections
in the pool, you should run the test with a fixed number of users and keep increasing the number of
connections in the connection pool until the database CPU utilization goes above 60%, the wait events
go from “CPU” to concurrency events and response times become unacceptable. Typically all three of
these symptoms will start occurring at approximately the same time.

If a QA department and load testing tools are not available, it is possible to use the methodology
described by James Morle in his paper "Brewing Benchmarks" and generate load testing scripts from
trace files, which can later be replayed by SwingBench.

When running a load test is impractical, you will need to estimate the number of connections based on
available data. The factors to consider are:

    1. How many cores are available on the database server?
    2. How many concurrent users or threads does the application need to support?
    3. When an application thread takes a connection from the pool, how much of the time is spent
       holding the connection without actually running database queries? The more time the
application spends “just holding” the connection, the larger the pool will need to be to support
       the application workload.
    4. How much of the database workload is IO-bound? You can check IOWAIT on the database server
       to determine this. The more IO-bound your workload is, the more concurrent users you can run
       without running into concurrency contention (You will see a lot of IO contention though).

“Number of cores”x4 is a good connection pool starting point. Less if the connections are heavily utilized
by the application and there is little IO activity and more if the opposite is true.

The remaining problem is what to do if the number of application servers is large and it is inefficient to
divide the connection pool limit among the application servers. Well-architected systems usually have a
separate data layer that can be deployed on separate set of servers. This data layer should be the only
component of the application allowed to open connections to the database, and it provides data objects
to the various application server components. In this architecture, the connections are divided between
the data-layer servers, of which there are typically much fewer.
This design has three great advantages: First, the data layer usually grows much slower than the
application and rarely requires new servers to be added, which means that pools rarely require resizing.
Second, application requests can be balanced between the data servers based on the remaining pool
capacity and third, if there is a need to add application-side caching to the system (such as Memcached),
only the data layer needs modification.
Application Message Queues
The Problem:
By limiting the number of connections from the application servers to the database, we are preventing a
large number of queries from queuing at the database server. If the total number of connections
allowed from application servers to the database is limited to 400, the run queue on the database will
not exceed 400 (at least not by much).

We discussed in the previous section why preventing excessive concurrency in the database layer is
critical for database scalability and latency. However, we still need to discuss how the application can
deal with the user requests that arrive when there is no free database connection to handle them.

Let’s assume that we limited the connection pool to 50 connections, and due to a slow-down in the
database, all 50 connections are currently busy servicing user requests. However, new user requests are
still arriving into the system at their usual rate. What shall we do with these requests?

    1. Throw away the database request and return error or static content to the user.
       Some requests have to be serviced immediately. If the front page of your website can't load
       within few seconds, it is not worth servicing at all. Hopefully, the database is not a critical
       component in displaying these pages (we'll discuss the options when we discuss caches). If it
       does depend on the database and your connection pool is currently busy, you will want to
       display a static page and hope the customer will try again later.
    2. Place the request in queue for later processing.
       Some requests can be put aside for later processing, giving the user the impression of
       immediate return. For example, if your system allows the user to request reports by email, the
       request can certainly be acknowledged and queued for off-line processing. This option can be
       mixed with the first option – limit the size of the queue to N requests and display error
       messages for the rest.
    3. Give the request extra-high priority. The application can recognize that the request arrived from
       the CIO and make sure it gets to the database ahead of any other user, perhaps cancelling
       several user requests to get this done.
    4. Give the request extra-low priority. Some requests are so non-critical that there is no reason to
       even attempt serving them with low latency. If a user uses your application to send a message
       to another user, and there is no guarantee on how soon the message will arrive, it makes sense
       to tell the user the message was sent while in effect waiting until a connection in the pool is idle
       before attempting to serve the message. Recurring events are almost always lower priority than
       one-time events: User signing up for the service is one time event, and if lost, will have
       immediate business impact. Auditing user activity, on the other hand, is recurring event, and in
       case of delay will have lower business impact.
    5. Some requests are actually a mix of requests from different sources such as a dashboard, in
       these cases it is best to display the different dashboard components as the data arrives, with
       some components taking longer than others to show up.
In all those cases, the application is able to prioritise requests and decide on a course of action, based on
information that the database did not have at the time. It makes sense to shift the queuing to the
application when the database is highly loaded, because the application is better capable of dealing with
the excessive load.

Databases are not the only constrained resources, as application servers have their own limitations
when dealing with excess load. Typically, application servers have limited number of threads. This is
done for the same reason we limit the number of connections to the database servers - the server only
has limited number of cores and excessive number of threads will overload the server without
improving throughput. Since database requests are usually the highest latency action that is done by an
application thread, when the database is slow to response, all the application server threads can be busy
waiting for the database. The CPU on the application server will be idle while the application cannot
respond to additional user requests.

All this leads to the conclusion that from both the database perspective and the application perspective,
it is preferable to decouple the application requests from the database requests. This allows the
application to prioritise requests, hide latency and keep the application server and database server busy
but not overloaded.

The Solution:
Message queues provide an asynchronous communications protocol, meaning that the sender and
receiver of the message do not need to interact with the message queue at the same time. They can be
used by web applications and OLTP systems as a way to hide latency or variance in latency.

Java defines a common messaging API, JMS. There are multiple implementations of this API, both open
source and commercial. Oracle advanced queues are bundled with Oracle RDBMS both SE and EE at no
extra cost. These implementations differ in their feature set, supported operations, reliability and
stability. The API supports queues for point-to-point messaging with a single publisher and single
consumer. It also supports topics for publish-subscribe model where multiple consumers can subscribe
to various topics and receive the messages broadcasted with the topic.

Message queues are typically installed by system administrators as a separate server or component, just
like databases are installed and maintained. The message queue server is called "Broker", and is usually
backed by a database to ensure that messages are persistent even when the broker fails. The application
server then connects to the broker by a URL, and can publish and consume from queues by the queue
name.
The Architecture:

                       Application Business Layer                                  Message
                                                                                   Queue
                       Application Data Layer


                       DataSource Interface
                                                               JNDI

                       DataSource


                                        Connection
                    JDBC Driver         Pool




New Problems:
There are some common mythologies related to queue management, which may make developers
reluctant to use them when necessaryvii:

   1. It is impossible to reliably monitor queues
   2. Queues are not necessary if you do proper capacity planning
   3. Message queues are unnecessarily complicated. There must be a simpler way to achieve the
      same goals.

Solutions to New Problems:
While queues are undeniably useful to improve throughput both at the database and application server
layers, they do complicate the architecture. Let’s tackle themyths one by one:

   1. If it was indeed impossible to monitor queues, you would not monitor the CPU, load average,
      average active sessions, blocking sessions, disk IO waits, latches.
      All systems have many queues. The only question is - where is the queue managed and how
      easy it will be to manage each specific queue.

       If you use Oracle Advanced Queues, V$AQ will show you the number of messages in the queue
       and the average wait for messages in the queue, which is usually all you need to determine the
       status of the queue. For the more paranoid, I'd recommend adding a heartbeat monitor - insert
       a monitoring message to the queue at regular intervals and check that your process can read it
       from queue and the amount of time it took to arrive.

       The more interesting question is what do you do with the monitoring information - at what
       point will you send an alert to the on-call SA and what will you want her to do when she receives
       the alert?
Any queuing system will have high variance in service times and arrival rates of work. If the
service time and arrival rates were constant, there will be no need for queues. The high variance
is expected to lead to spikes in system utilization, which can cause false alarms - the system is
behaving as it should, but messages are accumulating in the queue. Our goal is to give as early
as possible notice that there is a genuine issue with the system that should be resolved and not
send warnings when the system is behaving as expected.

For this end, I recommend monitoring the following parameters:
         Service time - this will be monitored at the consumer thread. The thread should track
         (i.e. instrument) and log at regular intervals the average time it took to process a
         message from the queue. If service time increase significantly (compared to a known
         baseline, taking into account the known variance in response times), it can indicate a
         slowdown in processing and should be investigated.
         Arrival rate should be monitored at the processes that are writing to the queue. How
         many messages are inserted to the queue every second? This should be tracked for long
         term capacity planning and to determine peak usage periods.
         Queue size - the number of messages in the queue. Using Little's Law we can measure
         the amount of time a message spends in the queue (wait time) instead.
         If queue size or wait time increase significantly, this can indicate a "business issue" - i.e.
         impending breach of SLA. If the wait time frequently climbs to the point when SLAs are
         breached, it indicates that the system is does not have enough capacity to serve the
         current workloads. In this case either service times should be reduced (i.e. tuning), or
         more processing servers should be added. Note that queue size can and should go up
         for short periods of time, and recovering from bursts can take a while (depending on the
         service utilization), so this is only an issue if the queue size is high and does not start
         declining within few minutes, which will indicate that the system is not recovering.
Service utilization - what percent of the time the consumer thread is busy. This can be
            calculated by (arrival rate/(service time x number of consumers)).
            The more the service is utilized, the higher the probability that when a new message
            arrives, it will have other messages ahead of it in the queue and since R=S+W, the
            service times will suffer. Since we already measure the queue size directly, the main use
            of service utilization is capacity planning, and in particular detection of over-provisioned
            systems. For known utilization and fixed service times, if we know the arrival rates will
            grow by 50% tomorrow, you can calculate the expected effect on response timesviii:




            Note that by replacing many small queues on the database server with one (or few)
            centralized queue in the application, you are in a much better position to calculate
            utilization and predict the effect on response times.

2. Queues are inevitable. Capacity planning or not, the fact that arrival rates and service times are
   random will ensure that there will be times when requests will be queued, unless you plan to
   turn away a large percentage of your business.

    I suspect that what is really meant by "capacity planning will eliminate need for queues" is that
    it is possible to over-provision a system in a way that the queue servers (consumers) will have
    very low utilization. In this case queues will be exceedingly rare so it may make sense to throw
    the queue away and have the application threads communicate with the consumers directly.
    The application will then have to throw away any request that arrives when the consumers are
    busy, but in this system it will almost never happen. This is “capacity planning by
    overprovisioning”. I've worked on many databases that rarely exceeded 5% CPU. You'll still need
    to closely monitor the service utilization to make sure you increase your capacity to keep
    utilization low. I would not call this type of capacity planning "proper", though.

    On the other hand, introduction of a few well defined and well understood queues will help
    capacity planning. If we assume fixed server utilization, the size of the queue is proportional to
    the number of servers. So on some systems; it is possible to do the capacity planning just by
    examining the queue sizes.
3. Message Queues are indeed a complicated and not always stable beast. Queues are a simple
   concept. How did we get to a point where we need all those servers, protocols and applications
   to simply create a queue?
   Depending on your problem definition, it is possible that message queues are an excessive
   overhead. Sometimes all you need is a memory structure and few pointers. My colleague Marc
   Fielding created a high-performance queue system with a database table and two jobs. Some
   developers consider the database a worse overhead and prefer to implement their queues with
   a file, split and xargs. If this satisfies your requirements, then by all means, use those solutions.

    In other cases, I've attempted to implement a simple queuing solution, but the requirements
    kept piling up: What if we want to add more consumers? What if the consumer crashed and
    only processed some of the messages it retrieved? By the time I finished tweaking my system to
    address all the new requirements; it was far easier to use an existing solution. So I advise to only
    use home-grown solutions if you are reasonably certain the requirements will remain simple. If
    you suspect that you'll have to start dealing with multiple subscribers, which may or may not
    need to retrieve the same message multiple times, which may or may not want to ack messages,
    and that may or may not want to filter specific message types, then I recommend using an
    existing solution.

    ActiveMQ, RabbitMQ (acquired by springsource) are popular open source implementations, and
    Oracle Advanced Queue is free if you already have Oracle RDBMS license. When choosing an off
    the shelf message queue, it is important to understand how the system can be monitored and
    make sure that queue size, wait times and availability of the queue can be tracked by your
    favorite monitoring tool. If high availability is a requirement, this should also be taken into
    account when choosing message queue provider, since different queue systems support
    different HA options.
Application Caching:
The Problem:
The database is a sophisticated and well optimized caching machine, but as we saw when we discussed
connection pools, it has its limitations when it comes to scaling. One of those limitations is that a single
database machine is limited in the amount of RAM it has, so if your data working set is larger than the
amount of memory available, your application would have to access the disk occasionally. Disk access is
10,000 times slower than memory access. Even a slight increase in the amount of disk access your
queries have to perform, the type that happens naturally as your system grows, can have devastating
impact on the database performance.

With Oracle RAC, more cache memory is available by pooling together memory from multiple machines
into global cache. However, the performance improvement from the additional servers is not
proportional to what you'd see if you would add more memory to the same machine. Oracle has to
maintain cache consistency between the servers, and this introduces significant overhead. RAC can
scale, but not in every case and it requires careful application design to make this happen.

The Solution:
Memcached is a distributed, memory-only, key-value store. It can be used by the application server to
cache results of database queries that can be used multiple times. The great benefit of Memcached is
that it is distributed and can use free memory on any server, allowing for caching to be done outside of
Oracle’s scarce buffer cache. If you have 5 application servers and you allocate 1G RAM to Memcached
on each server, you have 5G of additional caching.

Memcached cache is an LRU, just like the buffer cache. If the application is trying to store a new key, and
there is no free memory, the oldest item in the cache will be evicted and its memory used for the new
key.



According to the documentation, Memcached scales very well when adding additional servers because
the servers do not communicate with each other at all. Each client has a list of available servers and the
hash function that allows it to know which server will hold the value for which key. When the
application requests data from cache, it connects to a single server and accesses exactly one key. When
a single cache node crashes, there will get more cache misses and therefore more database requests,
but the rest of the nodes will continue operating as usual.

I was unable to find any published benchmarks that confirm this claim, so I ran my own un-official
benchmark, using Amazon’s ElastiCache, a service which allows one to create a Memcached cluster and
add nodes to it.

Few comments regarding the use of Amazon’s ElastiCache and how I ran the tests:

    1. Amazon’s ElastiCache is only usable from servers on Amazon’s EC2 cloud. To run the test, I
       created an ElastiCache cluster with two small servers (1.3G RAM, 1 virtual core), and one EC2
micro node (613 MB, up to two virtual cores for short bursts) running Amazon’s Linux
       distribution.
    2. I ran the test using Brutisix, a Memcached load test framework, written in PHP. The test is fairly
       configurable, and I ran it as follows:
                7 gets to 3 sets read/write mix, all reads and writes were random. Values were limited
                to 256 bit.
                First test ran with a key space of 10K keys, which fit easily in memory of one
                Memcached node. The node was pre-warmed with the keys.
                Second test ran with the same key space, two-nodes, both pre-warmed.
                Third test was one node again, 1M keys, which do not fit in memory of one or two
                nodes and no pre-warming of cache.
                Fourth test with two nodes, 1M keys. Second node added after first node was already
                active.
                The first 3 tests ran for 5 minutes each, the fourth ran for 15 minutes.
                The single node tests ran with 2 threads, and the two-node tests ran with four.

    3. Amazon’s cloud monitoring framework was used to monitor Memcached’s statistics. It had two
       annoying properties – it did not automatically refresh, and the values it showed were always 5
       minutes old. In the future, it will be worth the time to install my own monitoring software on an
       EC2 node to track Memcached performance.

Here is a chart of the total number of gets we could run on each node:
Number of hits and misses per node:




Few conclusions from the tests I ran:

    1. In the tests I ran, get latency was 2ms on AWS cluster and 0.0068 on my desktop. It appears that
       the only latency you’ll experience with Memcached is the network latency.
    2. The ratio of hits and misses did not affect the total throughput of the cluster. The throughput is
       somewhat better with a larger key space, possibly due to fewer get collisions.
    3. Throughput dropped when I added the second server, and total throughput never exceeded 60K
       gets per minute. It is likely that at the configuration I ran, the client could not sustain more than
       60K gets per minute.
    4. 60K random reads per minute at 2ms latency is pretty impressive for two very small servers,
       rented at 20 cents an hour. You will need a fairly high-end configuration to get the same
       performance from your database.


By using Memcached (or other application-side caching), load on the database will be reduced, since
there are fewer connections and fewer reads. Database slowdowns will have less impact on the
application responsiveness, since on many pages most of the data arrives from cache, the page can
gradually display without the users feeling that they wait forever to get results. Even better, if the
database is unavailable, you can still maintain partial availability of the application by displaying cached
results – in the best cases, only write operations will be unavailable when the database is down.

The Architecture:

                         Application Business Layer                                     Message
                                                                                        Queue
 Memcached               Application Data Layer


                         DataSource Interface
                                                                   JNDI

                         DataSource


                                           Connection
                      JDBC Driver          Pool
New Problems:
Unlike Oracle's buffer cache, which is automatically used by queries, use of the application cache does
not happen automatically and requires code changes to the application. In this sense it is somewhat
similar to Oracle's result cache - it stores results by request and not data blocks automatically.The
changes required to use Memcached are usually done in the data layer. The code that queries the
database is replaced by code that only queries the database if the result was not found in the cache first.

This places the burden of properly using the cache on the developers. It is said that the only difficult
problems in computer science are naming things and cache invalidation. The purpose of this paper is not
to solve the most difficult problem in computer science, but we will offer some advice on proper use of
Memcached.

In addition, Memcached presents the usual operational questions – How big should it be, and how can it
be monitored. We will discuss capacity planning and monitoring of Memcached as well.

Solutions to new problems:
The first step in integrating Memcached into your application is to re-write the functions in your data
layer, so they will look for data in the cache before querying the database:

For example, the following:

functionget_username(intuserid) {
username = db_select("SELECT usename FROM users WHERE userid = ?", userid);
return username;
 }



Will be replaced by:

functionget_username(intuserid) {
    /* first try the cache */
name = memcached_fetch("username:" + userid);
if (!name) {
       /* not found : request database */
name = db_select("SELECT username FROM users WHERE userid = ?", userid);
       /* then store in cache until next get */
memcached_add("username:" + userid, username);
    }
return data;
 }
We will also need to change the code that updates the database so it will update the cache as well,
otherwise we risk serving stale data:



functionupdate_username(intuserid, string username) {
/* first update database */
result = db_execute("Update users set username=? WHERE
userid=?",userid,username);
if (result) {
/* database update successful: update cache */
memcached_set("username:" + userid, username);
    }




Of course, not every function should be cached. The cache has limited size, and there is an overhead for
attempting to use the cache for data that is not actually there. The main benefits are to use the cache
for results of large or highly redundant queries.

To use the cache effectively without risking data corruption, keep the following in mind:

    1. Use ASH data to find the queries that use the most database time. Queries that take significant
       amount of time to execute and short queries that execute very often are good candidates for
       caching. Of course many of these queries use bind variables and return different results for each
       user. As we showed in the example, the bind variables can be used as part of the cache key to
       store and retrieve results for each group of binds separately. Due to the LRU nature of the
       cache, commonly used binds will remain and cache and get reused while infrequently used
       combinations will get evicted.
    2. Memcached takes large amounts of memory (the more the merrier!) but there is evidencex that
       it does not scale well across large number of cores. This makes Memcached a good candidate to
       share a server with an application that makes intensive use of the CPU and doesn't require as
       much memory. Another option is to create multiple virtual machines on a single multi-core
       server and install Memcached on all the virtual machines. However this configuration means
       that you will lose most of your caching capacity with the crash of a single physical server.
    3. Memcached is not durable. If you can't afford to lose specific information, store it in the
       database before you store it in Memcached. This seems to imply that you can't use Memcached
       to scale a system which is doing primarily large number of writes. In effect, it depends on the
       exact bottlenecks. If your top wait event is "Log file sync", you can use Memcached to reduce
       the total amount of work the database does, reduce the CPU load and therefore potentially
       reduce "log file sync" wait.
    4. Some data should be stored eventually but can be lost without critical impact to the system.
       Instrumentation and logging information is definitely in this category. This information can be
       stored in Memcached and written to the database in batches and infrequently.
5. Consider pre-populating the cache: If you rely on Memcached to keep your performance
      predictable, a crash of a Memcached server will send significant amounts of traffic to the
      database and the effects on performance will be noticeable. When the server comes back, it can
      take a while until all the data is loaded to the cache again, prolonging the period of reduced
      performance. To improve performance in the first minutes after a restart, consider a script that
      will pre-load data into the cache when the Memcached server starts.
   6. Consider very carefully what to do when the data is updated:
      Sometimes it is easy to simultaneously update the cache - if user changes his address and the
      address is stored in the cache, update the cache immediately after updating the database. This
      is the best case scenario, as the cache is kept useful through update. Memcached API contains
      functions that allow changing data atomically or avoid race conditions.
      When the data in the cache is actually aggregated data, it may not be possible to update it, but
      will be possible to evict the current information as irrelevant and reload it to the cache when it
      is next needed. This can make the cache useless when the data is updated and reloaded very
      frequently.
      Sometimes it isn't even possible to figure out what keys should be evicted from cache when
      specific field is updated, especially if the cache contains results of complex queries. This
      situation is best avoided, but can be dealt with by setting expiration time for the data, and
      preparing to serve possibly-stale data for that period of time.

How big should the cache be?

       It is better to have many servers with less memory than few servers with a lot of memory. This
       minimises the impact of one crashed Memcached server. Remember that there is no
       performance penalty to a large number of nodes.
       Losing a Memcached instance will always send additional traffic to the database. You need to
       have enough Memcached servers to make sure the extra traffic will not cause unacceptable
       latency to the application.
       There are no downsides to a cache that is too large, so in general allocate to Memcached all the
       memory you can afford.
       If the average number of gets per item is very low, you can safely reduce the amount of memory
       allocated.
       There is no "cache size advisor" for Memcached, and it is impossible to predict the effect of
       adding or reducing the cache size based on the monitoring data available from
       Memcached.SimCache is a tool that based on detailed hit/miss logs for the existing Memcached
       can simulate an LRU cache and predict the hit/miss ratio in various cache sizes. In many
       environments keeping such detailed log is impractical, but tracking a sample of the requests
       could be possible and can still be used to predict cache effects.
       Knowing the average latency of database reads under various loads and the latency of
       Memcached reads should allow you to predict changes in response time as Memcached size and
       its hit ratio changes.For example:
       You use SimCache to see that with cache size of 10G you will have hit ratio of 95% in
Memcached. Memcached has latency of 1ms in your system. With 5% of the queries hitting the
      database, you expect the database CPU utilization to be around 20%, almost 100% of the DB
      Time on the CPU, and almost no wait time on the queue between the business and the data
      layers (you tested this separately when sizing your connection pool). In this case the database
      latency will be 5ms, so we expect the average latency for the data layer to be
      0.95*1+0.05*5=1.2ms.




How do I monitor Memcached?

      Monitor number of items, gets, sets and misses. An increase in the number of cache misses will
      definitely mean that the database load is increasing at same time, and can indicate that more
      memory is necessary. Make sure that the number of gets is higher than the number of sets. If
      you are setting more than getting, the cache is a waste of space. If the number of gets per item
      is very low, the cache may be oversized. There is no downside to an oversized cache, but you
      may want to use the memory for another purpose.
      Monitor for number of evictions. Data is evicted when the application attempts to store new
      item but there is no memory left. An increase in the number of evictions can also indicate that
      more memory is needed. Evicted time shows the time between the last get of the item to its
      eviction. If this period is short, this is a good indication that memory shortage makes the cache
      less effective.
      It is important to note that low hitrate and high number of evictions do not immediately mean
      you should buy more memory. It is possible that your application is misusing the cache:
            o Maybe the application sets large numbers of keys, most of which are never used again.
               In this case you should reconsider the way you use the cache.
            o Maybe the TTL for the keys is too short. In this case you will see low hitrate but not
               many evictions.
            o The application frequently attempts to get items that don't exist, perhaps due to data
               purging of some sort. Consider setting the key with a "null" value, to make sure the
               invalid searches do not hit the database over and over.
      Monitor for swapping. Memcached is intended to speed performance by caching data in
      memory. If the data is spilled to disk, it is doing more harm than good.
      Monitor for average response time. You should see very few requests that take over 1-2ms,
      longer wait times can indicate that you are hitting the maximum connection limit for the server,
      or that CPU utilization on the server is too high.
      Monitor that the number of connections to the server does not come close to the max
      connections settings of Memcached (configurable).
      Do not monitor "stat sizes" for statistics about size of items in cache. This locks up the entire
      cache.
All the values I mentioned can be read from Memcached using the STAT call in its API. You can run this
command and get the results directly by telnet to port 11211. Many monitoring systems, including Cactii
and Ganglia include monitoring templates for Memcached.




i
 Traffic jam without bottleneck -experimental evidence for the physical mechanism of the formation of a jam
Yuki Sugiyama, Minoru Fukui, Macoto Kikuchi, KatsuyaHasebe, Akihiro Nakayama, Katsuhiro Nishinari, Shin-
ichiTadaki, Satoshi YukawaNew Journal of Physics, Vol.10, (2008), 033001
ii
  http://www.telegraph.co.uk/science/science-news/3334754/Too-many-cars-cause-traffic-jams.html
iii
   Scaling Oracle8i™: Building Highly Scalable OLTP System Architectures, James Morle
iv
   http://www.youtube.com/watch?v=xNDnVOCdvQ0
v
  http://docs.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/datasource.html
vi
   http://www.perfdynamics.com/Manifesto/USLscalability.html
vii
    http://teddziuba.com/2011/02/the-case-against-queues.html
viii
    http://www.cmg.org/measureit/issues/mit62/m_62_15.html
ix
    http://code.google.com/p/brutis/
x

http://assets.en.oreilly.com/1/event/44/Hidden%20Scalability%20Gotchas%20in%20Memcached%20and%20Frien
ds%20Presentation.pdf

Contenu connexe

Tendances

Getting started with Teradata
Getting started with TeradataGetting started with Teradata
Getting started with TeradataEdureka!
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...SL Corporation
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...ScaleBase
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenLorenzo Alberton
 
Azure Storage Revisited
Azure Storage RevisitedAzure Storage Revisited
Azure Storage RevisitedJoel Cochran
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data recordsDavid Walker
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 

Tendances (20)

Getting started with Teradata
Getting started with TeradataGetting started with Teradata
Getting started with Teradata
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
10 Tricks to Ensure Your Oracle Coherence Cluster is Not a "Black Box" in Pro...
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
Distributed RDBMS: Data Distribution Policy: Part 3 - Changing Your Data Dist...
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Azure Storage Revisited
Azure Storage RevisitedAzure Storage Revisited
Azure Storage Revisited
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data records
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 

Similaire à Queues, Pools and Caches - Paper

Scalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed SystemsScalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed Systemshyun soomyung
 
What is active-active
What is active-activeWhat is active-active
What is active-activeSaif Ahmad
 
Enhanced Dynamic Web Caching: For Scalability & Metadata Management
Enhanced Dynamic Web Caching: For Scalability & Metadata ManagementEnhanced Dynamic Web Caching: For Scalability & Metadata Management
Enhanced Dynamic Web Caching: For Scalability & Metadata ManagementDeepak Bagga
 
Scale from zero to millions of users.pdf
Scale from zero to millions of users.pdfScale from zero to millions of users.pdf
Scale from zero to millions of users.pdfNedyalkoKarabadzhako
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Geoffrey Fox
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfParthNavale
 
Bluedog white paper - scaling for high availability, high utilization
Bluedog white paper - scaling for high availability, high utilizationBluedog white paper - scaling for high availability, high utilization
Bluedog white paper - scaling for high availability, high utilizationtom termini
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani
 
Differences Between Architectures
Differences Between ArchitecturesDifferences Between Architectures
Differences Between Architecturesprasadsmn
 
A database management system
A database management systemA database management system
A database management systemghulam120
 
Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architectureVyacheslav Benedichuk
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview QuestionsSoniaMathias2
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxRadhika R
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridEmiliano Pecis
 
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...NAbderrahim
 

Similaire à Queues, Pools and Caches - Paper (20)

Scalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed SystemsScalable Web Architecture and Distributed Systems
Scalable Web Architecture and Distributed Systems
 
What is active-active
What is active-activeWhat is active-active
What is active-active
 
S18 das
S18 dasS18 das
S18 das
 
Enhanced Dynamic Web Caching: For Scalability & Metadata Management
Enhanced Dynamic Web Caching: For Scalability & Metadata ManagementEnhanced Dynamic Web Caching: For Scalability & Metadata Management
Enhanced Dynamic Web Caching: For Scalability & Metadata Management
 
Scale from zero to millions of users.pdf
Scale from zero to millions of users.pdfScale from zero to millions of users.pdf
Scale from zero to millions of users.pdf
 
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
 
System Design
System DesignSystem Design
System Design
 
Bluedog white paper - scaling for high availability, high utilization
Bluedog white paper - scaling for high availability, high utilizationBluedog white paper - scaling for high availability, high utilization
Bluedog white paper - scaling for high availability, high utilization
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Differences Between Architectures
Differences Between ArchitecturesDifferences Between Architectures
Differences Between Architectures
 
A database management system
A database management systemA database management system
A database management system
 
Data stream processing and micro service architecture
Data stream processing and micro service architectureData stream processing and micro service architecture
Data stream processing and micro service architecture
 
saas
saassaas
saas
 
Top System Design Interview Questions
Top System Design Interview QuestionsTop System Design Interview Questions
Top System Design Interview Questions
 
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagrid
 
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
 

Plus de Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

Plus de Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Dernier

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Dernier (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Queues, Pools and Caches - Paper

  • 1. Queues, Pools and Caches - Everything a DBA should know of scaling modern OLTP Gwen (Chen) Shapira, Senior Consultant ThePythian Group cshapi@gmail.com Scalability Problems in Highly Concurrent Systems When we drive through a particularly painful traffic jam, we tend to assume that the jam has a cause. That road maintenance or an accident blocked traffic and created the slowdown. However, we often reach the end of the traffic jam without seeing any visible cause. Traffic researcher Prof. Sugiyama and his team showed that with sufficient traffic density, traffic jams will occur with no discernible root cause. Traffic jams will even form when cars drive in constant speed on a circular one-lane tracki. “When a large number of vehicles, beyond the road capacity, are successively injected into the road, the density exceeds the critical value and the free flow state becomes unstable.”ii OLTP systems are systems built to handle large number of small transactions. In those systems the main requirements are servicing large number of concurrent requests, with low and predictable latency. Good scalability for OLTP system can be defined as “Achieving maximum useful concurrency from a shared system”iii. OLTP systems often behave exactly like traffic jams in Prof. Sugiyama’s experiments – more and more traffic is loaded into the database. Inevitably, a traffic jam will occur, and we may not be able to find any visible root cause for that. In a wonderful video, Andrew Holdsworth of Oracle’s Real World Performance group shows how increasing traffic on a database server can dramatically increase latency without any improvement in response times and how reducing the number of connections to the database can improve performanceiv. In this presentation, I’ll discuss several design patterns and frameworks that are used to improve scalability by controlling concurrency in modern OTLP systems and web based architectures. All the patterns and frameworks I’ll discuss are considered part of the software architecture. DBAs often take little interest in the design and architecture of the applications that use the database. But databases never operate in vacuum, DBAs who understand application design can have better dialog with the software team when it comes to scalability, and progress beyond finger pointing and “The database is slow” blaming. Those frameworks require sizing, capacity planning and monitoring – a task that DBAs are better qualified for than software developers, I’ll go into details on how DBAs can help size and monitor these systems with the database performance in mind.
  • 2. Connection Pools The Problem: Scaling application servers is a well understood problem. Through use of horizontal scaling and stateless interactions it is relatively easy to deploy enough application capacity to support even thousands of simultaneous user requests. This scalability, however, does not extend to the database layer. Opening and closing a database connection is a high latency operation, due to the network protocol used between the application server and the database and the significant overhead of database resources. Web applications and OLTP systems can't afford this latency for every user request. The Solution: Instead of opening a new connection for each application request, the application engine prepares a certain number of open database connections and caches them in a connection pool. In Java, DataSource class is a factory for creating database connections and the preferred way of getting a connection. Java defines a generic DataSource interface, and there are many vendors that provide their own DataSource implementations. Many, but not all the implementations also include connection pooling.v Using the generic DataSource interface, developers call getConnection(), and the DataSource class provides the connection. Since the developers write the same code regardless of whether the DataSource class they are using implements pooling or not, asking a developer whether he is using connection pooling is not a reliable method to determine if connection pooling is used. To make things more complicated, the developer is often unaware of which DataSource class he is using. The DataSource implementation will be registered with the Java Naming Directory (JNDI) and can be deployed and managed separatelyfrom the application that is using it. Finding out which DataSource is used and how the connection pool is configured can take some digging and creativity. Most application servers contain a configuration file called "server.xml" or "context.xml" that will contain various resource descriptions. Search for a resource with type "javax.sql.DataSource" can find the configuration of the DataSource class and the connection pool minimum and maximum sizes.
  • 3. The Architecture: Application Business Layer Application Data Layer JNDI DataSource Interface DataSource Connection JDBC Driver Pool New problems: 1. When connection pools are used all users share the same schema and same sessions, tracing can be difficult. We advise developers to use DBMS_APPLICATION_INFO to set extra information such as username (typically in client_info field), module and action to assist in future troubleshooting. 2. Deciding on the size of a connection pool is the biggest challenge in using connection pools to increase scalability. As always, the thing that gets us into trouble is the thing we don’t know that we don’t know. Most developers are well aware that if the connection pool is too small, the database will sit idle while users are either waiting for connections or are being turned away. Since the scalability limitation of small connection pools are known, developers tend to avoid them by creating large connection pools, and increasing their size at the first hint of performance problems.
  • 4. However a too large connection pool is a much greater risk to the application scalability. Here is what the scalability of an OLTP system typically looks likevi: Amdahl’s law say that the scalability of the system is constrained by its serial component as the users are waiting for shared resources such as IO and CPU (This is the contention delay), but according to the Universal Scalability Law there is a second delay called “coherency delay” – which is the cost of maintaining data consistency in the system, this models waits on latches and mutexes. After a certain point, adding more users to the system will decrease throughput. Even when throughput doesn’t increase, at the point where throughput stops growing linearly, data starts to queue and response times suffer proportionally: If you check the wait events for a system that is past the point of saturation, you will see very high CPU utilization, high “log file sync” event as a result of the CPU contention and high waits for concurrency events such as “buffer busy waits” and “library cache latch”.
  • 5. 3. Even when the negative effects of too many concurrent users on the system are made clear, developers still argue for oversized connection pools with the excuse that most of the connections will be idle most of the time. There are two significant problems with this approach: a. While we believe that most of the connections will be idle most of the time, we can’t be certain that this will be the case. In fact, the worst performance issues I’ve seen were caused by the application actually using the entire connection pool allocated. This often happens when response times at the database already suffer for some reason, and the application does not receive response in a timely manner. At this point the application or users rerun the operation, using another connection to run the exact same query. Soon there are hundreds of connections to the database, all attempting to run the same queries and waiting for the same latches. b. The oversized connection pools have to be re-established during failover events or database restarts. The larger the connection pool is, the longer the application will take to recover from failover event, as a result decreasing the availability of the application. 4. Connection pools typically allow setting minimum and maximum sizes for the pool. When the application starts it will open connections until the minimum number of connections is met. Whenever it runs out of connections, it will open new connections until it reaches the maximum level. If connections are idle for too long, they will be closed, but never below the minimum level. This sounds fairly reasonable, until you ask yourself - if we set the minimum to the number of connections usually needed, when will the pool run out of connections? A connection pool can be seen as a queue. Users arrive and are serviced by the database while holding a connection. According to little's law the avg. number of connections used in the queue is (avg. DB response time)*(avg. user arrival rate). It is easy to see that you will run out of connections if the rate that users use your site increases, or if the database performance degrades and response times increase. If your connection pool can grow at these times, it means that it will open new connections, a resource intensive operation as we previously noted, to a database that is already abnormally busy. This will farther slow things down, which can lead to a vicious cycle known as "connection storm". It is much safer to configure the connection pool to a specific size – which is the maximum number of concurrent users that can run queries on the database with acceptable performance. We’ll discuss later how to determine this size. This will ensure that during peak times you will have enough connections to maximize throughput at acceptable latency, and no more. 5. Unfortunately, even if you decide on a proper number of database connections, there is the problem of multiple application servers. In most web architectures there are multiple web servers, each with a separate connection pool, all connecting to the same database server. In this case, it seems appropriate to divide the number of connections the DB will sustain by the number of servers and size the individual pools by that number. The problem with this approach is that load balancing is never perfect, so it is expected that some app servers will run out of connections while others still have spare connections. In some cases the number of application
  • 6. servers is so large that dividing the number of connections leaves less than one connection per server. Solutions to new problems: As we discussed in the previous section, the key to scaling OLTP systems is by limiting the number of concurrent connections to a number that the database can reasonably support even when they are all active. The challenge is in determining this number. Keeping in mind that OLTP workloads are typically CPU-bound, the number of concurrent users the system can support is limited by the number of cores on the database server. A database with 12 cores can typically only run 12 concurrent CPU-bound sessions. The best way to size the connection pool is by simulating the load generated by the application. Running a load test on the database is a great way of figuring out the maximum number of concurrent active sessions that can be sustained by the database. This should usually be done with assistance from the QA department, as they probably already determined the mix of various transactions that simulates the normal operations load. It is important to test the number of concurrently active connections the database can support at its peak, therefore while testing it is critical to make sure that the database is indeed at full capacity and is the bottleneck at the point when we decide the number of connections is maximal. This can be reasonably validated by checking the CPU and IO queues at the database server and correlating with the response times of the virtual users. In usual performance tests, you try to decide on the maximum numbers of users the application can support. Therefore you run the test with increasing number of virtual users, until the response times become unacceptable. However, when attempting to determine the maximum number of connections in the pool, you should run the test with a fixed number of users and keep increasing the number of connections in the connection pool until the database CPU utilization goes above 60%, the wait events go from “CPU” to concurrency events and response times become unacceptable. Typically all three of these symptoms will start occurring at approximately the same time. If a QA department and load testing tools are not available, it is possible to use the methodology described by James Morle in his paper "Brewing Benchmarks" and generate load testing scripts from trace files, which can later be replayed by SwingBench. When running a load test is impractical, you will need to estimate the number of connections based on available data. The factors to consider are: 1. How many cores are available on the database server? 2. How many concurrent users or threads does the application need to support? 3. When an application thread takes a connection from the pool, how much of the time is spent holding the connection without actually running database queries? The more time the
  • 7. application spends “just holding” the connection, the larger the pool will need to be to support the application workload. 4. How much of the database workload is IO-bound? You can check IOWAIT on the database server to determine this. The more IO-bound your workload is, the more concurrent users you can run without running into concurrency contention (You will see a lot of IO contention though). “Number of cores”x4 is a good connection pool starting point. Less if the connections are heavily utilized by the application and there is little IO activity and more if the opposite is true. The remaining problem is what to do if the number of application servers is large and it is inefficient to divide the connection pool limit among the application servers. Well-architected systems usually have a separate data layer that can be deployed on separate set of servers. This data layer should be the only component of the application allowed to open connections to the database, and it provides data objects to the various application server components. In this architecture, the connections are divided between the data-layer servers, of which there are typically much fewer. This design has three great advantages: First, the data layer usually grows much slower than the application and rarely requires new servers to be added, which means that pools rarely require resizing. Second, application requests can be balanced between the data servers based on the remaining pool capacity and third, if there is a need to add application-side caching to the system (such as Memcached), only the data layer needs modification.
  • 8. Application Message Queues The Problem: By limiting the number of connections from the application servers to the database, we are preventing a large number of queries from queuing at the database server. If the total number of connections allowed from application servers to the database is limited to 400, the run queue on the database will not exceed 400 (at least not by much). We discussed in the previous section why preventing excessive concurrency in the database layer is critical for database scalability and latency. However, we still need to discuss how the application can deal with the user requests that arrive when there is no free database connection to handle them. Let’s assume that we limited the connection pool to 50 connections, and due to a slow-down in the database, all 50 connections are currently busy servicing user requests. However, new user requests are still arriving into the system at their usual rate. What shall we do with these requests? 1. Throw away the database request and return error or static content to the user. Some requests have to be serviced immediately. If the front page of your website can't load within few seconds, it is not worth servicing at all. Hopefully, the database is not a critical component in displaying these pages (we'll discuss the options when we discuss caches). If it does depend on the database and your connection pool is currently busy, you will want to display a static page and hope the customer will try again later. 2. Place the request in queue for later processing. Some requests can be put aside for later processing, giving the user the impression of immediate return. For example, if your system allows the user to request reports by email, the request can certainly be acknowledged and queued for off-line processing. This option can be mixed with the first option – limit the size of the queue to N requests and display error messages for the rest. 3. Give the request extra-high priority. The application can recognize that the request arrived from the CIO and make sure it gets to the database ahead of any other user, perhaps cancelling several user requests to get this done. 4. Give the request extra-low priority. Some requests are so non-critical that there is no reason to even attempt serving them with low latency. If a user uses your application to send a message to another user, and there is no guarantee on how soon the message will arrive, it makes sense to tell the user the message was sent while in effect waiting until a connection in the pool is idle before attempting to serve the message. Recurring events are almost always lower priority than one-time events: User signing up for the service is one time event, and if lost, will have immediate business impact. Auditing user activity, on the other hand, is recurring event, and in case of delay will have lower business impact. 5. Some requests are actually a mix of requests from different sources such as a dashboard, in these cases it is best to display the different dashboard components as the data arrives, with some components taking longer than others to show up.
  • 9. In all those cases, the application is able to prioritise requests and decide on a course of action, based on information that the database did not have at the time. It makes sense to shift the queuing to the application when the database is highly loaded, because the application is better capable of dealing with the excessive load. Databases are not the only constrained resources, as application servers have their own limitations when dealing with excess load. Typically, application servers have limited number of threads. This is done for the same reason we limit the number of connections to the database servers - the server only has limited number of cores and excessive number of threads will overload the server without improving throughput. Since database requests are usually the highest latency action that is done by an application thread, when the database is slow to response, all the application server threads can be busy waiting for the database. The CPU on the application server will be idle while the application cannot respond to additional user requests. All this leads to the conclusion that from both the database perspective and the application perspective, it is preferable to decouple the application requests from the database requests. This allows the application to prioritise requests, hide latency and keep the application server and database server busy but not overloaded. The Solution: Message queues provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time. They can be used by web applications and OLTP systems as a way to hide latency or variance in latency. Java defines a common messaging API, JMS. There are multiple implementations of this API, both open source and commercial. Oracle advanced queues are bundled with Oracle RDBMS both SE and EE at no extra cost. These implementations differ in their feature set, supported operations, reliability and stability. The API supports queues for point-to-point messaging with a single publisher and single consumer. It also supports topics for publish-subscribe model where multiple consumers can subscribe to various topics and receive the messages broadcasted with the topic. Message queues are typically installed by system administrators as a separate server or component, just like databases are installed and maintained. The message queue server is called "Broker", and is usually backed by a database to ensure that messages are persistent even when the broker fails. The application server then connects to the broker by a URL, and can publish and consume from queues by the queue name.
  • 10. The Architecture: Application Business Layer Message Queue Application Data Layer DataSource Interface JNDI DataSource Connection JDBC Driver Pool New Problems: There are some common mythologies related to queue management, which may make developers reluctant to use them when necessaryvii: 1. It is impossible to reliably monitor queues 2. Queues are not necessary if you do proper capacity planning 3. Message queues are unnecessarily complicated. There must be a simpler way to achieve the same goals. Solutions to New Problems: While queues are undeniably useful to improve throughput both at the database and application server layers, they do complicate the architecture. Let’s tackle themyths one by one: 1. If it was indeed impossible to monitor queues, you would not monitor the CPU, load average, average active sessions, blocking sessions, disk IO waits, latches. All systems have many queues. The only question is - where is the queue managed and how easy it will be to manage each specific queue. If you use Oracle Advanced Queues, V$AQ will show you the number of messages in the queue and the average wait for messages in the queue, which is usually all you need to determine the status of the queue. For the more paranoid, I'd recommend adding a heartbeat monitor - insert a monitoring message to the queue at regular intervals and check that your process can read it from queue and the amount of time it took to arrive. The more interesting question is what do you do with the monitoring information - at what point will you send an alert to the on-call SA and what will you want her to do when she receives the alert?
  • 11. Any queuing system will have high variance in service times and arrival rates of work. If the service time and arrival rates were constant, there will be no need for queues. The high variance is expected to lead to spikes in system utilization, which can cause false alarms - the system is behaving as it should, but messages are accumulating in the queue. Our goal is to give as early as possible notice that there is a genuine issue with the system that should be resolved and not send warnings when the system is behaving as expected. For this end, I recommend monitoring the following parameters: Service time - this will be monitored at the consumer thread. The thread should track (i.e. instrument) and log at regular intervals the average time it took to process a message from the queue. If service time increase significantly (compared to a known baseline, taking into account the known variance in response times), it can indicate a slowdown in processing and should be investigated. Arrival rate should be monitored at the processes that are writing to the queue. How many messages are inserted to the queue every second? This should be tracked for long term capacity planning and to determine peak usage periods. Queue size - the number of messages in the queue. Using Little's Law we can measure the amount of time a message spends in the queue (wait time) instead. If queue size or wait time increase significantly, this can indicate a "business issue" - i.e. impending breach of SLA. If the wait time frequently climbs to the point when SLAs are breached, it indicates that the system is does not have enough capacity to serve the current workloads. In this case either service times should be reduced (i.e. tuning), or more processing servers should be added. Note that queue size can and should go up for short periods of time, and recovering from bursts can take a while (depending on the service utilization), so this is only an issue if the queue size is high and does not start declining within few minutes, which will indicate that the system is not recovering.
  • 12. Service utilization - what percent of the time the consumer thread is busy. This can be calculated by (arrival rate/(service time x number of consumers)). The more the service is utilized, the higher the probability that when a new message arrives, it will have other messages ahead of it in the queue and since R=S+W, the service times will suffer. Since we already measure the queue size directly, the main use of service utilization is capacity planning, and in particular detection of over-provisioned systems. For known utilization and fixed service times, if we know the arrival rates will grow by 50% tomorrow, you can calculate the expected effect on response timesviii: Note that by replacing many small queues on the database server with one (or few) centralized queue in the application, you are in a much better position to calculate utilization and predict the effect on response times. 2. Queues are inevitable. Capacity planning or not, the fact that arrival rates and service times are random will ensure that there will be times when requests will be queued, unless you plan to turn away a large percentage of your business. I suspect that what is really meant by "capacity planning will eliminate need for queues" is that it is possible to over-provision a system in a way that the queue servers (consumers) will have very low utilization. In this case queues will be exceedingly rare so it may make sense to throw the queue away and have the application threads communicate with the consumers directly. The application will then have to throw away any request that arrives when the consumers are busy, but in this system it will almost never happen. This is “capacity planning by overprovisioning”. I've worked on many databases that rarely exceeded 5% CPU. You'll still need to closely monitor the service utilization to make sure you increase your capacity to keep utilization low. I would not call this type of capacity planning "proper", though. On the other hand, introduction of a few well defined and well understood queues will help capacity planning. If we assume fixed server utilization, the size of the queue is proportional to the number of servers. So on some systems; it is possible to do the capacity planning just by examining the queue sizes.
  • 13. 3. Message Queues are indeed a complicated and not always stable beast. Queues are a simple concept. How did we get to a point where we need all those servers, protocols and applications to simply create a queue? Depending on your problem definition, it is possible that message queues are an excessive overhead. Sometimes all you need is a memory structure and few pointers. My colleague Marc Fielding created a high-performance queue system with a database table and two jobs. Some developers consider the database a worse overhead and prefer to implement their queues with a file, split and xargs. If this satisfies your requirements, then by all means, use those solutions. In other cases, I've attempted to implement a simple queuing solution, but the requirements kept piling up: What if we want to add more consumers? What if the consumer crashed and only processed some of the messages it retrieved? By the time I finished tweaking my system to address all the new requirements; it was far easier to use an existing solution. So I advise to only use home-grown solutions if you are reasonably certain the requirements will remain simple. If you suspect that you'll have to start dealing with multiple subscribers, which may or may not need to retrieve the same message multiple times, which may or may not want to ack messages, and that may or may not want to filter specific message types, then I recommend using an existing solution. ActiveMQ, RabbitMQ (acquired by springsource) are popular open source implementations, and Oracle Advanced Queue is free if you already have Oracle RDBMS license. When choosing an off the shelf message queue, it is important to understand how the system can be monitored and make sure that queue size, wait times and availability of the queue can be tracked by your favorite monitoring tool. If high availability is a requirement, this should also be taken into account when choosing message queue provider, since different queue systems support different HA options.
  • 14. Application Caching: The Problem: The database is a sophisticated and well optimized caching machine, but as we saw when we discussed connection pools, it has its limitations when it comes to scaling. One of those limitations is that a single database machine is limited in the amount of RAM it has, so if your data working set is larger than the amount of memory available, your application would have to access the disk occasionally. Disk access is 10,000 times slower than memory access. Even a slight increase in the amount of disk access your queries have to perform, the type that happens naturally as your system grows, can have devastating impact on the database performance. With Oracle RAC, more cache memory is available by pooling together memory from multiple machines into global cache. However, the performance improvement from the additional servers is not proportional to what you'd see if you would add more memory to the same machine. Oracle has to maintain cache consistency between the servers, and this introduces significant overhead. RAC can scale, but not in every case and it requires careful application design to make this happen. The Solution: Memcached is a distributed, memory-only, key-value store. It can be used by the application server to cache results of database queries that can be used multiple times. The great benefit of Memcached is that it is distributed and can use free memory on any server, allowing for caching to be done outside of Oracle’s scarce buffer cache. If you have 5 application servers and you allocate 1G RAM to Memcached on each server, you have 5G of additional caching. Memcached cache is an LRU, just like the buffer cache. If the application is trying to store a new key, and there is no free memory, the oldest item in the cache will be evicted and its memory used for the new key. According to the documentation, Memcached scales very well when adding additional servers because the servers do not communicate with each other at all. Each client has a list of available servers and the hash function that allows it to know which server will hold the value for which key. When the application requests data from cache, it connects to a single server and accesses exactly one key. When a single cache node crashes, there will get more cache misses and therefore more database requests, but the rest of the nodes will continue operating as usual. I was unable to find any published benchmarks that confirm this claim, so I ran my own un-official benchmark, using Amazon’s ElastiCache, a service which allows one to create a Memcached cluster and add nodes to it. Few comments regarding the use of Amazon’s ElastiCache and how I ran the tests: 1. Amazon’s ElastiCache is only usable from servers on Amazon’s EC2 cloud. To run the test, I created an ElastiCache cluster with two small servers (1.3G RAM, 1 virtual core), and one EC2
  • 15. micro node (613 MB, up to two virtual cores for short bursts) running Amazon’s Linux distribution. 2. I ran the test using Brutisix, a Memcached load test framework, written in PHP. The test is fairly configurable, and I ran it as follows: 7 gets to 3 sets read/write mix, all reads and writes were random. Values were limited to 256 bit. First test ran with a key space of 10K keys, which fit easily in memory of one Memcached node. The node was pre-warmed with the keys. Second test ran with the same key space, two-nodes, both pre-warmed. Third test was one node again, 1M keys, which do not fit in memory of one or two nodes and no pre-warming of cache. Fourth test with two nodes, 1M keys. Second node added after first node was already active. The first 3 tests ran for 5 minutes each, the fourth ran for 15 minutes. The single node tests ran with 2 threads, and the two-node tests ran with four. 3. Amazon’s cloud monitoring framework was used to monitor Memcached’s statistics. It had two annoying properties – it did not automatically refresh, and the values it showed were always 5 minutes old. In the future, it will be worth the time to install my own monitoring software on an EC2 node to track Memcached performance. Here is a chart of the total number of gets we could run on each node:
  • 16. Number of hits and misses per node: Few conclusions from the tests I ran: 1. In the tests I ran, get latency was 2ms on AWS cluster and 0.0068 on my desktop. It appears that the only latency you’ll experience with Memcached is the network latency. 2. The ratio of hits and misses did not affect the total throughput of the cluster. The throughput is somewhat better with a larger key space, possibly due to fewer get collisions. 3. Throughput dropped when I added the second server, and total throughput never exceeded 60K gets per minute. It is likely that at the configuration I ran, the client could not sustain more than 60K gets per minute. 4. 60K random reads per minute at 2ms latency is pretty impressive for two very small servers, rented at 20 cents an hour. You will need a fairly high-end configuration to get the same performance from your database. By using Memcached (or other application-side caching), load on the database will be reduced, since there are fewer connections and fewer reads. Database slowdowns will have less impact on the application responsiveness, since on many pages most of the data arrives from cache, the page can gradually display without the users feeling that they wait forever to get results. Even better, if the database is unavailable, you can still maintain partial availability of the application by displaying cached results – in the best cases, only write operations will be unavailable when the database is down. The Architecture: Application Business Layer Message Queue Memcached Application Data Layer DataSource Interface JNDI DataSource Connection JDBC Driver Pool
  • 17. New Problems: Unlike Oracle's buffer cache, which is automatically used by queries, use of the application cache does not happen automatically and requires code changes to the application. In this sense it is somewhat similar to Oracle's result cache - it stores results by request and not data blocks automatically.The changes required to use Memcached are usually done in the data layer. The code that queries the database is replaced by code that only queries the database if the result was not found in the cache first. This places the burden of properly using the cache on the developers. It is said that the only difficult problems in computer science are naming things and cache invalidation. The purpose of this paper is not to solve the most difficult problem in computer science, but we will offer some advice on proper use of Memcached. In addition, Memcached presents the usual operational questions – How big should it be, and how can it be monitored. We will discuss capacity planning and monitoring of Memcached as well. Solutions to new problems: The first step in integrating Memcached into your application is to re-write the functions in your data layer, so they will look for data in the cache before querying the database: For example, the following: functionget_username(intuserid) { username = db_select("SELECT usename FROM users WHERE userid = ?", userid); return username; } Will be replaced by: functionget_username(intuserid) { /* first try the cache */ name = memcached_fetch("username:" + userid); if (!name) { /* not found : request database */ name = db_select("SELECT username FROM users WHERE userid = ?", userid); /* then store in cache until next get */ memcached_add("username:" + userid, username); } return data; }
  • 18. We will also need to change the code that updates the database so it will update the cache as well, otherwise we risk serving stale data: functionupdate_username(intuserid, string username) { /* first update database */ result = db_execute("Update users set username=? WHERE userid=?",userid,username); if (result) { /* database update successful: update cache */ memcached_set("username:" + userid, username); } Of course, not every function should be cached. The cache has limited size, and there is an overhead for attempting to use the cache for data that is not actually there. The main benefits are to use the cache for results of large or highly redundant queries. To use the cache effectively without risking data corruption, keep the following in mind: 1. Use ASH data to find the queries that use the most database time. Queries that take significant amount of time to execute and short queries that execute very often are good candidates for caching. Of course many of these queries use bind variables and return different results for each user. As we showed in the example, the bind variables can be used as part of the cache key to store and retrieve results for each group of binds separately. Due to the LRU nature of the cache, commonly used binds will remain and cache and get reused while infrequently used combinations will get evicted. 2. Memcached takes large amounts of memory (the more the merrier!) but there is evidencex that it does not scale well across large number of cores. This makes Memcached a good candidate to share a server with an application that makes intensive use of the CPU and doesn't require as much memory. Another option is to create multiple virtual machines on a single multi-core server and install Memcached on all the virtual machines. However this configuration means that you will lose most of your caching capacity with the crash of a single physical server. 3. Memcached is not durable. If you can't afford to lose specific information, store it in the database before you store it in Memcached. This seems to imply that you can't use Memcached to scale a system which is doing primarily large number of writes. In effect, it depends on the exact bottlenecks. If your top wait event is "Log file sync", you can use Memcached to reduce the total amount of work the database does, reduce the CPU load and therefore potentially reduce "log file sync" wait. 4. Some data should be stored eventually but can be lost without critical impact to the system. Instrumentation and logging information is definitely in this category. This information can be stored in Memcached and written to the database in batches and infrequently.
  • 19. 5. Consider pre-populating the cache: If you rely on Memcached to keep your performance predictable, a crash of a Memcached server will send significant amounts of traffic to the database and the effects on performance will be noticeable. When the server comes back, it can take a while until all the data is loaded to the cache again, prolonging the period of reduced performance. To improve performance in the first minutes after a restart, consider a script that will pre-load data into the cache when the Memcached server starts. 6. Consider very carefully what to do when the data is updated: Sometimes it is easy to simultaneously update the cache - if user changes his address and the address is stored in the cache, update the cache immediately after updating the database. This is the best case scenario, as the cache is kept useful through update. Memcached API contains functions that allow changing data atomically or avoid race conditions. When the data in the cache is actually aggregated data, it may not be possible to update it, but will be possible to evict the current information as irrelevant and reload it to the cache when it is next needed. This can make the cache useless when the data is updated and reloaded very frequently. Sometimes it isn't even possible to figure out what keys should be evicted from cache when specific field is updated, especially if the cache contains results of complex queries. This situation is best avoided, but can be dealt with by setting expiration time for the data, and preparing to serve possibly-stale data for that period of time. How big should the cache be? It is better to have many servers with less memory than few servers with a lot of memory. This minimises the impact of one crashed Memcached server. Remember that there is no performance penalty to a large number of nodes. Losing a Memcached instance will always send additional traffic to the database. You need to have enough Memcached servers to make sure the extra traffic will not cause unacceptable latency to the application. There are no downsides to a cache that is too large, so in general allocate to Memcached all the memory you can afford. If the average number of gets per item is very low, you can safely reduce the amount of memory allocated. There is no "cache size advisor" for Memcached, and it is impossible to predict the effect of adding or reducing the cache size based on the monitoring data available from Memcached.SimCache is a tool that based on detailed hit/miss logs for the existing Memcached can simulate an LRU cache and predict the hit/miss ratio in various cache sizes. In many environments keeping such detailed log is impractical, but tracking a sample of the requests could be possible and can still be used to predict cache effects. Knowing the average latency of database reads under various loads and the latency of Memcached reads should allow you to predict changes in response time as Memcached size and its hit ratio changes.For example: You use SimCache to see that with cache size of 10G you will have hit ratio of 95% in
  • 20. Memcached. Memcached has latency of 1ms in your system. With 5% of the queries hitting the database, you expect the database CPU utilization to be around 20%, almost 100% of the DB Time on the CPU, and almost no wait time on the queue between the business and the data layers (you tested this separately when sizing your connection pool). In this case the database latency will be 5ms, so we expect the average latency for the data layer to be 0.95*1+0.05*5=1.2ms. How do I monitor Memcached? Monitor number of items, gets, sets and misses. An increase in the number of cache misses will definitely mean that the database load is increasing at same time, and can indicate that more memory is necessary. Make sure that the number of gets is higher than the number of sets. If you are setting more than getting, the cache is a waste of space. If the number of gets per item is very low, the cache may be oversized. There is no downside to an oversized cache, but you may want to use the memory for another purpose. Monitor for number of evictions. Data is evicted when the application attempts to store new item but there is no memory left. An increase in the number of evictions can also indicate that more memory is needed. Evicted time shows the time between the last get of the item to its eviction. If this period is short, this is a good indication that memory shortage makes the cache less effective. It is important to note that low hitrate and high number of evictions do not immediately mean you should buy more memory. It is possible that your application is misusing the cache: o Maybe the application sets large numbers of keys, most of which are never used again. In this case you should reconsider the way you use the cache. o Maybe the TTL for the keys is too short. In this case you will see low hitrate but not many evictions. o The application frequently attempts to get items that don't exist, perhaps due to data purging of some sort. Consider setting the key with a "null" value, to make sure the invalid searches do not hit the database over and over. Monitor for swapping. Memcached is intended to speed performance by caching data in memory. If the data is spilled to disk, it is doing more harm than good. Monitor for average response time. You should see very few requests that take over 1-2ms, longer wait times can indicate that you are hitting the maximum connection limit for the server, or that CPU utilization on the server is too high. Monitor that the number of connections to the server does not come close to the max connections settings of Memcached (configurable). Do not monitor "stat sizes" for statistics about size of items in cache. This locks up the entire cache.
  • 21. All the values I mentioned can be read from Memcached using the STAT call in its API. You can run this command and get the results directly by telnet to port 11211. Many monitoring systems, including Cactii and Ganglia include monitoring templates for Memcached. i Traffic jam without bottleneck -experimental evidence for the physical mechanism of the formation of a jam Yuki Sugiyama, Minoru Fukui, Macoto Kikuchi, KatsuyaHasebe, Akihiro Nakayama, Katsuhiro Nishinari, Shin- ichiTadaki, Satoshi YukawaNew Journal of Physics, Vol.10, (2008), 033001 ii http://www.telegraph.co.uk/science/science-news/3334754/Too-many-cars-cause-traffic-jams.html iii Scaling Oracle8i™: Building Highly Scalable OLTP System Architectures, James Morle iv http://www.youtube.com/watch?v=xNDnVOCdvQ0 v http://docs.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/datasource.html vi http://www.perfdynamics.com/Manifesto/USLscalability.html vii http://teddziuba.com/2011/02/the-case-against-queues.html viii http://www.cmg.org/measureit/issues/mit62/m_62_15.html ix http://code.google.com/p/brutis/ x http://assets.en.oreilly.com/1/event/44/Hidden%20Scalability%20Gotchas%20in%20Memcached%20and%20Frien ds%20Presentation.pdf