Showdown: IBM DB2 versus Oracle Database for OLTP

1. Showdown: DB2 vs. Oracle Database for OLTP Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: db2news.wordpress.com Conor O’Mahony Email: [email_address] Twitter: conor_omahony Blog: database-diary.com

9. Longevity in TPC-C Performance Results as of April 21, 2008

10.

11. Longevity in SAP 3-Tier SD Performance Results as of Jan 8, 2008

12.

13.

14.

15.

16.

17.

18.

19.

20.

21. 2, 4 and 8 Members Over 95% Scalability Scalability for OLTP Applications 64 Members 95% Scalability 16 Members Over 95% Scalability 32 Members Over 95% Scalability 88 Members 90% Scalability 112 Members 89% Scalability 128 Members 84% Scalability

22.

23.

24.

25.

26.

27.

28.

29. With RAC – Pages that Need Recovery are Locked GRD GRD Instance 1 fails I/O Requests are Frozen Instance 1 Instance 2 Instance 3 Recovery Instance reads log of failed node Recovery instance locks pages that need recovery x x x x x x redo log redo log redo log GRD Must read log and lock pages before freeze is lifted.

30. DB2 pureScale – No Freeze at All Member 1 fails Member 1 Member 2 Member 3 No I/O Freeze CF knows what rows on these pages had in-flight updates at time of failure x x x x x x x x x x x x CF always knows what changes are in flight CF Central Lock Manager

31.

32.

33.

Editor's Notes

What are the most important features in an RDBMS for OLTP transactions? In order to deliver very high throughput levels for transactional systems, the RDBMS must be able to efficiently perform I/O operations without holding up the transaction. It must be able to utilize memory more efficiently and must also be able to effectively handle large numbers of users. These three critical areas enable a database server to deliver very high levels of performance. We will explore each of these three areas in detail on the following 3 slides.
In very high volume transactional systems the logger can quickly become the bottleneck. DB2 (and other RDBMSs) can do most of the work a transaction requires completely in memory (updates occur in the buffer pool in memory, authorizations and access plans are all cached in memory, etc.). However there is one thing that cannot happen in memory. Whenever a transaction performs a COMMIT the information that tells the RDBMS how to redo that transaction (i.e. the log information) must be flushed to disk. If the committed transaction is not recorded in the log files on disk then it would be possible to lose committed transactions. Therefore all database servers write records to the log files whenever a user commits a transaction. In order to achieve very high concurrency and high throughput, it is essential that the logger be as efficient as possible since these I/Os can quickly become the bottleneck (since disk access is significantly slower than memory access). There is a very strong proof point that demonstrates DB2 has a much more efficient logger than our competitors. TPC-C is an industry standard transaction processing benchmark that all major database vendors participate in. Each vendor runs their own benchmarks to try to demonstrate they have the best RDBMS. DB2 comes out on top more often than any other database vendor (but more on that later). One interesting thing to note about TPC-C is that one of the requirements of the benchmark is that you also publish how much log space you consume during the benchmark run. By comparing the log space consumed (and knowing that this standard benchmark requires every vendor to run the exact same transactions over and over again) we can compare the efficiency of the 3 database vendor’s loggers. The most current TPC-C results (as of March 18, 2008) are shown on this chart. You can see that for each transaction (standard TPC-C transaction) DB2 produced 2.4 KB of log. Oracle’s top result is with 10gR2 (no 11g top result as of 3/18/2008) consumes 2x that much log space meaning that their the DB2 logger is twice as efficient and therefore can deliver higher levels of throughput. Oracle ran a TPC-C benchmark with RAC and consumed 20x more space than DB2. Ask Oracle why RAC consumes so much log space for the same transactions! Microsoft SQL Server 2005 result, was even worse than Oracle consuming more than 2.5x that of DB2. These benchmarks are the most highly tuned database systems available (tuned by database vendor benchmark experts from the vendor). This reduction in logging is one of the reasons why DB2 delivers better OLTP performance compared to Oracle and Microsoft.
Efficient use of memory is also critical for high volume transaction processing. Given a limited amount of physical memory, you want your database to utilize it to the fullest in order to improve throughput of your system. DB2 has two unique advantages over Oracle in this area. The first is that DB2 allows for multiple buffer pools. In Oracle you can have only one buffer pool per page size (i.e. 1 4KB pool, 1 8KB pool, 1 16KB pool and 1 32KB pool). This can severely limit your ability to effectively utilize the memory on the server to tune the system for optimal performance. DB2 however, allows for as many buffer pools of any page size that you like. For example you can have 4 buffer pools of 4KB and another 5 buffer pools of 8KB, etc. You can choose the buffer pool configuration that best suites your transaction processing needs. As an example, on a server with 2TB of real memory in a TPC-C benchmark DB2 allocated several buffer pools of different page sizes whose total size in aggregate was 1.9TB. Now with the new threaded engine in DB2 9.5 there is even more advantage over Oracle. By using threads rather than processes for user connections, the amount of memory consumed per connection is significantly lower. This allows more user connections for a given amount of memory and leaves more memory available to other areas of DB2 (like the buffer pool). This better memory utilization results again in higher throughput and better performance. Later in this presentation we will talk about Self Tuning Memory Manager (STMM) which will show that not only does DB2 better exploit memory for higher performance, but it does so with less administration required.
The final area that is critical to high transaction performance is the ability to support large numbers of concurrent users. Both DB2 and Oracle have the ability to do connection concentration to reduce memory requirements on the server. However, only DB2 has the threaded engine mentioned on the previous slide. This enables DB2 to scale higher than Oracle on the same server with the same amount of memory and therefore deliver higher throughput.
There are several transaction processing benchmarks that demonstrate DB2’s performance leadership over Oracle. The first is TPC-C which is an industry standard transaction processing benchmark. SAP Sales and Distribution (SD) is also a widely used performance benchmark which simulates real world SAP transactions. The third transaction processing benchmark is called SPECjAppServer which measures the performance of a web based java application on the database system. We will discuss each of these benchmarks on the following slides.
Benchmarks are often a leapfrog game where on any given day, one database vendor can be in front of the rest if they run on some newly announced hardware or the latest software versions. This chart represents days of leadership for TPC-C since Jan 1, 2003 through April 21, 2008. It measures how long each of the vendors have held the top spot in TPC-C. Over this 5 year period of time, DB2 has been in a leadership position almost 2x longer than Oracle and in fact has lead longer than all other database vendors combined.
It is not very often that you get an Apples to Apples comparison where two database vendors run their benchmarks on the exact same hardware. This result is slightly dated (using DB2 v8 against Oracle 10g) however it shows that on exactly the same hardware, DB2 delivered 16% better performance than Oracle. In fact you would need 10 CPUs of Oracle to match the performance of 8 CPUs of DB2 on this class of server.
In fact, Oracle has rarely been able to challenge DB2 over the past 5 years on the SAP SD 3-tier benchmark. This chart represents days of leadership for SAP SD 3-tier since Jan 1, 2003. As you can see, DB2 has held the lead over the last 5 years 8 times longer than Oracle (the only other competitor to lead in this timeframe) .
This result shows the top SAP SD 3-tier benchmark results as of March 18, 2008. SAP Sales and Distribution 3-tier represents a configuration where the database software is running on it’s own server hardware and there are several SAP application servers in the middle tier. This is the configuration that most enterprise customers would run their SAP workloads on and DB2 has demonstrated clear performance leadership in this area.
On the SAP SD 2-tier benchmark DB2 leads Oracle by 18% using half the number of processor cores. On April 8, 2008 DB2 9.5 running on a 64 core IBM Power 595 with AIX 6.1 delivered 35,400 SD Users. Oracle’s top result is 30,000 SD users with 10g running on a 128 core HP Integrity Superdome with HP-UX.
A server process that wants to access a data page, for example page 501, will first check to see if that page is in its local buffer pool (step  ). If this page is not found, the server process will send an inter-process communication (IPC) request to a GCS process in order to ask the master node for that data page (  ). This results in the server process yielding the CPU and the CPU performing a context switch to potentially re-establish the GCS process on the CPU to process the interrupt. High levels of context switching can be very costly to perform. The GCS process then sends an IP request to the master node for the data block being requested (  ). Because IP calls are processed in the operating system kernel, the GCS process has to copy the requested information into kernel memory and then execute expensive IP stack calls to push the request to the remote node. Even if an InfiniBand network is being used, Oracle still uses IP over InfiniBand or in some cases Reliable Datagram Sockets (RDS). Use of a socket protocol even over InfiniBand is costly due to processor interrupts, IP stack traversal, etc. Next, the remote master GCS process will receive an interrupt and will be scheduled on the CPU to process the request. It will check to see if any other members have the page in its buffer cache. In this example, no member has the page so the GCS process will send an IP message back to the requester telling it to read the page from disk (  ). The GCS processes on the requesting node will be interrupted again to process the incoming IP request, and will in turn send an IPC interrupt (  ) to the server process to inform it that no other node in the cluster has the page. The server process will then read the page from disk into its own buffer cache (  ).
This slide illustrates the advantage of DB2 pureScale for very efficient access to data. A comparison to Oracle RAC will follow in the next section. The steps listed above show you how DB2 pureScale communicates with the CF to declare its intent to access a data page. Steps 2 and 3 are the critical success factors to DB2 pureScale efficiency. That is, when there is a need to communicate with the centralized CF, that communication uses RDMA. Essentially, the process on member 1 writes directly into the memory of the CF with its request. This is done without going through the IP socket stack, without context switching and in many cases without having to yield the CPU (the round trip communication time between the two servers can be as little as 15 microseconds).
A server process that wants to access a data page, for example page 501, will first check to see if that page is in its local buffer pool (step  ). If this page is not found, the server process will send an inter-process communication (IPC) request to a GCS process in order to ask the master node for that data page (  ). This results in the server process yielding the CPU and the CPU performing a context switch to potentially re-establish the GCS process on the CPU to process the interrupt. High levels of context switching can be very costly to perform. The GCS process then sends an IP request to the master node for the data block being requested (  ). Because IP calls are processed in the operating system kernel, the GCS process has to copy the requested information into kernel memory and then execute expensive IP stack calls to push the request to the remote node. Even if an InfiniBand network is being used, Oracle still uses IP over InfiniBand or in some cases Reliable Datagram Sockets (RDS). Use of a socket protocol even over InfiniBand is costly due to processor interrupts, IP stack traversal, etc. Next, the remote master GCS process will receive an interrupt and will be scheduled on the CPU to process the request. It will check to see if any other members have the page in its buffer cache. In this example, no member has the page so the GCS process will send an IP message back to the requester telling it to read the page from disk (  ). The GCS processes on the requesting node will be interrupted again to process the incoming IP request, and will in turn send an IPC interrupt (  ) to the server process to inform it that no other node in the cluster has the page. The server process will then read the page from disk into its own buffer cache (  ).
This slide illustrates the advantage of DB2 pureScale for very efficient access to data. A comparison to Oracle RAC will follow in the next section. The steps listed above show you how DB2 pureScale communicates with the CF to declare its intent to access a data page. Steps 2 and 3 are the critical success factors to DB2 pureScale efficiency. That is, when there is a need to communicate with the centralized CF, that communication uses RDMA. Essentially, the process on member 1 writes directly into the memory of the CF with its request. This is done without going through the IP socket stack, without context switching and in many cases without having to yield the CPU (the round trip communication time between the two servers can be as little as 15 microseconds).
To dive deeper into the “secret sauce” let’s look at exactly how the member communicates with the CF. If an agent on member 1 wants to read a page, that agent will write directly into the member of the CF telling the CF exactly what page it wants and even telling it what slot in its buffer pool that the page will go into on Member 1. If the CF does not have the page, it writes a message right into the memory of Member 1 to indicate that it doesn’t have the page. If the CF does have the page, it writes the data page directly into memory on Member 1 without any context switching or IP stack calls.
As previously mentioned, the critical success factor for scalability in an active-active cluster is to ensure that when a transaction requests a piece of data, it can get that data with the lowest amount of latency. With DB2 pureScale, by centralizing data that is of interest to more than one member in the cluster, and by accessing that data using RDMA in an interrupt free processing environment, you can see near linear scalability even out to dozens of nodes. More importantly, you do not need to design your application to be cluster aware. There is no need to route transactions that access the same data pages to a single node. In practice, this is not the case with Oracle RAC. There are many stories on the internet, and in published books on Oracle RAC that tell customers to avoid hot pages being passed between nodes by using one of the methods described in the last 4 bullets of the above slide. These methods require costly DBA and application developer interventions, as well as potential application rework as the size of the cluster changes.
To demonstrate the scalability of DB2 pureScale, the lab set up a configuration comprised of 128 members (note that for server consolidation environments it is possible to put multiple members on an SMP server). A workload was created where the read to write ratios are typically 90:10. As well, to prove the scalability of architecture, the application has no cluster awareness. In fact the application updates or selects a random row and therefore every row in the database will be touched by all members in the cluster (we did this to show that locality of data is not as essential for scaling as with other shared disk architectures) The results of this 128 member test show that there is near linear scaling even out to 128 members in the cluster. Up to 64 members in the cluster, the scalability (compared to the 1 member result) is still above 95% and at 128 members the scalability was at 84%. Note that this is a validation of the architecture and includes some capabilities under development that will not be in the December GA code.
The second key feature of DB2 pureScale is the high availability it provides. Again the secret to its success is the centralized locking and caching. When one member fails, all other members in the cluster can continue to process transactions. The only data that is unavailable are the actually pages that were being updated in flight when the member failed. And if those pages are hot then they will be in the CF memory which means the recovery of pages needed by other members will be very fast.
There are 3 things that occur during an instance failure at a high level Failure detection Pull pages that need to be fixed directly from CF memory Fix the pages In each of these steps, DB2 pureScale has been optimized with the goal of getting these pages fixed and having them accessible in under 20 seconds (all the while the rest of the data in the database is completely available).
Failure detection was a large part of the investment that went into DB2 pureScale. Software failure in a DB2 pureScale environment has been architected to be caught in a fraction of a second and to begin the driving of recovery processing within that second. Hardware failure is a more difficult challenge, but thanks to some innovative techniques, DB2 pureScale has built in a set of algorithms that can detect node failures in as little as 3 seconds without false failovers. When we talk about having the rows available again within 20 seconds of a failure, we mean from the time the failure occurred, not the time the failure was detected. Other vendors may exclude this time to give better numbers but from an end user this time is critical and so we include it.
Here is a detailed walk through of what happens when a node fails. Run this in slide show mode to see the steps. Note that we call this process “Online Failover” because the other transactions on other nodes are not impacted in any way from processing (which is different than Oracle RAC as you will see in future slides. As well, the data that needs to be fixed will primarily be in memory on the CF so it will be working at memory speeds. In the event of a hardware failure, we take the additional step of automatically fencing off the storage access from the failed member to prevent split brain issues.
In Oracle there are a similar set of steps to recover from a filed instance. However the middle two are where things are very different when compared to DB2 pureScale: Node failure detection Global lock remastering Lock pages that need recovery Fix those pages The biggest difference is that with DB2 pureScale there is centralized locking so there is no need to remaster global locks. Also pureScale does not need to find the pages to lock (it is already aware of the pages that need to be fixed).
In Oracle RAC, each data page (called a data block in Oracle) is mastered by one of the instances in the cluster. Oracle employs a distributed locking mechanism and therefore each instance in the cluster is responsible for managing and granting lock requests for the pages that it masters. In the event of a node failure, the data pages for the failed node become momentarily orphaned while RAC goes through a lock redistribution process to assign new ownership of these orphaned pages to the surviving nodes in the cluster. This is called Global Resource Directory (GRD) reconfiguration and while this is occurring, any request to read a page, as well as any request to lock a page, is momentarily frozen. Applications can continue to process on the surviving nodes, however, during this time, they cannot perform any I/O operations or request any new locks. This results in many applications experiencing a freeze as shown in this slide.
The second step in the Oracle RAC node recovery process is to lock all the data pages that need recovery. This must be done before the GRD freeze described earlier is released. If an instance was allowed to read a page from disk before the appropriate page locks were acquired, the update from the failed instance could be lost. The recovery instance performs a first pass read of the redo log file from the failed instance and locks any pages that need recovery as shown in Figure 2. This may require a significant number of random I/O operations as the log file, and potentially the pages that need recovery, may not be in the memory of any of the surviving nodes. The GRD freeze is lifted and the stalled applications can continue processing only after all these I/O operations are performed by the recovery instance and the appropriate pages are locked. Depending on the amount of work that the failed node was doing at the time of the failure, this process can take from tens of seconds up to as much as a minute before it completes. This GRD freeze and the fact that I/O operations cannot be performed during this period or new lock requests granted, is documented in several published books on Oracle RAC.
In comparison, DB2 pureScale environments require no global freeze in the cluster. The CF is aware at all times which pages would need recovery should any member fail. If a member fails, all other members in the cluster can continue to run transactions and perform I/O operations. Only requests to access pages that need recovery will be blocked while the recovery process cleans up from the failed member as shown on this slide (and the process is likely to happen from memory).

Showdown: IBM DB2 versus Oracle Database for OLTP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Showdown: IBM DB2 versus Oracle Database for OLTP

Similar to Showdown: IBM DB2 versus Oracle Database for OLTP (20)

Recently uploaded

Recently uploaded (20)

Showdown: IBM DB2 versus Oracle Database for OLTP

Editor's Notes