2. 2
Copyright IBM 2010
Agenda
DB2 pureScale technology review
RDMA and low-latency interconnect
Monitoring and tuning bufferpools
in pureScale
Architectural features for top
performance
Performance metrics
3. 3
Copyright IBM 2010
Cluster Interconnect
DB2 pureScale : Technology Review
Single Database View
Clients
Database
Log Log Log Log
Shared Storage Access
CS CS CSCS
CS CS
CS
Member Member Member Member
Primary2nd-ary
DB2 engine runs on several host computers
– Co-operate with each other to provide coherent access to the
database from any member
Data sharing architecture
– Shared access to database
– Members write to their own logs
– Logs accessible from another host (used during recovery)
Cluster Caching Facility (CF) technology from STG
– Efficient global locking and buffer management
– Synchronous duplexing to secondary ensures availability
Low latency, high speed interconnect
– Special optimizations provide significant advantages on RDMA-
capable interconnects like Infiniband
Clients connect anywhere,…
… see single database
– Clients connect into any member
– Automatic load balancing and client reroute may change
underlying physical member to which client is connected
Integrated cluster services
– Failure detection, recovery automation, cluster file system
– In partnership with STG (GPFS,RSCT) and Tivoli (SA MP)
Leverage IBM’s System z Sysplex Experience and Know-How
4. 4
Copyright IBM 2010
DB2 pureScale and low-latency interconnect
Infiniband & uDAPL provide the
low-latency RDMA infrastructure
exploited by pureScale
pureScale currently uses
DDR and QDR IB adapters
according to platform
– Peak throughput of about
2-4 M messages per second
– Provide message latencies
in the 10s of microseconds
or even lower
The Infiniband development roadmap indicates continued increases in bit rates
Infiniband Roadmap from www.infinibandta.org
5. 5
Copyright IBM 2010
Two-level page buffering – data consistency & improved performance
The local bufferpool (LBP) caches both read-only
and updated pages for that member
The shared GBP contains references to every page
in all LBPs across the cluster
– References ensure consistency across
members – who’s interested in which pages, in
case the pages are updated
The GBP also contains copies of all updated pages
from the LBPs
– Sent from the LBP at transaction commit time
– Stored in the GBP & available to members on
demand
– 30 µs page read request over Infiniband from
the GBP can be more than 100x faster than
reading from disk
Statistics are kept for tuning
– Found in LBP vs. found in GBP vs. read
from disk
– Useful in tuning GBP / LBP sizes
CF
M1 M2 M3
10µs
3
5000µs
60µs
30
µs
2
30µs
1
Expensive disk
reads from M1,
M2 not required –
get the modified
page from the CF
#1
#2
#3
#4#5
6. 6
Copyright IBM 2010
pureScale bufferpool monitoring and tuning
Familiar DB2 hit ratio calculations are useful with pureScale
– HR = (logical reads – physical reads) / logical reads
e.g. (pool_data_l_reads – pool_data_p_reads)/pool_data_l_reads
– As usual, physical reads come from disk, logical reads from the bufferpool
(in pureScale, either this means either the LBP or GBP)
e.g., pool_data_l_reads = pool_data_lbp_pages_found +
pool_data_gbp_l_reads
New metrics in pureScale support breaking this down by LBP & GBP amounts
– pool_data_lbp_pages_found = logical data reads satisfied by the LBP
• i.e., we needed a page, and it was present & valid in the LBP
– pool_data_gbp_l_reads = logical data reads attempted at the GBP
• i.e., either not present or not valid in the LBP, so we needed to go
to the GBP
– pool_data_gbp_p_reads = physical data reads due to page not present in
either the LBP or GBP
• Essentially the same as non-pureScale pool_data_p_reads
– pool_data_gbp_invalid_pages = number of GBP data read attempts due
to an LBP page being present but marked invalid
• An indicator of the rate of GBP updates & their impact on the LBP
Of course,
there are index
ones too
7. 7
Copyright IBM 2010
pureScale bufferpool monitoring
Overall (and non-pureScale) hit ratio
– (pool_data_l_reads – pool_data_p_reads)/pool_data_l_reads
– Great values: 95% for index, 90% for data
– Good values: 80-90% for index, 75-85% for data
LBP hit ratio
– (pool_data_lbp_pages_found / pool_data_l_reads) * 100%
– Generally lower than the overall hit ratio, since it excludes GBP hits
– Factors which may affect it, other than LBP size
• Increases with greater portion of read activity in the system
– Decreasing probability that LBP copies of the page have been invalidated
• May decrease with cluster size
– Increasing probability that another member has invalidated the LBP page
GBP hit ratio
– (pool_data_gbp_l_reads – pool_data_gbp_p_reads) /
pool_data_gbp_l_reads
– A hit here is a read of a previously modified page, so hit ratios are typically quite low
• An overall (LBP+GBP) H/R in the high 90's can correspond to a GBP H/R in the
low 80's
– Factors which may affect it, other than GBP size
• Decreases with greater portion of read activity
8. 8
Copyright IBM 2010
pureScale bufferpool tuning
Step 1: typical rule-of-thumb for GBP size = 35-40% of Σ( all members’ LBP sizes )
e.g. 4 members, LBP size of 1M pages each -> GBP size of 1.4 to 1.6M pages
NB - don't forget, GBP page size is always 4kB, no matter what the LBP page size is.
– If your workload very read-heavy (e.g. 90% read), initial GBP allocation could be in the
20-30% range
– For 2-member clusters, you may want to start with 40-50% of total LBP, vs. 35-40%
Step 2: monitor the overall BP hit ratio as usual, with pool_data_l_reads,
pool_data_p_reads, etc.
– Meets your goals? If yes, then done!
Step 3: check LBP H/R with pool_data_lbp_pages_found/pool_data_l_reads
– Great values: 90% for index, 85% for data
– Good values: 70-80% for index, 65-80% for data
– Increasing LBP size can help increase LBP H/R
– NB – for each 16 extra LBP pages, the GBP needs 1 extra page for registrations
Step 4: check GBP H/R with pool_data_gbp_l_reads, pool_data_gbp_p_reads, etc.
– Great values: 90% for index, 80% for data
– Good values: 65-80% for index, 60-75% for data
– pool_data_l_reads > 10 x pool_data_gbp_l_reads means low GBP
dependence – may mean tuning GBP size in this case is less valuable
– pool_data_gbp_invalid_pages > 25% of pool_data_gbp_l_reads means
GBP is really helping out, and could benefit from extra pages
9. 9
Copyright IBM 2010
Page lock negotiation – or Psst! Hey buddy, can you pass me that page?
– pureScale page locks are physical locks, indicating which member currently ‘owns’
the page. Picture the following:
• Member A : acquires a page P and modifies a row on it, and continues with its
transaction. ‘A’ holds an exclusive page lock on page P until ‘A’ commits
• Member B : wants to modify a different row on the same page P. What now?
– ‘B’ doesn’t have to wait until ‘A’ commits & releases the page lock
• The CF will negotiate the page back from ‘A’ in the middle of ‘A’s transaction,
on ‘B’s behalf
• Provides far better concurrency & performance than needing to wait for a page
lock until the holder commits.
Log P
P
pureScale architectural features for optimum performance
P P
Member A Member B
Log
P ?P !
CF
GLM
Px
: A: B
10. 10
Copyright IBM 2010
pureScale architectural features for optimum performance
Table append cache and index page cache
– What happens in the case of rapid inserts into a single table by
multiple members? Or rapid index updates?
Will it cause the insert page to ‘thrash’ back & forth between the
members, each time one has a new row?
– No - each member sets aside an extent for insertion into the table to
eliminate contention & page thrashing. Similarly for indexes with the
page cache
Lock avoidance
– pureScale exploits cursor stability (CS) locking semantics to
avoid taking locks in many common cases
– Reduces pathlength and saves trips to the CF
– Transparent & always on
11. 11
Copyright IBM 2010
Notes on storage configuration for performance
GPFS best practices
– Automatically configured by db2cluster command
• Blocksize >= 1 MB (vs. default 64k) provides
noticeably improved performance
• Direct (unbuffered) IO for both logs & tablespace
containers
• SCSI-3 P/R on AIX enables faster disk takeover on
member failure
– Separate paths for logs & tablespaces are recommended
Dominant storage performance factor for pureScale: fast
log writes
– Always important in OLTP
– Extra important in pureScale due to log flushes driven
by page reclaims
– Separate filesystems, separate devices from each
other & from tablespaces
– Ideally – comfortably under 1ms
– Possibly even SSDs to keep write latencies as
low as possible
12. 12
Copyright IBM 2010
12 Member Scalability Example
Moderately heavy transaction processing
workload modeling warehouse & ordering
process
– Write transactions rate 20%
– Typical read/write ratio of many OLTP
workloads
No cluster awareness in the app
– No affinity
– No partitioning
– No routing of transactions to members
Configuration
– Twelve 8-core p550 members, 64 GB, 5 GHz
– IBM 20Gb/s IB HCAs + 7874-024 IB Switch
– Duplexed PowerHA pureScale across 2 additional
8-core p550s, 64 GB, 5 GHz
– DS8300 storage 576 15K disks, Two 4Gb FC Switches
1Gb Ethernet
Client
Connectivity
20Gb IB
pureScale
Interconnect
7874-024
Switch
Two 4Gb FC
Switches
DS8300
Storage
p550
members
p550
Cluster Caching Facility
Clients (2-way x345)
13. 13
Copyright IBM 2010
12 Member Scalability Example - Results
0
1
2
3
4
5
6
7
8
9
10
11
12
0 5 10 15
1.98x @ 2 members
3.9x @ 4 members
# Members
Throughputvs.1member
7.6x @ 8 members
10.4x @ 12 members
14. 14
Copyright IBM 2010
DB2 pureScale Architecture Scalability
How far will it scale?
Take a web commerce type workload
– Read mostly but not read only – about 90/10
Don’t make the application cluster aware
– No routing of transactions to members
– Demonstrate transparent application scaling
Scale out to the 128 member limit and measure scalability
15. 15
Copyright IBM 2010
The 128-member result
64 Members
95% Scalability
16 Members
Over 95%
Scalability
2, 4 and 8
Members Over
95% Scalability
32 Members
Over 95%
Scalability
88 Members
90% Scalability
112 Members
89% Scalability
128 Members
84% Scalability
16. 16
Copyright IBM 2010
Summary
Performance & scalability are two top goals of
pureScale
– many architectural features were designed
solely to drive the best possible performance
Monitoring and tuning for pureScale extends
existing DB2 interfaces and practices
– e.g., techniques for optimizing GBP/LBP
configuration builds on steps already familiar
to DB2 DBAs.
The pureScale architecture exploits leading-edge
low latency interconnects and RDMA, to achieve
excellent performance & scalability
– Initial 12- & 128-member proofpoints are
strong evidence of a successful first release,
with even better things to come!