Presentation on using Solid State Disk (SSD) with Oracle databases, including the 11GR2 db flash cache and using flash in Exadata. Last given at Collaborate 2014 #clv14.
1. 1 Global Marketing
REMINDER
Check in on the
COLLABORATE mobile app
206:Using Flash SSD to Optimize
Oracle Database Performance
Guy Harrison
Executive Director, R&D
Information Management Group
Dell Software
2. 2 Software Group
Agenda
• Brief History of Magnetic Disk
• Solid State Disk (SSD) technologies
• SSD internals
• Oracle DB flash cache architecture
• Performance comparisons
• Exadata flash
• Recommendations and Suggestions
3. 3 Software Group
Introductions
Web: guyharrison.net
Email: guy.harrison@software.dell.com
Twitter: @guyharrison
Google Plus:
https://www.google.com/+GuyHarrison1
16. 16 Software Group
Moore’s law
• Transistor density doubles every 18 months
• Exponential growth is observed in most
electronic components:
–CPU clock speeds
–RAM
–Hard Disk Drive storage density
• But not in mechanical components
–Service time (Seek latency) – limited by actuator arm
speed and disk circumference
–Throughput (rotational latency) – limited by speed of
rotation, circumference and data density
17. 17 Software Group
Disk trends 2001-2009
260 1,635
-630
1,013
-390
-1,000
-500
0
500
1,000
1,500
2,000
IO Rate Disk Capacity IO/Capacity CPU IO/CPU
%agechange
23. 23 Software Group
Flavours of Solid State Disk
• DDR RAM Drive
• SATA flash drive
• PCI flash drive
• SSD storage Server
24. 24 Software Group
PCI SSD vs SATA SSD
• PCI vs SATA
– SATA was designed for traditional disk drives with high latencies
– PCI is designed for high speed devices
– PCI SSD has latency ~ 1/3rd of SATA
25. 25 Software Group
Dell Express flash
• PCI flash performance can
normally only be achieved
by attaching a PCI card
directly to the server
motherboard
• Dell express flash exposes
the interfaces the PCI bus
to front loading drive slots
allowing hot swap and
install of PCI flash
27. Block 128K-1M
Flash SSD internals
• Cell: One (SLC), Two (MLC) or Three
(TLC) bits
• Page: Typically 4K
• Block: Typically 128-512K
Storage Hierarchy:
• Read and first write require single
page IO
• Overwriting a page requires an erase
& overwrite of the block
Writes:
• 100,000 erase cycles for SLC before
failure
• 5,000 – 15,000 erase cycles for MLC
Write endurance:
Page 4-8K
Cell 1-2 bytes
29. 29 Software Group
Flash Disk write degradation
• All Blocks empty:
Write time=250 us
• 25% part full:
– Write time= ( ¾ * 250 us + 1/4 * 2000 us) = 687 us
• 75% part full
– Write time = ( ¼ * 250 us + ¾ * 2000 us ) = 1562 us
Empty
Partially Full
30. 30 Software Group
Valid Data Page
Empty Data Page
InValid Data Page
Free Block Pool
Used Block Pool
SSD Controller
Insert
Data Insert
31. 31 Software Group
Valid Data Page
Empty Data Page
Invalid Data Page
Free Block Pool
Used Block Pool
SSD Controller
Update
Data Update
32. 32 Software Group
Valid Data Page
Empty Data Page
Invalid Data Page
Free Block Pool
Used Block Pool
SSD Controller
Garbage Collection
35. 35 Software Group
Oracle DB flash cache
• Introduced in 11gR2 for OEL
and Solaris only
• Secondary cache maintained
by the DBWR, but only when
idle cycles permit
• Architecture is tolerant of
poor flash write performance
36. 36 Software Group
Database
files
Buffer
cache
DBWR
Oracle process
Free
Buffer
Waits
Write dirty blocks to disk
Write to buffer cache
Read from disk
Read from buffer cache
Free buffer waits often occur
when reads are much faster
than writes....
Buffer cache and Free buffer waits
37. 37 Software Group
Database
files
Buffer
cache
DBWR
Oracle process
Write dirty blocks to disk
Write to buffer cache
Read from disk
Read from buffer cache
Flash Cache
Write clean
blocks (time
permitting)
Read from
flash cache
DB Flash cache architecture is designed to
accelerate buffered reads
Flash Cache
38. 38 Software Group
Configuration
• Create filesystem from flash device
• Set DB_FLASH_CACHE_FILE and DB_FLASH_CACHE_SIZE.
• Consider Filesystemio_options=setall
39. 39 Software Group
Flash KEEP pool
• You can prioritise blocks for important objects using the
FLASH_CACHE clause:
40. 40 Software Group
Oracle Db flash cache statistics
http://guyharrison.squarespace.com/storage/flash_insert_stats.sql
41. 41 Software Group
Flash Cache Efficiency
http://guyharrison.squarespace.com/storage/flash_time_savings.sql
42. 42 Software Group
Flash cache Contents
http://guyharrison.squarespace.com/storage/flashContents.sql
44. 44 Software Group
Test systems
• Third System:
– Oracle Exadata X-2 ¼ rack
– 36 × 600 GB 15K RPM SAS
HDD
– 12 x 96GB Sun F20 SLC PCI
flash cards.
• Final System:
– Dell R720 2x8 core 2.7GHz
processors, 64 GB RAM
– 16x15K HDD in RAID 10
– 1x Dell Express Flash SLC
PCIe
• First System:
– Dell Optiplex dual-core 4GB
RAM
– 2xSeagate 7500RPM
Baracuda SATA HDD
– Intel X-25E SLC SATA SSD
• Second System:
– Dell R510 2xquad core, 32
GB RAM
– 4x300GB 15K RPM,6Gbps
Dell SAS HDD
– 1xFusionIO ioDrive SLC PCI
SSD
45. 45 Software Group
Performance: indexed reads(X-25)
529.7
143.27
48.17
0 100 200 300 400 500 600
No Flash
Flash cache
Flash tablespace
Elapsed (s)
CPU
db file IO
flash cache IO
Other
46. 46 Software Group
Performance: Read/Write (X-25)
3,289
1,693
200
0 500 1000 1500 2000 2500 3000 3500
No Flash
Flash Cache
Flash tablespace
Elapsed time (s)
CPU
db file IO
write complete
free buffer
flash cache IO
Other
47. 47 Software Group
Random reads – FusionIO
2,211
583
121
0 500 1000 1500 2000 2500
SAS disk, no flash
cache
SAS disk, flash cache
Table on SSD
Elapsed time (s)
CPU
Other
DB File IO
Flash cache IO
48. 48 Software Group
Updates – Fusion IO
6,219
1,934
529
0 1000 2000 3000 4000 5000 6000 7000
SAS disk, no flash cache
SAS disk, flash cache
Table on SSD
Elapsed Time (s)
DB CPU
db file IO
log file IO
flash cache
free buffer waits
Other
49. 49 Software Group
Buffer Cache bottlenecks
• Flash cache architecture
avoids ‘free buffer waits’
due to waits flash IO, but
write complete waits can
still occur on hot blocks.
• Free buffer waits are still
possible against the
database files, because
flash cache accelerates
reads but not writes
50. 50 Software Group
Full table scans
418
398
72
0 50 100 150 200 250 300 350 400 450
SAS disk, no flash cache
SAS disk, flash cache
Table on SSD
Elasped time (s)
CPU
Other
DB File IO
Flash Cache IO
Flash cache doesn’t accelerate
Full table scans b/c scans use
direct path reads and flash
cache only accelerates
buffered reads
51. 51 Software Group
Sorting – what we expect
Time
PGA Memory available (MB)
Table/Index IO CPU Time Temp Segment IO
Memory Sort
Single Pass
Disk Sort
Multi-pass
Disk Sort
52. 52 Software Group
Disk Sorts – temp tablespace SSD vs HDD
0
500
1000
1500
2000
2500
3000
3500
4000
050100150200250300
Elapsedtime(s)
Sort Area Size
SAS based TTS SSD based TTS
Single Pass
Disk Sort
Multi-pass
Disk Sort
54. 54 Software Group
292.39
291.93
0 50 100 150 200 250 300 350
SAS based redo log
Flash based redo log
Elapsed time (s)
CPU
Log IO
Redo performance – Fusion IO
55. 55 Software Group
Concurrent redo workload (x10)
55
1,605
1,637
397
331
1,944
1,681
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
SAS based redo log
Flash based redo log
Elapsed time (s)
CPU
Other
Log File IO
56. 57 Software Group
Redo logs - redo size
• Marcelle Kratochvil has reported significant improvements for SSD
redo when applying LOB updates
• Performance for SSD writing small OLTP style transactions may differ
significantly from large LOB updates:
– Small transactions will hit the same block repeatedly, resulting in block
erase overheads for most writes.
– When the redo size exceeds the SSD page size then this overhead is
avoided and redo performance on SSD may exceed HDD
– On the other hand ““in foreground garbage collection a larger write will
require more pages to be erased, so actually will suffer from even more
performance issues.”” (flashdba)
58. 59 Software Group
Conclusions for redo
• SSD is not a good match for redo
– Sustained sequential writes lead to heavy garbage collection overhead
– Magnetic disk is very good as sequential writes because seek time is
minimized
• Very good SSD might provide (very roughly) a 20-30% reduction in
redo log sync waits
– At least, that is the best I have seen
– Might provide no benefit at all on a busy system
– Might provide higher benefits on a lightly burdened system
• Very eager to compare data with anyone who has different results
60. 61 Software Group
Flash caching technologies
61
Dell FluidCache, FusionIO DirectCache, etc.
Read-
intensive, po
tentially
massive
tablespaces
•Temp
Tablespace
• Hot Segments
• Hot Partitions
• DB Flash
Cache
(limited to the
size of the SSD)
Regular Block Device
Device Driver
File System/ Raw
Devices/ ASM
FluidCache Driver
File System/ Raw
Devices/ ASM
Caching Block Device
LUN
61. 62 Software Group
Fusion IO direct cache – Table scans
147
147
147
36
0 20 40 60 80 100 120 140 160
No cache 1st scan
No cache 2nd scan
direct cache on 1st scan
direct cache on 2nd scan
Elapsed time (s)
CPU
IO
Other
64. 65 Software Group
Exadata flash storage
• 4x96GB PCI Flash drives on each storage server (4x
increase in X3)
• Flash can be configured as:
– Exadata Smart Flash Cache (ESFC)
– Solid State Disk available to ASM disk groups
• ESFC is not the same as the DB flash cache:
– Maintained by cellsrv, not DBWR
– Supports smart scans and full scans
– If CELL_FLASH_CACHE= KEEP,
– Statistics accessed via the cellcli program
• Considerations for cache vs. SSD are similar
66. 67 Software Group
CELL_FLASH_CACHE_KEEP
• CELL_FLASH_CACHE_KEEP applies at the segment
(table, index, partition) level
• Default setting caches smart scan and index lookup results. Full
table scans are only cached when the KEEP option is applied
CELL_FLASH_CA
CHE_KEEP
Index lookups Smart Scans Full Table scans
(not smar)
NONE Not cached Not cached Not cached
DEFAULT Cached Not Cached Not cached
KEEP Cached Cached Cached
67. 68 Software Group
Using Exadata flash as grid disk
• Exadata uses all flash disks as flash cache
• You can modify this configuration and assign flash disks as grid disks
ASM Disk Group ASM Disk Group
Cell Disks
SAS Disks
Grid Disks
Flash Disks
Flash Cache
ASM Disk Group ASM Disk Group
Cell Disks
SAS Disks
ASM Disk Group
Grid Disks
Flash Disks
Flash Cache
68. 69 Software Group
Index reads
9.39
21.64
31.74
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00
SSD tablespace, no cache
HDD tablespace, default cache
HDD tablespace, no cache
Time (s)
CPU Time IO Time
69. 70 Software Group
Full Table scans
2.94
4.75
11.27
3.36
33.14
12.45
0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00
SSD table, default cache
HDD table, keep cache
HDD table, default cache
Time (s)
1st scan 2nd scan
Beware of
CELL_FLASH_CACHE=KEEP
78. 79 Software Group
Exadata 12c Smart Flash Cache Write-back
• Database writes go to flash
cache
– LRU aging to HDD
– Reads serviced by flash prior to
age out
– Similar restrictions to flash cache
(smart scans, etc)
– Will be most effective when
“buffer waits” exist
– random IO writes are less
problematic for flash than
sequential writes.
79. 80 Software Group
Performance tests
1,917.34
7,693.62
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000
Write Back
Write Through
seconds
FlashCacheMode
CPU Time Other Wait Time Free Buffer Waits Buffer Busy Watis
81. 82 Software Group
Recommendations
• Don’t wait for SSD to become as cheap as HDD
– Magnetic HDD will always be cheaper per GB, SSD cheaper per IO
• Consider a mixed or tiered storage strategy
– Using DB flash cache, selective SSD tablespaces or partitions
– Use SSD where your IO bottleneck is greatest and SSD advantage is
significant
• DB flash cache offers an easy way to leverage SSD for OLTP
workloads, but has few advantages for OLAP or Data Warehouse
82. 83 Software Group
How to use SSD
• Database flash cache
– If your bottleneck is single block (indexed reads) and you are on OEL or
Solaris 11GR2
• Flash tablespace
– Optimize read/writes against “hot” segments or partitions
• Flash temp tablespace
– If multi-pass disk sorts or hash joins are your bottleneck
• Device cache (Dell FluidCache, FusionIO direct cache)
– If you want to optimize both scans and index reads OR you are not on
OEL/Solaris 11GR2
• Exadata uses Flash effectively for read AND write optimization
– Consider allocating some of Exadata Flash as ASM tablespace for hot
tables and segments
83. 84 Software Group
Visit the Dell Software Booth
Enter for a chance to
win a Dell Venue Pro 11
tablet
Draw is at
2:45pm
Thursday
84. Please complete the session
evaluation on the mobile app
We appreciate your feedback and insight
guy.harrison@software.dell.com
@guyharrison
Guyharrison.net
Notes de l'éditeur
Insanely popular – literally millions of users
: Dell R720CPUs: 2 sockets of 8 core processors : 0 model name : Genuine Intel(R) CPU @ 2.70GHz cache size : 20480 KB memory: 64 GBnumber/type of disks: DATA: consists of 16x 15k rpm HDD for RAID 10 configuration. 8 effective sprindles. PCIe SSDs: 2 PCIe SSDs installed, but only one (/dev/rssdb1 ) is used for the PCIDATA tablespace.