2. Who am I?
Frits Hoogland
– Working with Oracle products since 1996
– Working with VX Company since 2009
Interests
– Databases, Operating Systems, Application Servers
– Web techniques, TCP/IP, network security
– Technical security, performance
Twitter: @fritshoogland
Blog: http://fritshoogland.wordpress.com
Email: fhoogland@vxcompany.com
Oracle ACE Director
OakTable member
3. What is exadata
– Engineered system specifically for oracle database.
– Ability to reach high number of read IOPS and huge
bandwidth.
– Has it‟s own patchbundles.
– Validated versions and patch versions across database,
clusterware, o/s and storage, firmware.
– Dedicated, private storage for databases.
– ASM.
– Recent hardware & recent CPU.
– No virtualisation.
3
4. Exadata versions
– Oracle database 64 bit version >= 11
– ASM 64 bit version >= 11
- Exadata communication is layer in skgxp code
– Linux OL5 x64
- No UEK kernel used (except X2-8)
4
5. Exadata hardware
– Intel Xeon server hardware
– Infiniband 40Gb/s
– Oracle cell (storage) server
- Flash to mimic SAN cache
- High performance disks or high capacity disks
- 600GB 15k RPM / ~ 5ms latency
- 2/3TB 7.2k RPM / ~ 8ms latency
5
6. Flash
– Flashcards are in every storage server
– Total of 384GB per storage server
– Do not confuse exadata STORAGE server flashcache
with oracle database flashcache
– Flash can be configured either as cache (flash cache
and flash log) or as diskgroup or both
– When flash is used as diskgroup latency is ~ 1 ms
- Much faster than disk
- My guess was < 400µs
- 1µs infiniband
- 200µs flash IO time
- some time for storage server
6
7. Flash
– Flash is restricted to 4x96GB = 384GB per storage
server.
- Totals:
- Q:1152GB, H:2688GB, F:5376GB
- Net (ASM Normal redundancy):
- Q: 576GB, H:1344GB, F:2688GB
– That is a very limited amount of storage.
– But with flash as diskgroup there‟s no cache for PIO‟s!
7
8. Exadata specific features
– The secret sauce of exadata: the storage server
- smartscan
- storage indexes
- EHCC *
- IO Resource manager
8
9. OLTP
– How does OLTP look like (in general | simplistic)
– Fetch small amount of data
- Invoice numbers, client id, product id
- select single values or small ranges via index
– Create or update rows
- Sold items on invoice, payments, order status
- insert or update values
10
10. SLOB
– A great way to mimic or measure OLTP performance is
SLOB
– Silly Little Oracle Benchmark
– Author: Kevin Closson
– http://oaktable.net/articles/slob-silly-little-oracle-
benchmark
11
11. SLOB
–It can do reading:
FOR i IN 1..5000 LOOP
v_r := dbms_random.value(257, 10000) ;
SELECT COUNT(c2) into x
FROM cf1 where custid > v_r - 256 AND custid < v_r;
END LOOP;
12
12. SLOB
–And writing:
FOR i IN 1..500 LOOP
v_r := dbms_random.value(257, 10000) ;
UPDATE cf1 set
c2 =
'AAAAAAAABBBBBBBBAAAAAAAABBBBBBBBAAAAAAAABBBBBBBBAAAAAAAA
BBBBBBBBAAAAAAAABBBBBBBBAAAAAAAABBBBBBBBAAAAAAAABBBBBBBBA
AAAAAAABBBBBBBB',
....up to column 20 (c20)....
where custid > v_r - 256 AND custid < v_r;
COMMIT;
END LOOP;
13
19. 1 reader conclusion
- The time spend on PIO is 15% - 45%
- Majority of time spend on LIO/CPU time
- Because main portion is CPU, fastest CPU “wins”
- Actually: fastest CPU, memory bus and memory.
20
20. LIO benchmark
– Let‟s do a pre-warmed cache run
- Pre-warmed means: no PIO, data already in BC
- This means ONLY LIO speed is measured
21
26. LIO benchmark
Core difference and slower memory shows when # readers
exceeds core count.
Same memory speed: CPU speed matters less with
more concurrency.
27
28. LIO benchmark
Lesser core’s and slower memory make LIO processing
increasingly slower with more concurrency
For LIO processing ODA (non-Exadata) and Exadata
does not matter.
29
29. – Conclusion:
- LIO performance is impacted by:
- CPU speed
- Number of sockets and core‟s
- L1/2/3 cache sizes
- Memory speed
- Exadata does not matter here!
- When comparing entirely different systems also consider:
- Oracle version
- O/S and version (scheduling)
- Hyper threading / CPU architecture
- NUMA (Exadata/ODA: no NUMA!)
30
30. – But how about physical IO?
- Lower the buffercache to 4M
- sga_max_size to 1g
- cpu_count to 1
- db_cache_size to 1M (results in 4M)
- Slob run with 1 reader
31
31. The V2 is the slowest with 106 seconds.
The X2 is only a little slower with 76 seconds.
Surprise! ODA is the fastest here with 73 seconds.
32
32. Total time (s) CPU time (s) IO time (s)
ODA 73 17 60
X2 76 33 55
V2 106 52 52
– IO ODA:
- 60/1264355= 0.047 ms
– IO X2:
- 55/1265602= 0.043 ms
– IO V2:
- 52/1240941= 0.042 ms
33
33. – This is not random disk IO!
- Average latency of random IO 15k rpm disk ~ 5ms
- Average latency of random IO 7.2k rpm disk ~ 8ms
– So this must come from a cache or is not random disk IO
- Exadata has flashcache.
- On ODA, data probably very nearby on disk.
34
34. Total time (s) CPU time (s) IO time (s)
ODA 73 17 60
X2 76 33 55
V2 106 52 52
- Exadata IO takes (way) more CPU.
- Roughly the same time is spend on doing IO‟s.
35
36. Now IO responsetime on ODA is way higher than
Exadata (3008s)
Both Exadata’s perform alike: X2 581s, V2 588s.
37
37. Total time (s) CPU time (s) IO time (s)
ODA 3008 600 29428
X2 581 848 5213
V2 588 1388 4866
– IO ODA:
- 29428/13879603 = 2.120 ms
– IO X2:
- 5213/14045812 = 0.371 ms
– IO V2:
- 4866/14170303 = 0.343 ms
38
40. Total time (s) CPU time (s) IO time (s)
ODA 4503 1377 88756
X2 721 2069 13010
V2 747 3373 12405
– IO ODA:
- 88756/28246604 =0.003142183039 = 3.142 ms
– IO X2:
- 13010/28789330=0.0004519035351 = 0.452 ms
– IO V2:
- 12405/28766804=0.0004312262148 = 0.431 ms
41
42. ODA (20 x 15krpm HDD) disk capacity is saturated so
response time increases with more readers.
Flashcache is not saturated, so response time of IO
of 10-20-30 readers increases very little.
43
44. ODA response time more or less increases linear.
The V2 response time (with more flashcache!) starts
increasing at 70 readers. A bottleneck is showing up!
(7x384GB!!)
X2 flashcache (3x384GB) is not saturated, so little
increase in response time.
45
45. IOPS view instead of responsetime
3x384GB Flashcache and IB can serve
> 115’851 read IOPS!
This V2 has more flashcache, so
decline in read IOPS probably due to
something else!
ODA maxed out at ~ 11’200 read IOPS
46
46. - V2 top 5 timed events with 80 readers:
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
cell single block physical rea 102,345,354 56,614 1 47.1 User I/O
latch: cache buffers lru chain 27,187,317 33,471 1 27.8 Other 44.1%
latch: cache buffers chains 14,736,819 19,594 1 16.3 Concurrenc
DB CPU 13,427 11.2
wait list latch free 932,930 553 1 .5 Other
- X2 top 5 timed events with 80 readers:
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
cell single block physical rea 102,899,953 68,209 1 87.9 User I/O
DB CPU 9,297 12.0
latch: cache buffers lru chain 10,917,303 1,585 0 2.0 Other 2.9%
latch: cache buffers chains 2,048,395 698 0 .9 Concurrenc
cell list of blocks physical r 368,795 522 1 .7 User I/O
47
48. 1 LIO 80 LIO 1 PIO 80 PIO
ODA 1 11 73 9795
X2 (HC disks) 2 11 76 976
V2 (HP disks) 3 22 106 1518
52
49. 1 LIO 80 LIO 1 PIO 80 PIO
ODA 1 11 73 9795
X2 2 11 76 976
V2 3 22 106 1518
1 PIO w/o flashcache 80 PIO w/o flashcache
ODA 73 9795
X2 167 ?
V2 118 5098
53
50. - For scalability, OLTP needs buffered IO (LIO)
- Flashcache is EXTREMELY important physial IO scalability
- Never, ever, let flash be used for something else
- Unless you can always keep all your small reads in cache
- Flash mimics a SAN/NAS cache
- So nothing groundbreaking here, it does what current, normal infra should
do too...
- The bandwidth needed to deliver the data to the database is
provided by Infiniband
- 1 Gb ethernet = 120MB/s, 4 Gb fiber = 400MB/s
- Infiniband is generally available.
54
51. – How much IOPS can a single cell do?
- According to
https://blogs.oracle.com/mrbenchmark/entry/inside_the_sun_oracl
e_database
- A single cell can do 75‟000 IOPS from flash (8kB)
- Personal calculation: 60‟000 IOPS with 8kB
– Flashcache cache
- Caches small reads & writes (8kB and less) mostly
- Large multiblock reads are not cached, unless segment property
„cell_flash_cache‟ is set to „keep‟.
55
52. – Is Exadata a good idea for OLTP?
- From a strictly technical point of view, there is no benefit.
– But...
– Exadata gives you IORM
– Exadata gives you reasonably up to date hardware
– Exadata gives a system engineered for performance
– Exadata gives you dedicated disks
– Exadata gives a validated combination of database,
clusterware, operating system, hardware, firmware.
56
53. – Exadata storage servers provide NO redundancy for
data
- That‟s a function of ASM
– Exadata is configured with either
- Normal redundancy (mirroring) or
- High redundancy (triple mirroring)
– to provide data redundancy.
57
54. – Reading has no problem with normal/high redundancy.
– During writes, all two or three AU‟s need to be written.
– This means when you calculate write throughput, you
need to double all physical writes if using normal
redundancy.
58
55. – But we got flash! Right?
– Yes, you got flash. But it probably doesn‟t do what you
think it does:
59
58. – Please mind these are cumulative numbers!
– The half-rack is a POC machine, no heavy usage
between POC‟s.
– The quarter-rack has had some load, but definately not
heavy OLTP.
– I can imagine flashlog can prevent long write times if disk
IO‟s queue.
- A normal configured database on Exadata has online redo in
DATA and in RECO diskgroup
- Normal redundancy means every log write must be done 4 times
62
60. – Log file write response time on Exadata is not in the
same range as reads.
– There‟s the flashlog feature, but it does not work as the
whitepaper explains
– Be careful with heavy writing on Exadata.
- There‟s no Exadata specific improvement for writes.
64