7. gluent.com 7
Spinning disk IO throughput
• B-Tree index-walking disk-based RDBMS
• 15000 rpm spinning disks
• ~200 random IOPS per disk
• ~8kB read per random IO
• 8 kB * 200 IOPS = 1.6 MB/sec per disk
• Full scanning based workloads
• Potentially much more data to access & filter
• Partition pruning, zonemaps, storage indexes help to skip data 1
• Scan only required columns (formats with large chunk sizes)
• Sequential IO rate up to 200MB/sec per disk
http://www.dbms2.com/2013/05/27/data-skipping/
However, index
scans can read only
a subset of data
12. gluent.com 12
A simple Data Retrieval test!
• Retrieve 1% rows out of a 8 GB table:
SELECT
COUNT(*)
, SUM(order_total)
FROM
orders
WHERE
warehouse_id BETWEEN 500 AND 510
The Warehouse
IDs range between
1 and 999
Test data
generated by
SwingBench tool
13. gluent.com 13
Data Retrieval: Test Results
• Remember, this is a very simple scanning + filtering query:
TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ
------------------------- ---------- -------- -------- --------- ---------
test1: index range scan * 16715356 265203 37438 782858 511231
test2: full buffered */ C 630573765 132075 48944 1013913 849316
test3: full direct path * 630573765 15567 11808 1013873 1013850
test4: full smart scan */ 630573765 2102 729 1013873 1013850
test5: full inmemory scan 630573765 155 155 14 0
test6: full buffer cache 630573765 7850 7831 1014741 0
Test 5 & Test 6
run entirely
from memory
Source:
http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action
But why 50x
difference in
CPU usage?
18. gluent.com 18
CPU & cache friendly data structures are key!
Headers, ITL entries
Row Directory
#0 hdr row
#1 hdr row
#2 hdr row
#3 hdr row
#4 hdr row
#5 hdr row
#6 hdr row
#7 hdr row
#8 hdr row
… row
#1 offset
#2 offset
#3 offset
#0 offset
…
Hdr
byte
Column data
Lock
byte
CC
byte
Col.
len
Column data
Col.
len
Column data
Col.
len
Column data
Col.
len
• OLTP: Block->Row->Column format
• 8kB blocks
• Great for writes, changes
• Field-length encoding
• Reading column #100 requires walking
through all preceding columns
• Columns (with similar values) not densely
packed together
• Not CPU cache friendly for analytics!
19. gluent.com 19
Scanning columnar data structures
Scanning a column in a
row-oriented data block
Scanning a column in a
column-oriented compression unit
col 1 col 2
col 3
col 4
col 5
col 6
col 2
col 2
col 3
col 3
col 4
col 4
col 5
col 5
col5
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6 Read filter
column(s) first.
Access only
projected columns
if matches found.
Reduced memory
traffic. More
sequential RAM
access, SIMD on
adjacent data.
21. gluent.com 21
CPU Performance Counters on Linux
# perf stat -d -p PID sleep 30
Performance counter stats for process id '34783':
27373.819908 task-clock # 0.912 CPUs utilized
86,428,653,040 cycles # 3.157 GHz
32,115,412,877 instructions # 0.37 insns per cycle
# 2.39 stalled cycles per insn
7,386,220,210 branches # 269.828 M/sec
22,056,397 branch-misses # 0.30% of all branches
76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle
58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle
256,440,384 cache-references # 9.368 M/sec
222,036,981 cache-misses # 86.584 % of all cache refs
234,361,189 LLC-loads # 8.562 M/sec
218,570,294 LLC-load-misses # 93.26% of all LL-cache hits
18,493,582 LLC-stores # 0.676 M/sec
3,233,231 LLC-store-misses # 0.118 M/sec
7,324,946,042 L1-dcache-loads # 267.589 M/sec
305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits
36,890,302 L1-dcache-prefetches # 1.348 M/sec
30.000601214 seconds time elapsed
Measure what’s
going on inside a
CPU!
Metrics explained in
my blog entry:
http://bit.ly/1PBIlde
22. gluent.com 22
Testing data access path differences on Oracle 12c
SELECT COUNT(cust_valid)
FROM customers_nopart c
WHERE cust_id > 0
Run the same query on
same dataset stored in
different formats/layouts.
Full details:
http://blog.tanelpoder.com/2015/11/30
/ram-is-the-new-disk-and-how-to-
measure-its-performance-part-3-cpu-
instructions-cycles/
Test result data:
http://bit.ly/1RitNMr
27. gluent.com 27
Reducing memory writes within SQL execution
• Old approach:
1. Read compressed data chunk
2. Decompress data (write data to temporary memory location)
3. Filter out non-matching rows
4. Return data
• New approach:
1. Read and filter compressed columns
2. Decompress only required columns of matching rows
3. Return data
30. gluent.com 30
Some commercial column store history
• Disk-optimized column stores
• Expressway 103 / Sybase IQ (early ‘90s)
• MonetDB (early ‘90s)
• Oracle Hybrid Columnar Compression (disk/OLTP optimized)
• …
• Memory-optimized column stores
• …
• SAP HANA (December 2010)
• IBM DB2 with BLU Acceleration (June 2013)
• Oracle Database 12c with In-Memory Option (July 2014)
• …
* Not addressing memory-optimized OLTP / row-stores here
31. gluent.com 31
Future-proof Open Data Formats!
• Disk-optimized columnar data structures
• Apache Parquet
• https://parquet.apache.org/
• Apache ORC
• https://orc.apache.org/
• Memory / CPU-cache optimized data structures
• Apache Arrow
• Not only storage format
• … also a cross-system/cross-platform IPC communication framework
• https://arrow.apache.org/
32. gluent.com 32
Future
1. RAM gets cheaper + bigger, not necessarily faster
2. CPU caches get larger
3. RAM blends with storage and becomes non-volatile
4. IO subsystems (flash) get even closer to CPUs
5. IO latencies shrink
6. The latency difference between non-volatile storage and volatile
RAM shrinks - new database layouts!
7. CPU cache is king – new data structures needed!
33. gluent.com 33
References
• Slides & Video of this presentation:
• http://www.slideshare.net/tanelp
• https://vimeo.com/gluent
• Index range scans vs full scans:
• http://blog.tanelpoder.com/2014/09/17/about-index-range-scans-
disk-re-reads-and-how-your-new-car-can-go-600-miles-per-hour/
• RAM is the new disk series:
• http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-
how-to-measure-its-performance-part-1/
• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA
AlHnZqmuVmSFbHMLDsjaU/