GNW01: In-Memory Processing for Databases

gluent.com 1
In-Memory Execution for Databases
Tanel Poder
a long time computer performance geek

gluent.com 2
Intro: About me
• Tanel Põder
• Oracle Database Performance geek (18+ years)
• Exadata Performance geek
• Linux Performance geek
• Hadoop Performance geek
• CEO & co-founder:
Expert Oracle Exadata
book
(2nd edition is out now!)
Instant
promotion

gluent.com 3
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z
Gluent as a data
virtualization layer
Open Data
Formats!

gluent.com 4
Gluent Advisor
1. Analyzes DB storage use and access
patterns for safe offloading
2. 500+ Databases analyzed
3. 10+ PB analyzed – 81% offloadable
4. 2-24x query speedup
10 PB
Interested in
analyzing your
database?
http://gluent.com/whitepapers

gluent.com 5
Tape is dead, disk is tape, flash is disk, RAM locality is king
Jim Gray, 2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

gluent.com 6
Seagate Cheetah 15k RPM disk specs
200
MB
/sec!

gluent.com 7
Spinning disk IO throughput
• B-Tree index-walking disk-based RDBMS
• 15000 rpm spinning disks
• ~200 random IOPS per disk
• ~8kB read per random IO
• 8 kB * 200 IOPS = 1.6 MB/sec per disk
• Full scanning based workloads
• Potentially much more data to access & filter
• Partition pruning, zonemaps, storage indexes help to skip data 1
• Scan only required columns (formats with large chunk sizes)
• Sequential IO rate up to 200MB/sec per disk
http://www.dbms2.com/2013/05/27/data-skipping/
However, index
scans can read only
a subset of data

gluent.com 8
Scanning a bunch of spinning disks can keep
your CPUs really busy!
* Not even talking about flash or RAM here!

gluent.com 9
A simple query bottlenecked by CPU
9 GB scanned, processed
in 7 seconds:
~1300 MB/s in PX
~80 MB/s per slave

gluent.com 10
A complex query bottlenecked by CPU
Complex Query: Much
more CPU spent on
aggregations, joins. 9GB
processed in 1.5 minutes
9 GB / 90 seconds = ~
100MB/s PX
6 MB/s per slave

gluent.com 11
If disks and storage subsystems are getting so fast, why all the
buzz around in-memory database systems?
* Can’t we just cache the old database files in RAM?

gluent.com 12
A simple Data Retrieval test!
• Retrieve 1% rows out of a 8 GB table:
SELECT
COUNT(*)
, SUM(order_total)
FROM
orders
WHERE
warehouse_id BETWEEN 500 AND 510
The Warehouse
IDs range between
1 and 999
Test data
generated by
SwingBench tool

gluent.com 13
Data Retrieval: Test Results
• Remember, this is a very simple scanning + filtering query:
TESTNAME PLAN_HASH ELA_MS CPU_MS LIOS BLK_READ
------------------------- ---------- -------- -------- --------- ---------
test1: index range scan * 16715356 265203 37438 782858 511231
test2: full buffered */ C 630573765 132075 48944 1013913 849316
test3: full direct path * 630573765 15567 11808 1013873 1013850
test4: full smart scan */ 630573765 2102 729 1013873 1013850
test5: full inmemory scan 630573765 155 155 14 0
test6: full buffer cache 630573765 7850 7831 1014741 0
Test 5 & Test 6
run entirely
from memory
Source:
http://www.slideshare.net/tanelp/oracle-database-inmemory-option-in-action
But why 50x
difference in
CPU usage?

gluent.com 14
Tape is dead, disk is tape, flash is disk, RAM locality is king
Jim Gray, 2006
http://research.microsoft.com/en-us/um/people/gray/talks/flash_is_good.ppt

gluent.com 15
Latency Numbers Every Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache,
200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD,
4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter
roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory,
20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Source:
https://gist.github.com/jboner/2841832

gluent.com 16
CPU = fast
CPU L2 / L3
cache in between
RAM = slow

gluent.com 17
RAM access is the bottleneck of modern computers
Waits for RAM access show up as CPU usage in monitoring tools
Want to wait less? Do it less!

gluent.com 18
CPU & cache friendly data structures are key!
Headers, ITL entries
Row Directory
#0 hdr row
#1 hdr row
#2 hdr row
#3 hdr row
#4 hdr row
#5 hdr row
#6 hdr row
#7 hdr row
#8 hdr row
… row
#1 offset
#2 offset
#3 offset
#0 offset
…
Hdr
byte
Column data
Lock
byte
CC
byte
Col.
len
Column data
Col.
len
Column data
Col.
len
Column data
Col.
len
• OLTP: Block->Row->Column format
• 8kB blocks
• Great for writes, changes
• Field-length encoding
• Reading column #100 requires walking
through all preceding columns
• Columns (with similar values) not densely
packed together
• Not CPU cache friendly for analytics!

gluent.com 19
Scanning columnar data structures
Scanning a column in a
row-oriented data block
Scanning a column in a
column-oriented compression unit
col 1 col 2
col 3
col 4
col 5
col 6
col 2
col 2
col 3
col 3
col 4
col 4
col 5
col 5
col5
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6
col 1 col 2
3…
col 3 col 4
col 4 col 5
col 6 col 1 col 2
col 3
col 3
col 4
col 4
col 5
col 5
col 1 col 2
col 6
col 6 Read filter
column(s) first.
Access only
projected columns
if matches found.
Reduced memory
traffic. More
sequential RAM
access, SIMD on
adjacent data.

gluent.com 20
How to measure this stuff?

gluent.com 21
CPU Performance Counters on Linux
# perf stat -d -p PID sleep 30
Performance counter stats for process id '34783':
27373.819908 task-clock # 0.912 CPUs utilized
86,428,653,040 cycles # 3.157 GHz
32,115,412,877 instructions # 0.37 insns per cycle
# 2.39 stalled cycles per insn
7,386,220,210 branches # 269.828 M/sec
22,056,397 branch-misses # 0.30% of all branches
76,697,049,420 stalled-cycles-frontend # 88.74% frontend cycles idle
58,627,393,395 stalled-cycles-backend # 67.83% backend cycles idle
256,440,384 cache-references # 9.368 M/sec
222,036,981 cache-misses # 86.584 % of all cache refs
234,361,189 LLC-loads # 8.562 M/sec
218,570,294 LLC-load-misses # 93.26% of all LL-cache hits
18,493,582 LLC-stores # 0.676 M/sec
3,233,231 LLC-store-misses # 0.118 M/sec
7,324,946,042 L1-dcache-loads # 267.589 M/sec
305,276,341 L1-dcache-load-misses # 4.17% of all L1-dcache hits
36,890,302 L1-dcache-prefetches # 1.348 M/sec
30.000601214 seconds time elapsed
Measure what’s
going on inside a
CPU!
Metrics explained in
my blog entry:
http://bit.ly/1PBIlde

gluent.com 22
Testing data access path differences on Oracle 12c
SELECT COUNT(cust_valid)
FROM customers_nopart c
WHERE cust_id > 0
Run the same query on
same dataset stored in
different formats/layouts.
Full details:
http://blog.tanelpoder.com/2015/11/30
/ram-is-the-new-disk-and-how-to-
measure-its-performance-part-3-cpu-
instructions-cycles/
Test result data:
http://bit.ly/1RitNMr

gluent.com 23
CPU instructions used for scanning/counting 69M rows

gluent.com 24
Average CPU instructions per row processed
• Knowing that the table has about 69M rows, I can calculate
the average number of instructions issued per row processed

gluent.com 25
CPU cycles consumed (full scans only)

gluent.com 26
CPU efficiency (Instructions-per-Cycle)
Yes, modern superscalar
CPUs can execute multiple
instructions per cycle

gluent.com 27
Reducing memory writes within SQL execution
• Old approach:
1. Read compressed data chunk
2. Decompress data (write data to temporary memory location)
3. Filter out non-matching rows
4. Return data
• New approach:
1. Read and filter compressed columns
2. Decompress only required columns of matching rows
3. Return data

gluent.com 28
Memory reads & writes during internal processing
Unit = MB
Read only
requested columns
Rows counted from
chunk headers
Scan compressed data:
few memory writes

gluent.com 30
Some commercial column store history
• Disk-optimized column stores
• Expressway 103 / Sybase IQ (early ‘90s)
• MonetDB (early ‘90s)
• Oracle Hybrid Columnar Compression (disk/OLTP optimized)
• …
• Memory-optimized column stores
• …
• SAP HANA (December 2010)
• IBM DB2 with BLU Acceleration (June 2013)
• Oracle Database 12c with In-Memory Option (July 2014)
• …
* Not addressing memory-optimized OLTP / row-stores here

gluent.com 31
Future-proof Open Data Formats!
• Disk-optimized columnar data structures
• Apache Parquet
• https://parquet.apache.org/
• Apache ORC
• https://orc.apache.org/
• Memory / CPU-cache optimized data structures
• Apache Arrow
• Not only storage format
• … also a cross-system/cross-platform IPC communication framework
• https://arrow.apache.org/

gluent.com 32
Future
1. RAM gets cheaper + bigger, not necessarily faster
2. CPU caches get larger
3. RAM blends with storage and becomes non-volatile
4. IO subsystems (flash) get even closer to CPUs
5. IO latencies shrink
6. The latency difference between non-volatile storage and volatile
RAM shrinks - new database layouts!
7. CPU cache is king – new data structures needed!

gluent.com 33
References
• Slides & Video of this presentation:
• http://www.slideshare.net/tanelp
• https://vimeo.com/gluent
• Index range scans vs full scans:
• http://blog.tanelpoder.com/2014/09/17/about-index-range-scans-
disk-re-reads-and-how-your-new-car-can-go-600-miles-per-hour/
• RAM is the new disk series:
• http://blog.tanelpoder.com/2015/08/09/ram-is-the-new-disk-and-
how-to-measure-its-performance-part-1/
• https://docs.google.com/spreadsheets/d/1ss0rBG8mePAVYP4hlpvjqA
AlHnZqmuVmSFbHMLDsjaU/

gluent.com 34
Thanks!
http://gluent.com/whitepapers
We are hiring developers &
data engineers!!!
http://blog.tanelpoder.com
tanel@tanelpoder.com
@tanelpoder

GNW01: In-Memory Processing for Databases

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à GNW01: In-Memory Processing for Databases

Similaire à GNW01: In-Memory Processing for Databases (20)

Dernier

Dernier (20)

GNW01: In-Memory Processing for Databases