Column store indexes and batch processing mode (nx power lite)

 An independent SQL Consultant
 A user of SQL Server from version 2000 onwards with 12+ years
experience.

CPU Cache, Memory and IO Subsystem Latency

Core

L1

L2

Core

L1

L2

L3
Core

L1

L2

Core

L1

L2

1ns

10ns

100ns

10us

100us

10ms

C

The “Cache out” Curve
Throughput

Every time we drop out of a cache
and use the next slower one down,
we pay a big throughput penalty

CPU Cache

TLB

NUMA
Remote

Storage

Touched
Data Size

CPCaches

Sequential Versus Random Page CPU Cache Throughput

Million Pages/sec

C

1,000
900
800
700

Random Pages

600

Sequential Pages

500

Single Page

400
300
200
100
0

0

2

4

6

8

10

12

14

16

18

20

Size of Accessed memory (MB)

22

24

Service Time + Wait Time

26

28

30

32

Moores Law Vs. Advancements In Disk Technology
 “Transistors per square inch on integrated circuits
has doubled every two years since the integrated
circuit was invented”
 Spinning disk state of play
 Interfaces have evolved
 Aerial density has increased
 Rotation speed has peeked at 15K RPM
 Not much else . . .
 Up until NAND flash, disk based IO sub systems
have not kept pace with CPU advancements.
 With next generation storage
( resistance ram etc) CPUs and storage may follow
the same curve.

Control flow
Row by row

Row by row

Row by row

Row by row

How do rows
travel between
Iterators ?
Data Flow







Query execution which leverages
CPU caches.
Break through levels of compression
to bridge the performance gap
between IO subsystems and
modern processors.
Better query execution scalability
as the degree of parallelism
increase.

 First introduced in SQL Server 2012, greatly enhanced in 2014
 A batch is roughly 1000 rows in size and it is designed to fit into the L2/3
cache of the CPU, remember the slide on latency.
 Moving batches around is very efficient*:
One test showed that regular row-mode hash join consumed about
600 instructions per row while the batch-mode hash join needed
about 85 instructions per row and in the best case (small, dense
join domain) was a low as 16 instructions per row.
* From: Enhancements To SQL Server Column Stores
Microsoft Research

xperf –on base –stackwalk profile

SELECT p.EnglishProductName
,SUM([OrderQuantity])
,SUM([UnitPrice])
,SUM([ExtendedAmount])
,SUM([UnitPriceDiscountPct])
,SUM([DiscountAmount])
,SUM([ProductStandardCost])
,SUM([TotalProductCost])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSales] f
JOIN [dbo].[DimProduct] p
ON
f.ProductKey = p.ProductKey
GOUP BY p.EnglishProductName

xperfview stackwalk.etl
xperf –d stackwalk.etl

Conceptual View . . .
Break blobs into
batches and
pipeline them
into CPU cache

Load
segments
into
blob
cache

Blob
cache

CPU

. . and whats happening in the call stack

,SUM([UnitPrice])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSalesFio] f
ON
GROUP BY p.EnglishProductName

x12

Batch

at
DOP 2
Row mode
Batch

Row mode

0

100

200

300

400

500

Row mode
Hash Match
Aggregate

445,585 ms*

Vs.
Batch mode
Hash Match
Aggregate

78,400 ms*
* Timings are a
statistical
estimate






Compressing data going down
the column is far superior to
compressing data going across
the row, also we only retrieve
the column data that is of
interest.
Run length compression is used
in order to achieve this.
SQL Server 2012 introduces
column store compression . .
., SQL Server 2014 adds more
features to this.

Dictionary
Lookup ID
1
Colour
Red
Red

Blue
Blue
Green
Green
Green

Label
Red

2
3

Blue
Green

Segment
Lookup ID
1

Run Length
2

2
3

2
3

SQL Server 2014 Column Store Storage Internals
Row
Groups

A

B

C

< 1,048,576
rows

Encode &
Compress

Store
Delta stores
Encode and
Compress

Columns

Segments

Blobs

Global dictionary
Deletion Bitmap

Local Dictionary

Inserts of 1,048,576 rows and over
Inserts less than 1,048,576 rows
and updates
update = insert into
delta store
+ insert to the
deletion bit map

Tuple
mover

Delta store B-tree

Column store segments

SELECT

[ProductKey]
,[OrderDateKey]
,[DueDateKey]
,[ShipDateKey]
,[CustomerKey]
,[PromotionKey]
,[CurrencyKey]
.
.
INTO
FactInternetSalesBig
FROM
[dbo].[FactInternetSales]
CROSS JOIN master..spt_values AS a
CROSS JOIN master..spt_values AS b
WHERE
a.type
= 'p'
AND
b.type
= 'p'
AND
a.number <= 80
AND
b.number <= 100

494,116,038 rows

Size
(Mb)

80,000

70,000

60,000

57 %

50,000

74 %

92 %

94 %

40,000

30,000

20,000

10,000

0
Heap

Row compression

Page compression
Clustered column Column store archive compression
store index

Query

SELECT a.number
INTO
OrderedSequence
FROM
master..spt_values AS a
CROSS JOIN master..spt_values AS c
WHERE c.number <= 57
ORDER BY a.number
SELECT a.number
INTO
RandomSequence
FROM
master..spt_values AS a
CROSS JOIN master..spt_values AS c
WHERE c.number <= 57
ORDER BY NEWID()

Uncompressed Size

Size After Column
Store
Compression

17.85 Mb
1048576

1.5 billion
rows, 39,233.86
Mb
18.48 Mb

SQL Server 2012

SQL Server
2014

Column store indexes

Yes

Yes

Clustered column store indexes

No

Yes

Updateable column store indexes

No

Yes

Column store archive compression

No

Yes

Columns in a column store index can be dropped

No

Yes

Support for GUID, binary, datetimeoffset precision > 2, numeric precision > 18.

No

Yes

Enhanced compression by storing short strings natively ( instead of 32 bit IDs )

No

Yes

Bookmark support ( row_group_id:tuple_id)

No

Yes

Mixed row / batch mode execution

No

Yes

Optimized hash build and join in a single iterator

No

Yes

Hash memory spills cause row mode execution

No

Yes

Scan, filter, project, hash (inner) join
and (local) hash aggregate

Yes

Feature

Iterators supported

Disclaimer: your own mileage may vary depending on your data, hardware
and queries

Hardware
2 x 2.0 Ghz 6 core Xeon CPUs
Hyper threading enabled
22 GB memory
Raid 0: 6 x 250 GB SATA III HD 10K RPM
Raid 0: 3 x 80 GB Fusion IO
Software
Windows server 2012
SQL Server 2014 CTP 2
AdventureWorksDW DimProductTable
Enlarged FactInternetSales table

Compression Type / Time (ms)
300000
Time (ms)

SELECT SUM([OrderQuantity])
,SUM([UnitPrice])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSales]

250000

200000

150000

100000

2050Mb/s
85% CPU

50000

678Mb/s
98% CPU

256Mb/s
98% CPU

0
No compression

Row
compression

Page
No compression
Row
compression
compression

Page
compression

No
compression

545,761 ms*

Vs.
Page
compression
1,340,097 ms*
All stack trace
timings are a
statistical
estimate

Elapsed Time(ms) / Column Store Compression Type
4500
Elapsed Time(ms)/Compression Type

4000

3500
3000
2500

52 Mb/s
99% CPU

2000

27 Mb/s
56% CPU

flash cstore

flash cstore archive

1500
1000
500
0

hdd cstore

hdd cstore archive

Clustered
column store
index

60,651 ms

Vs.
Clustered
column store
index with
archive
compression

61,196 ms

We will look at the best we can
do without column store indexes:
Partitioned heap fact table with page
compression for spinning disk
Partitioned heap fact table without
any compression our flash storage
Non partitioned column store indexes
on both types of store with and without
archive compression.

,SUM([UnitPrice])
,SUM([SalesAmount])
,SUM([TaxAmt])
,SUM([Freight])
FROM [dbo].[FactInternetSales] f
ON
GROUP BY p.EnglishProductName

Join Scalability DOP / Time (ms)

Time (ms)
800000

HDD page compressed partitioned fact
table

700000

Flash partitioned fact table

600000
500000
400000
300000
200000
100000
0
2

4

6

8

10

12

14

Degree of parallelism

16

18

20

22

24

Join Scalability DOP / Time (ms)

Time (ms)
60000

hdd column store
hdd column store archive

50000

flash column store
flash column store archive

40000

30000

20000

10000

0

2

4

6

8

10

12

14

Degree of parallelism

16

18

20

22

24

A SQL Server workload should scale up to
the limits of hardware, such that:
All CPU capacity is exhausted
or
All storage IOPS bandwidth is exhausted
As concurrency increases, we need to
watch out for “The usual suspects” that can
throttle throughput back.

Latch Contention

Lock Contention

Spinlock Contention

40000

120
Elapsed Time (ms)
Pct CPU Utilisation

35000

100

30000
80

25000

20000

60

15000

40

10000
20

5000
0

0
1
2

2
4

3
6

4
8

5
10

6
12

7
14

8
16

9
18

10
20

11
22

12
24

8000

100
Waiting Latch Request Count

7000

90

Pct CPU Utilisation

80
6000
70

5000

60

4000

50
40

3000

30
2000
20
1000

10

0

0
1
2

2
4

3
6

4
8

5
10

6
12

7
14

8
16

9
18

10
20

11
22

12
24

10000
9000

100

Spinlock Spin Count (1000s)
Pct CPU Utilisation

90

8000

80

7000

70

6000

60

5000

50

4000

40

3000

30

2000

20

1000

10

0

0
12

24

36

48

10
5

12
6

14
7

16
8

18
9

20
10

22
11

24
12

What most
people tend to
have
CPU

CPU used for IO consumption + CPU used for decompression < total CPU capacity

Compression works for you 

CPU

CPU used for IO consumption + CPU used for decompression > total CPU capacity

Compression works against you 
CPU used for IO consumption + CPU used for decompression = total CPU capacity

Nothing to be gained or lost from using compression

No significant difference in terms of performance between column store
compression and column store archive compression.
Pre-sorting the data makes little difference to compression ratios.
Batch mode
Provides a tremendous performance boost with just two schedulers.
Does not provide linear scalability with the hardware available.
Does provide an order of magnitude performance increase in JOIN
performance.
Performs marginally better with column store indexes which do not use
archive compression.

 Enhancements To Column Store Indexes
(SQL Server 2014 ) Microsoft Research
 SQL Server Clustered Columnstore Tuple Mover
Remus Rasanu
 SQL Server Columnstore Indexes at Teched 2013
Remus Rasanu
 The Effect of CPU Caches and Memory Access Patterns
Thomas Kejser

Thomas Kejser
Former SQL CAT member
and CTO of Livedrive

ChrisAdkin8

chris1adkin@yahoo.co.uk

http://uk.linkedin.com/in/wollatondba

Column store indexes and batch processing mode (nx power lite)

Column store indexes and batch processing mode (nx power lite)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (13)

Similaire à Column store indexes and batch processing mode (nx power lite)

Similaire à Column store indexes and batch processing mode (nx power lite) (20)

Plus de Chris Adkin

Plus de Chris Adkin (9)

Dernier

Dernier (20)

Column store indexes and batch processing mode (nx power lite)

Notes de l'éditeur