Arrays in database systems, the next frontier?

Arrays in database systems,
the next frontier ?
Martin Kersten
CWI

NLeSC 9 Nov 2011

“We can't solve
problems by using
the same kind of
thinking we used
when we created
them.”

NLeSC 9 Nov 2011

Agenda

A crash course on column-stores

Column stores for science applications

The SciQL array query language

NLeSC 9 Nov 2011

The world of column stores
Motivation

Relational DBMSs dominate since the late 1970's / 1980's
l  Transactional workloads (OLTP, row-wise access)
l  I/O based processing
l  Ingres, Postgresql, MySQL, Oracle, SQLserver, DB2, …

Column stores dominate product development since 2005
l  Datawarehouses and business intelligence applications
l  Startups: Infobright, Aster Data, Greenplum, LucidDB,..
l  Commercial: Microsoft, IBM, SAP,…
MonetDB, the pioneer

NLeSC 9 Nov 2011

Workload changes: Transactions (OLTP) vs ...‫‏‬

NLeSC 9 Nov 2011

Workload changes: ... vs OLAP, BI, Data Mining, ...

NLeSC 9 Nov 2011


Databases hit The Memory Wall

§  Detailed and exhaustive analysis for different workloads using
4 RDBMSs by Ailamaki, DeWitt, Hill,, Wood in VLDB 1999:
“DBMSs On A Modern Processor: Where Does Time Go?”‫‏‬
§  CPU is 60%-90% idle,
waiting for memory:
§  L1 data stalls
§  L1 instruction stalls
§  L2 data stalls
§  TLB stalls
§  Branch mispredictions
§  Resource stalls

NLeSC 9 Nov 2011

Hardware Changes: The Memory Wall

Trip to memory = 1000s of instructions!

NLeSC 9 Nov 2011

Storing Relations in MonetDB

Void Void Void Void Void
1000 1000 1000 1000 1000
. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Virtual OID: seqbase=1000 (increment=1)
NLeSC 9 Nov 2011

BAT Data Structure

BAT:
binary association table
Head Tail
BUN:
binary unit

Hash tables, Head & Tail:
BUN heap:
T-trees, - consecutive memory
R-trees, blocks (arrays)‫‏‬
block (array)‫‏‬
... - memory-mapped file
files

Tail Heap:
- best-effort duplicate
elimination for strings
(~ dictionary encoding)
NLeSC 9 Nov 2011

MonetDB Front-end: SQL

l  SQL 2003
l  Parse SQL into logical n-ary relational algebra tree
l  Translate n-ary relational algebra into logical 2-ary relational algebra
l  Turn logical 2-ary plan into physical 2-ary plan (MAL program)

l  Front-end specific strategic optimization:
l  Heuristic optimization during all three previous steps
l  Primary key and distinct constraints:
l  Create and maintain hash indices
l  Foreign key constraints
l  Create and maintain foreign key join indices

NLeSC 9 Nov 2011

MonetDB Front-end: SQL
EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x;

function user.s2_1():void;
barrier _73 := language.dataflow();
_2:bat[:oid,:int] := sql.bind("sys","t","c",0);
_7:bat[:oid,:int] := sql.bind("sys","s","x",0);
_10 := bat.reverse(_7);
_11 := algebra.join(_2,_10);
_13 := algebra.markT(_11,0@0);
_15:bat[:oid,:int] := sql.bind("sys","t","a",0);
_17 := algebra.leftjoin(_14,_15);
_19 := algebra.markT(_18,0@0);
_21:bat[:oid,:int] := sql.bind("sys","s","z",0);
_23 := algebra.leftjoin(_20,_21);
exit _73;
_24 := sql.resultSet(2,1,_17);
sql.rsColumn(_24,"sys.t","a","int",32,0,_17);
sql.rsColumn(_24,"sys.s","z","int",32,0,_23);
_33 := io.stdout();
sql.exportResult(_33,_24);
end s2_1;

NLeSC 9 Nov 2011

MonetDB/5 Back-end: MAL
l  MAL: Monet Assembly Language
l  textual interface
l  Interpreted language
l  Designed as system interface language
l  Reduced, concise syntax
l  Strict typing
l  Meant for automatic generation and parsing/rewriting/processing
l  Not meant to be typed by humans
l  Efficient parser
l  Low overhead
l  Inherent support for tactical optimization: MAL -> MAL
l  Support for optimizer plug-ins
l  Support for runtime schedulers
l  Binary-algebra core
l  Flow control (MAL is computational complete)‫‏‬
NLeSC 9 Nov 2011

Processing Model (MonetDB Kernel)‫‏‬

l  Bulk processing:
l  full materialization of all intermediate results

l  Binary (i.e., 2-column) algebra core:
l  select, join, semijoin, outerjoin
l  union, intersection, diff (BAT-wise & column-wise)‫‏‬
l  group, count, max, min, sum, avg
l  reverse, mirror, mark

l  Runtime operational optimization:
l  Choosing optimal algorithm & implementation according to
input properties and system status

NLeSC 9 Nov 2011

Processing Model (MonetDB Kernel)‫‏‬
l  Heavy use of code expansion to reduce cost
1 algebra operator
select()‫‏‬

3 overloaded operators select(“=“,value) select(“between”,L,H)
select(“fcn”,parm)‫‏‬

10 operator algorithms scan hash-lookup bin-search bin-tree pos-lookup

scan_range_select_oid_int(),
~1500(!) routines hash_equi_select_void_str(), …
(macro expansion)‫‏‬
•  ~1500 selection routines
•  149 unary operations
•  335 join/group operations
•  ...

NLeSC 9 Nov 2011

The Software Stack

Front-ends XQuery SQL 03 SciQL RDF

Optimizers

Back-end(s) MonetDB 4 MonetDB 5

Kernel MonetDB kernel

NLeSC 9 Nov 2011

The Software Stack

Strategic optimization

Front-ends XQuery SQL 03 MAL

Optimizers Tactical optimization:
MAL -> MAL rewrites

Back-end(s) MonetDB 4 MonetDB 5 MAL

Runtime
Kernel MonetDB kernel operational
optimization

NLeSC 9 Nov 2011

MonetDB vs Traditional DBMS Architecture

l  Architecture-Conscious Query Processing
vs Magnetic disk I/O conscious processing
- 
l  Data layout, algorithms, cost models
l  RISC Relational Algebra (operator-at-a-time)
- vs Tuple-at-a-time Iterator Model
l  Faster through simplicity: no tuple expression interpreter
l  Multi-Model: ODMG, SQL, XML/XQuery, ..., RDF/SPARQL
vs Relational with Bolt-on Subsystems
- 
l  Columns as the building block for complex data structures

l  Decoupling of Transactions from Execution/Buffering
vs ARIES integrated into Execution/Buffering/Indexing
- 
l  ACID, but not ARIES.. Pay as you need transaction overhead.

l  Run-Time Indexing and Query Optimization
- vs Static DBA/Workload-driven Optimization & Indexing
l  Extensible Optimizer Framework;
l  cracking, recycling, sampling-based runtime optimization
NLeSC 9 Nov 2011

Evolution

It is not the strongest of the
species that survives, nor the
most intelligent, but the one
most responsive to change.

Charles Darwin (1809 - 1882)

NLeSC 9 Nov 2011

SkyServer Schema

446
columns

>585
million
rows

6
columns

>
20
Billion
rows

NLeSC 9 Nov 2011

“An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
motivation & idea ACM TODS 35(4), Dec. 2010

Motivation:
l  scientific databases, data analytics

l  Terabytes of data (observational , transactional)

l  Prevailing read-only workload

l  Ad-hoc queries with commonalities

Background:
l  Operator-at-a-time execution paradigm

Ø  Automatic materialization of intermediates

l  Canonical column-store organization

Ø  Intermediates have reduced dimensionality and finer granularity

Ø  Simplified overlap analysis

Recycling idea:
l  instead of garbage collecting,

l  keep the intermediates and reuse them
l  speed up query streams with commonalities
l  low cost and self-organization
NLeSC 9 Nov 2011

fit into MonetDB ACM TODS 35(4), Dec. 2010

SQL
XQuery

func6on
user.s1_2(A0:date,
...):void;

X5
:=
sql.bind("sys","lineitem",...);

X10
:=
algebra.select(X5,A0);

X12
:=
sql.bindIdx("sys","lineitem",...);

MAL

X15
:=
algebra.join(X10,X12);

X25
:=
m6me.addmonths(A1,A2);

...
Recycler

Tac6cal
Op6mizer

Op6mizer

func6on
user.s1_2(A0:date,
...):void;

X5
:=
sql.bind("sys","lineitem",...);
MAL

X10
:=
algebra.select(X5,A0);

X12
:=
sql.bindIdx("sys","lineitem",...);

Run-‐6me
Support

X15
:=
algebra.join(X10,X12);

MonetDB
Kernel

X25
:=
m6me.addmonths(A1,A2);

...
Admission
&
Evic6on

MonetDB
Recycle
Pool

Server

NLeSC 9 Nov 2011

instruction matching ACM TODS 35(4), Dec. 2010

Run time comparison of
l  instruction types
l  argument values
Y3
:=
sql.bind("sys","orders","o_orderdate",0);

Exact

X1
:=
sql.bind("sys","orders","o_orderdate",0);

…

matching

Name Value Data type Size

X1 10 :bat[:oid,:date]

T1 “sys” :str

T2 “orders” :str

…

NLeSC 9 Nov 2011

instruction subsumption ACM TODS 35(4), Dec. 2010

Y3
:=
algebra.select(X1,20,45);

X3
:=

…

X5
:=

X5

Name Value Data type Size
X1 10 :bat[:oid,:int] 2000
X3 130 :bat[:oid,:int] 700
X5 150 :bat[:oid,:int] 350
…

NLeSC 9 Nov 2011

SkyServer evaluation ACM TODS 35(4), Dec. 2010

Sloan Digital Sky Survey / SkyServer
http://cas.sdss.org
l  100 GB subset of DR4
l  100-query batch from January
2008 log
l  1.5GB intermediates, 99% reuse
l  Join intermediates major
consumer of memory and major
contributor to savings

NLeSC 9 Nov 2011

What is an array?
An array is a systematic arrangement of objects
addressed by dimension values.
Get(A, X, Y,…) => Value
Set(A, X, Y,…) <= Value

There are many species:
vector, bit array, dynamic array, parallel array,
sparse array, variable length array, jagged array

NLeSC 9 Nov 2011

Who needs them anyway ?
Seismology – partial time-series
Climate simulation – temporal ordered grid
Astronomy – temporal ordered images
Remote sensing – image processing
Social networks – graph algorithms
Genomics – ordered strings

Scientists ‘love them’ :
MSEED, NETCDF, FITS, CSV,..
NLeSC 9 Nov 2011

Arrays in DBMS
Relational prototype built on arrays, Peterlee
IS(1975)

Persistent programming languages, Astral (1980),
Plain (1980)

Object-orientation and persistent languages were
the make belief to handle them, O2(1992)

NLeSC 9 Nov 2011

PostgreSQL 8.3
Array declarations:
CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]);
CREATE TABLE tictactoe ( squares integer[3][3] );

Array operations: denotation ([]), contains (@>), is
contained in (<@), append, concat (||),
dimension, lower, upper, prepend, to-string, from-
string

Array constraints: none, no enforcement of
dimensions.
NLeSC 9 Nov 2011

Mysql
From the MySQL forum May 2010:
“
>How to store multiple values in a single field? Is there any array
data type concept in mysql?
>
As Jörg said "Multiple values in a single field" would be an explicit
violation of the relational model..."
“
Is there any experience beyond encoding it as blobs?

NLeSC 9 Nov 2011

Rasdaman
Breaks large C++ arrays (rasters) into disjoint
chunks

Maps chunks into large binary objects (blob)

Provide function interface to access them

RASCAL, a SQL92 extension

Known to work up to 12 TBs.
NLeSC 9 Nov 2011

SciDB
Breaks large C++ arrays (rasters) into overlapping
chunks

Built storage manager from scratch

Map-reduce processing model

Provide function interface to access them

AQL, a crippled SQL92
NLeSC 9 Nov 2011

What is the problem?
-  Appropriate array denotations?
-  Functional complete operation set ?
-  Scale ?
-  Size limitations due to (blob) representations ?
-  Community awareness?

NLeSC 9 Nov 2011

MonetDB SciQL

SciQL (pronounced ‘cycle’ )
•  A backward compatible extension of SQL’03
•  Symbiosis of relational and array paradigm
•  Flexible structure-based grouping
•  Capitalizes the MonetDB physical array storage
•  Recycling, an adaptive ‘materialized view’
•  Zero-cost attachment contract for cooperative clients
http://www.cwi.nl/~mk/SciQL.pdf

NLeSC 9 Nov 2011

Table vs arrays

CREATE TABLE tmp
A collection of tuples

Indexed by a (primary) key

Default handling

Explicitly created using
INS/UPD/DEL

NLeSC 9 Nov 2011

Table vs arrays

CREATE TABLE tmp CREATE ARRAY tmp
A collection of tuples A collection of a priori defined tuples

Indexed by a (primary) key Indexed by dimension expressions

Default handling Implicitly defined by default value,

Explicitly created using To be updated with INS/DEL/UPD
INS/UPD/DEL

NLeSC 9 Nov 2011

SciQL examples
CREATE TABLE matrix (
x integer,
y integer,
value float
PRIMARY KEY (x,y) );

INSERT INTO matrix VALUES
(0,0,0),(0,1,0),(1,1,0)(1,0,0);

0 0 0
0 1 0
1 1 0
1 0 0
NLeSC 9 Nov 2011

SciQL examples
CREATE TABLE matrix ( CREATE ARRAY matrix (
x integer, x integer DIMENSION[2],
y integer, y integer DIMENSION[2],
value float value float DEFAULT 0);

INSERT INTO matrix VALUES
null … … …
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
null null null …
0 0 0 0 0
0 null …
1
0 1 0 0 0 0
0 null null
1 1 0 0 1
1 0 0
NLeSC 9 Nov 2011

SciQL examples

DELETE matrix WHERE y=1 DELETE matrix WHERE y=1
A hole in the array

0 0 0
null null
1
1 0 0
0 0 0
0 1

NLeSC 9 Nov 2011

SciQL examples

INSERT INTO matrix VALUES INSERT INTO matrix VALUES
(0,1,1), (1,1,2) (0,1,1), (1,1,2)
0 0 0
1 2
1
1 0 0
0 0 0
0 1 1
0 1
1 1 2
NLeSC 9 Nov 2011

SciQL unbounded arrays
x integer, x integer DIMENSION,
y integer, y integer DIMENSION,

INSERT INTO matrix VALUES INSERT INTO matrix VALUES
(0,2,1), (0,1,2) (0,2,1), (0,1,2)

0 2 1 2 1 0

0 1 2 1 0 0
0 0 2
0 1
NLeSC 9 Nov 2011

SciQL Dimensions
Unbounded Dimensions
scalar-type DIMENSION

Bounded Dimensions
scalar-type DIMENSION[stop]
scalar-type DIMENSION[first: step: stop]
scalar-type DIMENSION[*: *: *]

timestamp DIMENSION [ timestamp ‘2010-01-19’ : *: timestamp ‘1’
minute]

NLeSC 9 Nov 2011

SciQL table queries

CREATE ARRAY matrix (
x integer DIMENSION,
y integer DIMENSION,
value float DEFAULT 0 );

-- simple checker boarding aggregation
SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0

NLeSC 9 Nov 2011

SciQL array queries

CREATE ARRAY matrix (
x integer DIMENSION,
y integer DIMENSION,
value float DEFAULT 0 );

-- group based aggregation to construct an unbounded vector
SELECT [x], sum(value) FROM matrix
WHERE (x + y) % 2 = 0
GROUP BY x;

NLeSC 9 Nov 2011

SciQL array views
CREATE ARRAY vmatrix (
x integer DIMENSION[-1:5],
y integer DIMENSION[-1:5],
value float DEFAULT -1 )
AS SELECT x, y, value FROM matrix;

-1 -1 -1 -1
-1 0 0 -1
-1 0 0 -1
-1 -1 -1 -1

NLeSC 9 Nov 2011

SciQL tiling examples
V0,3 V1,3 V2,3 V3,3

V0,2 V1,2 V2,2 V3,2

V0,1 V1,1 V2,1 V3,1

Anchor
Point V0,0 V1,0 V2,0 V3,0

SELECT x, y, avg(value)
FROM matrix
GROUP BY matrix[x:1:x+2][y:1:y+2];

NLeSC 9 Nov 2011

V0,3 V1,3 V2,3 V3,3

V0,2 V1,2 V2,2 V3,2

V0,1 V1,1 V2,1 V3,1

Anchor
Point V0,0 V1,0 V2,0 V3,0

FROM matrix
GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2];

NLeSC 9 Nov 2011

V0,3 V1,3 V2,3 V3,3

Anchor
Point V0,2 V1,2 V2,2 V3,2

V0,1 V1,1 V2,1 V3,1
null

V0,0 V1,0 V2,0 V3,0
null null

FROM matrix
GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2];

NLeSC 9 Nov 2011

V0,3 V1,3 V2,3 V3,3

Anchor
Point V0,2 V1,2 V2,2 V3,2

V0,1 V1,1 V2,1 V3,1

V0,0 V1,0 V2,0 V3,0

FROM matrix
GROUP BY matrix[x][y],
matrix[x-1][y], matrix[x+1][y],
matrix[x][y-1], matrix[x][y+1];
NLeSC 9 Nov 2011

SciQL, A Query Language for Science Applications

•  Seamless integration of array-, set-, and sequence-
semantics.
•  Dimension constraints as a declarative means for
indexed access to array cells.
•  Structural grouping to generalize the value-based
grouping towards selective access to groups of cells
based on positional relationships for aggregation.

NLeSC 9 Nov 2011

Seismology use case
Rietbrock: Chili earthquake
… 2TB of wave fronts
… filter by sta/lta
… remove false positives
… window-based 3 min cuts
… heuristic tests
… interactive response required …

How can a database system help?
Scanning 2TB on modern pc takes >3 hours

NLeSC 9 Nov 2011

Use case, a SciQL dream
Rietbrock: Chili earthquake
create array mseed (
tick timestamp dimension[timestamp ‘2010’:*],
data decimal(8,6),
station string );

NLeSC 9 Nov 2011

Rietbrock: … filter by sta/lta

--- average by window of 5 seconds
select A.tick, avg(A.data)
from mseed A
group by A[tick:1:tick + 5 seconds]

NLeSC 9 Nov 2011

select A.tick
from mseed A, mseed B
where A.tick = B.tick
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
B[tick:tick + 15 seconds]

NLeSC 9 Nov 2011

create view candidates(
station string,
tick timestamp,
ratio float ) as
select A.station, A.tick, avg(A.data) / avg(B.data) as ratio
from mseed A, mseed B
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
B[tick:tick + 15 seconds]

NLeSC 9 Nov 2011

Rietbrock: … remove false positives
-- remove isolated errors by direct environment
-- using wave propagation statics

create table neighbors(
head string,
tail string,
delay timestamp,
weight float)

NLeSC 9 Nov 2011

select A.tick, B.tick
from candidates A, candidates B, neighbors N
where A.station = N.head
and B.station = N.tail
and B.tick = A.tick + N.delay
and B.ratio * N.weight < A.ratio;

NLeSC 9 Nov 2011

delete from candidates
select A.tick
from candidates A, candidates B, neighbors N
where A.station = N.head
and B.station = N.tail
and B.tick = A.tick + N.delay
and B.ratio * N.weight < A.ratio;

NLeSC 9 Nov 2011

Rietbrock: … window-based 3 min cuts
… heuristic tests

select B.station, myfunction(B.data)
from candidates A, mseed B
group by distinct B[tick:tick + 3 minutes];

-- using a User Defined Function written in C.

NLeSC 9 Nov 2011

Use case
Rietbrock: … interactive response required …

The query over 2TB of seismic data will be
handled before he finishes his coffee.

NLeSC 9 Nov 2011

Status
•  The language definition is ‘finished’
•  The grammar is included in SQL parser
•  Semantic checks added to SQL parser
•  A test suite is being built
•  Runtime support features and software stack
•  …
•  Exposure to real life cases and external libraries

NLeSC 9 Nov 2011

Science DBMS landscape
MonetDB 5.23 SciDB 0.5 Rasdaman
Architecture Server approach Server approach Plugin(Oracle, DB2, Informix,
Mysql, Postgresql)
Open source Mozilla License GPL 3.0 Commercial GPL 3.0 Dual license
Downloads >12.000 /month Tens up to now ??
SQL SQL 2003 ?? SQL92++
Interoperability {JO}DBC, C(++),Python, … C++ UDF C++, Java, OGC
Array language SciQL AQL RASQL
Array model Fixed+variable bounds Fixed arrays Fixed+variable bounds
Science Linked libraries Linked libraries Linked libraries
Foreign files Vaults of csv, FITS, ?? Tiff,png,jpg..,
NETCDF, MSEED csv,,NETCDF,HDF4,
Distribution 50-200 node cluster 4 node cluster 20-node
Distribution tech Dynamic partial replication Static fragmentation Static fragmentation
Executor Various schemes Map-reduce Tile streaming
Largest demo Skyserver SDSS 6 3TB --- 12TB, IGN –F (on Postgresql)
Storage tuning Query adaptive Schema definitions Workload driven
Optimization Heuristics + cost base ?? Heuristics +cost based
NLeSC 9 Nov 2011

Arrays in database systems, the next frontier?

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

En vedette

En vedette (8)

Similaire à Arrays in database systems, the next frontier?

Similaire à Arrays in database systems, the next frontier? (20)

Plus de PlanetData Network of Excellence

Plus de PlanetData Network of Excellence (20)

Dernier

Dernier (20)

Arrays in database systems, the next frontier?