This presentation was presented by Martin Kersten (CWI), well known in the Dutch eScience and scientific computing community, at the Netherlands eScience Center (NLeSC) on November 9, 2011 in Amsterdam, Netherlands.
Abstract of the presentation:
This presentation gives an introduction to NoSQL (Not only SQL) (pdf) databases with examples from MonetDB and discussed, applications and limitations.
1. Arrays in database systems,
the next frontier ?
Martin Kersten
CWI
NLeSC 9 Nov 2011
2. “We can't solve
problems by using
the same kind of
thinking we used
when we created
them.”
NLeSC 9 Nov 2011
3. Agenda
A crash course on column-stores
Column stores for science applications
The SciQL array query language
NLeSC 9 Nov 2011
4. The world of column stores
Motivation
Relational DBMSs dominate since the late 1970's / 1980's
l Transactional workloads (OLTP, row-wise access)
l I/O based processing
l Ingres, Postgresql, MySQL, Oracle, SQLserver, DB2, …
Column stores dominate product development since 2005
l Datawarehouses and business intelligence applications
l Startups: Infobright, Aster Data, Greenplum, LucidDB,..
l Commercial: Microsoft, IBM, SAP,…
MonetDB, the pioneer
NLeSC 9 Nov 2011
5. The world of column stores
Workload changes: Transactions (OLTP) vs ...
NLeSC 9 Nov 2011
6. The world of column stores
Workload changes: ... vs OLAP, BI, Data Mining, ...
NLeSC 9 Nov 2011
7. The world of column stores
Databases hit The Memory Wall
§ Detailed and exhaustive analysis for different workloads using
4 RDBMSs by Ailamaki, DeWitt, Hill,, Wood in VLDB 1999:
“DBMSs On A Modern Processor: Where Does Time Go?”
§ CPU is 60%-90% idle,
waiting for memory:
§ L1 data stalls
§ L1 instruction stalls
§ L2 data stalls
§ TLB stalls
§ Branch mispredictions
§ Resource stalls
NLeSC 9 Nov 2011
8. The world of column stores
Hardware Changes: The Memory Wall
Trip to memory = 1000s of instructions!
NLeSC 9 Nov 2011
10. BAT Data Structure
BAT:
binary association table
Head Tail
BUN:
binary unit
Hash tables, Head & Tail:
BUN heap:
T-trees, - consecutive memory
R-trees, blocks (arrays)
block (array)
... - memory-mapped file
files
Tail Heap:
- best-effort duplicate
elimination for strings
(~ dictionary encoding)
NLeSC 9 Nov 2011
11. MonetDB Front-end: SQL
l SQL 2003
l Parse SQL into logical n-ary relational algebra tree
l Translate n-ary relational algebra into logical 2-ary relational algebra
l Turn logical 2-ary plan into physical 2-ary plan (MAL program)
l Front-end specific strategic optimization:
l Heuristic optimization during all three previous steps
l Primary key and distinct constraints:
l Create and maintain hash indices
l Foreign key constraints
l Create and maintain foreign key join indices
NLeSC 9 Nov 2011
12. MonetDB Front-end: SQL
EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x;
function user.s2_1():void;
barrier _73 := language.dataflow();
_2:bat[:oid,:int] := sql.bind("sys","t","c",0);
_7:bat[:oid,:int] := sql.bind("sys","s","x",0);
_10 := bat.reverse(_7);
_11 := algebra.join(_2,_10);
_13 := algebra.markT(_11,0@0);
_14 := bat.reverse(_13);
_15:bat[:oid,:int] := sql.bind("sys","t","a",0);
_17 := algebra.leftjoin(_14,_15);
_18 := bat.reverse(_11);
_19 := algebra.markT(_18,0@0);
_20 := bat.reverse(_19);
_21:bat[:oid,:int] := sql.bind("sys","s","z",0);
_23 := algebra.leftjoin(_20,_21);
exit _73;
_24 := sql.resultSet(2,1,_17);
sql.rsColumn(_24,"sys.t","a","int",32,0,_17);
sql.rsColumn(_24,"sys.s","z","int",32,0,_23);
_33 := io.stdout();
sql.exportResult(_33,_24);
end s2_1;
NLeSC 9 Nov 2011
13. MonetDB/5 Back-end: MAL
l MAL: Monet Assembly Language
l textual interface
l Interpreted language
l Designed as system interface language
l Reduced, concise syntax
l Strict typing
l Meant for automatic generation and parsing/rewriting/processing
l Not meant to be typed by humans
l Efficient parser
l Low overhead
l Inherent support for tactical optimization: MAL -> MAL
l Support for optimizer plug-ins
l Support for runtime schedulers
l Binary-algebra core
l Flow control (MAL is computational complete)
NLeSC 9 Nov 2011
14. Processing Model (MonetDB Kernel)
l Bulk processing:
l full materialization of all intermediate results
l Binary (i.e., 2-column) algebra core:
l select, join, semijoin, outerjoin
l union, intersection, diff (BAT-wise & column-wise)
l group, count, max, min, sum, avg
l reverse, mirror, mark
l Runtime operational optimization:
l Choosing optimal algorithm & implementation according to
input properties and system status
NLeSC 9 Nov 2011
15. Processing Model (MonetDB Kernel)
l Heavy use of code expansion to reduce cost
1 algebra operator
select()
3 overloaded operators select(“=“,value) select(“between”,L,H)
select(“fcn”,parm)
10 operator algorithms scan hash-lookup bin-search bin-tree pos-lookup
scan_range_select_oid_int(),
~1500(!) routines hash_equi_select_void_str(), …
(macro expansion)
• ~1500 selection routines
• 149 unary operations
• 335 join/group operations
• ...
NLeSC 9 Nov 2011
17. The Software Stack
Strategic optimization
Front-ends XQuery SQL 03 MAL
Optimizers Tactical optimization:
MAL -> MAL rewrites
Back-end(s) MonetDB 4 MonetDB 5 MAL
Runtime
Kernel MonetDB kernel operational
optimization
NLeSC 9 Nov 2011
18. MonetDB vs Traditional DBMS Architecture
l Architecture-Conscious Query Processing
vs Magnetic disk I/O conscious processing
-
l Data layout, algorithms, cost models
l RISC Relational Algebra (operator-at-a-time)
- vs Tuple-at-a-time Iterator Model
l Faster through simplicity: no tuple expression interpreter
l Multi-Model: ODMG, SQL, XML/XQuery, ..., RDF/SPARQL
vs Relational with Bolt-on Subsystems
-
l Columns as the building block for complex data structures
l Decoupling of Transactions from Execution/Buffering
vs ARIES integrated into Execution/Buffering/Indexing
-
l ACID, but not ARIES.. Pay as you need transaction overhead.
l Run-Time Indexing and Query Optimization
- vs Static DBA/Workload-driven Optimization & Indexing
l Extensible Optimizer Framework;
l cracking, recycling, sampling-based runtime optimization
NLeSC 9 Nov 2011
19. Evolution
It is not the strongest of the
species that survives, nor the
most intelligent, but the one
most responsive to change.
Charles Darwin (1809 - 1882)
NLeSC 9 Nov 2011
20. Agenda
A crash course on column-stores
Column stores for science applications
The SciQL array query language
NLeSC 9 Nov 2011
22. SkyServer Schema
446
columns
>585
million
rows
6
columns
>
20
Billion
rows
NLeSC 9 Nov 2011
23. “An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
motivation & idea ACM TODS 35(4), Dec. 2010
Motivation:
l scientific databases, data analytics
l Terabytes of data (observational , transactional)
l Prevailing read-only workload
l Ad-hoc queries with commonalities
Background:
l Operator-at-a-time execution paradigm
Ø Automatic materialization of intermediates
l Canonical column-store organization
Ø Intermediates have reduced dimensionality and finer granularity
Ø Simplified overlap analysis
Recycling idea:
l instead of garbage collecting,
l keep the intermediates and reuse them
l speed up query streams with commonalities
l low cost and self-organization
NLeSC 9 Nov 2011
24. “An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
fit into MonetDB ACM TODS 35(4), Dec. 2010
SQL
XQuery
func6on
user.s1_2(A0:date,
...):void;
X5
:=
sql.bind("sys","lineitem",...);
X10
:=
algebra.select(X5,A0);
X12
:=
sql.bindIdx("sys","lineitem",...);
MAL
X15
:=
algebra.join(X10,X12);
X25
:=
m6me.addmonths(A1,A2);
...
Recycler
Tac6cal
Op6mizer
Op6mizer
func6on
user.s1_2(A0:date,
...):void;
X5
:=
sql.bind("sys","lineitem",...);
MAL
X10
:=
algebra.select(X5,A0);
X12
:=
sql.bindIdx("sys","lineitem",...);
Run-‐6me
Support
X15
:=
algebra.join(X10,X12);
MonetDB
Kernel
X25
:=
m6me.addmonths(A1,A2);
...
Admission
&
Evic6on
MonetDB
Recycle
Pool
Server
NLeSC 9 Nov 2011
25. “An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
instruction matching ACM TODS 35(4), Dec. 2010
Run time comparison of
l instruction types
l argument values
Y3
:=
sql.bind("sys","orders","o_orderdate",0);
Exact
X1
:=
sql.bind("sys","orders","o_orderdate",0);
…
matching
Name Value Data type Size
X1 10 :bat[:oid,:date]
T1 “sys” :str
T2 “orders” :str
…
NLeSC 9 Nov 2011
26. “An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
instruction subsumption ACM TODS 35(4), Dec. 2010
Y3
:=
algebra.select(X1,20,45);
X3
:=
algebra.select(X1,10,80);
…
X5
:=
algebra.select(X1,20,60);
X5
Name Value Data type Size
X1 10 :bat[:oid,:int] 2000
X3 130 :bat[:oid,:int] 700
X5 150 :bat[:oid,:int] 350
…
NLeSC 9 Nov 2011
27. “An architecture for recycling
Recycler intermediates in a column-store”.
Ivanova, Kersten, Nes, Goncalves.
SkyServer evaluation ACM TODS 35(4), Dec. 2010
Sloan Digital Sky Survey / SkyServer
http://cas.sdss.org
l 100 GB subset of DR4
l 100-query batch from January
2008 log
l 1.5GB intermediates, 99% reuse
l Join intermediates major
consumer of memory and major
contributor to savings
NLeSC 9 Nov 2011
28. Agenda
A crash course on column-stores
Column stores for science applications
The SciQL array query language
NLeSC 9 Nov 2011
29. What is an array?
An array is a systematic arrangement of objects
addressed by dimension values.
Get(A, X, Y,…) => Value
Set(A, X, Y,…) <= Value
There are many species:
vector, bit array, dynamic array, parallel array,
sparse array, variable length array, jagged array
NLeSC 9 Nov 2011
31. Arrays in DBMS
Relational prototype built on arrays, Peterlee
IS(1975)
Persistent programming languages, Astral (1980),
Plain (1980)
Object-orientation and persistent languages were
the make belief to handle them, O2(1992)
NLeSC 9 Nov 2011
32. PostgreSQL 8.3
Array declarations:
CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]);
CREATE TABLE tictactoe ( squares integer[3][3] );
Array operations: denotation ([]), contains (@>), is
contained in (<@), append, concat (||),
dimension, lower, upper, prepend, to-string, from-
string
Array constraints: none, no enforcement of
dimensions.
NLeSC 9 Nov 2011
33. Mysql
From the MySQL forum May 2010:
“
>How to store multiple values in a single field? Is there any array
data type concept in mysql?
>
As Jörg said "Multiple values in a single field" would be an explicit
violation of the relational model..."
“
Is there any experience beyond encoding it as blobs?
NLeSC 9 Nov 2011
34. Rasdaman
Breaks large C++ arrays (rasters) into disjoint
chunks
Maps chunks into large binary objects (blob)
Provide function interface to access them
RASCAL, a SQL92 extension
Known to work up to 12 TBs.
NLeSC 9 Nov 2011
35. SciDB
Breaks large C++ arrays (rasters) into overlapping
chunks
Built storage manager from scratch
Map-reduce processing model
Provide function interface to access them
AQL, a crippled SQL92
NLeSC 9 Nov 2011
36. What is the problem?
- Appropriate array denotations?
- Functional complete operation set ?
- Scale ?
- Size limitations due to (blob) representations ?
- Community awareness?
NLeSC 9 Nov 2011
37. MonetDB SciQL
SciQL (pronounced ‘cycle’ )
• A backward compatible extension of SQL’03
• Symbiosis of relational and array paradigm
• Flexible structure-based grouping
• Capitalizes the MonetDB physical array storage
• Recycling, an adaptive ‘materialized view’
• Zero-cost attachment contract for cooperative clients
http://www.cwi.nl/~mk/SciQL.pdf
NLeSC 9 Nov 2011
38. Table vs arrays
CREATE TABLE tmp
A collection of tuples
Indexed by a (primary) key
Default handling
Explicitly created using
INS/UPD/DEL
NLeSC 9 Nov 2011
39. Table vs arrays
CREATE TABLE tmp CREATE ARRAY tmp
A collection of tuples A collection of a priori defined tuples
Indexed by a (primary) key Indexed by dimension expressions
Default handling Implicitly defined by default value,
Explicitly created using To be updated with INS/DEL/UPD
INS/UPD/DEL
NLeSC 9 Nov 2011
40. SciQL examples
CREATE TABLE matrix (
x integer,
y integer,
value float
PRIMARY KEY (x,y) );
INSERT INTO matrix VALUES
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
0 0 0
0 1 0
1 1 0
1 0 0
NLeSC 9 Nov 2011
41. SciQL examples
CREATE TABLE matrix ( CREATE ARRAY matrix (
x integer, x integer DIMENSION[2],
y integer, y integer DIMENSION[2],
value float value float DEFAULT 0);
PRIMARY KEY (x,y) );
INSERT INTO matrix VALUES
null … … …
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
null null null …
0 0 0 0 0
0 null …
1
0 1 0 0 0 0
0 null null
1 1 0 0 1
1 0 0
NLeSC 9 Nov 2011
42. SciQL examples
CREATE TABLE matrix ( CREATE ARRAY matrix (
x integer, x integer DIMENSION[2],
y integer, y integer DIMENSION[2],
value float value float DEFAULT 0);
PRIMARY KEY (x,y) );
DELETE matrix WHERE y=1 DELETE matrix WHERE y=1
A hole in the array
0 0 0
null null
1
1 0 0
0 0 0
0 1
NLeSC 9 Nov 2011
43. SciQL examples
CREATE TABLE matrix ( CREATE ARRAY matrix (
x integer, x integer DIMENSION[2],
y integer, y integer DIMENSION[2],
value float value float DEFAULT 0);
PRIMARY KEY (x,y) );
INSERT INTO matrix VALUES INSERT INTO matrix VALUES
(0,1,1), (1,1,2) (0,1,1), (1,1,2)
0 0 0
1 2
1
1 0 0
0 0 0
0 1 1
0 1
1 1 2
NLeSC 9 Nov 2011
44. SciQL unbounded arrays
CREATE TABLE matrix ( CREATE ARRAY matrix (
x integer, x integer DIMENSION,
y integer, y integer DIMENSION,
value float value float DEFAULT 0);
PRIMARY KEY (x,y) );
INSERT INTO matrix VALUES INSERT INTO matrix VALUES
(0,2,1), (0,1,2) (0,2,1), (0,1,2)
0 2 1 2 1 0
0 1 2 1 0 0
0 0 2
0 1
NLeSC 9 Nov 2011
46. SciQL table queries
CREATE ARRAY matrix (
x integer DIMENSION,
y integer DIMENSION,
value float DEFAULT 0 );
-- simple checker boarding aggregation
SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0
NLeSC 9 Nov 2011
47. SciQL array queries
CREATE ARRAY matrix (
x integer DIMENSION,
y integer DIMENSION,
value float DEFAULT 0 );
-- group based aggregation to construct an unbounded vector
SELECT [x], sum(value) FROM matrix
WHERE (x + y) % 2 = 0
GROUP BY x;
NLeSC 9 Nov 2011
48. SciQL array views
CREATE ARRAY vmatrix (
x integer DIMENSION[-1:5],
y integer DIMENSION[-1:5],
value float DEFAULT -1 )
AS SELECT x, y, value FROM matrix;
-1 -1 -1 -1
-1 0 0 -1
-1 0 0 -1
-1 -1 -1 -1
NLeSC 9 Nov 2011
49. SciQL tiling examples
V0,3 V1,3 V2,3 V3,3
V0,2 V1,2 V2,2 V3,2
V0,1 V1,1 V2,1 V3,1
Anchor
Point V0,0 V1,0 V2,0 V3,0
SELECT x, y, avg(value)
FROM matrix
GROUP BY matrix[x:1:x+2][y:1:y+2];
NLeSC 9 Nov 2011
50. SciQL tiling examples
V0,3 V1,3 V2,3 V3,3
V0,2 V1,2 V2,2 V3,2
V0,1 V1,1 V2,1 V3,1
Anchor
Point V0,0 V1,0 V2,0 V3,0
SELECT x, y, avg(value)
FROM matrix
GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2];
NLeSC 9 Nov 2011
51. SciQL tiling examples
V0,3 V1,3 V2,3 V3,3
Anchor
Point V0,2 V1,2 V2,2 V3,2
V0,1 V1,1 V2,1 V3,1
null
V0,0 V1,0 V2,0 V3,0
null null
SELECT x, y, avg(value)
FROM matrix
GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2];
NLeSC 9 Nov 2011
52. SciQL tiling examples
V0,3 V1,3 V2,3 V3,3
Anchor
Point V0,2 V1,2 V2,2 V3,2
V0,1 V1,1 V2,1 V3,1
V0,0 V1,0 V2,0 V3,0
SELECT x, y, avg(value)
FROM matrix
GROUP BY matrix[x][y],
matrix[x-1][y], matrix[x+1][y],
matrix[x][y-1], matrix[x][y+1];
NLeSC 9 Nov 2011
53. SciQL, A Query Language for Science Applications
• Seamless integration of array-, set-, and sequence-
semantics.
• Dimension constraints as a declarative means for
indexed access to array cells.
• Structural grouping to generalize the value-based
grouping towards selective access to groups of cells
based on positional relationships for aggregation.
NLeSC 9 Nov 2011
54. Seismology use case
Rietbrock: Chili earthquake
… 2TB of wave fronts
… filter by sta/lta
… remove false positives
… window-based 3 min cuts
… heuristic tests
… interactive response required …
How can a database system help?
Scanning 2TB on modern pc takes >3 hours
NLeSC 9 Nov 2011
55. Use case, a SciQL dream
Rietbrock: Chili earthquake
create array mseed (
tick timestamp dimension[timestamp ‘2010’:*],
data decimal(8,6),
station string );
NLeSC 9 Nov 2011
56. Use case, a SciQL dream
Rietbrock: … filter by sta/lta
--- average by window of 5 seconds
select A.tick, avg(A.data)
from mseed A
group by A[tick:1:tick + 5 seconds]
NLeSC 9 Nov 2011
57. Use case, a SciQL dream
Rietbrock: … filter by sta/lta
select A.tick
from mseed A, mseed B
where A.tick = B.tick
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
B[tick:tick + 15 seconds]
NLeSC 9 Nov 2011
58. Use case, a SciQL dream
Rietbrock: … filter by sta/lta
create view candidates(
station string,
tick timestamp,
ratio float ) as
select A.station, A.tick, avg(A.data) / avg(B.data) as ratio
from mseed A, mseed B
where A.tick = B.tick
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
B[tick:tick + 15 seconds]
NLeSC 9 Nov 2011
59. Use case, a SciQL dream
Rietbrock: … remove false positives
-- remove isolated errors by direct environment
-- using wave propagation statics
create table neighbors(
head string,
tail string,
delay timestamp,
weight float)
NLeSC 9 Nov 2011
60. Use case, a SciQL dream
Rietbrock: … remove false positives
select A.tick, B.tick
from candidates A, candidates B, neighbors N
where A.station = N.head
and B.station = N.tail
and B.tick = A.tick + N.delay
and B.ratio * N.weight < A.ratio;
NLeSC 9 Nov 2011
61. Use case, a SciQL dream
Rietbrock: … remove false positives
delete from candidates
select A.tick
from candidates A, candidates B, neighbors N
where A.station = N.head
and B.station = N.tail
and B.tick = A.tick + N.delay
and B.ratio * N.weight < A.ratio;
NLeSC 9 Nov 2011
62. Use case, a SciQL dream
Rietbrock: … window-based 3 min cuts
… heuristic tests
select B.station, myfunction(B.data)
from candidates A, mseed B
where A.tick = B.tick
group by distinct B[tick:tick + 3 minutes];
-- using a User Defined Function written in C.
NLeSC 9 Nov 2011
63. Use case
Rietbrock: … interactive response required …
The query over 2TB of seismic data will be
handled before he finishes his coffee.
NLeSC 9 Nov 2011
64. Status
• The language definition is ‘finished’
• The grammar is included in SQL parser
• Semantic checks added to SQL parser
• A test suite is being built
• Runtime support features and software stack
• …
• Exposure to real life cases and external libraries
NLeSC 9 Nov 2011