SlideShare une entreprise Scribd logo
1  sur  67
Arrays in database systems,
     the next frontier ?
            Martin Kersten
                  CWI




      NLeSC 9 Nov 2011
“We can't solve
problems by using
the same kind of
thinking we used
when we created
them.”




         NLeSC 9 Nov 2011
Agenda

A crash course on column-stores


Column stores for science applications


The SciQL array query language




              NLeSC 9 Nov 2011
The world of column stores
                             Motivation

Relational DBMSs dominate since the late 1970's / 1980's
   l     Transactional workloads (OLTP, row-wise access)
   l     I/O based processing
   l     Ingres, Postgresql, MySQL, Oracle, SQLserver, DB2, …

Column stores dominate product development since 2005
    l    Datawarehouses and business intelligence applications
    l    Startups: Infobright, Aster Data, Greenplum, LucidDB,..
    l    Commercial: Microsoft, IBM, SAP,…
                         MonetDB, the pioneer

                      NLeSC 9 Nov 2011
The world of column stores
Workload changes: Transactions (OLTP) vs ...‫‏‬




            NLeSC 9 Nov 2011
The world of column stores
Workload changes: ... vs OLAP, BI, Data Mining, ...




              NLeSC 9 Nov 2011
The world of column stores

             Databases hit The Memory Wall

§  Detailed and exhaustive analysis for different workloads using
    4 RDBMSs by Ailamaki, DeWitt, Hill,, Wood in VLDB 1999:
    “DBMSs On A Modern Processor: Where Does Time Go?”‫‏‬
§  CPU is 60%-90% idle,
    waiting for memory:
     §  L1 data stalls
     §  L1 instruction stalls
     §  L2 data stalls
     §  TLB stalls
     §  Branch mispredictions
     §  Resource stalls




                        NLeSC 9 Nov 2011
The world of column stores
Hardware Changes: The Memory Wall




Trip to memory = 1000s of instructions!




        NLeSC 9 Nov 2011
Storing Relations in MonetDB




Void         Void           Void        Void    Void
1000          1000          1000        1000    1000
  .            .              .             .     .

  .            .              .             .     .

  .            .              .             .     .

  .            .              .             .     .

  .            .              .             .     .




Virtual OID: seqbase=1000 (increment=1)
                         NLeSC 9 Nov 2011
BAT Data Structure




                                      BAT:
                                      binary association table
               Head     Tail
                                      BUN:
                                      binary unit

Hash tables,                          Head & Tail:
                                      BUN heap:
T-trees,                              - consecutive memory
R-trees,                                blocks (arrays)‫‏‬
                                        block (array)‫‏‬
...                                   - memory-mapped file
                                                         files

                                      Tail Heap:
                                       - best-effort duplicate
                                         elimination for strings
                                        (~ dictionary encoding)
               NLeSC 9 Nov 2011
MonetDB Front-end: SQL

l    SQL 2003
            l      Parse SQL into logical n-ary relational algebra tree
            l      Translate n-ary relational algebra into logical 2-ary relational algebra
            l      Turn logical 2-ary plan into physical 2-ary plan (MAL program)


l    Front-end specific strategic optimization:
      l    Heuristic optimization during all three previous steps
            l  Primary key and distinct constraints:
              l    Create and maintain hash indices
            l      Foreign key constraints
              l    Create and maintain foreign key join indices

                                 NLeSC 9 Nov 2011
MonetDB Front-end: SQL
EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x;

function user.s2_1():void;
barrier _73 := language.dataflow();
  _2:bat[:oid,:int] := sql.bind("sys","t","c",0);
  _7:bat[:oid,:int] := sql.bind("sys","s","x",0);
  _10 := bat.reverse(_7);
  _11 := algebra.join(_2,_10);
  _13 := algebra.markT(_11,0@0);
  _14 := bat.reverse(_13);
  _15:bat[:oid,:int] := sql.bind("sys","t","a",0);
  _17 := algebra.leftjoin(_14,_15);
  _18 := bat.reverse(_11);
  _19 := algebra.markT(_18,0@0);
  _20 := bat.reverse(_19);
  _21:bat[:oid,:int] := sql.bind("sys","s","z",0);
  _23 := algebra.leftjoin(_20,_21);
exit _73;
  _24 := sql.resultSet(2,1,_17);
  sql.rsColumn(_24,"sys.t","a","int",32,0,_17);
  sql.rsColumn(_24,"sys.s","z","int",32,0,_23);
  _33 := io.stdout();
  sql.exportResult(_33,_24);
end s2_1;

        NLeSC 9 Nov 2011
MonetDB/5 Back-end: MAL
l    MAL: Monet Assembly Language
      l  textual interface
      l    Interpreted language
l    Designed as system interface language
       l  Reduced, concise syntax
       l  Strict typing
       l  Meant for automatic generation and parsing/rewriting/processing
       l  Not meant to be typed by humans
l    Efficient parser
       l  Low overhead
       l  Inherent support for tactical optimization: MAL -> MAL
       l  Support for optimizer plug-ins
       l  Support for runtime schedulers
l    Binary-algebra core
l    Flow control (MAL is computational complete)‫‏‬
                         NLeSC 9 Nov 2011
Processing Model (MonetDB Kernel)‫‏‬

l    Bulk processing:
       l  full materialization of all intermediate results

l    Binary (i.e., 2-column) algebra core:
       l  select, join, semijoin, outerjoin
       l  union, intersection, diff (BAT-wise & column-wise)‫‏‬
       l  group, count, max, min, sum, avg
       l  reverse, mirror, mark

l    Runtime operational optimization:
       l  Choosing optimal algorithm & implementation according to
           input properties and system status



                       NLeSC 9 Nov 2011
Processing Model (MonetDB Kernel)‫‏‬
       l     Heavy use of code expansion to reduce cost
1 algebra operator
                                                        select()‫‏‬

3 overloaded operators               select(“=“,value)     select(“between”,L,H)
                                                    select(“fcn”,parm)‫‏‬

10 operator algorithms        scan     hash-lookup bin-search       bin-tree   pos-lookup

                                            scan_range_select_oid_int(),
~1500(!) routines                          hash_equi_select_void_str(), …
(macro expansion)‫‏‬
  •          ~1500 selection routines
  •          149 unary operations
  •          335 join/group operations
  •          ...



                               NLeSC 9 Nov 2011
The Software Stack


Front-ends     XQuery                SQL 03      SciQL   RDF

                                    Optimizers


Back-end(s)   MonetDB 4             MonetDB 5


  Kernel                  MonetDB kernel




                   NLeSC 9 Nov 2011
The Software Stack

                                                 Strategic optimization

Front-ends     XQuery                SQL 03               MAL

                                    Optimizers   Tactical optimization:
                                                 MAL -> MAL rewrites

Back-end(s)   MonetDB 4             MonetDB 5             MAL

                                                      Runtime
  Kernel                  MonetDB kernel             operational
                                                     optimization




                   NLeSC 9 Nov 2011
MonetDB vs Traditional DBMS Architecture

l    Architecture-Conscious Query Processing
                      vs Magnetic disk I/O conscious processing
                     - 
      l    Data layout, algorithms, cost models
l    RISC Relational Algebra (operator-at-a-time)
                     - vs Tuple-at-a-time Iterator Model
      l    Faster through simplicity: no tuple expression interpreter
l    Multi-Model: ODMG, SQL, XML/XQuery, ..., RDF/SPARQL
                     vs Relational with Bolt-on Subsystems
                     - 
      l    Columns as the building block for complex data structures

l    Decoupling of Transactions from Execution/Buffering
                      vs ARIES integrated into Execution/Buffering/Indexing
                     - 
      l    ACID, but not ARIES.. Pay as you need transaction overhead.

l    Run-Time Indexing and Query Optimization
                     - vs Static DBA/Workload-driven Optimization & Indexing
      l    Extensible Optimizer Framework;
      l    cracking, recycling, sampling-based runtime optimization
                                NLeSC 9 Nov 2011
Evolution

It is not the strongest of the
species that survives, nor the
most intelligent, but the one
most responsive to change. 

Charles Darwin (1809 - 1882)



        NLeSC 9 Nov 2011
Agenda

A crash course on column-stores


Column stores for science applications


The SciQL array query language




              NLeSC 9 Nov 2011
NLeSC 9 Nov 2011
SkyServer Schema




     446	
  columns	
  
   >585	
  million	
  rows	
  




      6	
  columns	
  
>	
  20	
  Billion	
  rows	
  




                                 NLeSC 9 Nov 2011
“An architecture for recycling
             Recycler                     intermediates in a column-store”.
                                          Ivanova, Kersten, Nes, Goncalves.
             motivation & idea            ACM TODS 35(4), Dec. 2010


Motivation:
     l  scientific databases, data analytics

     l  Terabytes of data (observational , transactional)

     l  Prevailing read-only workload

     l  Ad-hoc queries with commonalities

Background:
l  Operator-at-a-time execution paradigm

     Ø  Automatic materialization of intermediates

l  Canonical column-store organization

     Ø  Intermediates have reduced dimensionality and finer granularity

     Ø  Simplified overlap analysis

Recycling idea:
l  instead of garbage collecting,

l  keep the intermediates and reuse them
     l  speed up query streams with commonalities
     l  low cost and self-organization
                       NLeSC 9 Nov 2011
“An architecture for recycling
                                            Recycler                                        intermediates in a column-store”.
                                                                                            Ivanova, Kersten, Nes, Goncalves.
                                            fit into MonetDB                                ACM TODS 35(4), Dec. 2010




                                                                                  SQL	
                   XQuery	
  
func6on	
  user.s1_2(A0:date,	
  ...):void;	
  
	
  	
  	
  X5	
  :=	
  sql.bind("sys","lineitem",...);	
  
	
  	
  	
  X10	
  :=	
  algebra.select(X5,A0);	
  	
  
	
  	
  	
  X12	
  :=	
  sql.bindIdx("sys","lineitem",...);	
  	
                               MAL	
  
	
  	
  	
  X15	
  :=	
  algebra.join(X10,X12);	
  	
  
	
  	
  	
  X25	
  :=	
  m6me.addmonths(A1,A2);	
  	
  
	
  	
  	
  ...	
                                                                                             Recycler	
  
                                                                                Tac6cal	
  Op6mizer	
  
                                                                                                              Op6mizer	
  
func6on	
  user.s1_2(A0:date,	
  ...):void;	
  
	
  	
  	
  X5	
  :=	
  sql.bind("sys","lineitem",...);	
                                       MAL	
  
	
  	
  	
  X10	
  :=	
  algebra.select(X5,A0);	
  	
  
	
  	
  	
  X12	
  :=	
  sql.bindIdx("sys","lineitem",...);	
  	
                                            Run-­‐6me	
  Support	
  
	
  	
  	
  X15	
  :=	
  algebra.join(X10,X12);	
  	
                             MonetDB	
  Kernel	
  
	
  	
  	
  X25	
  :=	
  m6me.addmonths(A1,A2);	
  	
  
	
  	
  	
  ...	
                                                                                                         Admission	
  &	
  Evic6on	
  

                                                                              MonetDB	
             Recycle	
  Pool	
  
                                                                              Server	
  




                                                                      NLeSC 9 Nov 2011
“An architecture for recycling
              Recycler                            intermediates in a column-store”.
                                                  Ivanova, Kersten, Nes, Goncalves.
              instruction matching                ACM TODS 35(4), Dec. 2010



Run time comparison of
l    instruction types
l    argument values
                                 Y3	
  :=	
  sql.bind("sys","orders","o_orderdate",0);	
  


             Exact	
  	
        X1	
  :=	
  sql.bind("sys","orders","o_orderdate",0);	
  
                                …	
  
             matching	
  
                                        Name    Value          Data type              Size

                                        X1      10             :bat[:oid,:date]

                                        T1      “sys”          :str

                                        T2      “orders”       :str

                                        …




                          NLeSC 9 Nov 2011
“An architecture for recycling
     Recycler                            intermediates in a column-store”.
                                         Ivanova, Kersten, Nes, Goncalves.
     instruction subsumption             ACM TODS 35(4), Dec. 2010




Y3	
  :=	
  algebra.select(X1,20,45);	
  



X3	
  :=	
  algebra.select(X1,10,80);	
  
…	
  
X5	
   :=	
  algebra.select(X1,20,60);	
  
X5	
  
 Name      Value     Data type               Size
 X1        10        :bat[:oid,:int]         2000
 X3        130       :bat[:oid,:int]          700
 X5        150       :bat[:oid,:int]          350
 …




                 NLeSC 9 Nov 2011
“An architecture for recycling
                 Recycler                    intermediates in a column-store”.
                                             Ivanova, Kersten, Nes, Goncalves.
                 SkyServer evaluation        ACM TODS 35(4), Dec. 2010



Sloan Digital Sky Survey / SkyServer
http://cas.sdss.org
l    100 GB subset of DR4
l    100-query batch from January
      2008 log
l    1.5GB intermediates, 99% reuse
l    Join intermediates major
      consumer of memory and major
      contributor to savings




                          NLeSC 9 Nov 2011
Agenda

A crash course on column-stores


Column stores for science applications


The SciQL array query language




              NLeSC 9 Nov 2011
What is an array?
An array is a systematic arrangement of objects
 addressed by dimension values.
    Get(A, X, Y,…) => Value
    Set(A, X, Y,…) <= Value


There are many species:
 vector, bit array, dynamic array, parallel array,
 sparse array, variable length array, jagged array




              NLeSC 9 Nov 2011
Who needs them anyway ?
Seismology           – partial time-series
Climate simulation – temporal ordered grid
Astronomy            – temporal ordered images
Remote sensing       – image processing
Social networks      – graph algorithms
Genomics             – ordered strings


Scientists ‘love them’ :
  MSEED, NETCDF, FITS, CSV,..
               NLeSC 9 Nov 2011
Arrays in DBMS
Relational prototype built on arrays, Peterlee
 IS(1975)


Persistent programming languages, Astral (1980),
 Plain (1980)


Object-orientation and persistent languages were
 the make belief to handle them, O2(1992)




               NLeSC 9 Nov 2011
PostgreSQL 8.3
Array declarations:
CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]);
CREATE TABLE tictactoe ( squares integer[3][3] );



Array operations: denotation ([]), contains (@>), is
  contained in (<@), append, concat (||),
  dimension, lower, upper, prepend, to-string, from-
  string


Array constraints: none, no enforcement of
  dimensions.
                      NLeSC 9 Nov 2011
Mysql
From the MySQL forum May 2010:
“
>How to store multiple values in a single field? Is there any array
data type concept in mysql?
>
As Jörg said "Multiple values in a single field" would be an explicit
violation of the relational model..."
“
Is there any experience beyond encoding it as blobs?




                    NLeSC 9 Nov 2011
Rasdaman
Breaks large C++ arrays (rasters) into disjoint
  chunks


Maps chunks into large binary objects (blob)


Provide function interface to access them


RASCAL, a SQL92 extension


Known to work up to 12 TBs.
               NLeSC 9 Nov 2011
SciDB
Breaks large C++ arrays (rasters) into overlapping
  chunks


Built storage manager from scratch


Map-reduce processing model


Provide function interface to access them


AQL, a crippled SQL92
              NLeSC 9 Nov 2011
What is the problem?
-  Appropriate array denotations?
-  Functional complete operation set ?
-  Scale ?
-  Size limitations due to (blob) representations ?
-  Community awareness?




                NLeSC 9 Nov 2011
MonetDB SciQL

SciQL (pronounced ‘cycle’ )
•  A backward compatible extension of SQL’03
•  Symbiosis of relational and array paradigm
•  Flexible structure-based grouping
•  Capitalizes the MonetDB physical array storage
  •  Recycling, an adaptive ‘materialized view’
  •  Zero-cost attachment contract for cooperative clients
                http://www.cwi.nl/~mk/SciQL.pdf



                NLeSC 9 Nov 2011
Table vs arrays

CREATE TABLE tmp
A collection of tuples


Indexed by a (primary) key


Default handling


Explicitly created using
  INS/UPD/DEL



                    NLeSC 9 Nov 2011
Table vs arrays

CREATE TABLE tmp               CREATE ARRAY tmp
A collection of tuples         A collection of a priori defined tuples


Indexed by a (primary) key     Indexed by dimension expressions


Default handling               Implicitly defined by default value,


Explicitly created using       To be updated with INS/DEL/UPD
  INS/UPD/DEL



                    NLeSC 9 Nov 2011
SciQL examples
CREATE TABLE matrix (
  x integer,
  y integer,
  value float
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES
(0,0,0),(0,1,0),(1,1,0)(1,0,0);

         0      0   0
         0      1   0
         1      1   0
         1      0   0
                        NLeSC 9 Nov 2011
SciQL examples
CREATE TABLE matrix (                 CREATE ARRAY matrix (
  x integer,                               x integer DIMENSION[2],
  y integer,                               y integer DIMENSION[2],
  value float                              value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES
                                                  null   …      …      …
(0,0,0),(0,1,0),(1,1,0)(1,0,0);
                                                  null   null   null   …
         0      0   0                              0      0
                                                          0     null   …
                                              1
         0      1   0                         0    0      0
                                                          0     null   null
         1      1   0                              0      1
         1      0   0
                        NLeSC 9 Nov 2011
SciQL examples
CREATE TABLE matrix (                 CREATE ARRAY matrix (
  x integer,                               x integer DIMENSION[2],
  y integer,                               y integer DIMENSION[2],
  value float                              value float DEFAULT 0);
PRIMARY KEY (x,y) );


DELETE matrix WHERE y=1               DELETE matrix WHERE y=1
                                      A hole in the array

        0       0   0
                                                            null   null
                                                     1
        1       0   0
                                                     0       0      0
                                                             0      1

                        NLeSC 9 Nov 2011
SciQL examples
CREATE TABLE matrix (                 CREATE ARRAY matrix (
  x integer,                               x integer DIMENSION[2],
  y integer,                               y integer DIMENSION[2],
  value float                              value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES             INSERT INTO matrix VALUES
(0,1,1), (1,1,2)                      (0,1,1), (1,1,2)
         0      0   0
                                                         1   2
                                                   1
         1      0   0
                                                   0     0   0
         0      1   1
                                                         0   1
         1      1   2
                        NLeSC 9 Nov 2011
SciQL unbounded arrays
CREATE TABLE matrix (                 CREATE ARRAY matrix (
  x integer,                               x integer DIMENSION,
  y integer,                               y integer DIMENSION,
  value float                              value float DEFAULT 0);
PRIMARY KEY (x,y) );


INSERT INTO matrix VALUES             INSERT INTO matrix VALUES
(0,2,1), (0,1,2)                      (0,2,1), (0,1,2)

         0      2   1                              2     1   0

         0      1   2                              1     0   0
                                                   0     0   2
                                                         0   1
                        NLeSC 9 Nov 2011
SciQL Dimensions
Unbounded Dimensions
  scalar-type DIMENSION


Bounded Dimensions
  scalar-type DIMENSION[stop]
  scalar-type DIMENSION[first: step: stop]
  scalar-type DIMENSION[*: *: *]

timestamp DIMENSION [ timestamp ‘2010-01-19’ : *: timestamp ‘1’
   minute]

                  NLeSC 9 Nov 2011
SciQL table queries

CREATE ARRAY matrix (
  x integer DIMENSION,
  y integer DIMENSION,
  value float DEFAULT 0 );


-- simple checker boarding aggregation
SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0




                  NLeSC 9 Nov 2011
SciQL array queries

CREATE ARRAY matrix (
  x integer DIMENSION,
  y integer DIMENSION,
  value float DEFAULT 0 );


-- group based aggregation to construct an unbounded vector
SELECT [x], sum(value) FROM matrix
  WHERE (x + y) % 2 = 0
  GROUP BY x;

                  NLeSC 9 Nov 2011
SciQL array views
CREATE ARRAY vmatrix (
  x integer DIMENSION[-1:5],
  y integer DIMENSION[-1:5],
  value float DEFAULT -1 )
AS SELECT x, y, value FROM matrix;

               -1       -1      -1     -1
               -1        0      0      -1
               -1        0      0      -1
               -1       -1      -1     -1




                    NLeSC 9 Nov 2011
SciQL tiling examples
              V0,3    V1,3     V2,3   V3,3


              V0,2    V1,2     V2,2   V3,2


              V0,1    V1,1     V2,1   V3,1

Anchor
Point         V0,0    V1,0     V2,0   V3,0




         SELECT x, y, avg(value)
         FROM matrix
         GROUP BY matrix[x:1:x+2][y:1:y+2];


                 NLeSC 9 Nov 2011
SciQL tiling examples
                 V0,3    V1,3     V2,3   V3,3


                 V0,2    V1,2     V2,2   V3,2


                 V0,1    V1,1     V2,1   V3,1

Anchor
Point            V0,0    V1,0     V2,0   V3,0




         SELECT x, y, avg(value)
         FROM matrix
         GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2];


                    NLeSC 9 Nov 2011
SciQL tiling examples
          V0,3    V1,3     V2,3   V3,3

Anchor
Point     V0,2    V1,2     V2,2   V3,2


          V0,1    V1,1     V2,1   V3,1
   null

          V0,0    V1,0     V2,0   V3,0
   null                                  null



SELECT x, y, avg(value)
FROM matrix
GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2];


             NLeSC 9 Nov 2011
SciQL tiling examples
               V0,3    V1,3     V2,3   V3,3

Anchor
Point          V0,2    V1,2     V2,2   V3,2


               V0,1    V1,1     V2,1   V3,1


               V0,0    V1,0     V2,0   V3,0




         SELECT x, y, avg(value)
         FROM matrix
         GROUP BY matrix[x][y],
          matrix[x-1][y], matrix[x+1][y],
          matrix[x][y-1], matrix[x][y+1];
                  NLeSC 9 Nov 2011
SciQL, A Query Language for Science Applications


•  Seamless integration of array-, set-, and sequence-
   semantics.
•  Dimension constraints as a declarative means for
   indexed access to array cells.
•  Structural grouping to generalize the value-based
   grouping towards selective access to groups of cells
   based on positional relationships for aggregation.




                 NLeSC 9 Nov 2011
Seismology use case
Rietbrock: Chili earthquake
  … 2TB of wave fronts
  … filter by sta/lta
  … remove false positives
  … window-based 3 min cuts
  … heuristic tests
  … interactive response required …


How can a database system help?
  Scanning 2TB on modern pc takes >3 hours

                NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: Chili earthquake
create array mseed (
 tick   timestamp dimension[timestamp ‘2010’:*],
 data decimal(8,6),
 station string );




               NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … filter by sta/lta


--- average by window of 5 seconds
select A.tick, avg(A.data)
from mseed A
group by A[tick:1:tick + 5 seconds]




               NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … filter by sta/lta
select A.tick
from mseed A, mseed B
where A.tick = B.tick
and avg(A.data) / avg(B.data) > delta
group by A[tick:tick + 5 seconds],
  B[tick:tick + 15 seconds]




                NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … filter by sta/lta
create view candidates(
  station string,
  tick timestamp,
  ratio float ) as
select A.station, A.tick, avg(A.data) / avg(B.data) as ratio
  from mseed A, mseed B
  where A.tick = B.tick
  and avg(A.data) / avg(B.data) > delta
  group by A[tick:tick + 5 seconds],
   B[tick:tick + 15 seconds]

                    NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … remove false positives
-- remove isolated errors by direct environment
-- using wave propagation statics


create table neighbors(
 head string,
 tail string,
 delay timestamp,
 weight float)

                 NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … remove false positives
select A.tick, B.tick
  from candidates A, candidates B, neighbors N
 where A.station = N.head
 and B.station = N.tail
 and B.tick = A.tick + N.delay
 and B.ratio * N.weight < A.ratio;




              NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … remove false positives
delete from candidates
 select A.tick
 from candidates A, candidates B, neighbors N
 where A.station = N.head
 and B.station = N.tail
 and B.tick = A.tick + N.delay
 and B.ratio * N.weight < A.ratio;



              NLeSC 9 Nov 2011
Use case, a SciQL dream
Rietbrock: … window-based 3 min cuts
  … heuristic tests


select B.station, myfunction(B.data)
  from candidates A, mseed B
 where A.tick = B.tick
 group by distinct B[tick:tick + 3 minutes];


-- using a User Defined Function written in C.


                NLeSC 9 Nov 2011
Use case
Rietbrock: … interactive response required …




The query over 2TB of seismic data will be
 handled before he finishes his coffee.

              NLeSC 9 Nov 2011
Status
•  The language definition is ‘finished’
•  The grammar is included in SQL parser
•  Semantic checks added to SQL parser
•  A test suite is being built
•  Runtime support features and software stack
•  …
•  Exposure to real life cases and external libraries




                NLeSC 9 Nov 2011
NLeSC 9 Nov 2011
NLeSC 9 Nov 2011
Science DBMS landscape
                    MonetDB 5.23                  SciDB 0.5              Rasdaman
Architecture        Server approach               Server approach        Plugin(Oracle, DB2, Informix,
                                                                         Mysql, Postgresql)
Open source         Mozilla License               GPL 3.0 Commercial     GPL 3.0 Dual license
Downloads           >12.000 /month                Tens up to now         ??
SQL                 SQL 2003                      ??                     SQL92++
Interoperability    {JO}DBC, C(++),Python, …      C++ UDF                C++, Java, OGC
Array language      SciQL                         AQL                    RASQL
Array model         Fixed+variable bounds         Fixed arrays           Fixed+variable bounds
Science             Linked libraries              Linked libraries       Linked libraries
Foreign files       Vaults of csv, FITS,          ??                     Tiff,png,jpg..,
                    NETCDF, MSEED                                        csv,,NETCDF,HDF4,
Distribution        50-200 node cluster           4 node cluster         20-node
Distribution tech   Dynamic partial replication   Static fragmentation   Static fragmentation
Executor            Various schemes               Map-reduce             Tile streaming
Largest demo        Skyserver SDSS 6 3TB          ---                    12TB, IGN –F (on Postgresql)
Storage tuning      Query adaptive                Schema definitions     Workload driven
Optimization        Heuristics + cost base        ??                     Heuristics +cost based
                                 NLeSC 9 Nov 2011

Contenu connexe

Tendances

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
Oracle Database 11g Release 2 - XMLDB New Features
Oracle Database 11g Release 2 - XMLDB New FeaturesOracle Database 11g Release 2 - XMLDB New Features
Oracle Database 11g Release 2 - XMLDB New FeaturesMarco Gralike
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 

Tendances (7)

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Oracle Database 11g Release 2 - XMLDB New Features
Oracle Database 11g Release 2 - XMLDB New FeaturesOracle Database 11g Release 2 - XMLDB New Features
Oracle Database 11g Release 2 - XMLDB New Features
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 

En vedette

Stockage des données dans les sgbd
Stockage des données dans les sgbdStockage des données dans les sgbd
Stockage des données dans les sgbdMarc Akoley
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems divjeev
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Hatim CHAHDI
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business IntelligenceDavid Portnoy
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadataLouis liu
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataAsis Mohanty
 

En vedette (8)

Stockage des données dans les sgbd
Stockage des données dans les sgbdStockage des données dans les sgbd
Stockage des données dans les sgbd
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
 
Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Agile Business Intelligence
Agile Business IntelligenceAgile Business Intelligence
Agile Business Intelligence
 
Teradata vs-exadata
Teradata vs-exadataTeradata vs-exadata
Teradata vs-exadata
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 

Similaire à Arrays in database systems, the next frontier?

SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)Tony Rogerson
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage EngineersThomas Kejser
 
Comparison between rdbms and nosql
Comparison between rdbms and nosqlComparison between rdbms and nosql
Comparison between rdbms and nosqlbharati k
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicEric Verhulst
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramChris Adkin
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in databaseNikhil Patteri
 
OSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerOSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerNETWAYS
 
OSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerOSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerNETWAYS
 
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerOSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for SysadminsNathan Milford
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memoryNico Ludwig
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 

Similaire à Arrays in database systems, the next frontier? (20)

NoSQL with MySQL
NoSQL with MySQLNoSQL with MySQL
NoSQL with MySQL
 
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
 
Databases for Storage Engineers
Databases for Storage EngineersDatabases for Storage Engineers
Databases for Storage Engineers
 
Comparison between rdbms and nosql
Comparison between rdbms and nosqlComparison between rdbms and nosql
Comparison between rdbms and nosql
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
MonetDB :column-store approach in database
MonetDB :column-store approach in databaseMonetDB :column-store approach in database
MonetDB :column-store approach in database
 
OSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerOSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner Fischer
 
OSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerOSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data center
 
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerOSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 

Plus de PlanetData Network of Excellence

A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoPlanetData Network of Excellence
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksPlanetData Network of Excellence
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamPlanetData Network of Excellence
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...PlanetData Network of Excellence
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchPlanetData Network of Excellence
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReducePlanetData Network of Excellence
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...PlanetData Network of Excellence
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...PlanetData Network of Excellence
 

Plus de PlanetData Network of Excellence (20)

Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
 
Privacy-Preserving Schema Reuse
Privacy-Preserving Schema ReusePrivacy-Preserving Schema Reuse
Privacy-Preserving Schema Reuse
 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
CLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data ArchitectureCLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data Architecture
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
 
Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?
 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
 

Dernier

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Arrays in database systems, the next frontier?

  • 1. Arrays in database systems, the next frontier ? Martin Kersten CWI NLeSC 9 Nov 2011
  • 2. “We can't solve problems by using the same kind of thinking we used when we created them.” NLeSC 9 Nov 2011
  • 3. Agenda A crash course on column-stores Column stores for science applications The SciQL array query language NLeSC 9 Nov 2011
  • 4. The world of column stores Motivation Relational DBMSs dominate since the late 1970's / 1980's l  Transactional workloads (OLTP, row-wise access) l  I/O based processing l  Ingres, Postgresql, MySQL, Oracle, SQLserver, DB2, … Column stores dominate product development since 2005 l  Datawarehouses and business intelligence applications l  Startups: Infobright, Aster Data, Greenplum, LucidDB,.. l  Commercial: Microsoft, IBM, SAP,… MonetDB, the pioneer NLeSC 9 Nov 2011
  • 5. The world of column stores Workload changes: Transactions (OLTP) vs ...‫‏‬ NLeSC 9 Nov 2011
  • 6. The world of column stores Workload changes: ... vs OLAP, BI, Data Mining, ... NLeSC 9 Nov 2011
  • 7. The world of column stores Databases hit The Memory Wall §  Detailed and exhaustive analysis for different workloads using 4 RDBMSs by Ailamaki, DeWitt, Hill,, Wood in VLDB 1999: “DBMSs On A Modern Processor: Where Does Time Go?”‫‏‬ §  CPU is 60%-90% idle, waiting for memory: §  L1 data stalls §  L1 instruction stalls §  L2 data stalls §  TLB stalls §  Branch mispredictions §  Resource stalls NLeSC 9 Nov 2011
  • 8. The world of column stores Hardware Changes: The Memory Wall Trip to memory = 1000s of instructions! NLeSC 9 Nov 2011
  • 9. Storing Relations in MonetDB Void Void Void Void Void 1000 1000 1000 1000 1000 . . . . . . . . . . . . . . . . . . . . . . . . . Virtual OID: seqbase=1000 (increment=1) NLeSC 9 Nov 2011
  • 10. BAT Data Structure BAT: binary association table Head Tail BUN: binary unit Hash tables, Head & Tail: BUN heap: T-trees, - consecutive memory R-trees, blocks (arrays)‫‏‬ block (array)‫‏‬ ... - memory-mapped file files Tail Heap: - best-effort duplicate elimination for strings (~ dictionary encoding) NLeSC 9 Nov 2011
  • 11. MonetDB Front-end: SQL l  SQL 2003 l  Parse SQL into logical n-ary relational algebra tree l  Translate n-ary relational algebra into logical 2-ary relational algebra l  Turn logical 2-ary plan into physical 2-ary plan (MAL program) l  Front-end specific strategic optimization: l  Heuristic optimization during all three previous steps l  Primary key and distinct constraints: l  Create and maintain hash indices l  Foreign key constraints l  Create and maintain foreign key join indices NLeSC 9 Nov 2011
  • 12. MonetDB Front-end: SQL EXPLAIN SELECT a, z FROM t, s WHERE t.c = s.x; function user.s2_1():void; barrier _73 := language.dataflow(); _2:bat[:oid,:int] := sql.bind("sys","t","c",0); _7:bat[:oid,:int] := sql.bind("sys","s","x",0); _10 := bat.reverse(_7); _11 := algebra.join(_2,_10); _13 := algebra.markT(_11,0@0); _14 := bat.reverse(_13); _15:bat[:oid,:int] := sql.bind("sys","t","a",0); _17 := algebra.leftjoin(_14,_15); _18 := bat.reverse(_11); _19 := algebra.markT(_18,0@0); _20 := bat.reverse(_19); _21:bat[:oid,:int] := sql.bind("sys","s","z",0); _23 := algebra.leftjoin(_20,_21); exit _73; _24 := sql.resultSet(2,1,_17); sql.rsColumn(_24,"sys.t","a","int",32,0,_17); sql.rsColumn(_24,"sys.s","z","int",32,0,_23); _33 := io.stdout(); sql.exportResult(_33,_24); end s2_1; NLeSC 9 Nov 2011
  • 13. MonetDB/5 Back-end: MAL l  MAL: Monet Assembly Language l  textual interface l  Interpreted language l  Designed as system interface language l  Reduced, concise syntax l  Strict typing l  Meant for automatic generation and parsing/rewriting/processing l  Not meant to be typed by humans l  Efficient parser l  Low overhead l  Inherent support for tactical optimization: MAL -> MAL l  Support for optimizer plug-ins l  Support for runtime schedulers l  Binary-algebra core l  Flow control (MAL is computational complete)‫‏‬ NLeSC 9 Nov 2011
  • 14. Processing Model (MonetDB Kernel)‫‏‬ l  Bulk processing: l  full materialization of all intermediate results l  Binary (i.e., 2-column) algebra core: l  select, join, semijoin, outerjoin l  union, intersection, diff (BAT-wise & column-wise)‫‏‬ l  group, count, max, min, sum, avg l  reverse, mirror, mark l  Runtime operational optimization: l  Choosing optimal algorithm & implementation according to input properties and system status NLeSC 9 Nov 2011
  • 15. Processing Model (MonetDB Kernel)‫‏‬ l  Heavy use of code expansion to reduce cost 1 algebra operator select()‫‏‬ 3 overloaded operators select(“=“,value) select(“between”,L,H) select(“fcn”,parm)‫‏‬ 10 operator algorithms scan hash-lookup bin-search bin-tree pos-lookup scan_range_select_oid_int(), ~1500(!) routines hash_equi_select_void_str(), … (macro expansion)‫‏‬ •  ~1500 selection routines •  149 unary operations •  335 join/group operations •  ... NLeSC 9 Nov 2011
  • 16. The Software Stack Front-ends XQuery SQL 03 SciQL RDF Optimizers Back-end(s) MonetDB 4 MonetDB 5 Kernel MonetDB kernel NLeSC 9 Nov 2011
  • 17. The Software Stack Strategic optimization Front-ends XQuery SQL 03 MAL Optimizers Tactical optimization: MAL -> MAL rewrites Back-end(s) MonetDB 4 MonetDB 5 MAL Runtime Kernel MonetDB kernel operational optimization NLeSC 9 Nov 2011
  • 18. MonetDB vs Traditional DBMS Architecture l  Architecture-Conscious Query Processing vs Magnetic disk I/O conscious processing -  l  Data layout, algorithms, cost models l  RISC Relational Algebra (operator-at-a-time) - vs Tuple-at-a-time Iterator Model l  Faster through simplicity: no tuple expression interpreter l  Multi-Model: ODMG, SQL, XML/XQuery, ..., RDF/SPARQL vs Relational with Bolt-on Subsystems -  l  Columns as the building block for complex data structures l  Decoupling of Transactions from Execution/Buffering vs ARIES integrated into Execution/Buffering/Indexing -  l  ACID, but not ARIES.. Pay as you need transaction overhead. l  Run-Time Indexing and Query Optimization - vs Static DBA/Workload-driven Optimization & Indexing l  Extensible Optimizer Framework; l  cracking, recycling, sampling-based runtime optimization NLeSC 9 Nov 2011
  • 19. Evolution It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change. Charles Darwin (1809 - 1882) NLeSC 9 Nov 2011
  • 20. Agenda A crash course on column-stores Column stores for science applications The SciQL array query language NLeSC 9 Nov 2011
  • 21. NLeSC 9 Nov 2011
  • 22. SkyServer Schema 446  columns   >585  million  rows   6  columns   >  20  Billion  rows   NLeSC 9 Nov 2011
  • 23. “An architecture for recycling Recycler intermediates in a column-store”. Ivanova, Kersten, Nes, Goncalves. motivation & idea ACM TODS 35(4), Dec. 2010 Motivation: l  scientific databases, data analytics l  Terabytes of data (observational , transactional) l  Prevailing read-only workload l  Ad-hoc queries with commonalities Background: l  Operator-at-a-time execution paradigm Ø  Automatic materialization of intermediates l  Canonical column-store organization Ø  Intermediates have reduced dimensionality and finer granularity Ø  Simplified overlap analysis Recycling idea: l  instead of garbage collecting, l  keep the intermediates and reuse them l  speed up query streams with commonalities l  low cost and self-organization NLeSC 9 Nov 2011
  • 24. “An architecture for recycling Recycler intermediates in a column-store”. Ivanova, Kersten, Nes, Goncalves. fit into MonetDB ACM TODS 35(4), Dec. 2010 SQL   XQuery   func6on  user.s1_2(A0:date,  ...):void;        X5  :=  sql.bind("sys","lineitem",...);        X10  :=  algebra.select(X5,A0);          X12  :=  sql.bindIdx("sys","lineitem",...);     MAL        X15  :=  algebra.join(X10,X12);          X25  :=  m6me.addmonths(A1,A2);          ...   Recycler   Tac6cal  Op6mizer   Op6mizer   func6on  user.s1_2(A0:date,  ...):void;        X5  :=  sql.bind("sys","lineitem",...);   MAL        X10  :=  algebra.select(X5,A0);          X12  :=  sql.bindIdx("sys","lineitem",...);     Run-­‐6me  Support        X15  :=  algebra.join(X10,X12);     MonetDB  Kernel        X25  :=  m6me.addmonths(A1,A2);          ...   Admission  &  Evic6on   MonetDB   Recycle  Pool   Server   NLeSC 9 Nov 2011
  • 25. “An architecture for recycling Recycler intermediates in a column-store”. Ivanova, Kersten, Nes, Goncalves. instruction matching ACM TODS 35(4), Dec. 2010 Run time comparison of l  instruction types l  argument values Y3  :=  sql.bind("sys","orders","o_orderdate",0);   Exact     X1  :=  sql.bind("sys","orders","o_orderdate",0);   …   matching   Name Value Data type Size X1 10 :bat[:oid,:date] T1 “sys” :str T2 “orders” :str … NLeSC 9 Nov 2011
  • 26. “An architecture for recycling Recycler intermediates in a column-store”. Ivanova, Kersten, Nes, Goncalves. instruction subsumption ACM TODS 35(4), Dec. 2010 Y3  :=  algebra.select(X1,20,45);   X3  :=  algebra.select(X1,10,80);   …   X5   :=  algebra.select(X1,20,60);   X5   Name Value Data type Size X1 10 :bat[:oid,:int] 2000 X3 130 :bat[:oid,:int] 700 X5 150 :bat[:oid,:int] 350 … NLeSC 9 Nov 2011
  • 27. “An architecture for recycling Recycler intermediates in a column-store”. Ivanova, Kersten, Nes, Goncalves. SkyServer evaluation ACM TODS 35(4), Dec. 2010 Sloan Digital Sky Survey / SkyServer http://cas.sdss.org l  100 GB subset of DR4 l  100-query batch from January 2008 log l  1.5GB intermediates, 99% reuse l  Join intermediates major consumer of memory and major contributor to savings NLeSC 9 Nov 2011
  • 28. Agenda A crash course on column-stores Column stores for science applications The SciQL array query language NLeSC 9 Nov 2011
  • 29. What is an array? An array is a systematic arrangement of objects addressed by dimension values. Get(A, X, Y,…) => Value Set(A, X, Y,…) <= Value There are many species: vector, bit array, dynamic array, parallel array, sparse array, variable length array, jagged array NLeSC 9 Nov 2011
  • 30. Who needs them anyway ? Seismology – partial time-series Climate simulation – temporal ordered grid Astronomy – temporal ordered images Remote sensing – image processing Social networks – graph algorithms Genomics – ordered strings Scientists ‘love them’ : MSEED, NETCDF, FITS, CSV,.. NLeSC 9 Nov 2011
  • 31. Arrays in DBMS Relational prototype built on arrays, Peterlee IS(1975) Persistent programming languages, Astral (1980), Plain (1980) Object-orientation and persistent languages were the make belief to handle them, O2(1992) NLeSC 9 Nov 2011
  • 32. PostgreSQL 8.3 Array declarations: CREATE TABLE sal_emp ( name text, pay_by_quarter integer[], schedule text[][]); CREATE TABLE tictactoe ( squares integer[3][3] ); Array operations: denotation ([]), contains (@>), is contained in (<@), append, concat (||), dimension, lower, upper, prepend, to-string, from- string Array constraints: none, no enforcement of dimensions. NLeSC 9 Nov 2011
  • 33. Mysql From the MySQL forum May 2010: “ >How to store multiple values in a single field? Is there any array data type concept in mysql? > As Jörg said "Multiple values in a single field" would be an explicit violation of the relational model..." “ Is there any experience beyond encoding it as blobs? NLeSC 9 Nov 2011
  • 34. Rasdaman Breaks large C++ arrays (rasters) into disjoint chunks Maps chunks into large binary objects (blob) Provide function interface to access them RASCAL, a SQL92 extension Known to work up to 12 TBs. NLeSC 9 Nov 2011
  • 35. SciDB Breaks large C++ arrays (rasters) into overlapping chunks Built storage manager from scratch Map-reduce processing model Provide function interface to access them AQL, a crippled SQL92 NLeSC 9 Nov 2011
  • 36. What is the problem? -  Appropriate array denotations? -  Functional complete operation set ? -  Scale ? -  Size limitations due to (blob) representations ? -  Community awareness? NLeSC 9 Nov 2011
  • 37. MonetDB SciQL SciQL (pronounced ‘cycle’ ) •  A backward compatible extension of SQL’03 •  Symbiosis of relational and array paradigm •  Flexible structure-based grouping •  Capitalizes the MonetDB physical array storage •  Recycling, an adaptive ‘materialized view’ •  Zero-cost attachment contract for cooperative clients http://www.cwi.nl/~mk/SciQL.pdf NLeSC 9 Nov 2011
  • 38. Table vs arrays CREATE TABLE tmp A collection of tuples Indexed by a (primary) key Default handling Explicitly created using INS/UPD/DEL NLeSC 9 Nov 2011
  • 39. Table vs arrays CREATE TABLE tmp CREATE ARRAY tmp A collection of tuples A collection of a priori defined tuples Indexed by a (primary) key Indexed by dimension expressions Default handling Implicitly defined by default value, Explicitly created using To be updated with INS/DEL/UPD INS/UPD/DEL NLeSC 9 Nov 2011
  • 40. SciQL examples CREATE TABLE matrix ( x integer, y integer, value float PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES (0,0,0),(0,1,0),(1,1,0)(1,0,0); 0 0 0 0 1 0 1 1 0 1 0 0 NLeSC 9 Nov 2011
  • 41. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES null … … … (0,0,0),(0,1,0),(1,1,0)(1,0,0); null null null … 0 0 0 0 0 0 null … 1 0 1 0 0 0 0 0 null null 1 1 0 0 1 1 0 0 NLeSC 9 Nov 2011
  • 42. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); DELETE matrix WHERE y=1 DELETE matrix WHERE y=1 A hole in the array 0 0 0 null null 1 1 0 0 0 0 0 0 1 NLeSC 9 Nov 2011
  • 43. SciQL examples CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION[2], y integer, y integer DIMENSION[2], value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES INSERT INTO matrix VALUES (0,1,1), (1,1,2) (0,1,1), (1,1,2) 0 0 0 1 2 1 1 0 0 0 0 0 0 1 1 0 1 1 1 2 NLeSC 9 Nov 2011
  • 44. SciQL unbounded arrays CREATE TABLE matrix ( CREATE ARRAY matrix ( x integer, x integer DIMENSION, y integer, y integer DIMENSION, value float value float DEFAULT 0); PRIMARY KEY (x,y) ); INSERT INTO matrix VALUES INSERT INTO matrix VALUES (0,2,1), (0,1,2) (0,2,1), (0,1,2) 0 2 1 2 1 0 0 1 2 1 0 0 0 0 2 0 1 NLeSC 9 Nov 2011
  • 45. SciQL Dimensions Unbounded Dimensions scalar-type DIMENSION Bounded Dimensions scalar-type DIMENSION[stop] scalar-type DIMENSION[first: step: stop] scalar-type DIMENSION[*: *: *] timestamp DIMENSION [ timestamp ‘2010-01-19’ : *: timestamp ‘1’ minute] NLeSC 9 Nov 2011
  • 46. SciQL table queries CREATE ARRAY matrix ( x integer DIMENSION, y integer DIMENSION, value float DEFAULT 0 ); -- simple checker boarding aggregation SELECT sum(value) FROM matrix WHERE (x + y) % 2 = 0 NLeSC 9 Nov 2011
  • 47. SciQL array queries CREATE ARRAY matrix ( x integer DIMENSION, y integer DIMENSION, value float DEFAULT 0 ); -- group based aggregation to construct an unbounded vector SELECT [x], sum(value) FROM matrix WHERE (x + y) % 2 = 0 GROUP BY x; NLeSC 9 Nov 2011
  • 48. SciQL array views CREATE ARRAY vmatrix ( x integer DIMENSION[-1:5], y integer DIMENSION[-1:5], value float DEFAULT -1 ) AS SELECT x, y, value FROM matrix; -1 -1 -1 -1 -1 0 0 -1 -1 0 0 -1 -1 -1 -1 -1 NLeSC 9 Nov 2011
  • 49. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 Anchor Point V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY matrix[x:1:x+2][y:1:y+2]; NLeSC 9 Nov 2011
  • 50. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 Anchor Point V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY DISTINCT matrix[x:1:x+2][y:1:y+2]; NLeSC 9 Nov 2011
  • 51. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 Anchor Point V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 null V0,0 V1,0 V2,0 V3,0 null null SELECT x, y, avg(value) FROM matrix GROUP BY DISTINCT matrix[x-1:1:x+1][y:1:y+2]; NLeSC 9 Nov 2011
  • 52. SciQL tiling examples V0,3 V1,3 V2,3 V3,3 Anchor Point V0,2 V1,2 V2,2 V3,2 V0,1 V1,1 V2,1 V3,1 V0,0 V1,0 V2,0 V3,0 SELECT x, y, avg(value) FROM matrix GROUP BY matrix[x][y], matrix[x-1][y], matrix[x+1][y], matrix[x][y-1], matrix[x][y+1]; NLeSC 9 Nov 2011
  • 53. SciQL, A Query Language for Science Applications •  Seamless integration of array-, set-, and sequence- semantics. •  Dimension constraints as a declarative means for indexed access to array cells. •  Structural grouping to generalize the value-based grouping towards selective access to groups of cells based on positional relationships for aggregation. NLeSC 9 Nov 2011
  • 54. Seismology use case Rietbrock: Chili earthquake … 2TB of wave fronts … filter by sta/lta … remove false positives … window-based 3 min cuts … heuristic tests … interactive response required … How can a database system help? Scanning 2TB on modern pc takes >3 hours NLeSC 9 Nov 2011
  • 55. Use case, a SciQL dream Rietbrock: Chili earthquake create array mseed ( tick timestamp dimension[timestamp ‘2010’:*], data decimal(8,6), station string ); NLeSC 9 Nov 2011
  • 56. Use case, a SciQL dream Rietbrock: … filter by sta/lta --- average by window of 5 seconds select A.tick, avg(A.data) from mseed A group by A[tick:1:tick + 5 seconds] NLeSC 9 Nov 2011
  • 57. Use case, a SciQL dream Rietbrock: … filter by sta/lta select A.tick from mseed A, mseed B where A.tick = B.tick and avg(A.data) / avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds] NLeSC 9 Nov 2011
  • 58. Use case, a SciQL dream Rietbrock: … filter by sta/lta create view candidates( station string, tick timestamp, ratio float ) as select A.station, A.tick, avg(A.data) / avg(B.data) as ratio from mseed A, mseed B where A.tick = B.tick and avg(A.data) / avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds] NLeSC 9 Nov 2011
  • 59. Use case, a SciQL dream Rietbrock: … remove false positives -- remove isolated errors by direct environment -- using wave propagation statics create table neighbors( head string, tail string, delay timestamp, weight float) NLeSC 9 Nov 2011
  • 60. Use case, a SciQL dream Rietbrock: … remove false positives select A.tick, B.tick from candidates A, candidates B, neighbors N where A.station = N.head and B.station = N.tail and B.tick = A.tick + N.delay and B.ratio * N.weight < A.ratio; NLeSC 9 Nov 2011
  • 61. Use case, a SciQL dream Rietbrock: … remove false positives delete from candidates select A.tick from candidates A, candidates B, neighbors N where A.station = N.head and B.station = N.tail and B.tick = A.tick + N.delay and B.ratio * N.weight < A.ratio; NLeSC 9 Nov 2011
  • 62. Use case, a SciQL dream Rietbrock: … window-based 3 min cuts … heuristic tests select B.station, myfunction(B.data) from candidates A, mseed B where A.tick = B.tick group by distinct B[tick:tick + 3 minutes]; -- using a User Defined Function written in C. NLeSC 9 Nov 2011
  • 63. Use case Rietbrock: … interactive response required … The query over 2TB of seismic data will be handled before he finishes his coffee. NLeSC 9 Nov 2011
  • 64. Status •  The language definition is ‘finished’ •  The grammar is included in SQL parser •  Semantic checks added to SQL parser •  A test suite is being built •  Runtime support features and software stack •  … •  Exposure to real life cases and external libraries NLeSC 9 Nov 2011
  • 65. NLeSC 9 Nov 2011
  • 66. NLeSC 9 Nov 2011
  • 67. Science DBMS landscape MonetDB 5.23 SciDB 0.5 Rasdaman Architecture Server approach Server approach Plugin(Oracle, DB2, Informix, Mysql, Postgresql) Open source Mozilla License GPL 3.0 Commercial GPL 3.0 Dual license Downloads >12.000 /month Tens up to now ?? SQL SQL 2003 ?? SQL92++ Interoperability {JO}DBC, C(++),Python, … C++ UDF C++, Java, OGC Array language SciQL AQL RASQL Array model Fixed+variable bounds Fixed arrays Fixed+variable bounds Science Linked libraries Linked libraries Linked libraries Foreign files Vaults of csv, FITS, ?? Tiff,png,jpg.., NETCDF, MSEED csv,,NETCDF,HDF4, Distribution 50-200 node cluster 4 node cluster 20-node Distribution tech Dynamic partial replication Static fragmentation Static fragmentation Executor Various schemes Map-reduce Tile streaming Largest demo Skyserver SDSS 6 3TB --- 12TB, IGN –F (on Postgresql) Storage tuning Query adaptive Schema definitions Workload driven Optimization Heuristics + cost base ?? Heuristics +cost based NLeSC 9 Nov 2011