How Database Convergence Impacts the Coming Decades of Data Management

How Database Convergence Impacts the
Coming Decades of Data Management
Nikita Shamgunov
CEO and co-founder of MemSQL

2
MISSION
Growth of digital business impacting data architectures
We make every company a real-time enterprise
PRODUCT
Top Ranked Operational Data Warehouse
MemSQL provides you the ability to learn and react in real time
ABOUT
Founders are former Facebook, SQL Server database engineers
$85m in funding from Top Tier investors; Enterprise Customers:
MemSQL at a Glance

Converge Transactions and Analytics in a Relational Database
● New breed of applications
○ Analytics as part of a transaction
○ Analytics when the data is born
○ In database AI/ML
● Scalable OLAP and OLTP in one system
○ Fewer systems to manage
○ Utility database consumption
○ Supports HTAP

Traditional + Future Architecture
4
In-Memory Data Store
Analytics, Historical Reporting and Data Discovery
Analytic Apps
DMSA
Data
Integration
Transactions + Operational Analytics Traditional Reports and Analytics
IoT
Data
Social
Data
RAM
?
HTAP Apps
Analytic
Apps

The New Data Architecture without DMSA
5
Transactions + Operational Reports
?
IoT
Data
Social
Data
In-Memory Data Store
HTAP Apps
Analytic
Apps
RAM
Analytics, Historical Reporting and Data Discovery

The Enterprise Requires Performance
6
FAST
Data Loading
Stream data
Real-time loading
Full data access
LOW
Query Latency
Vectorized queries
Real-time dashboards
Live data access
Multi-threaded processing
Transactions and Analytics
Scalable performance
HIGH
Concurrency

● Focus on analytics and Deliver a Hybrid Cloud Data Warehouse
○ Hybrid-cloud
○ Scalable with integration to data lakes
○ Real-time
○ Simplicity
● Converge transactions and analytics
○ Transaction support
○ Multi-cloud reliability
○ Application support
North Star. Build a New Category of Databases

Real-time and Query Performance
Goals: Eliminate batching and deliver instant results to user or app
● Investments
○ Streaming ingest
■ Kafka
■ Kinesis
○ Transactional consistency
■ Ability to change data rapidly
■ Ability to scale analytics to millions requests a second to enable self service customer
customer analytics
○ Query performance
■ Scale out
■ Vectorization
● Results
○ Dramatic query performance improvements for BI use cases
○ PIPELINEs adoption is growing

Simplicity
● Goals
○ No knobs where you don’t need them
○ Data warehousing workloads work out of the box
○ No hints for queries
○ No scaling limits
● Investments
○ Query optimization and query execution
● Timelines
○ Several releases in 2017

10
▪ Columnstore
• On disk with working set
in memory
• Super fast scans
• Support analytical and
data warehousing
workloads
• One index
• Petabyte scale
Access Methods
▪ Rowstore
• Fully in memory
• Submillisecond point
updates
• Multiple indexes

11
▪ Supports multi-statement
transactions
▪ Supports MVCC
Scale out and Transactional
▪ Scalable on commodity
hardware
▪ Data hash partitioned
and stored in two copies

Query Processing
12 MemSQL Confidential

13
Query Performance
▪ Group-By/Aggregate Performance
• Operations on encoded data
• Single-instruction multiple data (SIMD)
▪ Filter pushdown to column store
▪ Preference for dictionary compression
MemSQL Confidential

14
SIMD overview
▪ Intel AVX-2
▪ 256-bit registers
▪ Pack multiple values per
register
▪ Special instructions for
SIMD register operations
▪ Arithmetic, logic, load,
store etc.
▪ Allows multiple operations
in 1 instruction
1 2 3 4
1 1 1 1
2 3 4 5
+
MemSQL Confidential

15
Filter pushdown to dictionary
▪ Example:
• FactClick(id, region_id, …)
• Select region_id, count(*) from FactClick
where region_id like ’%east%’
• region_id has only a few dozen values
• It is dictionary-encoded
MemSQL Confidential

16
Segment-level filter pushdown
▪ E.g. 6 regions
▪ 1M rows per segment
▪ WHERE region_id like ‘%east%’
▪ 6 string comparisons/segment
▪ Cache lookup table L:
[true, true, false, false, false, false]
▪ Output only rows where
L[dictionary_id] = true
dictionary_id Region
0 Northeast
1 Southeast
2 North-central
3 South-central
4 Northwest
5 Southwest
Dictionary
MemSQL Confidential

MemSQL Confidential17
Performance
▪ Improved Group-By/Aggregate, up to 80X
▪ Columnstore string filter pushdown
▪ Improved sort performance (can be by 2-3X)
▪ Unenforced uniqueness constraints with RELY option
▪ Query optimizer improvements
▪ Columnstore update
▪ Columnstore JSON

MemSQL Confidential18
Automatic Statistics
▪ Always-on cardinality statistics for every column
▪ For columnstore tables only
▪ On by default
▪ Will result in better query plans with less DBA involvement
to run ANALYZE TABLE and tune queries

Columnstore update performance
▪ Ability to update rows identified via columnstore sort key
▪ Uses in-memory index on row store segment
19
Row Store
Segment
Col Store
Segment
Col Store
Segment
Index on
Sort Key
…
Seek

New Query Features
▪ Cross-database queries (joins, insert-select)
▪ UPDATE/DELETE with joins
▪ UPDATE with subselect in SET clause
▪ reference_table LEFT JOIN …; (select …) LEFT JOIN… now supported
▪ Window functions with complex frames
• E.g. avg (a) over (order by b rows between 5 preceding and current row)
▪ New window functions
• first_value, last_value, nth_value, percentile_cont, percentile_disc
▪ Unenforced unique constraint + RELY

Extensibility Features
▪ Major, release-defining feature set
▪ User-defined
• Stored procedures (SPs)
• Scalar-valued functions (UDFs)
• Table-valued functions (TVFs)
• Aggregate functions (UDAFs)
▪ Highlights
• SQL-developer friendly, clean syntax (no @, $ etc.)
• Compiled to machine code for speed
• Array and record support

Example UDF: normalize_string()
select normalize_string(" Abc XYZ ");
abc xyz

Implementation of normalize_string()
delimiter //
create or replace function normalize_string(str varchar(255)) returns varchar(255) as
declare
r varchar(255) = ""; i int; previousChar char; nextChar char; s varchar(255);
begin
s = lower(trim(str));
if length(s) = 0 then return s; end if;
previousChar = substr(s, 1, 1);
r = concat(r, previousChar);
i = 2;
while i <= length(s) loop
nextChar = substr(s, i, 1);
if not(previousChar = ' ' and nextChar = ' ') then
r = concat(r, substr(s, i, 1));
end if;
previousChar = nextChar;
i += 1;
end loop;
return r;
end //

Example SP: Move data more than 5 minutes old
from t1 to t2;
create table t1(a int, ts datetime);
create table t2(a int, ts datetime);
…
create or replace procedure myMove() as
declare
boundary datetime
= date_add(now(), interval -5 minute);
begin
insert into t2 select * from t1 where ts < boundary;
delete from t1 where ts < boundary;
end;

Example TVF
create table t (i int);
insert into t values (1),(2),(3),(4),(5);
create function basic(l int) returns table as
return select * from t limit l;
memsql> select * from basic(0);
Empty set (0.00 sec)
memsql> select * from basic(2);
+------+
| i |
+------+
| 3 |
| 2 |
+------+
2 rows in set (0.01 sec)

User-Defined Aggregate Functions (UDAFs)
▪ Used like built-in aggregates like SUM()
▪ Based on 4 user-defined functions
• Initialize
• Iterate
• Merge
• Terminate

Example UDAF
-- pick any arbitrary value from input
delimiter //
create function any_init() returns int as begin return -1; end;//
create function any_iter(s int, v int) returns int as begin return v; end;//
create function any_merge(s1 int, s2 int) returns int as
begin
if s1 = -1 then return s2; else return s1; end if;
end;//
create function any_terminate(s int) returns int as begin return s; end;//
delimiter ;
create aggregate any_val(int)
returns int
with state int
initialize with any_init
iterate with any_iter
merge with any_merge
terminate with any_terminate;

UDAF Output
create table t(g int, x int);
insert into t values (100, 10), (100, 12), (100, 14), (200, 21), (200,
27);
select g, any_val(x) from t group by g;
memsql> select g, any_val(x) from t group by g;
+------+------------+
| g | any_val(x) |
+------+------------+
| 100 | 10 |
| 200 | 27 |
+------+------------+
2 rows in set (0.00 sec)

SCALAR (get a scalar query result)
29
create table t (i int);
insert into t values (1), (2), (3), (4), (5);
create or replace procedure scalar_basic() as
declare
v query(i int) = select max(i) from t;
s int = scalar(v);
begin
call tracelog(s);
end;
MemSQL Confidential

30
COLLECT
create or replace procedure p_coll() as
declare
c array(record(v varchar(80)));
t query(v varchar(80)) = select v from r order by v;
begin
delete from proc_log;
c = collect(t);
for x in c loop
call tracelog(x.v);
end loop;
end;
MemSQL Confidential

CALL and ECHO
▪ call sp_name(args)
• When no need to output rowset
▪ echo sp_name(args)
• Outputs rowset to client
▪ Exception handling supported

Performance
▪ Compiled to machine code using LLVM
▪ UDFs are inlined when appropriate

Distributed Execution
▪ SPs run on aggregator
▪ From SPs, parameters and variables are substituted as
strings on aggregator before execution on leaves
▪ UDFs can run on any node
• Aggregators or leaves
• Multiple invocations can run in parallel within a query

Summary
▪ New, user-defined
• Scalar functions
• Stored procedures
• Table-valued functions
• Aggregate functions
▪ Friendly to experienced SQL developers
▪ Array and record types supported
▪ High-performance through compilation to machine code

How Database Convergence Impacts the Coming Decades of Data Management

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How Database Convergence Impacts the Coming Decades of Data Management

Similaire à How Database Convergence Impacts the Coming Decades of Data Management (20)

Plus de SingleStore

Plus de SingleStore (20)

Dernier

Dernier (20)

How Database Convergence Impacts the Coming Decades of Data Management