How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.
Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.
How Database Convergence Impacts the Coming Decades of Data Management
1. How Database Convergence Impacts the
Coming Decades of Data Management
Nikita Shamgunov
CEO and co-founder of MemSQL
2. 2
MISSION
Growth of digital business impacting data architectures
We make every company a real-time enterprise
PRODUCT
Top Ranked Operational Data Warehouse
MemSQL provides you the ability to learn and react in real time
ABOUT
Founders are former Facebook, SQL Server database engineers
$85m in funding from Top Tier investors; Enterprise Customers:
MemSQL at a Glance
3. Converge Transactions and Analytics in a Relational Database
● New breed of applications
○ Analytics as part of a transaction
○ Analytics when the data is born
○ In database AI/ML
● Scalable OLAP and OLTP in one system
○ Fewer systems to manage
○ Utility database consumption
○ Supports HTAP
4. Traditional + Future Architecture
4
In-Memory Data Store
Analytics, Historical Reporting and Data Discovery
Analytic Apps
DMSA
Data
Integration
Transactions + Operational Analytics Traditional Reports and Analytics
IoT
Data
Social
Data
RAM
?
HTAP Apps
Analytic
Apps
5. The New Data Architecture without DMSA
5
Transactions + Operational Reports
?
IoT
Data
Social
Data
In-Memory Data Store
HTAP Apps
Analytic
Apps
RAM
Analytics, Historical Reporting and Data Discovery
6. The Enterprise Requires Performance
6
FAST
Data Loading
Stream data
Real-time loading
Full data access
LOW
Query Latency
Vectorized queries
Real-time dashboards
Live data access
Multi-threaded processing
Transactions and Analytics
Scalable performance
HIGH
Concurrency
7. ● Focus on analytics and Deliver a Hybrid Cloud Data Warehouse
○ Hybrid-cloud
○ Scalable with integration to data lakes
○ Real-time
○ Simplicity
● Converge transactions and analytics
○ Transaction support
○ Multi-cloud reliability
○ Application support
North Star. Build a New Category of Databases
8. Real-time and Query Performance
Goals: Eliminate batching and deliver instant results to user or app
● Investments
○ Streaming ingest
■ Kafka
■ Kinesis
○ Transactional consistency
■ Ability to change data rapidly
■ Ability to scale analytics to millions requests a second to enable self service customer
customer analytics
○ Query performance
■ Scale out
■ Vectorization
● Results
○ Dramatic query performance improvements for BI use cases
○ PIPELINEs adoption is growing
9. Simplicity
● Goals
○ No knobs where you don’t need them
○ Data warehousing workloads work out of the box
○ No hints for queries
○ No scaling limits
● Investments
○ Query optimization and query execution
● Timelines
○ Several releases in 2017
10. 10
▪ Columnstore
• On disk with working set
in memory
• Super fast scans
• Support analytical and
data warehousing
workloads
• One index
• Petabyte scale
Access Methods
▪ Rowstore
• Fully in memory
• Submillisecond point
updates
• Multiple indexes
13. 13
Query Performance
▪ Group-By/Aggregate Performance
• Operations on encoded data
• Single-instruction multiple data (SIMD)
▪ Filter pushdown to column store
▪ Preference for dictionary compression
MemSQL Confidential
14. 14
SIMD overview
▪ Intel AVX-2
▪ 256-bit registers
▪ Pack multiple values per
register
▪ Special instructions for
SIMD register operations
▪ Arithmetic, logic, load,
store etc.
▪ Allows multiple operations
in 1 instruction
1 2 3 4
1 1 1 1
2 3 4 5
+
MemSQL Confidential
15. 15
Filter pushdown to dictionary
▪ Example:
• FactClick(id, region_id, …)
• Select region_id, count(*) from FactClick
where region_id like ’%east%’
• region_id has only a few dozen values
• It is dictionary-encoded
MemSQL Confidential
16. 16
Segment-level filter pushdown
▪ E.g. 6 regions
▪ 1M rows per segment
▪ WHERE region_id like ‘%east%’
▪ 6 string comparisons/segment
▪ Cache lookup table L:
[true, true, false, false, false, false]
▪ Output only rows where
L[dictionary_id] = true
dictionary_id Region
0 Northeast
1 Southeast
2 North-central
3 South-central
4 Northwest
5 Southwest
Dictionary
MemSQL Confidential
17. MemSQL Confidential17
Performance
▪ Improved Group-By/Aggregate, up to 80X
▪ Columnstore string filter pushdown
▪ Improved sort performance (can be by 2-3X)
▪ Unenforced uniqueness constraints with RELY option
▪ Query optimizer improvements
▪ Columnstore update
▪ Columnstore JSON
18. MemSQL Confidential18
Automatic Statistics
▪ Always-on cardinality statistics for every column
▪ For columnstore tables only
▪ On by default
▪ Will result in better query plans with less DBA involvement
to run ANALYZE TABLE and tune queries
19. Columnstore update performance
▪ Ability to update rows identified via columnstore sort key
▪ Uses in-memory index on row store segment
19
Row Store
Segment
Col Store
Segment
Col Store
Segment
Index on
Sort Key
…
Seek
20. New Query Features
▪ Cross-database queries (joins, insert-select)
▪ UPDATE/DELETE with joins
▪ UPDATE with subselect in SET clause
▪ reference_table LEFT JOIN …; (select …) LEFT JOIN… now supported
▪ Window functions with complex frames
• E.g. avg (a) over (order by b rows between 5 preceding and current row)
▪ New window functions
• first_value, last_value, nth_value, percentile_cont, percentile_disc
▪ Unenforced unique constraint + RELY
20 MemSQL Confidential
21. Extensibility Features
▪ Major, release-defining feature set
▪ User-defined
• Stored procedures (SPs)
• Scalar-valued functions (UDFs)
• Table-valued functions (TVFs)
• Aggregate functions (UDAFs)
▪ Highlights
• SQL-developer friendly, clean syntax (no @, $ etc.)
• Compiled to machine code for speed
• Array and record support
21 MemSQL Confidential
23. Implementation of normalize_string()
delimiter //
create or replace function normalize_string(str varchar(255)) returns varchar(255) as
declare
r varchar(255) = ""; i int; previousChar char; nextChar char; s varchar(255);
begin
s = lower(trim(str));
if length(s) = 0 then return s; end if;
previousChar = substr(s, 1, 1);
r = concat(r, previousChar);
i = 2;
while i <= length(s) loop
nextChar = substr(s, i, 1);
if not(previousChar = ' ' and nextChar = ' ') then
r = concat(r, substr(s, i, 1));
end if;
previousChar = nextChar;
i += 1;
end loop;
return r;
end //
23 MemSQL Confidential
24. Example SP: Move data more than 5 minutes old
from t1 to t2;
create table t1(a int, ts datetime);
create table t2(a int, ts datetime);
…
create or replace procedure myMove() as
declare
boundary datetime
= date_add(now(), interval -5 minute);
begin
insert into t2 select * from t1 where ts < boundary;
delete from t1 where ts < boundary;
end;
24 MemSQL Confidential
25. Example TVF
create table t (i int);
insert into t values (1),(2),(3),(4),(5);
create function basic(l int) returns table as
return select * from t limit l;
memsql> select * from basic(0);
Empty set (0.00 sec)
memsql> select * from basic(2);
+------+
| i |
+------+
| 3 |
| 2 |
+------+
2 rows in set (0.01 sec)
25 MemSQL Confidential
26. User-Defined Aggregate Functions (UDAFs)
▪ Used like built-in aggregates like SUM()
▪ Based on 4 user-defined functions
• Initialize
• Iterate
• Merge
• Terminate
26 MemSQL Confidential
27. Example UDAF
-- pick any arbitrary value from input
delimiter //
create function any_init() returns int as begin return -1; end;//
create function any_iter(s int, v int) returns int as begin return v; end;//
create function any_merge(s1 int, s2 int) returns int as
begin
if s1 = -1 then return s2; else return s1; end if;
end;//
create function any_terminate(s int) returns int as begin return s; end;//
delimiter ;
create aggregate any_val(int)
returns int
with state int
initialize with any_init
iterate with any_iter
merge with any_merge
terminate with any_terminate;
27 MemSQL Confidential
28. UDAF Output
create table t(g int, x int);
insert into t values (100, 10), (100, 12), (100, 14), (200, 21), (200,
27);
select g, any_val(x) from t group by g;
memsql> select g, any_val(x) from t group by g;
+------+------------+
| g | any_val(x) |
+------+------------+
| 100 | 10 |
| 200 | 27 |
+------+------------+
2 rows in set (0.00 sec)
28 MemSQL Confidential
29. SCALAR (get a scalar query result)
29
create table t (i int);
insert into t values (1), (2), (3), (4), (5);
create or replace procedure scalar_basic() as
declare
v query(i int) = select max(i) from t;
s int = scalar(v);
begin
call tracelog(s);
end;
MemSQL Confidential
30. 30
COLLECT
create or replace procedure p_coll() as
declare
c array(record(v varchar(80)));
t query(v varchar(80)) = select v from r order by v;
begin
delete from proc_log;
c = collect(t);
for x in c loop
call tracelog(x.v);
end loop;
end;
MemSQL Confidential
31. CALL and ECHO
▪ call sp_name(args)
• When no need to output rowset
▪ echo sp_name(args)
• Outputs rowset to client
▪ Exception handling supported
31 MemSQL Confidential
32. Performance
▪ Compiled to machine code using LLVM
▪ UDFs are inlined when appropriate
32 MemSQL Confidential
33. Distributed Execution
▪ SPs run on aggregator
▪ From SPs, parameters and variables are substituted as
strings on aggregator before execution on leaves
▪ UDFs can run on any node
• Aggregators or leaves
• Multiple invocations can run in parallel within a query
33 MemSQL Confidential
34. Summary
▪ New, user-defined
• Scalar functions
• Stored procedures
• Table-valued functions
• Aggregate functions
▪ Friendly to experienced SQL developers
▪ Array and record types supported
▪ High-performance through compilation to machine code
34 MemSQL Confidential