Cassandra Materialized Views and User Defined Functions

1
Cassandra 2.2 and 3.0
new features
DuyHai DOAN
Apache Cassandra Technical Evangelist
#VoxxedBerlin @doanduyhai

Datastax
2
•  Founded in April 2010
•  We contribute a lot to Apache Cassandra™
•  400+ customers (25 of the Fortune 100), 450+ employees
•  Headquarter in San Francisco Bay area
•  EU headquarter in London, ofﬁces in France and Germany
•  Datastax Enterprise = OSS Cassandra + extra features

Materialized Views (MV)
•  Why ?
•  Detailed Impl
•  Gotchas

Why Materialized Views ?
•  Relieve the pain of manual denormalization
CREATE TABLE user(
id int PRIMARY KEY,
country text,
…
);
CREATE TABLE user_by_country(
country text,
id int,
…,
PRIMARY KEY(country, id)
);
4

CREATE TABLE user_by_country (
country text,
id int,
firstname text,
lastname text,
PRIMARY KEY(country, id));
Materialzed View In Action
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
5

Materialzed View Syntax
CREATE MATERIALIZED VIEW [IF NOT EXISTS]
keyspace_name.view_name
AS SELECT column1, column2, ...
FROM keyspace_name.table_name
WHERE column1 IS NOT NULL AND column2 IS NOT NULL ...
PRIMARY KEY(column1, column2, ...)
Must select all primary key columns of base table
•  IS NOT NULL condition for now
•  more complex conditions in future
•  at least all primary key columns of base table
(ordering can be different)
•  maximum 1 column NOT pk from base table
6

Materialized View Impl
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
SET country=‘FR’
WHERE id=1
①
•  send mutation to all replicas
•  waiting for ack(s) with CL
8

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
②
Acquire local lock on
base table partition
9

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
③
Local read to fetch current values
SELECT * FROM user WHERE id=1
10

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
④
Create BatchLog with
•  DELETE FROM user_by_country
WHERE country = ‘old_value’
•  INSERT INTO
user_by_country(country, id, …)
VALUES(‘FR’, 1, ...)
11

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑤
Execute async BatchLog
to paired view replica
with CL = ONE
12

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑥
Apply base table updade locally
SET COUNTRY=‘FR’
13

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑦
Release local lock
14

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑧
Return ack to
coordinator
15

C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑨
If CL ack(s)
received, ack client
16

MV Failure Cases: concurrent updates
Read base row (country=‘UK’)
•  DELETE FROM mv WHERE
country=‘UK’
•  INSERT INTO mv …(country)
VALUES(‘US’)
•  Send async BatchLog
•  Apply update country=‘US’
1) UPDATE … SET country=‘US’ 2) UPDATE … SET country=‘FR’
country=‘UK’
VALUES(‘FR’)
•  Apply update country=‘FR’
t0
t1
t2
Without local lock
17

country=‘UK’
VALUES(‘US’)
country=‘UK’
VALUES(‘FR’)
t0
t1
t2
Without local lock
18
INSERT INTO mv …(country) VALUES(‘US’)
INSERT INTO mv …(country) VALUES(‘FR’)

country=‘UK’
VALUES(‘US’)
Read base row (country=‘US’)
country=‘US’
VALUES(‘FR’)
With local lock
🔒
🔓 🔒
🔓19

MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
⑤
Execute async BatchLog
to paired view replica
with CL = ONE
✘
MV replica down
20

MV Failure Cases: failed updates to MV
C*
C*
C*
C*
C* C*
C* C*
UPDATE user
WHERE id=1
BatchLog
replay
MV replica up
21

Materialized View Performance
•  Write performance
•  local lock
•  local read-before-write for MV à update contention on partition (most of perf hits)
•  local batchlog for MV
•  ☞ you only pay this price once whatever number of MV
•  for each base table update: mv_count x 2 (DELETE + INSERT) extra mutations
22

•  Write performance vs manual denormalization
•  MV better because no client-server network trafﬁc for read-before-write
•  MV better because less network trafﬁc for multiple views (client-side BATCH)
•  Makes developer life easier à priceless
23

•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many nodes)
•  MV better because single read path (secondary index = read index + read data)
24

Materialized Views Consistency
•  Consistency level
•  CL honoured for base table, ONE for MV + local batchlog
•  Weaker consistency guarantees for MV than for base table.
•  Exemple, write at QUORUM
•  guarantee that QUORUM replicas of base tables have received write
•  guarantee that QUORUM of MV replicas will eventually receive DELETE + INSERT
25

Materialized Views Gotchas
•  Beware of hot spots !!!
•  MV user_by_gender 😱
26

User Deﬁne Functions (UDF)
•  Why ?
•  Detailed Impl
•  UDAs
•  Gotchas

Rationale
•  Push computation server-side
•  save network bandwidth (1000 nodes!)
•  simplify client-side code
•  provide standard & useful function (sum, avg …)
•  accelerate analytics use-case (pre-aggregation for Spark)
29

How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALL ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
30

CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
An UDF is keyspace-wide
31

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code
Type = CQL3 type
32

RETURN returnType
LANGUAGE language // j
AS $$
// source code here
$$;
Always called
Null-check mandatory in code
33

RETURN returnType
LANGUAGE language // jav
AS $$
// source code here
$$;
If any input is null, code block is
skipped and return null
34

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
CQL types
•  primitives (boolean, int, …)
•  collections (list, set, map)
•  tuples
•  UDT
35

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$; JVM supported languages
•  Java, Scala
•  Javascript (slow)
•  Groovy, Jython, JRuby
•  Clojure ( JSR 223 impl issue)
36

RETURN returnType
LANGUAGE language
AS $$
// source code here
$$;
37

UDA
•  Real use-case for UDF
•  Aggregation server-side à huge network bandwidth saving
•  Provide similar behavior for Group By, Sum, Avg etc …
39

How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name
State type
Initial state type
40

STYPE stateType
INITCOND initCond;
Accumulator function. Signature:
accumulatorFunction(stateType, type1, type2, …)
RETURNS stateType
41

STYPE stateType
INITCOND initCond;
Optional final function. Signature:
finalFunction(stateType)
42

STYPE stateType
INITCOND initCond;
UDA return type ?
If finalFunction
•  return type of finalFunction
Else
•  return stateType
43

Gotchas
C* C*
C*
C*
UDA
①
② & ③
⑤
② & ③
② & ③
45

Gotchas
C* C*
C*
C*
UDA
①
② & ③
⑤
② & ③
② & ③
46
Why do not apply UDF/UDA on replica node ?

Gotchas
C* C*
C*
C*
UDA
①
② & ③
④
•  apply accumulatorFunction
•  apply ﬁnalFunction
⑤
② & ③
② & ③
1.  Because of eventual
consistency
2.  UDF/UDA applied AFTER
last-write-win logic
47

Gotchas
48
•  UDA in Cassandra is not distributed !
•  Execute UDA on a large number of rows (106 for ex.)
•  single fat partition
•  multiple partitions
•  full table scan
•  à Increase client-side timeout
•  default Java driver timeout = 12 secs
•  JAVA-1033 JIRA for per-request timeout setting

Cassandra UDA or Apache Spark ?
49
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark deﬁnitely

Cassandra UDA or Apache Spark ?
50
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition UDA with token-aware driver because node local
ONE Multiple partitions Apache Spark because distributed reads
> ONE Single partition UDA because data-locality lost with Spark
> ONE Multiple partitions Apache Spark deﬁnitely

New Storage Engine
•  Data structure
•  Disk space usage

Pre 3.0 data structure
Map<byte[ ], SortedMap<byte[ ], Cell>>
53
CREATE TABLE sensor_data(
sensor_id uuid,
date timestamp,
sensor_type text,
sensor_value double,
PRIMARY KEY(sensor_id, date)
);

Pre 3.0 on disk layout
54
RowKey: de305d54-75b4-431b-adb2-eb6b9e546014
=> (column=2015-04-27 10:00:00+0100:, value=, timestamp=1430128800)
=> (column=2015-04-27 10:00:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128800)
=> (column=2015-04-27 10:00:00+0100:sensor_value, value=23.48, timestamp=1430128800)
=> (column=2015-04-27 10:01:00+0100:, value=, timestamp=1430128860)
=> (column=2015-04-27 10:01:00+0100:sensor_type, value=‘Temperature’, timestamp=1430128860)
=> (column=2015-04-27 10:01:00+0100:sensor_value, value=24.08, timestamp=1430128860)
Clustering values are repeated
for each normal column
Full timestamp storage

Cassandra 3.0 data structure
Map<byte[ ], SortedMap<ClusteringColumn, Row>>
55
CREATE TABLE sensor_data(
sensor_id uuid,
date timestamp,
sensor_type text,
sensor_value double,
PRIMARY KEY(sensor_id, date)
);

Cassandra 3.0 on disk layout
56
PartitionKey: de305d54-75b4-431b-adb2-eb6b9e546014
=> clusteringColumn:2015-04-27 10:00:00+0100
=> row_timestamp=1430128800
=> (column_value=‘Temperature’, delta_encoded_timestamp=+0)
=> (column_value=23.48, delta_encoded_timestamp=+0)
=> clusteringColumn:2015-04-27 10:01:00+0100
=> row_timestamp=1430128860
=> (column_value=‘Temperature’, delta_encoded_timestamp=+0)
=> (column_value=24.08, delta_encoded_timestamp=+0)
Delta-encoded timestamp
vs row timestamp

Gains
57
•  No clustering value repetition
•  Column labels are stored only once in meta data
•  Delta encoding of timestamp, 8 bytes saved each time
•  Less disk space used

Benchmarks
58
CREATE TABLE events (
id uuid,
date timeuuid,
prop1 int,
prop2 text,
prop3 float,
PRIMARY KEY(id, date));
106 rows
Small string

Benchmarks
59
CREATE TABLE largetext(
key int,
prop1 int,
prop2 text,
PRIMARY KEY(id));
106 rows
Large string (1000)

Benchmarks
60
CREATE TABLE
largeclustering(
key int,
clust text,
prop1 int,
prop2 set<float>,
PRIMARY KEY(id, clust));
106 rowsMedium string (100)
50 items

Benchmarks
61
CREATE TABLE events (
id uuid,
date timeuuid,
prop1 int,
prop2 text,
prop3 float,
PRIMARY KEY(id, date))
WITH COMPACT STORAGE ;

@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/
Thank You
63

Cassandra Materialized Views and User Defined Functions

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Cassandra Materialized Views and User Defined Functions

Similaire à Cassandra Materialized Views and User Defined Functions (20)

Dernier

Dernier (20)

Cassandra Materialized Views and User Defined Functions