Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
2. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Apache Cassandra Evangelist!
• talks, meetups, confs!
• open-source devs (Achilles, Apache Zeppelin)!
• OSS Cassandra point of contact!
☞ duy_hai.doan@datastax.com!
☞ @doanduyhai
Who Am I ?
3. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Datastax
• Founded in April 2010!
• We contribute a lot to Apache Cassandra™!
• 400+ customers (25 of the Fortune 100), 450+ employees!
• Headquarter in San Francisco Bay area!
• EU headquarter in London, offices in France and Germany!
• Datastax Enterprise = OSS Cassandra + extra features!
4. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Agenda
• Materialized Views (MV)!
• User Defined Functions (UDF) & User Defined Aggregates (UDA)!
• JSON syntax!
• New SASI full text search!
6. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why Materialized Views ?
• Relieve the pain of manual denormalization!
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …,
PRIMARY KEY(country, id));
7. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views creation
CREATE TABLE user_by_country (
country text, id int,
firstname text, lastname text,
PRIMARY KEY(country, id));
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
9. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
• Write performance
• slower than normal write!
• local lock + read-before-write cost (but paid only once for all views)!
• for each base table update, worst case: mv_count x 2 (DELETE +
INSERT) extra mutations for the views!
10. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
• Write performance vs manual denormalization
• MV better because no client-server network traffic for read-before-write
• MV better because less network traffic for multiple views (client-side
BATCH)
• Makes developer life easier à priceless
11. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
• Read performance vs secondary index
• MV better because single node read (secondary index can hit many
nodes)
• MV better because single read path (secondary index = read index + read
data)
12. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Consistency
• Consistency level!
• CL honoured for base table, ONE for MV + local batchlog!
• Weaker consistency guarantees for MV than for base table !
15. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Rationale
• Push computation server-side!
• save network bandwidth (1000 nodes!)!
• simplify client-side code!
• provide standard & useful function (sum, avg …)!
• accelerate analytics use-case (pre-aggregation for Spark)!
16. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
17. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code!
Type = Cassandra type!
18. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Always called. Null-check mandatory in code !
19. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
If any input is null, function execution is skipped and return null!
20. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Cassandra types!
• primitives (boolean, int, …)!
• collections (list, set, map)!
• tuples!
• UDT!
21. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
JVM supported languages!
• Java, Scala!
• Javascript (slow)!
• Groovy, Jython, JRuby!
• Clojure ( JSR 223 impl issue)!
23. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
User Define Aggregate (UDA)
• Real use-case for UDF!
• Aggregation server-side à huge network bandwidth saving !
• Provide similar behavior for Group By, Sum, Avg etc …!
24. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name!
State type!
Initial state type!
25. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Accumulator function signature:!
accumulatorFunction(stateType, type1, type2, …)!
RETURNS stateType!
!
Accumulator function ≈ foldLeft function !
26. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Optional final function signature:
finalFunction(stateType)
27. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Optional final function signature:
finalFunction(stateType)
29. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Gotchas
• UDA in Cassandra is not distributed !!
• Do not execute UDA on a large number of rows (106 for ex.)!
• single fat partition!
• multiple partitions!
• full table scan!
!
• à Increase client-side timeout!
• default Java driver timeout = 12 secs!
30. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Cassandra UDA or Apache Spark ?
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition! UDA with token-aware driver because node local!
ONE Multiple partitions! Apache Spark because distributed reads!
> ONE Single partition! UDA because data-locality lost with Spark!
> ONE Multiple partitions! Apache Spark definitely!
33. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why JSON ?
• JSON is a very good exchange format
• But a terrible schema …!
!
• How to have best of both worlds ?!
• use Cassandra schema!
• convert rows to JSON format!
35. SASI full text search index
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
36. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why SASI ?
• Searching (and full text search) was always a pain point for
Cassandra!
• limited search predicates (=, <=, <, > and >= only)!
• limited scope (only on primary key columns)!
• Existing secondary index performance is poor!
• reversed-index!
• use Cassandra itself as index storage …!
• limited predicate ( = ). Inequality predicate = full cluster scan
😱!
37. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How is it implemented ?
• New index structure = suffix trees
• Extended predicates (=, inequalities, LIKE %)!
• Full text search (tokenizers, stop-words, stemming …)!
• Query Planner to optimize AND predicates!
• NO, we don’t use Apache Lucene
38. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Who made it ?
• Open source contribution by an engineers team from …!
!
40. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
When is it available ?
• Right now with Cassandra ≥ 3.5!
• available in Cassandra 3.4 but critical bugs!
• Later improvement!
• index on collections (List, Set & Map) !!
• OR clause (WHERE (xxx OR yyy) AND zzz)!
• != operator!
41. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
SASI vs Search Engine
SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!
• Cassandra is not a search engine !!! (database = durability)!
• always slower because 2 passes (SASI index read + original Cassandra
data)!
• no scoring
• no ordering (ORDER BY)!
• no grouping (GROUP BY) à Apache Spark for analytics!
!
!