Apache Cassandra is a scalable database with high availability features. But they come with severe limitations in term of querying capabilities. Since the introduction of SASI in Cassandra 3.4, the limitations belong to the pass. Now you can create performant indices on your columns as well as benefit from full text search capabilities with the introduction of the new LIKE %term% syntax. To illustrate how SASI works, we'll use a database of 100 000 albums and artists.
Gen AI in Business - Global Trends Report 2024.pdf
SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Milan 2016
1. @doanduyhai
SASI, Cassandra on the full text search ride
DuyHai DOAN
Apache Cassandra™ Evangelist
Apache Zeppelin™ Commiter
MILAN 25-26 NOVEMBER 2016
2. @doanduyhai
Datastax
• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 400+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = better Apache Cassandra + extra features
2
3. @doanduyhai
1 5 minutes introduction to Apache Cassandra™
2 SASI introduction
3 SASI cluster-wide
4 SASI local read/write path
5 Query planner
6 Take away
3
13. @doanduyhai
What is SASI ?
13
• SSTable-Attached Secondary Index à new 2nd index impl that follows
SSTable life-cycle
• Objective: provide more performant & capable 2nd index
15. @doanduyhai
Why is it better than native 2nd index ?
15
• follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized
• new data-strutures
• range query (<, ≤, >, ≥) possible
• full text search options
18. @doanduyhai
Distributed index
18
On cluster level, SASI works exactly like native 2nd index
H
A
E
D
B C
G F
UK user1 user102 … user493
US user54 user483 … user938
UK user87 user176 … user987
UK user17 user409 … user787
24. @doanduyhai
Caveat 1: non restrictive filters
24
H
A
E
D
B C
G F
coordinator
Hit all
nodes
eventually
L
25. @doanduyhai
Caveat 1 solution : always use LIMIT
25
H
A
E
D
B C
G F
coordinator
SELECT *
FROM …
WHERE ...
LIMIT 1000
26. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
26
H
A
E
D
B C
G F
coordinator
Not found WHERE user_email = ‘xxx'
27. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
27
H
A
E
D
B C
G F
coordinator
Still no result
WHERE user_email = ‘xxx'
28. @doanduyhai
Caveat 2: 1-to-1 index (user_email)
28
H
A
E
D
B C
G F
coordinator
At best 1 user found
At worst 0 user found
WHERE user_email = ‘xxx'
29. @doanduyhai
Caveat 2 solution: materialized views
29
For 1-to-1 index/relationship, use materialized views instead
CREATE MATERIALIZED VIEW user_by_email AS
SELECT * FROM users
WHERE user_id IS NOT NULL and user_email IS NOT NULL
PRIMARY KEY (user_email, user_id)
But range queries ( <, >, ≤, ≥) not possible …
33. @doanduyhai
SASI Life-cycle: in-memory
33
Commit log1
. . .
1
Commit log2
Commit logn
Memory
. . .
MemTable
Table1
MemTable
Table2
MemTable
TableN
2
Index
MemTable1
Index
MemTable2
. . .
Index
MemTableN
3
ACK the client
34. @doanduyhai
Local write path data structures
34
Index mode, data type Data structure Usage
PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'
CONTAINS, text Guava ConcurrentSuffixTree
name LIKE ’%John%'
name LIKE ’%ny’
PREFIX, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
SPARSE, other JDK ConcurrentSkipListSet
age = 20
age >= 20 AND age <= 30
suitable for 1-to-N index with N ≤ 5
37. @doanduyhai
Local read path
37
• first, optimize query using Query Planer (see later)
• then load chunks (4k) of index files from disk into memory
• perform binary search to find the indexed value(s)
• retrieve the offset of the partition from the original SSTable
• read the original data
46. @doanduyhai
SASI vs search engines
SASI vs Solr/ElasticSearch ?
• Cassandra is not a search engine !!! (database = durability)
• always slower because 2 passes (SASI index read + original Cassandra data)
• no scoring
• no ordering (ORDER BY)
• no grouping (GROUP BY) à Apache Spark for analytics
If you don’t need the above features, SASI is for you!
46
47. @doanduyhai
SASI sweet spots
SASI is a relevant choice if
• you need multi criteria search and you don't need ordering/grouping/scoring
• you mostly need 100 to 10000 of rows for your search queries
• you always know the partition keys of the rows to be searched for (this one applies to
native secondary index too)
47
48. @doanduyhai
SASI blind spots
SASI is a poor choice if
• you have strong SLA on search latency, for example few millisecs requirement
• ordering of the search results is important for you
48