This document discusses advanced search and top-K queries in Cassandra. It proposes integrating Lucene indexes with each Cassandra node to enable more expressive queries like range queries, multi-variable searches, and top-K queries. The integration would allow each node to index its own data with Lucene while supporting distributed queries. The document also describes how Stratio's tools like Deep and Crossdata can help integrate Lucene indexes with Spark for large-scale querying and analytics across Cassandra and other data stores.
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
1. Advanced search and
Top-K queries in Cassandra
1
Daniel Higuero
dhiguero@stratio.com
@dhiguero
Andrés de la Peña
andres@stratio.com
@a_de_la_pena
2. Who are we?
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 70+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
#CassandraSummit 2014
3. Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
5. Primary key queries
• O(1) node lookup for partition key
• Range slices for clustering key
• Usually requires denormalization
Partition key CLIENT Clustering key range
Node
3
Node
1
Node
2
apena
2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…
#CassandraSummit 2014 5
7. CLIENT C*
node
C*
node
2i local column
family
C*
node
2i local column
family
2i local column
family
Secondary indexes queries
• Inverted index
• Mitigates denormalization
• Queries may involve all C* nodes
• Queries limited to a single column
#CassandraSummit 2014 7
9. C*#
node#
C*#
node#
C*#
node#
Spark
master
Token range queries
• Used by MapReduce frameworks
as Hadoop or Spark
• All kinds of queries are possible
• Low throughput
• Ad-hoc queries
• Batch processing
• Materialized views
CLIENT
query= function (all data)
#CassandraSummit 2014 9
10. C*#
node#
C*#
node#
C*#
node#
Combining 2i with MapReduce
• Expressiveness avoiding full scans
• Still limited by one indexed column per query
Spark
CLIENT master
Secondary
index
Secondary
index
Secondary
index
#CassandraSummit 2014 10
11. What do we miss from 2i indexes?
MORE EXPRESIVENESS
• Range queries
• Multivariable search
• Full text search
• Sorting by fields
• Top-k queries
#CassandraSummit 2014 11
12. What do we like from the existing 2i?
IT’S ARCHITECTURE
• Each node indexes its own data
• The index implementations do not need to be distributed
• Natural extension point
• Can be created after design and ingestion
#CassandraSummit 2014 12
13. Thinking in a custom secondary index implementation…
WHY NOT USE ?
#CassandraSummit 2014 13
14. Why we like Lucene
• Proven stable and fast indexing solution
• Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch
• Can be fully embedded in application code
• Published under the Apache License
#CassandraSummit 2014 14
16. ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets (
id bigint,
createdAt timestamp,
message text,
userid bigint,
username text,
PRIMARY KEY (userid, createdAt, id) );
Create index
• Built in the background in any moment
• Real time updates
• Mapping eases ETL
• Language aware
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene)
USING 'com.stratio.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '60',
'schema' : '{ default_analyzer : "EnglishAnalyzer",
fields : {
createdat : {type : "date", pattern : "yyyy-MM-dd"},
message : {type : "text", analyzer : ”EnglishAnalyzer"},
userid : {type : "string"},
username : {type : "string"}
}} '};
#CassandraSummit 2014 16
17. SELECT * FROM tweets WHERE lucene
= ‘{
filter : {type : "match",
field : "text",
value : "cassandra"}
}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
#CassandraSummit 2014 17
18. Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene
= ‘{
query: {type:”match",
field : ”text”,
value : “cassandra”}
}’ LIMIT 5;
C*
node
Search top-5 CLIENT Search top-5
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
index
Merge
14 to
best 5
#CassandraSummit 2014 18
19. Modifying Cassandra for generic top-k queries
Two new methods in SecondaryIndexSearcher:
boolean'requiresFullScan(List<IndexExpression>'clause);'
List<Row>'sort(List<IndexExpression>'clause,'List<Row>'rows);'
Two new methods in AbstractRangeCommand:
boolean'requiresFullScan();'
List<Row>'combine(List<Row>'rows);'
And some changes in StorageProxy#getRangeSlice…
#CassandraSummit 2014 19
20. Queries can be as complex as you want
SELECT * FROM tweets WHERE lucene = ‘{
filter :
{
type : "boolean", must :
[
{type : "range", field : "time" lower : "2014/04/25”},
{type : "boolean", should :
[
{type : "prefix", field : "user", value : "a"} ,
{type : "wildcard", field : "user", value : "*b*"} ,
{type : "match", field : "user", value : "fast"}
]
}
]
},
sort :
{
fields: [ {field :"time", reverse : true},
{field : "user", reverse : false} ]
}
}’ LIMIT 10000;
#CassandraSummit 2014 20
21. Some implementation details
• A Lucene document per CQL row, and a Lucene field per indexed column
• SortingMergePolicy keeps index sorted in the same way that C* does
• Index commits synchronized with column family flushes
• Segments merge synchronized with column family compactions
NO MAINTENANCE REQUIRED
#CassandraSummit 2014 21
26. Integrating Lucene & Spark
Split friendly. It supports searches within a token range
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:"match", field:”text", value:"cassandra"}
}’
AND TOKEN(userid, createdAt, id) > 253653456456
AND TOKEN(userid, createdAt, id) <= 3456467456756
LIMIT 10000;
#CassandraSummit 2014 26
27. Integrating Lucene & Spark
Paging friendly: It supports starting queries in a certain point
SELECT * FROM tweets WHERE lucene = ‘{
filter : {type:”match", field:”text", value:”cassandra”}
}’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;
#CassandraSummit 2014 27
28. Integrating Lucene & Spark
CLIENT
Spark
master
C*
node
C*
node
C*
node
Lucene
Lucene
Lucene
• Compute large amounts of data
• Avoid systematic full scan
• Reduces the amount of data to be processed
• Filtering push-down
#CassandraSummit 2014 28
33. Stratio Deep
INTEGRATING SPARK WITH DIFFERENT DATASTORES
• Common Cell abstraction in the RDD
• Maintain compatibility with Spark operations
• Compatible with multiple datastore technologies
• DeepSparkContext
• DeepJobConfig
• Compatible with Lucene indexes
#CassandraSummit 2014 33
34. Stratio Crossdata
UNIFYING BATCH AND STREAMING QUERIES
• Single SQL-like language
• Compatible with multiple datastore technologies
• Connector-based architecture
• Ability to combine data from different datastore
• Complement non-native operation with Spark
• E.g., JOIN in Cassandra
• Custom support for Lucene-based secondary indexes
#CassandraSummit 2014 34
36. Conclusions
• Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
• Top-k query support
- Lucene scoring formula
- Sort by field values
• Compatible with MapReduce frameworks
• Preserves Cassandra’s functionality
#CassandraSummit 2014 36
37. github.com/stratio/stratio-cassandra
• Published as fork of Apache Cassandra
• Apache License Version 2.0
stratio.github.io/crossdata
Its open source
• Apache License Version 2.0
#CassandraSummit 2014 37
38. Advanced search and
Top-K queries in Cassandra
38
Daniel Higuero
dhiguero@stratio.com
@dhiguero
Andrés de la Peña
andres@stratio.com
@a_de_la_pena