Riak Search is a distributed data indexing and search platform built on top of Riak. The talk will introduce Riak Search, covering overall goals, architecture, and core functionality, with specific focus on how Erlang is used to manage and execute an ever-changing population of ad hoc query processes.
19. ...shard to add more storage capacity...
Your
Application
Lucene Lucene Lucene Riak
19
20. ...replicate to add more throughput.
Lucene Lucene Lucene
Your
Application
Lucene Lucene Lucene Riak
Lucene Lucene Lucene
20
21. ...replicate to add more throughput.
Lucene Lucene Lucene
Your
Application
Lucene Lucene Lucene Riak
Lucene Lucene Lucene
Operations nightmare!
21
22. What do we really want?
Your Riak-ified
Riak
Application Lucene
22
23. What do we really want?
Your Riak
Riak
Application Search
23
24. Functionality? Be like Lucene (and more).
• Lucene Syntax
• Leverages Java Lucene Analyzers
• Solr Endpoints
• Integration via Riak Post-Commit Hook (Index)
• Integration via Riak Map/Reduce (Query)
• Near-Realtime
• Schema-less
24
25. Operations? Be like Riak.
• No special nodes
• Add nodes, get more compute and storage
• Automatically load balance
• Replicas for durability and performance
• Index and query in parallel
• Swappable storage backends
25
28. The Inverted Index
Document Inverted Index
day, 1
dog, 1
#1
Every dog has
his day.
every, 1
has, 1
his, 1
28
29. The Inverted Index
Documents Combined Inverted Index
and, 4
#1 Every dog has
his day.
bag, 3
bark, 2
bite, 2
The dog's bark
#2
cat, 3
is worse than
his bite. cat, 4
day, 1
dog, 1
#3
Let the cat out
of the bag. dog, 2
dog, 4
every, 1
#4 It's raining
cats and dogs. has, 1
...
29
46. Tradeoffs...
Document Partitioning Term Partitioning
+ Lower Latency Queries - Higher Latency Queries
- Lower Throughput + Higher Throughput
- Lots of Disk Seeks - Hotspots in Ring
(the "Obama" problem)
46
47. Riak Search: Term Partitioning
Term-partitioning is the most viable approach for our beta
clients’ needs: high throughput on Really Big Datasets.
Optimizations:
• Term splitting to reduce hot spots
• Bloom filters & caching to save query-time bandwidth
• Batching to save query-time & index-time bandwidth
Support for either approach eventually.
47
72. The Message Format
Message ::
{results, [Result]} |
{results, disconnect}
Result ::
{DocID, Properties}
DocID ::
term()
Properties ::
proplist()
72
73. The Message Format
{results, [
{375, []},
{961, [{color, "red"}]},
{155, [{pos, [1,2,5]}]}
]}
73
74. Yay for Erlang!
• Clean lines between load balancing and logic, single-
and multi-node look the same
• Easy to create new operators, rapid development of
experimental features
• Linked processes make cleanup a breeze
• Significant code reduction over early Java prototypes
74
80. Thanks! Questions?
Search Team:
John Muellerleile - @jrecursive
Rusty Klophaus - @rklophaus
Kevin Smith - @kevsmith
Currently working with a small set of Beta users.
Open-source release planned for Q3.
www.basho.com