SlideShare a Scribd company logo
1 of 29
Download to read offline
SOLR SIDE-CAR INDEX
Andrzej Bialecki. LucidWorks
ab@lucidworks.com
About the speaker
•
•
•
•

Started using Lucene in 2003 (1.2-dev…)
Created Luke – the Lucene Index Toolbox
Apache Nutch, Hadoop, Solr committer, Lucene PMC member
LucidWorks engineer
Agenda
•
•
•
•
•

Challenge: incremental document updates
Existing solutions and workarounds
Sidecar index strategy and components
Scalability and performance
QA
Challenge: incremental document updates
•
•
•

Incremental update (field-level update): modification of a part of document
Sounds like a fundamentally useful functionality!
But Lucene / Solr doesn’t offer true field-level updates (yet!)
– “Update” is really a sequence of “retrieve old document, update fields, add
updated document, delete old document”
– “Atomic update” functionality in Solr is a (useful) syntactic sugar
Common use cases for field updates
•
•

•

Documents composed logically of two parts with different update schedules
– E.g. mostly static documents with some quickly changing fields
Two different classes of data in changing fields
– Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns
– Text fields: e.g. reviews, tags, click-through feedback, user profiles
Challenge: how to integrate these modifications with the main index content?
– Re-indexing whole documents isn’t always an option
True full-text (inverted fields) incremental updates
•

•

Very complex issue, broad impact on many Lucene internals
– Inverted index structure is not optimized for partial document updates
– At least another 6-12 months away?
LUCENE-4258 – work in progress
Handling updates via full re-index
•
•

•

If the corpus is small, or incremental updates infrequent… just re-index everything!
Pros:
– Relatively easy to implement – update source documents and re-index
– Allows adding all types of data, including e.g. labels as searchable text
Cons:
– Infeasible for larger corpora or frequent updates, time-wise and cost-wise
– Requires keeping around the source documents
• Sometimes inconvenient, when documents are assembled in a complex
pipeline
Handling updates via Solr’s ExternalFileField
•

•

•

Pros:
– Simple to implement
– Updates are easy – just file edits, no need to re-index
Cons:
– Only docId => field : number
– Not suitable for full-text searchable field updates
• E.g. can’t support user-generated labels attached to a
doc
– Still useful if a simple “popularity”-type metric is sufficient
Internally implemented as an in-memory ValueSource usable by
function queries

doc0=1.5
doc1=2.5
doc2=0.5
…
Numeric DocValues updates
•

Since Lucene/Solr 4.6 … to be released Really Soon

•
•



Details can be found in LUCENE-5189
As simple as:

indexWriter.updateNumericDocValue(term, field, value)
•
•

•

Neatly solves the problem of numeric updates: popularity, in-stock, etc.
Some limitations:
– Massive updates still somewhat costly until the next merge (like deletes)
– Can only update existing fields
Obviously doesn’t address the full-text inverted field updates
Lucene ParallelReader overview
•

•

•

0

Pretends that two or more IndexReader-s are
slices of the same index
– Slices contain data for different fields
– Both stored and inverted parts are supported
– Data for matching docs is joined on the fly
Structure of all indexes MUST match 1:1 !!!
– The same number of segments
– The same count of docs per segment
– Internal document ID-s must match 1:1
– List of deletes is taken from the first index
Sounds cool … but in practice it’s rarely used:
– It’s very difficult to meet these requirements
– This is even more difficult in the presence of
index updates and merges

f1, f2, f3, f4…

ParallelReader
0
1
2
3

0
1
2
3

f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …

0
1
2
3

f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …

4
5

0
1

f1, f2, ...
f1, f2, …

0
1

f3, f4, ...
f3, f4, …

6

0

f1, f2, …

0

f3, f4, …

main IR

parallel IR
Handling updates via ParallelReader
•

•

Pros:
– All types of data (e.g. searchable full-text
labels) can be added
Cons:
– Must ensure that the other index always
matches the structure of the main index
– Complicated and fragile (rebuild on every
update?)
– No tools to manage this parallel index in
Solr

ParallelReader

0
1
2
3

0
1
2
3

f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …

4
5

0
1

f1, f2, ...
f1, f2, …

6

0

f1, f2, …
main IR

0
1

f3, f4, ...
f3, f4, …

0
1

f3, f4, ...
f3, f4, …

0

f3, f4, …

parallel IR
Sidecar Index Components for Solr
•

•

•
•

Uses the ParallelReader strategy for field updates
– “Main” and “sidecar” data comes from two different Solr collections
– “Sidecar” collection is updated independently from the main collection
– “Sidecar” collection is used as a source of document fields for building and
updating a parallel index
Integrates the management of ParallelReader (“sidecar index”) into Solr
– Initial creation of ParallelReader, including synchronization of internal ID-s
– Tracking of updates and IndexReader.reopen(…) events
Partly based on a version of Click Framework in LucidWorks Search
Available under Apache License here: http://github.com/LucidWorks/sidecar_index
“Main”, “sidecar” collections and parallel index
•
•
•
•
•

“Main” collection contains only the parts of documents with “main” fields
“Sidecar” collection is a source of documents with “sidecar” fields
SidecarIndexReaderFactory creates and maintains the parallel index (sidecar
index)
“Main” collection uses SidecarIndexReader that acts as ParallelReader
Main index is updated as usual, via the “main” collection’s IndexWriter
Solr
Main_collection
SidecarIndexReader
main index

sidecar index

Sidecar_collection
Implementation details
•

•

•

SidecarIndexReaderFactory extends Solr’s IndexReaderFactory
– newReader(Directory, SolrCore) – initial open
– newReader(IndexWriter, SolrCore) – NRT open
SidecarIndexReader acts like a ParallelReader
– Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader
– Basically had to re-implement the logic from ParallelReader 
ParallelReader challenges:
– How to synchronize internal ID-s?
– How to create segments that are of the same size as those of the main index?
– How to handle deleted documents?
– How to handle updates to the main index?
– How to handle updates to the sidecar data?
Sidecar collection

ParallelReader challenges and solutions
•

•

How to synchronize internal ID-s?
– “Main” collection is traversed sequentially by
internal docId
– Primary key is retrieved for each document
– Matching document is found in the “sidecar”
collection
– Matching document is added to the “sidecar” index
Very costly phase!
– Random seek and retrieval from “sidecar”
collection
– Primary key lookup is fast
– … but stored field retrieval and indexing isn’t
Main collection

G
B
C
E
A
F
D

q=id:D

0
1
2
3

0
1
2
3

D, f2, ...
B, f2, ...
A, f2, ...
F, f2, …

4
5

0
1

C, f2, ...
G, f2, …

6

0

f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …
f3, f4, ...
f3, f4, ...
f3, f4, …

E, f2, …

main IR

0
1
2

f3, f4, ...
f3, f4, ...
f3, f4, ...

sidecar IR
ParallelReader challenges and solutions
•
•
•

Optimization 1: don’t rebuild data for unmodified
segments
Optimization 2 (cheating): ignore NRT segments
How to handle deleted docs?
– Insert dummy (empty) documents so that
the number and the order of documents still
match

ParallelReader
0
1
2
3

0
1
2
3

f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …

0
1
2
3

f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …

4
5

0
1

f1, f2, ...
f1, f2, …

0
1

f3, f4, ...
f3, f4, …

X

7

0
1

f1, f2, ...
f1, f2, …

0
1

dummy
f3, f4, …

NRT

0

f1, f2, …
main IR

sidecar IR
Implementation: SidecarMergePolicy
•
•

How to create segments that are of the same size as
the “main” index?
Carefully manage the “sidecar” index creation:
– IndexWriter uses SerialMergeScheduler to
prevent out-of-order merges
– Force flush when reaching the next target count
of documents
– Merges are enforced using SidecarMergePolicy
that tracks the sizes of the “main” index segments

ParallelReader
0
1
2
3

0
1
2
3

f1, f2, ...
f1, f2, ...
f1, f2, ...
f1, f2, …

0
1
2
3

f3, f4, ...
f3, f4, ...
f3, f4, ...
f3, f4, …

4
5

0
1

f1, f2, ...
f1, f2, …

0
1

f3, f4, ...
f3, f4, …

6

0

f1, f2, …

0

f3, f4, …

main IR

SidecarMergePolicy
target sizes:
Seg0 – 4 docs
Seg1 – 2 docs
Seg2 – 1 doc

sidecar IR
Implementation: SidecarIndexReader
•
•

•
•

•

Re-implements the logic of ParallelReader
– ParallelReader != DirectoryReader
Exposes Directory of the “main” index for replication
– Replicas need the “sidecar” collection replica to rebuild the sidecar index locally
– If document routing and shard placement is the same then we don’t have to use
distributed search – all data will be local
Reopen(…) avoids rebuilding unmodified segments
Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when
necessary
– When there’s a major merge in the “main” index
– When “sidecar” data is updated
Ref-counting of IndexReaders at different levels is very tricky!
Example configuration in solrconfig.xml
<indexReaderFactory name="IndexReaderFactory"
class="com.lucid.solr.sidecar.SidecarIndexReaderFactory">
<str name="docIdField">id</str>
<str name="sourceCollection">source</str>
<bool name="enabled">true</bool>
</indexReaderFactory>
Example use case: integration of click-through data
•
•

•

Raw click-through data:
– Query, query_time, docId, click_time [, user]
Aggregated click-through data:
– User-generated popularity score: F(number and timing of clicks per docId)
• Numeric updates
– User-generated labels: F(top-N queries that led to clicks on docId)
• Full-text searchable updates
– User profiles: F(top-N queries per user, top-N docId-s clicked, etc)
– …
Queries can now be expanded to score based on TF/IDF in user-generated labels
Scalability and performance
Scalability and performance
•

•
•
•

Initial full rebuild is very costly
– ~0.6 ms / document
– 1 mln docs = 600 sec = 10 min
– Not even close to “real time” …
Cost related to new segments in “main” index depends on the size of segments
Major merge events will trigger full rebuild
BUT: search-time cost is negligible
Caveats
•
•

Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track
– The sidecar code is still unstable and occasionally explodes
Performance of full rebuild quickly becomes the bottleneck on frequent updates
– So the main use case is massive but infrequent updates of “sidecar” data

•

Code: http://github.com/LucidWorks/sidecar_index

•

Fixes and contributions are welcome – the code is Apache licensed
Agenda
•
•
•
•
•

Challenge: incremental document updates
Existing solutions and workarounds
Sidecar index strategy and components
Scalability and performance
QA
QA

Andrzej Bialecki
ab@lucidworks.com

More Related Content

What's hot

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHPHiraq Citra M
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLucidworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 

What's hot (20)

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHP
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, LucidworksLifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
Lifecycle of a Solr Search Request - Chris "Hoss" Hostetter, Lucidworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 

Similar to Andrzej bialecki lr-2013-dublin

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4Gerke Max Preussner
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelDaniel Coupal
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayMongoDB
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...MongoDB
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 

Similar to Andrzej bialecki lr-2013-dublin (20)

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4
West Coast DevCon 2014: Engine Overview - A Programmers Glimpse at UE4
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
wk 4 -- linking.ppt
wk 4 -- linking.pptwk 4 -- linking.ppt
wk 4 -- linking.ppt
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
XPages Performance Master Class - Survive in the fast lane on the Autobahn (E...
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBayStoring eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
 
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB  present...
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Andrzej bialecki lr-2013-dublin

  • 1.
  • 2. SOLR SIDE-CAR INDEX Andrzej Bialecki. LucidWorks ab@lucidworks.com
  • 3. About the speaker • • • • Started using Lucene in 2003 (1.2-dev…) Created Luke – the Lucene Index Toolbox Apache Nutch, Hadoop, Solr committer, Lucene PMC member LucidWorks engineer
  • 4. Agenda • • • • • Challenge: incremental document updates Existing solutions and workarounds Sidecar index strategy and components Scalability and performance QA
  • 5. Challenge: incremental document updates • • • Incremental update (field-level update): modification of a part of document Sounds like a fundamentally useful functionality! But Lucene / Solr doesn’t offer true field-level updates (yet!) – “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document” – “Atomic update” functionality in Solr is a (useful) syntactic sugar
  • 6. Common use cases for field updates • • • Documents composed logically of two parts with different update schedules – E.g. mostly static documents with some quickly changing fields Two different classes of data in changing fields – Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns – Text fields: e.g. reviews, tags, click-through feedback, user profiles Challenge: how to integrate these modifications with the main index content? – Re-indexing whole documents isn’t always an option
  • 7. True full-text (inverted fields) incremental updates • • Very complex issue, broad impact on many Lucene internals – Inverted index structure is not optimized for partial document updates – At least another 6-12 months away? LUCENE-4258 – work in progress
  • 8. Handling updates via full re-index • • • If the corpus is small, or incremental updates infrequent… just re-index everything! Pros: – Relatively easy to implement – update source documents and re-index – Allows adding all types of data, including e.g. labels as searchable text Cons: – Infeasible for larger corpora or frequent updates, time-wise and cost-wise – Requires keeping around the source documents • Sometimes inconvenient, when documents are assembled in a complex pipeline
  • 9. Handling updates via Solr’s ExternalFileField • • • Pros: – Simple to implement – Updates are easy – just file edits, no need to re-index Cons: – Only docId => field : number – Not suitable for full-text searchable field updates • E.g. can’t support user-generated labels attached to a doc – Still useful if a simple “popularity”-type metric is sufficient Internally implemented as an in-memory ValueSource usable by function queries doc0=1.5 doc1=2.5 doc2=0.5 …
  • 10. Numeric DocValues updates • Since Lucene/Solr 4.6 … to be released Really Soon • •  Details can be found in LUCENE-5189 As simple as: indexWriter.updateNumericDocValue(term, field, value) • • • Neatly solves the problem of numeric updates: popularity, in-stock, etc. Some limitations: – Massive updates still somewhat costly until the next merge (like deletes) – Can only update existing fields Obviously doesn’t address the full-text inverted field updates
  • 11. Lucene ParallelReader overview • • • 0 Pretends that two or more IndexReader-s are slices of the same index – Slices contain data for different fields – Both stored and inverted parts are supported – Data for matching docs is joined on the fly Structure of all indexes MUST match 1:1 !!! – The same number of segments – The same count of docs per segment – Internal document ID-s must match 1:1 – List of deletes is taken from the first index Sounds cool … but in practice it’s rarely used: – It’s very difficult to meet these requirements – This is even more difficult in the presence of index updates and merges f1, f2, f3, f4… ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … 6 0 f1, f2, … 0 f3, f4, … main IR parallel IR
  • 12. Handling updates via ParallelReader • • Pros: – All types of data (e.g. searchable full-text labels) can be added Cons: – Must ensure that the other index always matches the structure of the main index – Complicated and fragile (rebuild on every update?) – No tools to manage this parallel index in Solr ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 4 5 0 1 f1, f2, ... f1, f2, … 6 0 f1, f2, … main IR 0 1 f3, f4, ... f3, f4, … 0 1 f3, f4, ... f3, f4, … 0 f3, f4, … parallel IR
  • 13. Sidecar Index Components for Solr • • • • Uses the ParallelReader strategy for field updates – “Main” and “sidecar” data comes from two different Solr collections – “Sidecar” collection is updated independently from the main collection – “Sidecar” collection is used as a source of document fields for building and updating a parallel index Integrates the management of ParallelReader (“sidecar index”) into Solr – Initial creation of ParallelReader, including synchronization of internal ID-s – Tracking of updates and IndexReader.reopen(…) events Partly based on a version of Click Framework in LucidWorks Search Available under Apache License here: http://github.com/LucidWorks/sidecar_index
  • 14.
  • 15.
  • 16. “Main”, “sidecar” collections and parallel index • • • • • “Main” collection contains only the parts of documents with “main” fields “Sidecar” collection is a source of documents with “sidecar” fields SidecarIndexReaderFactory creates and maintains the parallel index (sidecar index) “Main” collection uses SidecarIndexReader that acts as ParallelReader Main index is updated as usual, via the “main” collection’s IndexWriter Solr Main_collection SidecarIndexReader main index sidecar index Sidecar_collection
  • 17. Implementation details • • • SidecarIndexReaderFactory extends Solr’s IndexReaderFactory – newReader(Directory, SolrCore) – initial open – newReader(IndexWriter, SolrCore) – NRT open SidecarIndexReader acts like a ParallelReader – Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader – Basically had to re-implement the logic from ParallelReader  ParallelReader challenges: – How to synchronize internal ID-s? – How to create segments that are of the same size as those of the main index? – How to handle deleted documents? – How to handle updates to the main index? – How to handle updates to the sidecar data?
  • 18. Sidecar collection ParallelReader challenges and solutions • • How to synchronize internal ID-s? – “Main” collection is traversed sequentially by internal docId – Primary key is retrieved for each document – Matching document is found in the “sidecar” collection – Matching document is added to the “sidecar” index Very costly phase! – Random seek and retrieval from “sidecar” collection – Primary key lookup is fast – … but stored field retrieval and indexing isn’t Main collection G B C E A F D q=id:D 0 1 2 3 0 1 2 3 D, f2, ... B, f2, ... A, f2, ... F, f2, … 4 5 0 1 C, f2, ... G, f2, … 6 0 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … f3, f4, ... f3, f4, ... f3, f4, … E, f2, … main IR 0 1 2 f3, f4, ... f3, f4, ... f3, f4, ... sidecar IR
  • 19. ParallelReader challenges and solutions • • • Optimization 1: don’t rebuild data for unmodified segments Optimization 2 (cheating): ignore NRT segments How to handle deleted docs? – Insert dummy (empty) documents so that the number and the order of documents still match ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … X 7 0 1 f1, f2, ... f1, f2, … 0 1 dummy f3, f4, … NRT 0 f1, f2, … main IR sidecar IR
  • 20. Implementation: SidecarMergePolicy • • How to create segments that are of the same size as the “main” index? Carefully manage the “sidecar” index creation: – IndexWriter uses SerialMergeScheduler to prevent out-of-order merges – Force flush when reaching the next target count of documents – Merges are enforced using SidecarMergePolicy that tracks the sizes of the “main” index segments ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … 6 0 f1, f2, … 0 f3, f4, … main IR SidecarMergePolicy target sizes: Seg0 – 4 docs Seg1 – 2 docs Seg2 – 1 doc sidecar IR
  • 21. Implementation: SidecarIndexReader • • • • • Re-implements the logic of ParallelReader – ParallelReader != DirectoryReader Exposes Directory of the “main” index for replication – Replicas need the “sidecar” collection replica to rebuild the sidecar index locally – If document routing and shard placement is the same then we don’t have to use distributed search – all data will be local Reopen(…) avoids rebuilding unmodified segments Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when necessary – When there’s a major merge in the “main” index – When “sidecar” data is updated Ref-counting of IndexReaders at different levels is very tricky!
  • 22. Example configuration in solrconfig.xml <indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool> </indexReaderFactory>
  • 23. Example use case: integration of click-through data • • • Raw click-through data: – Query, query_time, docId, click_time [, user] Aggregated click-through data: – User-generated popularity score: F(number and timing of clicks per docId) • Numeric updates – User-generated labels: F(top-N queries that led to clicks on docId) • Full-text searchable updates – User profiles: F(top-N queries per user, top-N docId-s clicked, etc) – … Queries can now be expanded to score based on TF/IDF in user-generated labels
  • 25.
  • 26. Scalability and performance • • • • Initial full rebuild is very costly – ~0.6 ms / document – 1 mln docs = 600 sec = 10 min – Not even close to “real time” … Cost related to new segments in “main” index depends on the size of segments Major merge events will trigger full rebuild BUT: search-time cost is negligible
  • 27. Caveats • • Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track – The sidecar code is still unstable and occasionally explodes Performance of full rebuild quickly becomes the bottleneck on frequent updates – So the main use case is massive but infrequent updates of “sidecar” data • Code: http://github.com/LucidWorks/sidecar_index • Fixes and contributions are welcome – the code is Apache licensed
  • 28. Agenda • • • • • Challenge: incremental document updates Existing solutions and workarounds Sidecar index strategy and components Scalability and performance QA