2. Flipkart’s Index
Flipkart’s Index
1. Data organized in multiple indexes/Solr cores. Couple of millions of
documents.
2. SKUs are documents.
3. Data organized in multiple solr cores.
4. Extensive use of facets and filters.
5. All search doesn’t allow faceting.
Lots of custom components
1. Custom collectors ( for enabling blending of results for diversity /
personalization )
2. Custom Query parsers ( for enabling really customized scoring)
3. Custom fields
3. Typical Ecommerce Document
● Catalogue data
○ Static
○ Largely textual
● Pricing related data
○ Dynamic
○ Faster moving
● Offers
○ Channel specific based on nature of event
● Availability
○ Dynamic
○ Faster moving
and more...
4. First Cut Integration
1. Catalogue Management System aka CMS
a. Single Source of truth for all Systems
b. Merges data from multiple sources, doing joins and keeps the latest snapshot,
keyed by Product Id
c. Raises notification whenever the data changes .
Catalogue Management System
(Static and dynamic)
Data Import
Handler
(Fetch, Transform,
Dedup,
Update)
SOLR
Notification
Sales Signals,
Custom tags
5. But ….
1. Limitations
a. Too much data ( and more than 80% , not of any interest to search system)
b. CMS has to keep data for ever. (Remember it is source of truth). But search System
doesn’t need to index all documents. ( obsolete products). So lots of drops.
c. Merging becomes too much for CMS. Introduces Lag.
2. DIH Limitations
a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262)
b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and
single threaded doesn’t help.
c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But
difficult to integrate proactively.
6. So CMS is re-factored
CMS
(service)
Dynamic Field 1
Service (service)
Notification stream
Notification stream
dynamic sorting fields (
sparse but a lot of them
)
(mysql db)
Snapshot
SOLR Master
External Field ,
consumed through
DIH
Solr
Slaves
7. Why are Partial updates a challenge in Lucene ?
1. Update
a. Lucene doesn’t support partial updates. Tough to do with inverted index. It
is because all terms for that document needs to be updated. Lots of open
tickets
b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258
(overlay segment based) , Incremental Field Updates through Stacked
Segments
c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5}
d. Document @ t2 → Term vectors { T1, T4, T10 }
e. Inverted index actually stores the posting list for its terms. These posting
lists are quite sparse and compressed using delta encodings for efficiency
reasons.
f. T1 → {1, 5, 7 } etc
g. T2 → {2, 5, 6}
h. To support partial update, the document has to be removed from posting
listing of all its previous terms .. That is non-trivial. Because that will involve
remembering and storing all terms for a given document.
i. So instead Lucene and inverted index systems, mark old document as
deleted in another data structure (live docs)
8. Why are Partial updates a challenge in Lucene ?
1. What it means is a update in actually
a. Delete + Add . ( Regardless of which
attribute changed)
b. Deleted documents are compacted by a
background merge thread.
2. Updates become only after a commit
c. Soft commit will create a new segment in
memory.
d. Hard commit will do a fsync to directory.
9. But do we need to re-index a document ? Lets evaluate
1. Lucene might hold 3 kinds of data
a. Data used for actual search ( analyzed, converted into tokens )
b. Data used for plain filtering ( not analyzed, e.g. price, discount)
c. Data used for ranking ( e.g. relevancy signals and there can be a
lot of them)
2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing.
a. Pipeline can be spam filtering → text cleaning → duplicate
detection → NLP → Entity extraction etc etc
3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags ,
usually with enumerated values
a. Can be dynamic
b. Can be governed by policies and business constraints.
10. But do we need to re-index a document ? Lets evaluate
1. Ranking Signals ⇒ Needs to be row oriented.
a. Can be batch update (e.g. category specific ranks, ratings)
or real time updates e.g. availability.
b. Lucene actually un-inverts such fields using FieldCache
c. Doc values were introduced to manage the cost of
FieldCache and better provide updatability.
d. updatable NumericDocValues (LUCENE-5189, since 4.6)
, updatable binary doc values (LUCENE-5513, since 4.8)
e. Solr still doesn’t have updatabale doc values. Jira ticket
open, but issues around update/write-ahead logs. ( SOLR-
5944)
11. First Approach : Leverage Updatable Numeric DocValues
1. Solr Limitation : Easily overcome in master slave model by
plugging your own update chain and accessing IndexWriter
directly.
2. But :
a. You need a commit for docvalues to reflect. ( Not real time !! )
b. Filtering on DocValues : is inefficient. Specially on Numeric
Fields.
c. Making it work is solr cloud is non trivial. For details please
see SOLR-5944.
d. Docvalues are dense. Updates are not stacked. It always
dumps the full view of modified field doc value on every
commit. (optimizing for search performance) (http://shaierera.
blogspot.in/2014/04/updatable-docvalues-under-hood.html)
e. But what if we had 500 fields doc values for millions of docs.
12. First Approach : Leverage Updatable Numeric DocValues
1. Commit caveats:
a. Soft commits is NOT FREE.
Soft-commit in solr = IndexWriter.getReader() in lucene ==
flush + open .
There is NRTCachingDirectory, which caches the small
segment produced and makes it cheaper to do soft
commits. Details can found in McCandless’s post.
b. In Solr invalidate all caches and they have to be re-
generated on every commit. Some caches like filterCache
have a huge impact on performance. Warming them up
itself might take 2-3 minutes at times.
c. Warmup puts memory pressure on jvm and builds spikes
in allocations. Some caches like documentCache can’t
even be warmed up.
d. More commits ⇒ more segments ⇒ more merges
13. 2nd Approach. : NRT Store and Value Sources
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html
- abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext)
Gets the values for this reader and the context that was previously passed to createWeight()
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm
FunctionValues
- boolean exists(int doc) : Returns true if there is a value for this document
- double doubleVal(int doc)
Value Sources Allowed us to Plug External Data sources right inside Solr. These
external data need not be part of the index themselves, but should be easily retrievable.
Because they would be called millions of times and right inside a loop.
14. The Challenge
1. Entries in Solr caches have really no expiry time and have no way to invalidate entries.
2. Solution : Get rid of query cache altogether. But still, we have filterCache.
3. So now : matching and scoring had to be really fast.
a. Calls to value source need to be extremely fast. We have optimized them out, so
that they are as fast as accessing doc values.
b. The cost of ranking functions themselves. Some of the optimizations involved
getting and reducing cost of Math functions themselves
15. So the learnings
1. Understand your data, change rate and what you want to do with your data
2. Solr / Lucene have really good abstractions both around indexing and query. Both
provide you with a lot of hooks and plugins. Think through and take advantage of them.
3. Experiment, profile and benchmark. Delve into the APIs and internals.
4. The experts do help. The dense docValues and softcommits not being free, were direct
contributions of discussions with Shalin.
5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built
a lucene-codecs (which built and updated inverted index in redis).