The document summarizes a presentation on search in Plone given at the Plone Conference in San Francisco on November 3, 2011. The presentation introduced information retrieval (IR) concepts, described the ZCatalog and Solr search engines, and discussed the conclusions from a conference discussion about integrating Solr with Plone. Key points included that Solr has advantages over ZCatalog for relevance and features, but ZCatalog cannot be completely replaced, and the current Solr add-ons do not provide the best foundation for future integration.
5. IR 101
• Transformations
• Terms
• Models
• Measures
Tuesday, November 29, 2011
6. IR 101
Transformations
• Turn binary, HTML, or other document
formats into fields and strings
• Parse the strings into a set of terms
• Build indexes of the terms specific to the IR
model used
• Queries are parsed into query operators and
strings, which are parsed into terms
Tuesday, November 29, 2011
7. IR 101
String => Terms
• Tokenization - locate word boundaries
• Normalization - remove capitals and diacritics
• Stopping - remove stop words (a, of, on,
the...)
• Stemming - reduce to word stems (walks,
walking => walk)
• Recognizers - concepts, parts of speech,
names, locations...
• Must be identical for documents and queries
Tuesday, November 29, 2011
8. IR 101
Terms
• Application specific
• Words or phrases
• IR models assign weights to terms in
documents
Tuesday, November 29, 2011
9. IR 101
Term Weighting
• Simplest:Yes/No Boolean value
• Better: Term Frequency - # occurrences
• More meaningful: tf-idf
• Term Freq * Inverse Document Freq
• How many documents contain the term?
• Increase weight of rare terms and vice
versa
Tuesday, November 29, 2011
10. IR 101
Boolean Model
• First and most adopted
• Based on Boolean logic + set theory
• Does a document contain query terms - Y/N
• Intuitive, easy to implement
• No ranking, special query language, too many
or too few results
• Typical for library systems
Tuesday, November 29, 2011
11. IR 101
Vector Space Models
• Represent documents and queries as vectors
of terms
• Term values are weighted - by count or tf-idf
• Use vector operations to compare
documents with queries
• Relevance score based on cosine of angle
between doc/query vectors
Tuesday, November 29, 2011
12. IR 101
Probabilistic Models
• Compute probability that a document is
relevant to a query
• Relevance ranking functions range from
simple to complex
• Sophisticated ranking functions include
• Okapi BM25 (uses tf and idf)
• Machine learning formulas (use training
data)
Tuesday, November 29, 2011
13. IR 101
Extending the Models
• Many many refinements possible
• Term interdependencies
• Fuzzy sets
• Semantic analysis, link analysis
• Combining models (Extended Boolean)
• The best search engines represent thousands
of engineering hours
Tuesday, November 29, 2011
14. IR 101
Measures
• Search engine results are measured against:
• Precision - Percent of results that are
relevant
• Recall - Percent of relevant results that are
returned
• F-Score - Harmonic mean of precision and
recall
Tuesday, November 29, 2011
16. ZCatalog
• Zope/Plone search engine
• Full text and field searching
• Probabilistic model using Okapi BM25
• OOTB ZCTextIndex very simple
• TextIndexNG adds multilingual, better parsing
components, binary transforms, synonyms
Tuesday, November 29, 2011
17. Solr
• Popular open source enterprise search
platform
• Eliminating smaller commercial search
companies
• Java, based on Lucene Java search library,
sophisticated vector space ++ model
• RESTful APIs
• Large, active community
• Powers Twitter, Wikipedia, Netflix...
Tuesday, November 29, 2011
18. What does Solr have
that ZCatalog Doesn’t?
• Better relevance ranking
• More search features: snippets, hit
highlighting, spelling suggestions, synonyms,
more like this, faceted search
• More configurable: stop words, field
boosting, parsing components
• An army of engineers working on it
Tuesday, November 29, 2011
19. Plone + Solr
Today
• Two add-ons available
• collective.solr - Intercepts catalog queries
and dispatches them to Solr
• alm.solrindex - adds a new index type to
the catalog, SolrIndex
• Plus a buildout recipe:
collective.recipe.solrinstance
Tuesday, November 29, 2011
20. Conclusions from
Conference Discussion
Tuesday, November 29, 2011
21. Why Does Plone
Need Solr?
• Certain types of projects need it, for features
or because ZCatalog can’t scale to very large
sites
• We need it to keep up with the enterprise
CMS pack
Tuesday, November 29, 2011
22. Points of Agreement
• It will be impossible to completely replace
ZCatalog with Solr
• Solr indexing will never be transactional
• Removing ZCatalog from Zope would be
very difficult
• Tackle small, focused ZCatalog
improvements when possible - like
improving indexing interface
Tuesday, November 29, 2011
23. Points of Agreement
• Navigation and search should be handled
separately
• Navigation needs to be transactional,
search does not
• Split out a catalog used for navigation from
the general catalog
• Explore a non-catalog utility to support
navigation, optimize for speed
Tuesday, November 29, 2011
24. Points of Agreement
• Treating Solr integration simply as ZCatalog
replacement does not take best advantage of
Solr features
• ZCatalog can’t represent the richness of
Solr, focus on the Solr API
• Take advantage of spelling suggestions,
facets, results snippets with hit highlighting,
synonyms, more like this, etc.
• Provide Solr indexing, field weighting, etc.
configuration choices in the control panel
Tuesday, November 29, 2011
25. Points of Agreement
• Neither of the current Solr add-ons provides
the best foundation for the future
• But they’ve taught us how to do things
better
• Non-Solr approaches to improved Plone
search should be deprecated
• Andreas Jung is not planning improvements
to TextIndexNG!
Tuesday, November 29, 2011
26. Points of Agreement
• Stop investing in ZCatalog as a search engine,
Solr is the future
Tuesday, November 29, 2011
27. Plone + Solr
Roadmap
• Short term: Make Solr integration easy with
an approved add-on (like LDAP)
• Build on what we’ve learned and create a
better add-on to replace collective.solr and
alm.solrindex
• Who wants to sponsor a sprint?
Tuesday, November 29, 2011
28. Plone + Solr
Roadmap
• Long term: Ship Solr integration with Plone,
but don’t require Solr
• Solr has a lot of overhead and is not always
needed
• But using it should be as easy as answering
yes to a “Build with Solr?” installation
option
Tuesday, November 29, 2011