SlideShare une entreprise Scribd logo
1  sur  73
Télécharger pour lire hors ligne
Solr Black Belt
code4lib conference 2010 - Asheville, NC
    Erik Hatcher, Lucid Imagination
        Naomi Dushay, Stanford




                                           1
What’s new in Solr 1.4?
•   Java-based replication              •   TermsComponent


•   VelocityResponseWriter (Solritas)   •   Rich document indexing, via Tika
                                            (Solr Cell)
•   Logging switched to SLF4J
                                        •   Greatly improved faceting
                                            performance
•   Rollback, since last commit


•   StatsComponent                      •   Exact/near duplicate document
                                            handling

•   TermVectorComponent
                                        •   Support added for Lucene's omitTf

•   Configurable Directory provider
                                        •   "trie" range query support

•   CharFilter



                                                                                2
Performance
       Improvements
• Caching
• Concurrent file access
• Per-segment index updates
• Faceting
• DocSet generation, avoids scoring
• Streaming updates for SolrJ
                                      3
Lucene 2.9
•   IndexReader#reopen()

•   Faster filter performance, by 300% in some cases

•   Per-segment FieldCache

•   Reusable token streams

•   Faster numeric/date range queries, thanks to trie

•   and tons more, see Lucene 2.9's CHANGES.txt



                                                        4
Deployment Architectures




                           5
JVM
• -server
• -XmxNNNNm
• Java 1.6 (latest point release)
• garbage collector
• 64-bit?
• Tools: JVM GC logging, jconsole
                                    6
Useful JVM switches
•   -Xloggc:gc.out: Will output GC information to a file named “gc.out”.

•   –XX:+PrintGC: Outputs basic information at every garbage
    collection.

•   –XX:+PrintGCDetails: Outputs more detailed information at every
    garbage collection.

•   –XX:+PrintGCTimeStamps: Outputs a time stamp at the start of
    each garbage collection event. Used with –XX:+PrintGC or –XX:
    +PrintGCDetails to show when each garbage collection begins.

•   -XX:-HeapDumpOnOutOfMemoryError: Dump heap to file when
    java.lang.OutOfMemoryError is thrown.




                                                                          7
Indexing Performance
• Tricks of the trade:
 • multithread/multiprocess
 • batch documents
 • separate Solr server and indexers
 • Indexing master + replicants
• StreamingUpdateSolrServer + javabin
                                        8
MARC indexing strategies


 • SolrMarc
 • Future? DataImportHandler hooks


                                     9
Index Settings
•   useCompoundFile: set to false

•   mergeFactor: 10 or lower, generally

•   ramBufferSizeMB: buffer used for added documents before flushing to
    directory; more predictable instead of using maxBufferedDocs.
    Benchmarking shows <= 128 is best.

•   maxMergeDocs: maximum number of documents for a single segment

•   maxFieldLength: generally max. int is desired = 2147483647

•   maxWarmingSearchers: 1 is best




                                                                         10
Searching Performance
•   javabin - binary protocol for Java clients
•   caches: filterCache most relevant here
    •   autowarm
    •   FastLRUCache
•   warming queries: firstSearcher, newSearcher
    •   sorting, faceting


                                                 11
debugQuery=true

• parsed queries
• scoring explanations
• search component timings


                             12
Query Parsing

• defType
 • applies to main query only
 • fq parsed as "lucene" unless individually
    overridden
• {!parser local=params}query string

                                               13
Solr Query Parser
          (lucene)
•   http://lucene.apache.org/java/2_9_1/
    queryparsersyntax.html + Solr extensions

•   Kitchen sink parser, includes advanced user-
    unfriendly syntax

•   Syntax errors throw parse exceptions back to
    client

•   Example: title:ipod* AND price:[0 TO 100]

•   http://wiki.apache.org/solr/SolrQuerySyntax


                                                   14
SolrQueryParser
•   Default query parser

•   schema.xml

    •   <defaultSearchField>text</defaultSearchField>

    •   <solrQueryParser defaultOperator="OR"/>

•   Adds _query_:"..." and _val_:"..." hooks

•   Supports leading wildcards with
    ReversedWildcardFilterFactory


                                                        15
Dismax Query Parser
     (dismax)
• Simplified syntax:
  loose text “quote phrases” -prohibited
  +required
• Spreads query terms across query fields
  (qf) with dynamic boosting per field, phrase
  construction (pf), and boosting query and
  function capabilities (bq and bf)


                                                16
dismax: q and q.alt

• odd number of quotes is parsed as if there
  were no quotes
• wildcards, fuzzy, etc not supported
• q.alt: alternate query; "lucene" parsed, used
  when q is omitted; useful as *:* to get
  collection-wide facet counts



                                                  17
dismax: qf and pf
• query fields / phrase fields
• syntax: field[^boost]...
 • example: title^2 body
• pf for boosting where terms in q are in
  close proximity; entire q string is used as
  phrase implicitly


                                                18
dismax: qs and ps

• qs: query slop; used for explicit "phrase
  queries"
• ps: phrase slop; used for implicit phrase
  query added for pf fields




                                              19
dismax: mm
  •    minimum match, for optional clauses

  •    default = 100% (pure AND)

  •    Examples:

      •    pure OR: mm=0 or mm=0%

      •    at least two should match: mm=2

      •    at least 75% should match: mm=75%

      •    1-3 clauses, must match, 4 or more 90% must match: mm=3<90%

      •    1-2 clauses all required, 3-9 clauses all but 25% must match, 9 or more all
           but 3 are requried: mm=2<-25% 9<-3

      •    1-3 clauses all must match, 3-5 clauses, one less than the number of clauses
           must match, 6 or more clauses, 80% must match, rounded down:
           mm=2<-1 5<80%
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

                                                                                          20
dismax: tie
•   tiebreaker

•   more than one field may match and scored based on term
    frequency

•   how much the final score of the query will be influenced by the
    scores of the lower scoring fields compared to the highest scoring
    field.

•   A value of "0.0" makes the query a pure "disjunction max query" --
    only the maximum scoring sub query contributes to the final score.
    A value of "1.0" makes the query a pure "disjunction sum query"
    where it doesn't matter what the maximum scoring sub query is,
    the final score is the sum of the sub scores. Typically a low value (ie:
    0.1) is useful.



                                                                              21
dismax: tie
•   The “tie” (tie breaker) parameter is very important, but not easy to understand. It may
    be useful to visualize it as a “slider” control between 0 and 1, with a value of 0 being a
    “pure disjunction max” query, and a value of “1” being a “pure disjunction sum” query.
    So the “max” score is added to the sum of all other scores multiplied by the tie
    breaker:

    •   If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is
        0:

        •   score = 2.12 + ((1.7 + 0.5) * 0 ) = 2.12

    •   If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is
        1:

        •   score = 2.12 + ((1.7 + 0.5) * 1) = 4.32




                                                                                                 22
dismax: bq
•   boosting query

•   "lucene" query parsed, by default

•   combined (optionally) with users query to boost
    matching documents

•   warning: a boolean query with boost of 1.0 has clauses
    added as-is, can be problematic by adding required/
    prohibited clauses; could be caused by multiple bq
    parameters

•   Example: bq=library:music^2



                                                             23
dismax: bf
•   boost function

•   same as using _val_:"function(...)" in bq
    parameter

•   example: bf=recip(ms(NOW,mydatefield),
    3.16e-11,1,1)

•   but careful with adding versus multiplying
    scores, bf will be additive - see "boost" query
    parser


                                                      24
local params
•   {!parser p=param}expression

•   OR {!parser p=param v=expression}

•   Indirect parameter values with $syntax:

    •   {!parser p=$p}expression&p=param

•   Real example:

    •   _query_:”{!dismax qf=$qf_author pf=$pf_author}[advanced
        author search box field value], where qf_author and
        pf_author defined in request handler mapping, combined
        with other clauses or similar _query_'s for other groups



                                                                   25
Raw query parser
• {!raw f=field}Foo Bar
• exact TermQuery, no analysis or
  transformations
• ideal for typical fq usage
 • fq={!raw f=format}Musical Score
 • avoids query parsing escaping madness
                                           26
request handler ninjitsu
http://localhost:8983/solr/document?id=...

    <requestHandler class="solr.SearchHandler" name="/document">

      <lst name="invariants">

        <str name="q">{!raw f=id v=$id}</str>

        <str name="rows">1</str>

        <str name="fl">*</str>

      </lst>

    </requestHandler>



                                                                   27
Field query parser

• {!field f=field}Foo Bar
• generally equivalent to field:"Foo Bar"
• parses to term or phrase query, depending
  on analysis for field




                                              28
Prefix query parser

• {!prefix f=field}foo
• no analysis or transformation performed
• generally equivalent to field:foo*


                                            29
Function query parser


• {!func}log(foo)
• Used for _val_ expressions in "lucene"
  parser




                                           30
Boost query parser
• {!boost b=log(popularity)}foo
• Multiplies score, rather than additive
• Example:
  • ?q={!boost b=$dateboost v=$qq
    defType=dismax}&dateboost=recip(
    ms(NOW,manufacturedate_dt),
    3.16e-11,1,1)&qf=text&pf=text&qq
    =ipod


                                           31
extended dismax
           (edismax)
•   Solr 1.5 (currently trunk)

•   Supports full lucene query syntax in the absence of syntax
    errors: AND/OR/NOT, wildcards, fuzzy...; and/or also

•   When syntax errors, smart partial escaping of special
    characters, fielded queries, +/-, and phrases still supported

•   shingles phrases specified in pf2 and pf3 parameters

•   advanced stopword handling: stopwords are not required in the
    mandatory part of the query but are still used (if indexed) in
    the proximity boosting part. If a query consists of all stopwords
    (e.g. to be or not to be) then all will be required.


                                                                        32
edismax: pf2 and pf3

• shingles into two and three term phrases
• prevents problem of needing 100% of the
  words in the document, as well as having all
  of the words in a single field, to get any
  boost




                                                 33
edismax: boost


• wraps generated query with boost query
• like the dismax bf param, but multiplies the
  function query instead of adding it in




                                                 34
Nested queries
• Naomi's "A Better Advanced Search",
  Wednesday, 13:00
• http://www.lucidimagination.com/blog/
  2009/03/31/nested-queries-in-solr/
• Example:
 • _query_:"{!dismax qf=$qf1}query1"
    AND _query_:"{!dismax qf=$qf2}query2"


                                            35
Useful request handlers


• dump, ping, luke, system, plugins, threads,
  properties, file




                                                36
Dump
• http://localhost:8983/solr/debug/dump
• Echoes parameters, content streams, and
  Solr web context
• Careful with content stream enabled, client
  could retrieve contents of any file on
  server or accessible network! [Solution:
  disable dump request handler]


                                                37
Ping

• http://localhost:8983/solr/admin/ping
• If healthcheck configured and file not
  available, error is reported
• Executes single configured request and
  reports failure or OK



                                          38
Luke
•   http://localhost:8983/solr/admin/luke

•   Introspects Lucene index structure and schema
    relationships

•   See an individual document:

    •   ?doc=<key> or ?docId=<lucene doc #>

•   Schema details: ?show=schema

•   Admin schema browser uses Luke request handler

•   See also: original Luke tool - http://www.getopt.org/luke/

                                                                 39
System


• http://localhost:8983/solr/admin/system
• core info, Lucene version, JVM details,
  uptime, operating system info




                                            40
Plugins


• http://localhost:8983/solr/admin/plugins
• Configuration details of Solr core, available
  query and update handlers, cache settings




                                                 41
Threads


• http://localhost:8983/solr/admin/threads
• JVM thread details


                                             42
Properties


• http://localhost:8983/solr/admin/properties
• All JVM system properties, or single
  property value (?name=os.arch)




                                                43
File

• http://localhost:8983/solr/admin/file?file=/
 • See fetchable directory tree
• http://localhost:8983/solr/admin/file?
  file=schema.xml&contentType=text/plain




                                               44
Search components

• Standard: query, facet, mlt, highlight,
             stats, debug
• Others: elevation, clustering, term,
           term vector




                                            45
Clustering

•   Dynamic grouping of documents into labeled sets

•   http://localhost:8983/solr/clustering?q=*:*&rows=10

•   http://wiki.apache.org/solr/ClusteringComponent

•   Requires additional steps to install (see
    documentation) with Apache Solr distro; baked fully
    into Lucid certified distro



                                                          46
Terms

• Enumerates terms from specified fields
• http://localhost:8983/solr/terms?
  terms.fl=name&terms.sort=index&terms.pr
  efix=vi




                                           47
Term Vectors

• Details term vector information: term
  frequency, document frequency, position
  and offset information
• http://localhost:8983/solr/select/?q=*
  %3A*&qt=tvrh&tv=true&tv.all=true




                                            48
stats.jsp
•   Not technically a “request handler”, outputs only
    XML

•   http://localhost:8983/solr/admin/stats.jsp

•   Index stats such as number of documents,
    searcher open time

•   Request handler details, number of requests and
    errors, average request time, average requests
    per second, number of pending docs, etc, etc


                                                        49
Analysis Tricks
   •   CharFilters: MappingCharFilterFactory, PatternReplaceCharFilterFactory,
       HTMLStripCharFilterFactory

   •   ReversedWildcardFilterFactory, see example schema.xml "text_rev" field type

       •   *thing queries for gniht*

   •   PositionFilterFactory

       •   "can be used with a query Analyzer to prevent expensive Phrase and
           MultiPhraseQueries" or "all words and shingles to be placed at the same position,
           so that all shingles to be treated as synonyms of each other."

   •   CommonGramsFilterFactory - Makes shingles by combining common tokens and
       regular tokens

   •   CollationKeyFilterFactory (Solr 1.5) - locale based sorting


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
                                                                                               50
Faceting


• multi-select
• hierarchical


                            51
Multi-select

• &facet.field=facet_field&fq=facet_field:
  (value1 OR value2)
• to exclude filters from facet counts:
 • &facet.field={!ex=group}facet_field&fq={!
    tag=group}facet_field:value2



                                             52
Hierarchical


•   http://wiki.apache.org/solr/HierarchicalFaceting




                                                       53
Facet paging


• Blacklight trick, requesting one more than
  page size




                                               54
i18n
• CJK
 • SmartChineseAnalyzer
• German
 • DictionaryCompoundWordTokenFilterFactory
• To watch:
 • http://code.google.com/p/lucene-hunspell/
                                               55
Testing

• Automate
• Relevancy
• Performance
• Solr log analysis: zero results queries, slow
  queries



                                                  56
Questions
•   One subject that's of some interest to me is paging through facets.  It drives me a little crazy that Solr
    lets you page through facets, yet it won't give you a total count of how many facets you are paging
    through, which makes presenting a fully functional paging mechanism rather problematic.  I've heard
    that Bobo-browse may be helpful here but haven't dug into it too deeply.  Maybe this is too narrow a
    topic to be worth spending much time on, but if anybody has any thoughts or solutions, I'd love to
    discuss them

•   What if we wanted to implement a traditional browse with Solr?  Like a call number browse to
    simulate shelf browsing? Is there a way to leverage Solr for something like that?  I'd think the trie
    structure would make this possible, but how it could be exposed in that manner is a mystery.

•   that inner query/nested query stuff that Naomi is using for advanced search would be one thing I'd
    add to the list.  Continues to confuse me every time I look at it.

•   Another idea, approaches for figuring out how much RAM solr needs, and how big to make the
    various Solr query caches. I know it depends on a lot and is different for every index, but I don't even
    know how to get started figuring out what it should be for my index.  Not sure if this makes sense as
    an issue or not, just an idea.




                                                                                                                 57
Questions
•   We're currently using 1.3, so the biggest changes/improvements in 1.4 would
    be good.

•   I'm also interested in fulltext indexing.  We have some documents
    (newspapers and dissertations) that are quite large (hundreds of MB of
    plaintext).  Is there a good rule-of-thumb for how much text we should
    index?  How large is too large?  Is uncorrected OCR'd text worth indexing?

•   The other topic I'm particularly interested in is update performance.  Most
    of our data is currently batch-loaded and batch-indexed, but we are moving
    to interactive editing for some of our data, with the expectation that the
    solr index be kept updated in realtime (or near-realtime).  Should we use a
    separate server (or core) to keep the updates from impacting read-only
    performance?  Do we need to optimize the index (this can take 20+
    seconds for our main index) frequently?




                                                                                  58
Questions

• One other thing: we're using the web
  service interface which seems fast and
  reliable.  Is the SolrJ interface significantly
  faster or better?
• DidYouMean/Spellcheck

                                                   59
Questions
   does it make sense to use fixtures or
  fixture scenarios like Rails? does it make
  sense to set up a separate 'testing' core
that can be dynamically dumped and rebuilt
     through the apis by your test suite




                                              60
Questions
1.  What methods and tools can be used to determine
whether configuration or physical resource changes
might improve performance.  E.g. increasing filter cache,
adding more memory, going to 64 bit architecture,
adding another disk drive to the array, etc.

2. Best procedures to make these configuration changes.
E.g. These two parameters work in conjunction with
each other, change this one then that one, this one
should be set to X percent of your physical memory,
don't touch this one unless you really know what you
are doing, etc.



                                                           61
Questions
- Scaling issues: millions of records, trying to keep data
reasonably current
- Distributed search
- Considerations for non-Roman data mixed in with Roman
data?  We have CJK data, Cyrillic, Hebrew, Arabic.  Is
there a sensible way to set up the analyzers?
- Any considerations for merging heterogenous data (MARC,
OAI-DC, EAD, web spidering) that may be particular to Solr? 
(I don't expect so, it's all going into one schema, but
maybe you're run into something.)




                                                               62
Questions
Indexing strategies:

* Performance tuning or configuring Solr for indexing (as opposed to a copy of Solr a search app
runs on). Which config options make a difference? What JVM options matter?
* Merging a 'build' copy of an index into a search app's copy. Is this the replication piece?
* Using multiple threads when writing to Solr. Using StreamingSolrUpdateServer effectively/safely.

Advanced features on retrieval side:

* Info about facets: can Solr retrieve the global count number for a facet in addition to the count
number within a filtered search result set? Only with 2 queries?
* Doing Google-like autosuggest against facet values for subject terms (not like facet.prefix method
in the Solr 1.4 book). Best to use a multicore setup and have an index or two dedicated to
autosuggestions?

Multiple index design:

As my colleague Eric put it: big generalized index + N extreme indexes = Righteous Discovery
Platform || High Folly?

This is a question we are dealing with. As librarians and researchers learn what we are doing on our
campus a lot of people are offering up data. Some of which is *highly* specialized. For example,
metadata based on a microscopy data standard. We expect that these researchers would like us to
create an expert search tool with advanced features tailored to their data model



                                                                                                       63
Questions
Getting a better understanding of Solr memory use would be very helpful for us. (Or perhaps tools
and tips for understanding Solr memory use)
Right now we can watch the tomcat/Solr jvm with Jconsole and see heap use suddenly increasing and
decreasing, but we don't understand why, so our main technique is to wait until we get an
OutOfMemoryError and then increase the memory we give to the Solr/Tomcat JVM. (That and continuing
to buy more memory:)

The dismax/edismax and how folks are using them to tweak relevance ranking (based on MARC fields) is
also of great interest.

A couple of topics that may or may not be of interest to other folks and may or may not be
appropriate for the workshop.  The context of these is that we are trying to understand scalability
and performance issues with very large indexes (300GB x 10) and multiple shards (5 million full-text
docs and growing.)

1) I'd like to get a bit of a better understanding of how filter queries are implemented. (and how
that relates to faceting)

2) I'd like to get a better understanding of how distributed search is implemented.  In particular,
I'd like to understand the traffic that goes between the head shard and the shards it distributes
the query to.  For example in the tomcat logs we can see traffic with the isShard=true  and
ids="abc","def" parameters.




                                                                                                       64
Questions

• Call number -> shelf key
• Reverse sorting fields
• termsComponent queries
• Terms -> documents
• Can we apply facets?

                             65
Books




        66
e-book now available!
        print coming soon
http://www.manning.com/lucene

                                67
LucidWorks for Solr
•   Certified Distribution

•   Value-added integration

    •   KStemmer

    •   Carrot2 clustering

    •   LucidGaze for Solr

    •   installer

•   Reference Manual

•   Solr 1.4++ certified



                              68
LucidGaze for Solr

• Monitoring tool, captures, stores, and
  interactively views Solr performance
  metrics
• requests/second
• time/request

                                           69
70
LucidFind




http://search.lucidimagination.com/?q=code4lib


                                                 71
72
http://www.flickr.com/photos/mikeoliveri/2036797884/




                                                      73

Contenu connexe

Tendances

Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
Erik Hatcher
 

Tendances (20)

code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Solr 4
Solr 4Solr 4
Solr 4
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 

Similaire à Solr Black Belt Pre-conference

Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the Wild
Tim Vaillancourt
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
Christophe Grand
 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary Server
Ezra Zygmuntowicz
 

Similaire à Solr Black Belt Pre-conference (20)

Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Performance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen BorgersPerformance van Java 8 en verder - Jeroen Borgers
Performance van Java 8 en verder - Jeroen Borgers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the Wild
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query Language
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
Performance & Scalability Improvements in Perforce
Performance & Scalability Improvements in PerforcePerformance & Scalability Improvements in Perforce
Performance & Scalability Improvements in Perforce
 
The Return of the Living Datalog
The Return of the Living DatalogThe Return of the Living Datalog
The Return of the Living Datalog
 
Redis: REmote DIctionary Server
Redis: REmote DIctionary ServerRedis: REmote DIctionary Server
Redis: REmote DIctionary Server
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012
 
Developer testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to IntegrateDeveloper testing 201: When to Mock and When to Integrate
Developer testing 201: When to Mock and When to Integrate
 

Plus de Erik Hatcher

Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Plus de Erik Hatcher (12)

Ted Talk
Ted TalkTed Talk
Ted Talk
 
Solr Payloads
Solr PayloadsSolr Payloads
Solr Payloads
 
it's just search
it's just searchit's just search
it's just search
 
Solr Powered Libraries
Solr Powered LibrariesSolr Powered Libraries
Solr Powered Libraries
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrSolr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache Solr
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Solr Black Belt Pre-conference

  • 1. Solr Black Belt code4lib conference 2010 - Asheville, NC Erik Hatcher, Lucid Imagination Naomi Dushay, Stanford 1
  • 2. What’s new in Solr 1.4? • Java-based replication • TermsComponent • VelocityResponseWriter (Solritas) • Rich document indexing, via Tika (Solr Cell) • Logging switched to SLF4J • Greatly improved faceting performance • Rollback, since last commit • StatsComponent • Exact/near duplicate document handling • TermVectorComponent • Support added for Lucene's omitTf • Configurable Directory provider • "trie" range query support • CharFilter 2
  • 3. Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 3
  • 4. Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 4
  • 6. JVM • -server • -XmxNNNNm • Java 1.6 (latest point release) • garbage collector • 64-bit? • Tools: JVM GC logging, jconsole 6
  • 7. Useful JVM switches • -Xloggc:gc.out: Will output GC information to a file named “gc.out”. • –XX:+PrintGC: Outputs basic information at every garbage collection. • –XX:+PrintGCDetails: Outputs more detailed information at every garbage collection. • –XX:+PrintGCTimeStamps: Outputs a time stamp at the start of each garbage collection event. Used with –XX:+PrintGC or –XX: +PrintGCDetails to show when each garbage collection begins. • -XX:-HeapDumpOnOutOfMemoryError: Dump heap to file when java.lang.OutOfMemoryError is thrown. 7
  • 8. Indexing Performance • Tricks of the trade: • multithread/multiprocess • batch documents • separate Solr server and indexers • Indexing master + replicants • StreamingUpdateSolrServer + javabin 8
  • 9. MARC indexing strategies • SolrMarc • Future? DataImportHandler hooks 9
  • 10. Index Settings • useCompoundFile: set to false • mergeFactor: 10 or lower, generally • ramBufferSizeMB: buffer used for added documents before flushing to directory; more predictable instead of using maxBufferedDocs. Benchmarking shows <= 128 is best. • maxMergeDocs: maximum number of documents for a single segment • maxFieldLength: generally max. int is desired = 2147483647 • maxWarmingSearchers: 1 is best 10
  • 11. Searching Performance • javabin - binary protocol for Java clients • caches: filterCache most relevant here • autowarm • FastLRUCache • warming queries: firstSearcher, newSearcher • sorting, faceting 11
  • 12. debugQuery=true • parsed queries • scoring explanations • search component timings 12
  • 13. Query Parsing • defType • applies to main query only • fq parsed as "lucene" unless individually overridden • {!parser local=params}query string 13
  • 14. Solr Query Parser (lucene) • http://lucene.apache.org/java/2_9_1/ queryparsersyntax.html + Solr extensions • Kitchen sink parser, includes advanced user- unfriendly syntax • Syntax errors throw parse exceptions back to client • Example: title:ipod* AND price:[0 TO 100] • http://wiki.apache.org/solr/SolrQuerySyntax 14
  • 15. SolrQueryParser • Default query parser • schema.xml • <defaultSearchField>text</defaultSearchField> • <solrQueryParser defaultOperator="OR"/> • Adds _query_:"..." and _val_:"..." hooks • Supports leading wildcards with ReversedWildcardFilterFactory 15
  • 16. Dismax Query Parser (dismax) • Simplified syntax: loose text “quote phrases” -prohibited +required • Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf) 16
  • 17. dismax: q and q.alt • odd number of quotes is parsed as if there were no quotes • wildcards, fuzzy, etc not supported • q.alt: alternate query; "lucene" parsed, used when q is omitted; useful as *:* to get collection-wide facet counts 17
  • 18. dismax: qf and pf • query fields / phrase fields • syntax: field[^boost]... • example: title^2 body • pf for boosting where terms in q are in close proximity; entire q string is used as phrase implicitly 18
  • 19. dismax: qs and ps • qs: query slop; used for explicit "phrase queries" • ps: phrase slop; used for implicit phrase query added for pf fields 19
  • 20. dismax: mm • minimum match, for optional clauses • default = 100% (pure AND) • Examples: • pure OR: mm=0 or mm=0% • at least two should match: mm=2 • at least 75% should match: mm=75% • 1-3 clauses, must match, 4 or more 90% must match: mm=3<90% • 1-2 clauses all required, 3-9 clauses all but 25% must match, 9 or more all but 3 are requried: mm=2<-25% 9<-3 • 1-3 clauses all must match, 3-5 clauses, one less than the number of clauses must match, 6 or more clauses, 80% must match, rounded down: mm=2<-1 5<80% http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html 20
  • 21. dismax: tie • tiebreaker • more than one field may match and scored based on term frequency • how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field. • A value of "0.0" makes the query a pure "disjunction max query" -- only the maximum scoring sub query contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't matter what the maximum scoring sub query is, the final score is the sum of the sub scores. Typically a low value (ie: 0.1) is useful. 21
  • 22. dismax: tie • The “tie” (tie breaker) parameter is very important, but not easy to understand. It may be useful to visualize it as a “slider” control between 0 and 1, with a value of 0 being a “pure disjunction max” query, and a value of “1” being a “pure disjunction sum” query. So the “max” score is added to the sum of all other scores multiplied by the tie breaker: • If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is 0: • score = 2.12 + ((1.7 + 0.5) * 0 ) = 2.12 • If the max score is 2.12 and the other scores are 1.7 and 0.5, and the tie breaker is 1: • score = 2.12 + ((1.7 + 0.5) * 1) = 4.32 22
  • 23. dismax: bq • boosting query • "lucene" query parsed, by default • combined (optionally) with users query to boost matching documents • warning: a boolean query with boost of 1.0 has clauses added as-is, can be problematic by adding required/ prohibited clauses; could be caused by multiple bq parameters • Example: bq=library:music^2 23
  • 24. dismax: bf • boost function • same as using _val_:"function(...)" in bq parameter • example: bf=recip(ms(NOW,mydatefield), 3.16e-11,1,1) • but careful with adding versus multiplying scores, bf will be additive - see "boost" query parser 24
  • 25. local params • {!parser p=param}expression • OR {!parser p=param v=expression} • Indirect parameter values with $syntax: • {!parser p=$p}expression&p=param • Real example: • _query_:”{!dismax qf=$qf_author pf=$pf_author}[advanced author search box field value], where qf_author and pf_author defined in request handler mapping, combined with other clauses or similar _query_'s for other groups 25
  • 26. Raw query parser • {!raw f=field}Foo Bar • exact TermQuery, no analysis or transformations • ideal for typical fq usage • fq={!raw f=format}Musical Score • avoids query parsing escaping madness 26
  • 27. request handler ninjitsu http://localhost:8983/solr/document?id=... <requestHandler class="solr.SearchHandler" name="/document"> <lst name="invariants"> <str name="q">{!raw f=id v=$id}</str> <str name="rows">1</str> <str name="fl">*</str> </lst> </requestHandler> 27
  • 28. Field query parser • {!field f=field}Foo Bar • generally equivalent to field:"Foo Bar" • parses to term or phrase query, depending on analysis for field 28
  • 29. Prefix query parser • {!prefix f=field}foo • no analysis or transformation performed • generally equivalent to field:foo* 29
  • 30. Function query parser • {!func}log(foo) • Used for _val_ expressions in "lucene" parser 30
  • 31. Boost query parser • {!boost b=log(popularity)}foo • Multiplies score, rather than additive • Example: • ?q={!boost b=$dateboost v=$qq defType=dismax}&dateboost=recip( ms(NOW,manufacturedate_dt), 3.16e-11,1,1)&qf=text&pf=text&qq =ipod 31
  • 32. extended dismax (edismax) • Solr 1.5 (currently trunk) • Supports full lucene query syntax in the absence of syntax errors: AND/OR/NOT, wildcards, fuzzy...; and/or also • When syntax errors, smart partial escaping of special characters, fielded queries, +/-, and phrases still supported • shingles phrases specified in pf2 and pf3 parameters • advanced stopword handling: stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required. 32
  • 33. edismax: pf2 and pf3 • shingles into two and three term phrases • prevents problem of needing 100% of the words in the document, as well as having all of the words in a single field, to get any boost 33
  • 34. edismax: boost • wraps generated query with boost query • like the dismax bf param, but multiplies the function query instead of adding it in 34
  • 35. Nested queries • Naomi's "A Better Advanced Search", Wednesday, 13:00 • http://www.lucidimagination.com/blog/ 2009/03/31/nested-queries-in-solr/ • Example: • _query_:"{!dismax qf=$qf1}query1" AND _query_:"{!dismax qf=$qf2}query2" 35
  • 36. Useful request handlers • dump, ping, luke, system, plugins, threads, properties, file 36
  • 37. Dump • http://localhost:8983/solr/debug/dump • Echoes parameters, content streams, and Solr web context • Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler] 37
  • 38. Ping • http://localhost:8983/solr/admin/ping • If healthcheck configured and file not available, error is reported • Executes single configured request and reports failure or OK 38
  • 39. Luke • http://localhost:8983/solr/admin/luke • Introspects Lucene index structure and schema relationships • See an individual document: • ?doc=<key> or ?docId=<lucene doc #> • Schema details: ?show=schema • Admin schema browser uses Luke request handler • See also: original Luke tool - http://www.getopt.org/luke/ 39
  • 40. System • http://localhost:8983/solr/admin/system • core info, Lucene version, JVM details, uptime, operating system info 40
  • 41. Plugins • http://localhost:8983/solr/admin/plugins • Configuration details of Solr core, available query and update handlers, cache settings 41
  • 43. Properties • http://localhost:8983/solr/admin/properties • All JVM system properties, or single property value (?name=os.arch) 43
  • 44. File • http://localhost:8983/solr/admin/file?file=/ • See fetchable directory tree • http://localhost:8983/solr/admin/file? file=schema.xml&contentType=text/plain 44
  • 45. Search components • Standard: query, facet, mlt, highlight, stats, debug • Others: elevation, clustering, term, term vector 45
  • 46. Clustering • Dynamic grouping of documents into labeled sets • http://localhost:8983/solr/clustering?q=*:*&rows=10 • http://wiki.apache.org/solr/ClusteringComponent • Requires additional steps to install (see documentation) with Apache Solr distro; baked fully into Lucid certified distro 46
  • 47. Terms • Enumerates terms from specified fields • http://localhost:8983/solr/terms? terms.fl=name&terms.sort=index&terms.pr efix=vi 47
  • 48. Term Vectors • Details term vector information: term frequency, document frequency, position and offset information • http://localhost:8983/solr/select/?q=* %3A*&qt=tvrh&tv=true&tv.all=true 48
  • 49. stats.jsp • Not technically a “request handler”, outputs only XML • http://localhost:8983/solr/admin/stats.jsp • Index stats such as number of documents, searcher open time • Request handler details, number of requests and errors, average request time, average requests per second, number of pending docs, etc, etc 49
  • 50. Analysis Tricks • CharFilters: MappingCharFilterFactory, PatternReplaceCharFilterFactory, HTMLStripCharFilterFactory • ReversedWildcardFilterFactory, see example schema.xml "text_rev" field type • *thing queries for gniht* • PositionFilterFactory • "can be used with a query Analyzer to prevent expensive Phrase and MultiPhraseQueries" or "all words and shingles to be placed at the same position, so that all shingles to be treated as synonyms of each other." • CommonGramsFilterFactory - Makes shingles by combining common tokens and regular tokens • CollationKeyFilterFactory (Solr 1.5) - locale based sorting http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters 50
  • 52. Multi-select • &facet.field=facet_field&fq=facet_field: (value1 OR value2) • to exclude filters from facet counts: • &facet.field={!ex=group}facet_field&fq={! tag=group}facet_field:value2 52
  • 53. Hierarchical • http://wiki.apache.org/solr/HierarchicalFaceting 53
  • 54. Facet paging • Blacklight trick, requesting one more than page size 54
  • 55. i18n • CJK • SmartChineseAnalyzer • German • DictionaryCompoundWordTokenFilterFactory • To watch: • http://code.google.com/p/lucene-hunspell/ 55
  • 56. Testing • Automate • Relevancy • Performance • Solr log analysis: zero results queries, slow queries 56
  • 57. Questions • One subject that's of some interest to me is paging through facets.  It drives me a little crazy that Solr lets you page through facets, yet it won't give you a total count of how many facets you are paging through, which makes presenting a fully functional paging mechanism rather problematic.  I've heard that Bobo-browse may be helpful here but haven't dug into it too deeply.  Maybe this is too narrow a topic to be worth spending much time on, but if anybody has any thoughts or solutions, I'd love to discuss them • What if we wanted to implement a traditional browse with Solr?  Like a call number browse to simulate shelf browsing? Is there a way to leverage Solr for something like that?  I'd think the trie structure would make this possible, but how it could be exposed in that manner is a mystery. • that inner query/nested query stuff that Naomi is using for advanced search would be one thing I'd add to the list.  Continues to confuse me every time I look at it. • Another idea, approaches for figuring out how much RAM solr needs, and how big to make the various Solr query caches. I know it depends on a lot and is different for every index, but I don't even know how to get started figuring out what it should be for my index.  Not sure if this makes sense as an issue or not, just an idea. 57
  • 58. Questions • We're currently using 1.3, so the biggest changes/improvements in 1.4 would be good. • I'm also interested in fulltext indexing.  We have some documents (newspapers and dissertations) that are quite large (hundreds of MB of plaintext).  Is there a good rule-of-thumb for how much text we should index?  How large is too large?  Is uncorrected OCR'd text worth indexing? • The other topic I'm particularly interested in is update performance.  Most of our data is currently batch-loaded and batch-indexed, but we are moving to interactive editing for some of our data, with the expectation that the solr index be kept updated in realtime (or near-realtime).  Should we use a separate server (or core) to keep the updates from impacting read-only performance?  Do we need to optimize the index (this can take 20+ seconds for our main index) frequently? 58
  • 59. Questions • One other thing: we're using the web service interface which seems fast and reliable.  Is the SolrJ interface significantly faster or better? • DidYouMean/Spellcheck 59
  • 60. Questions does it make sense to use fixtures or fixture scenarios like Rails? does it make sense to set up a separate 'testing' core that can be dynamically dumped and rebuilt through the apis by your test suite 60
  • 61. Questions 1.  What methods and tools can be used to determine whether configuration or physical resource changes might improve performance.  E.g. increasing filter cache, adding more memory, going to 64 bit architecture, adding another disk drive to the array, etc. 2. Best procedures to make these configuration changes. E.g. These two parameters work in conjunction with each other, change this one then that one, this one should be set to X percent of your physical memory, don't touch this one unless you really know what you are doing, etc. 61
  • 62. Questions - Scaling issues: millions of records, trying to keep data reasonably current - Distributed search - Considerations for non-Roman data mixed in with Roman data?  We have CJK data, Cyrillic, Hebrew, Arabic.  Is there a sensible way to set up the analyzers? - Any considerations for merging heterogenous data (MARC, OAI-DC, EAD, web spidering) that may be particular to Solr?  (I don't expect so, it's all going into one schema, but maybe you're run into something.) 62
  • 63. Questions Indexing strategies: * Performance tuning or configuring Solr for indexing (as opposed to a copy of Solr a search app runs on). Which config options make a difference? What JVM options matter? * Merging a 'build' copy of an index into a search app's copy. Is this the replication piece? * Using multiple threads when writing to Solr. Using StreamingSolrUpdateServer effectively/safely. Advanced features on retrieval side: * Info about facets: can Solr retrieve the global count number for a facet in addition to the count number within a filtered search result set? Only with 2 queries? * Doing Google-like autosuggest against facet values for subject terms (not like facet.prefix method in the Solr 1.4 book). Best to use a multicore setup and have an index or two dedicated to autosuggestions? Multiple index design: As my colleague Eric put it: big generalized index + N extreme indexes = Righteous Discovery Platform || High Folly? This is a question we are dealing with. As librarians and researchers learn what we are doing on our campus a lot of people are offering up data. Some of which is *highly* specialized. For example, metadata based on a microscopy data standard. We expect that these researchers would like us to create an expert search tool with advanced features tailored to their data model 63
  • 64. Questions Getting a better understanding of Solr memory use would be very helpful for us. (Or perhaps tools and tips for understanding Solr memory use) Right now we can watch the tomcat/Solr jvm with Jconsole and see heap use suddenly increasing and decreasing, but we don't understand why, so our main technique is to wait until we get an OutOfMemoryError and then increase the memory we give to the Solr/Tomcat JVM. (That and continuing to buy more memory:) The dismax/edismax and how folks are using them to tweak relevance ranking (based on MARC fields) is also of great interest. A couple of topics that may or may not be of interest to other folks and may or may not be appropriate for the workshop.  The context of these is that we are trying to understand scalability and performance issues with very large indexes (300GB x 10) and multiple shards (5 million full-text docs and growing.) 1) I'd like to get a bit of a better understanding of how filter queries are implemented. (and how that relates to faceting) 2) I'd like to get a better understanding of how distributed search is implemented.  In particular, I'd like to understand the traffic that goes between the head shard and the shards it distributes the query to.  For example in the tomcat logs we can see traffic with the isShard=true  and ids="abc","def" parameters. 64
  • 65. Questions • Call number -> shelf key • Reverse sorting fields • termsComponent queries • Terms -> documents • Can we apply facets? 65
  • 66. Books 66
  • 67. e-book now available! print coming soon http://www.manning.com/lucene 67
  • 68. LucidWorks for Solr • Certified Distribution • Value-added integration • KStemmer • Carrot2 clustering • LucidGaze for Solr • installer • Reference Manual • Solr 1.4++ certified 68
  • 69. LucidGaze for Solr • Monitoring tool, captures, stores, and interactively views Solr performance metrics • requests/second • time/request 69
  • 70. 70
  • 72. 72