SlideShare a Scribd company logo
1 of 55
Practical Search with Solr:
Beyond just Looking it Up
29 April 2010


     Bess Sadler, Stanford University Library
     Naomi Dushay, Stanford University Library
     Tom Burton-West, the Hathi Trust Project
Slides posted at the end of this
Agenda                                                                          presentation; full replay
                                                                                    available within
                                                                               ~48 hours of live webcast
    Introductions
    What, the data’s dirty? Bess Sadler
         Clean data is easy to search and browse. However, you probably
         don’t have clean data.
    Queries are not obvious: Naomi Dushay
         Browsing ordered lists; Dismax; When simple search is not enough
    Big, Bigger, Biggest: Tom Burton West
         Large scale issues: Phrase queries and common words; OCR
    Q&A

                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                               2
About the Presenters
   Bess Sadler, Stanford University Library
         Senior Software Engineer at Stanford University Library, and co-founder
         of Blacklight (http://projectblacklight.org), formerly the Chief Architect
         for the Online Library Environment at the U-VA. www-sul.stanford.edu
   Naomi Dushay, Stanford University Library
         Senior Software Engineer at Stanford University Library, expert in digital
         library research; formerly a member of the core infrastructure team of
         the National Science Digital Library. www-sul.stanford.edu
   Tom Burton-West, the Hathi Trust Project
         Information Retrieval Programmer in the University of Michigan’s Digital
         Library Production Service; works on the Hathi Trust Large Scale Search
         project and blogs about it at www.hathitrust.org/blogs.


                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                      3
What, the data’s dirty?
  Clean data is
  easy to search
  and browse.
  However, you
  probably don’t                                                                 Bess Sadler
  have clean data.                                                 Stanford University Library




                Lucid Imagination, Inc. – http://www.lucidimagination.com           4            4
Before we begin, you should know

     Some basics around Solr that we won’t cover
       Gets and posts
       Search Index is not a DBMS
       XML
       Strong Data Typing
     We’ll refer to these in the talk; if you’re unfamiliar with them,
     see: bit.ly/practical-solr for some quick definitions of these
     terms



              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          5
Mapping Library Data Types:
Your data is not as different as you think it is
  Library               Engineering                 Health Care                   Intellectual      Legal
                                                                                  Property
  Books                 Specs                       Research papers               Patents           Contracts

  Personal Name         Concept                     Disease Types                 Mechanisms        Parties

  Publication           Formal                       Journals                     Filing and        Rulings and court
                        Documentations                                            Disclosure Docs   documents

  Combined Facets:      Test results, analog        Test results, analog Authors, titles, prior Exhibits, photos,
  Book, Video,          data files, Media           data files, Media     art, assignees,       criminal evidence,
  Journal,              files, data sets, rich      files,patient records claims, descriptions, emails/e-discovery
  Newspaper,            documents                                         figures
  Physical Artifacts,   SKUs
  Digital artifacts



   Other domains: pharmaceutical, manufacturing, etc. are similar in the diversity
   of document types and data types within the documents
                          Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                                        6
Data is weird
    Not Normal: The data is not always in the fields or places you
    expect, even when you have a detailed spec.
      Local practices differ
      Practices change over time
      Sometimes stuff is just wrong (but remember: it’s better to be
      consistent than right)
      Be prepared for cleanup – indexing your data is going to uncover a
      lot of problems you never knew about before
    Formats are not necessarily optimized for discovery
      For example: PDFs are optimized for presentation, not discovery;
      putting them into a discovery system presents its own challenges.

                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            7
Using Solr Cell (aka Extracting Request Handler)/PDFBox

       Good news examples
         Please, God, just some metadata!
         When we got lucky, we had another source of the metadata
       Bad News examples
         Typography
         Text inset boxes
         It’s only a little easier than OCR …
       Advanced PDFBox options only work
       when there is a lot of consistency

                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            8
Search vs. Browse

     Search:
     More focused -- the user is looking for a known item, or has a
     specific question to be answered. (e.g., a citation, a part
     number, a specific judicial ruling, “that book by Steinbeck”)
     Browse:
     The user has a generalized, nebulous information need that
     they will refine as they interact with a collection of resources.
     (e.g., finding a good book to read, shopping for accessories,
     keeping current in one’s field)




              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          9
Search Challenges

     Relevancy – indexing the full text isn’t good enough
     Fielded search – context is meaningful (“Cook” example)
     Fielded search – will data be where you expect it to be?
     Users don’t’ speak your jargon:
       “indian cooking” is “Cookery Indic”
     Stemming -- Nature/Naturalism
     How do you know you have your relevancy rankings right?
       You ask!


              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          10
Why Browsing Is Important


     Search is not enough
     What is a facet?
     Here is how it works in Solr
     Here is why your users will like you for doing it.
     More challenges related to browsing coming up…




              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          11
Queries are not obvious
  Browsing Ordered Lists
  A Little About Dismax
  and
  When Simple Search
                                                                              Naomi Dushay
  Is Not Enough
                                                                   Stanford University Library




                Lucid Imagination, Inc. – http://www.lucidimagination.com                        12
Candidates for Browsing
  Names (Employees, Customers, Students, Authors)
  Part Numbers                                                              One Strategy for Data
                                                                                   that is
  When Spelling is Unclear                                                   Not Normalized is
                                                                           Browsing Ordered Lists
     uighur, uyghur, uyghar, uigher
  Strings of Both Letters and Digits, such as SKUs, Part
  Numbers, Invoice Numbers, Transaction Record Numbers
  Addresses in Sequence
  Titles (Books, …)



               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                    13
Some Values are Easily Ordered
     Numeric Values
     Dates (if normalized)
     Some Letter Tokens (e.g. categories)




             Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                         14
Values Difficult to Sort Lexically
     Digits in non-numeric context
        lexical sort of numbers: 1, 111, 20, 222, 8 …
     “A715C74”
     “The Princess and the Pea”
     “Sir Isaac Newton”
     “Die Fledermaus”
     piña vs. pina




               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                           15
Call Numbers are Difficult to Sort Lexically
  (applies to SKUs, Part Numbers,                      A7 .L3 .V2
  Non-uniform serial numbers
  across domains, etc.)                                A7 .L3 V2
                                                       A7 .L3 V.2
  Letters combined with DIGITS
                                                       A7 .L3 1902 V.2
  Some digits are decimals,
  some are integers                                    M5 .L3 2000 .K2 1880

  Inconsistent punctuation                             M5 .L3 .K451 V.5

  Suffixes to be ignored for                           M5 .L3 K2 D MAJ 1880
  sorting purposes                                     M5 .L3 K2 OP.7:NO.6 1880
                                                       A7 .L3 1902 V2 TANEYTOWN
                                                       M5 .L3 K2 .Q2 MD:CRAP0*DMA 1981

                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                         16
Normalization for Sorting is a Process
                                                                                   It might not
                                    Programmatically                                need to be
                                    Normalize Data                                    perfect.


    Clean up data
                                       Assess
                                       Sorted Output

                               Humans
    Find Dirty Data
                                                                  Automated test




                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                  17
Basic Sorting Normalization Strategies
    Normalize Letter Case (e.g. all lowercase)
    Leading Spaces (can use zeros for digits; space works)
    Trailing Spaces
    Skip Ignored Characters (“The Fly”, “Ms. Jane Doe”)
    Numbers sorted as an Integer (leading spaces/zeros),
    vs. as a Decimal (trailing spaces/zeros)?

                                                                           Normalization should
                                                                          accommodate dirty data
                                                                            whenever practical.



              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                   18
a+++0007.000000
A7 .B3  b0.300000
       a+++0007.000000
A7 B33  b0.330000
       a+++0017.000000
A17 .B4 b0.400000
        Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                    19
Weird Values Happen

     ZDVD 4971
     MFILM 24 REEL 5
     Shelved by title
     XX(123457)
     call # varies
     no call number




               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                           20
Solr Performance Issue: Query Time Sorting

     q=sortfield["666" TO *]&rows=10
        Will sort ALL of the sortfield values at Query Time
        Response time abysmal for sortfields
        with huge numbers of values
     Try this: Terms Component




              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          21
Solrconfig.xml




                              QUERY LOOKS LIKE:

                              http://host:port/solr/alphaTerms?
                                 terms.fl=
                                 terms.lower=
                                 per_page=
            Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                        22
/solr/alphaTerms?
     terms.fl=shelfkey&
     terms.lower=lc+hc++0337.000000+f0.500000+f0.512000&
     per_page=10



   TermsComponent
 queries the part of the
  index that is already
lexically sorted for each
           field.




                        Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                    23
Now that I Have Terms, How do I get the Documents?
/solr/select?
q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”)
&qt=standard

            (URL encode if you need to)




                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            24
Sortfield Value : Document                                     NOT always 1:1

  1:Many One Sortfield Value – Multiple Documents
    One product, multiple generations of user manuals
    One court case, multiple briefing and disclosure documents


  Many:1 One Document – Multiple Sortfield Values
    Which value are you going to pick for browsing list?
        Allow user to select in UI, if possible




                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                25
What About Browsing Before the Known Sort Value?




               n Before                                                 n After
http://hayward-ca.gov/refreshyourlife/wp-content/uploads/2009/07/fiction-spines.jpg


                     Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                      26
Create Reverse Sortkey
 Use a simple character mapping to reverse the sort order
    IF   sortkey HAS                      reversekey GETS
    0                                     Z
    1                                     Y
    …                                     …
    9                                     Q
    A                                     P
    …                                     …
    Z                                     0
               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                           27
Lucid Imagination, Inc. – http://www.lucidimagination.com   28
A Little About Dismax




Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                            29
Solr QueryParsing Strategies

FEATURE                                                                     LUCENE   DISMAX

Boolean                                                                       √
Each Text Box -> Groups of Index Fields                                       √        √
Each Text Box -> Complex Boosting Equation                                  yuck       √
Multiple Text Boxes                                                         yuck       √
Multiple Query Words Match Across Fields                                               √
Boosting Matches Simple                                                                √
   “Author” “Title” “Subject” Searches

                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                              30
Dismax (disjoint max) Query Parser:
 Some of My Favorite Things
     Assign boost values for field matching at query time BUT
     complex boosting formulae can reside in solrconfig.xml
       Index can be neutral; assign query time boosting to fields for
       different types of queries
     Easy to boost exact phrase matches higher than query terms
     scattered across document.
     Tune how many query words MUST match, and what the
     other matching thresholds/parameters might be
     http://wiki.apache.org/solr/DisMaxRequestHandler


              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          31
Example Dismax Request Handler
<!-- author search request handler -->
      <requestHandler name="search_author" class="solr.SearchHandler" >
           <lst name="defaults">
                 <str name="defType">dismax</str>

<!-- require 4 or more terms to match … -->
                  <str name="mm">4&lt;-1 4&lt;90%</str>

<!– boost formula -->
                <str name="qf">author_unstem^10 author native_script_author</str>

<!-- boost phrase matches -->
                 <str name="pf">author_unstem^100 author^10 native_script_author^10</str>
…



http://wiki.apache.org/solr/DisMaxRequestHandler

                     Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                       32
Sometimes,
Simple Search + Facets
   is Not Enough




 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                             33
WHEN isn’t it enough?

  Pay attention to user feedback
  Study Search Logs
         Queries without results




              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          34
Our Users Also Asked for:
  Boolean
  Targeting a particular (group of) fields
    “… combined searching feature so that I can specify the author
    and title.”
          (author) Mozart (title) sonata 21 – not a book about Mozart’s
          sonatas
    “I often search publisher AND year, or publisher AND place of
    publication, and occasionally need all three terms in
    combination.”
         (publisher) “Little, Brown & Co” – not “The Little Brown Jug”
         Plaintiff, Defendant, Attorney – all?



                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                             35
Search Form has More Than One Text Box
    Want Features of Dismax
    Need Way to Boost Appropriately for Each Text Box
    Need Way to Combine Text Boxes




             Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                         36
Local Params                              LocalParams allow additional, localized
                                           instructions to be sent as part of the
                                                           query.



   Ways to Parse Query Terms
   Send in Non-Default Values for Variables
  Use Variables Declared in Request Handler That Don’t
 Map To QueryParser Arguments


   http://wiki.apache.org/solr/LocalParams



               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                    37
Solrconfig.xml
      <requestHandler name=”multi_box" class="solr.SearchHandler" >
           <lst name="defaults">
                 <str name="defType">lucene</str>
                 <str name=”q.op”>AND</str>

<!– author box -->
                 <str name="qf_author">author fields boost formula</str>
                 <str name="pf_author">author phrase boosts</str>

<!– title box -->
                    <str name="qf_title">title fields boost formula</str>
                    <str name="pf_title">title phrase boosts</str>

…




                            Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                        38
Using LocalParams Variables
    Text boxes combined with AND
      _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” AND
      _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”


    Text boxes combined with OR
      _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” OR
      _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms”




              Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                          39
Note: DISMAX doesn’t do Boolean within the text
boxes: there are workarounds …
     edismax (Solr 1.5)
     faking it:
       http://www.stanford.edu/people/~ndushay/code4lib2010/advSe
       archSolrQueries.pdf




                  Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                              40
My Favorite Places To Find Information
 LucidImagination Search
   http://www.lucidimagination.com/search/
   (NOT a coerced statement!)
 Solr wikis
   http://wiki.apache.org/solr/FrontPage




               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                           41
Big, Bigger Biggest
  Large scale issues:
  Phrase queries and common words
  OCR

                                                                            Tom Burton West
                                                                           Hathi Trust Project




               Lucid Imagination, Inc. – http://www.lucidimagination.com                         42
Hathi Trust Large Scale Search Challenges
 Goal: Design a system for full-text search that
 will scale to 5 million to 20 million volumes (at a reasonable cost.)
 Challenges:
   Must scale to 20 million full-text volumes
   Very long documents compared to
   most large-scale search applications
   Multilingual collection
   OCR quality varies




                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            43
Index Size, Caching, and Memory

  Our documents average about 300 pages
  which is about 700KB of OCR.
  Our 5 million document index is between 2 and 3 terabytes.
  About 300 GB per million documents
  Large index means disk I/O is bottleneck
  Tradeoff JVM vs OS memory
    Solr uses OS memory (disk I/O caching) for caching of postings
    Memory available for disk I/O caching has most impact on response
    time (assuming adequate cache warming)
  Fitting entire index in memory not feasible with terabyte size index
                Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            44
Response time varies with query

                                                                        Average:    673
                                                                        Median:      91
                                                                        90th:       328
                                                                        99th:      7,504




            Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                      45
Slowest 5 % of queries
                                                                                         The slowest 5% of queries took about
                                                                                         1 second or longer.
                 Response Time 95th percentile (seconds)                                 The slowest 1% of queries took
                                                                                         between 10 seconds and 2 minutes.
                1,000
Response Time




                                                                                         Slowest 0.5% of queries took
  (seconds)




                 100
                                                                                         between 30 seconds and 2 minutes
                  10
                                                                                         These queries affect response time of
                   1
                                                                                         other queries
                   0
                        940   950   960   970     980     990     1,000
                                                                                                Cache pollution
                                      Query number                                              Contention for resources
                                                                                         Slowest queries are phrase queries
                                                                                         containing common words



                                           Lucid Imagination, Inc. – http://www.lucidimagination.com                             46
Query processing
  Phrase queries use position index (Boolean queries do not).
  Position index accounts for 85% of index size
  Position list for common words such as
  “the” can be many GB in size
  This causes lots of disk I/O .
  Solr depends on the operating systems disk cache to reduce disk
  I/O requirements for words that occur in more than one query
  I/O from Phrase queries containing
  common words pollutes the cache



                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                             47
Slow Queries
  Slowest test query: “the lives and literature of the beat
  generation” took 2 minutes.
  4MB data read for Boolean query.
  9,000+ MB read for Phrase query.
               NUMBER OF            POSTINGS LIST         TOTAL TERM OCCURRENCES       POSITION LIST
        WORD
               DOCUMENTS              (SIZE MB)                  (MILLIONS)              (SIZE MB)

  the               800,000                      0.8                         4,351              4,351
  of                892,000                    0.89                          2,795              2,795
  and               769,000                    0.77                          1,870              1,870
  literature        435,000                    0.44                                9                   9
  generation        414,000                    0.41                                5                   5
  lives             432,000                    0.43                                5                   5
  beat              278,000                    0.28                                1                   1
  TOTAL                                        4.02                                             9,036

                 Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                           48
Why not use Stop Words?
  The word “the” occurs more than 4 billion times in our 1 million
  document index.
  Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.
  Couldn’t search for many phrases
     “to be or not to be”
     “the who”
     “man in the moon” vs. “man on the moon”
  Stop words in one language are content words in another language
     German stop words “war” and “die” are content words in English
     English stop words “is” and “by” are content words (“ice” and “village”)
     in Swedish

                  Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                49
“CommonGrams”

 Ported Nutch “CommonGrams” algorithm to Solr
 Create Bi-Grams selectively for any two word sequence containing
 common terms
 Slowest query: “The lives and literature of the beat generation”
   “the-lives”    “lives-and”
   “and-literature”           “literature-of”
    “of-the”     “the-beat”              “generation”




                  Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                              50
Standard index vs. CommonGrams
Standard Index                                             Common Grams
                  TOTAL                                                           TOTAL
                                 NUMBER OF                                                    NUMBER OF
               OCCURRENCES                                                     OCCURRENCES
        WORD                        DOCS                                                         DOCS
                IN CORPUS                                                       IN CORPUS
                                (THOUSANDS)                                                  (THOUSANDS)
                (MILLIONS)                                            TERM      (MILLIONS)

the                 2,013                  386              of-the                    446           396
of                  1,299                  440              generation               2.42           262
and                    855                 376              the-lives                0.36           128
literature                 4               210              literature-of            0.35           103
lives                      2               194              lives-and                0.25           115
generation                 2               199              and-literature           0.24            77
beat                    0.6                130              the-beat                 0.06            26
TOTAL               4,176                                   TOTAL                     450


                   Lucid Imagination, Inc. – http://www.lucidimagination.com                               51
Comparison of Response time (ms)
                                                                                      SLOWEST
                 AVERAGE               MEDIAN                          90th    99th     QUERY
Standard Index        459                       32                    146     6,784   120,595
Common
                        68                        3                      71   2,226     7,800
Grams




                  Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                                                52
Other issues

     Analyze your slowest queries
       We analyzed the slowest queries from our query logs and
       discovered additional “common words” to be added to our list.
       We used Solr Admin panel to run our slowest queries from our
       logs with the “debug” flag checked.
        We discovered that words such as “l’art” were being split into
       two token phrase queries.
       We used the Solr Admin Analysis tool and determined that the
       analyzer we were using was the culprit.




               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                           53
Other issues

     We broke Solr … temporarily
       Dirty OCR in combination with over 200 languages creates
       indexes with over 2.4 billion unique terms
       Solr/Lucene index size was limited to 2.1 Billion unique terms
       Patched: Now it’s 274 Billion
       Dirty OCR is difficult to remove without removing “good” words.
       Because Solr/Lucene tii/tis index uses pointers into the frequency
       and position files we suspect that the performance impact is
       minimal compared to disk I/O demands, but we will be testing
       soon.


               Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                                            54
Q&A
      Download these slides at
      http://bit.ly/practical-solr
         On demand replay is
        available within 24-48
       hours of the live webcast
Lucid Imagination, Inc. – http://www.lucidimagination.com
                                                            55

More Related Content

Viewers also liked

Tv ролики
Tv роликиTv ролики
Tv роликиtarodnova
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence communityLucidworks (Archived)
 
Cancer
CancerCancer
Cancertanica
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル彰 村地
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlersLucidworks (Archived)
 
最新ブラウザー UI 比較
最新ブラウザー UI 比較最新ブラウザー UI 比較
最新ブラウザー UI 比較彰 村地
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bDonna Millard
 
Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe PhotoshopLaRue
 
Moving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchMoving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchLucidworks (Archived)
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Inglestanica
 
In The Annals Of Rock History The Who
In The Annals Of Rock History The WhoIn The Annals Of Rock History The Who
In The Annals Of Rock History The Whotanica
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2彰 村地
 
Presentation
PresentationPresentation
Presentationtarodnova
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1彰 村地
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
Lucene rev preso busch realtime search lr1010
Lucene rev preso busch realtime search lr1010Lucene rev preso busch realtime search lr1010
Lucene rev preso busch realtime search lr1010Lucidworks (Archived)
 

Viewers also liked (20)

Tv ролики
Tv роликиTv ролики
Tv ролики
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Impact of open source search on the intelligence community
Impact of open source search on the intelligence communityImpact of open source search on the intelligence community
Impact of open source search on the intelligence community
 
Cancer
CancerCancer
Cancer
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
Integration of apache solr with crawlers
Integration of apache solr with crawlersIntegration of apache solr with crawlers
Integration of apache solr with crawlers
 
最新ブラウザー UI 比較
最新ブラウザー UI 比較最新ブラウザー UI 比較
最新ブラウザー UI 比較
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3b
 
Web Design Course Overview
Web Design Course OverviewWeb Design Course Overview
Web Design Course Overview
 
Adobe Photoshop
Adobe PhotoshopAdobe Photoshop
Adobe Photoshop
 
Moving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source SearchMoving to Solr/Lucene Open Source Search
Moving to Solr/Lucene Open Source Search
 
Presentacion Ingles
Presentacion InglesPresentacion Ingles
Presentacion Ingles
 
In The Annals Of Rock History The Who
In The Annals Of Rock History The WhoIn The Annals Of Rock History The Who
In The Annals Of Rock History The Who
 
Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2Network Forensics Puzzle Contest に挑戦 #2
Network Forensics Puzzle Contest に挑戦 #2
 
Presentation
PresentationPresentation
Presentation
 
Crazy
CrazyCrazy
Crazy
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1Network Forensics Puzzle Contest に挑戦 #1
Network Forensics Puzzle Contest に挑戦 #1
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
Lucene rev preso busch realtime search lr1010
Lucene rev preso busch realtime search lr1010Lucene rev preso busch realtime search lr1010
Lucene rev preso busch realtime search lr1010
 

Similar to Practical Search with Solr: Beyond just Looking it Up

The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining IntroductionVijayasankariS
 
Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSearch, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSeth Grimes
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationLorri Mon
 
Envisioning Social Applications of Library Linked Data
Envisioning Social Applications of Library Linked DataEnvisioning Social Applications of Library Linked Data
Envisioning Social Applications of Library Linked DataUldis Bojars
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 
Thinking about research
Thinking about researchThinking about research
Thinking about researchCindoroo
 
Ola ei top tech trends
Ola ei top tech trendsOla ei top tech trends
Ola ei top tech trendsStephen Abram
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 
Steps for research process
Steps for research processSteps for research process
Steps for research processMira
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Bradley Allen
 
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureLouis Rosenfeld
 

Similar to Practical Search with Solr: Beyond just Looking it Up (20)

The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
Data mining Introduction
Data mining IntroductionData mining Introduction
Data mining Introduction
 
Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled VisionSearch, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense: An Analytics Fueled Vision
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
 
Basics of Web Research for ELA 10
Basics of Web Research for ELA 10Basics of Web Research for ELA 10
Basics of Web Research for ELA 10
 
Oss swot
Oss swotOss swot
Oss swot
 
Envisioning Social Applications of Library Linked Data
Envisioning Social Applications of Library Linked DataEnvisioning Social Applications of Library Linked Data
Envisioning Social Applications of Library Linked Data
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
Thinking about research
Thinking about researchThinking about research
Thinking about research
 
Semantics
SemanticsSemantics
Semantics
 
Ola ei top tech trends
Ola ei top tech trendsOla ei top tech trends
Ola ei top tech trends
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 
Alamw2013
Alamw2013Alamw2013
Alamw2013
 
Cla gale2012
Cla gale2012Cla gale2012
Cla gale2012
 
Steps for research process
Steps for research processSteps for research process
Steps for research process
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
Genetics hons training
Genetics hons trainingGenetics hons training
Genetics hons training
 
Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)Enterprise Navigation (KM World 2007)
Enterprise Navigation (KM World 2007)
 
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information ArchitectureUsing Search Analytics to Diagnose What’s Ailing your Information Architecture
Using Search Analytics to Diagnose What’s Ailing your Information Architecture
 
Internet Research Ethics and IRBs by Elizabeth Buchanan
Internet Research Ethics and IRBs by Elizabeth BuchananInternet Research Ethics and IRBs by Elizabeth Buchanan
Internet Research Ethics and IRBs by Elizabeth Buchanan
 

More from Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Practical Search with Solr: Beyond just Looking it Up

  • 1. Practical Search with Solr: Beyond just Looking it Up 29 April 2010 Bess Sadler, Stanford University Library Naomi Dushay, Stanford University Library Tom Burton-West, the Hathi Trust Project
  • 2. Slides posted at the end of this Agenda presentation; full replay available within ~48 hours of live webcast Introductions What, the data’s dirty? Bess Sadler Clean data is easy to search and browse. However, you probably don’t have clean data. Queries are not obvious: Naomi Dushay Browsing ordered lists; Dismax; When simple search is not enough Big, Bigger, Biggest: Tom Burton West Large scale issues: Phrase queries and common words; OCR Q&A Lucid Imagination, Inc. – http://www.lucidimagination.com 2
  • 3. About the Presenters Bess Sadler, Stanford University Library Senior Software Engineer at Stanford University Library, and co-founder of Blacklight (http://projectblacklight.org), formerly the Chief Architect for the Online Library Environment at the U-VA. www-sul.stanford.edu Naomi Dushay, Stanford University Library Senior Software Engineer at Stanford University Library, expert in digital library research; formerly a member of the core infrastructure team of the National Science Digital Library. www-sul.stanford.edu Tom Burton-West, the Hathi Trust Project Information Retrieval Programmer in the University of Michigan’s Digital Library Production Service; works on the Hathi Trust Large Scale Search project and blogs about it at www.hathitrust.org/blogs. Lucid Imagination, Inc. – http://www.lucidimagination.com 3
  • 4. What, the data’s dirty? Clean data is easy to search and browse. However, you probably don’t Bess Sadler have clean data. Stanford University Library Lucid Imagination, Inc. – http://www.lucidimagination.com 4 4
  • 5. Before we begin, you should know Some basics around Solr that we won’t cover Gets and posts Search Index is not a DBMS XML Strong Data Typing We’ll refer to these in the talk; if you’re unfamiliar with them, see: bit.ly/practical-solr for some quick definitions of these terms Lucid Imagination, Inc. – http://www.lucidimagination.com 5
  • 6. Mapping Library Data Types: Your data is not as different as you think it is Library Engineering Health Care Intellectual Legal Property Books Specs Research papers Patents Contracts Personal Name Concept Disease Types Mechanisms Parties Publication Formal Journals Filing and Rulings and court Documentations Disclosure Docs documents Combined Facets: Test results, analog Test results, analog Authors, titles, prior Exhibits, photos, Book, Video, data files, Media data files, Media art, assignees, criminal evidence, Journal, files, data sets, rich files,patient records claims, descriptions, emails/e-discovery Newspaper, documents figures Physical Artifacts, SKUs Digital artifacts Other domains: pharmaceutical, manufacturing, etc. are similar in the diversity of document types and data types within the documents Lucid Imagination, Inc. – http://www.lucidimagination.com 6
  • 7. Data is weird Not Normal: The data is not always in the fields or places you expect, even when you have a detailed spec. Local practices differ Practices change over time Sometimes stuff is just wrong (but remember: it’s better to be consistent than right) Be prepared for cleanup – indexing your data is going to uncover a lot of problems you never knew about before Formats are not necessarily optimized for discovery For example: PDFs are optimized for presentation, not discovery; putting them into a discovery system presents its own challenges. Lucid Imagination, Inc. – http://www.lucidimagination.com 7
  • 8. Using Solr Cell (aka Extracting Request Handler)/PDFBox Good news examples Please, God, just some metadata! When we got lucky, we had another source of the metadata Bad News examples Typography Text inset boxes It’s only a little easier than OCR … Advanced PDFBox options only work when there is a lot of consistency Lucid Imagination, Inc. – http://www.lucidimagination.com 8
  • 9. Search vs. Browse Search: More focused -- the user is looking for a known item, or has a specific question to be answered. (e.g., a citation, a part number, a specific judicial ruling, “that book by Steinbeck”) Browse: The user has a generalized, nebulous information need that they will refine as they interact with a collection of resources. (e.g., finding a good book to read, shopping for accessories, keeping current in one’s field) Lucid Imagination, Inc. – http://www.lucidimagination.com 9
  • 10. Search Challenges Relevancy – indexing the full text isn’t good enough Fielded search – context is meaningful (“Cook” example) Fielded search – will data be where you expect it to be? Users don’t’ speak your jargon: “indian cooking” is “Cookery Indic” Stemming -- Nature/Naturalism How do you know you have your relevancy rankings right? You ask! Lucid Imagination, Inc. – http://www.lucidimagination.com 10
  • 11. Why Browsing Is Important Search is not enough What is a facet? Here is how it works in Solr Here is why your users will like you for doing it. More challenges related to browsing coming up… Lucid Imagination, Inc. – http://www.lucidimagination.com 11
  • 12. Queries are not obvious Browsing Ordered Lists A Little About Dismax and When Simple Search Naomi Dushay Is Not Enough Stanford University Library Lucid Imagination, Inc. – http://www.lucidimagination.com 12
  • 13. Candidates for Browsing Names (Employees, Customers, Students, Authors) Part Numbers One Strategy for Data that is When Spelling is Unclear Not Normalized is Browsing Ordered Lists uighur, uyghur, uyghar, uigher Strings of Both Letters and Digits, such as SKUs, Part Numbers, Invoice Numbers, Transaction Record Numbers Addresses in Sequence Titles (Books, …) Lucid Imagination, Inc. – http://www.lucidimagination.com 13
  • 14. Some Values are Easily Ordered Numeric Values Dates (if normalized) Some Letter Tokens (e.g. categories) Lucid Imagination, Inc. – http://www.lucidimagination.com 14
  • 15. Values Difficult to Sort Lexically Digits in non-numeric context lexical sort of numbers: 1, 111, 20, 222, 8 … “A715C74” “The Princess and the Pea” “Sir Isaac Newton” “Die Fledermaus” piña vs. pina Lucid Imagination, Inc. – http://www.lucidimagination.com 15
  • 16. Call Numbers are Difficult to Sort Lexically (applies to SKUs, Part Numbers, A7 .L3 .V2 Non-uniform serial numbers across domains, etc.) A7 .L3 V2 A7 .L3 V.2 Letters combined with DIGITS A7 .L3 1902 V.2 Some digits are decimals, some are integers M5 .L3 2000 .K2 1880 Inconsistent punctuation M5 .L3 .K451 V.5 Suffixes to be ignored for M5 .L3 K2 D MAJ 1880 sorting purposes M5 .L3 K2 OP.7:NO.6 1880 A7 .L3 1902 V2 TANEYTOWN M5 .L3 K2 .Q2 MD:CRAP0*DMA 1981 Lucid Imagination, Inc. – http://www.lucidimagination.com 16
  • 17. Normalization for Sorting is a Process It might not Programmatically need to be Normalize Data perfect. Clean up data Assess Sorted Output Humans Find Dirty Data Automated test Lucid Imagination, Inc. – http://www.lucidimagination.com 17
  • 18. Basic Sorting Normalization Strategies Normalize Letter Case (e.g. all lowercase) Leading Spaces (can use zeros for digits; space works) Trailing Spaces Skip Ignored Characters (“The Fly”, “Ms. Jane Doe”) Numbers sorted as an Integer (leading spaces/zeros), vs. as a Decimal (trailing spaces/zeros)? Normalization should accommodate dirty data whenever practical. Lucid Imagination, Inc. – http://www.lucidimagination.com 18
  • 19. a+++0007.000000 A7 .B3 b0.300000 a+++0007.000000 A7 B33 b0.330000 a+++0017.000000 A17 .B4 b0.400000 Lucid Imagination, Inc. – http://www.lucidimagination.com 19
  • 20. Weird Values Happen ZDVD 4971 MFILM 24 REEL 5 Shelved by title XX(123457) call # varies no call number Lucid Imagination, Inc. – http://www.lucidimagination.com 20
  • 21. Solr Performance Issue: Query Time Sorting q=sortfield["666" TO *]&rows=10 Will sort ALL of the sortfield values at Query Time Response time abysmal for sortfields with huge numbers of values Try this: Terms Component Lucid Imagination, Inc. – http://www.lucidimagination.com 21
  • 22. Solrconfig.xml QUERY LOOKS LIKE: http://host:port/solr/alphaTerms? terms.fl= terms.lower= per_page= Lucid Imagination, Inc. – http://www.lucidimagination.com 22
  • 23. /solr/alphaTerms? terms.fl=shelfkey& terms.lower=lc+hc++0337.000000+f0.500000+f0.512000& per_page=10 TermsComponent queries the part of the index that is already lexically sorted for each field. Lucid Imagination, Inc. – http://www.lucidimagination.com 23
  • 24. Now that I Have Terms, How do I get the Documents? /solr/select? q=sortkey:( “a++67+mn++4” OR “a++67+mp+85”) &qt=standard (URL encode if you need to) Lucid Imagination, Inc. – http://www.lucidimagination.com 24
  • 25. Sortfield Value : Document NOT always 1:1 1:Many One Sortfield Value – Multiple Documents One product, multiple generations of user manuals One court case, multiple briefing and disclosure documents Many:1 One Document – Multiple Sortfield Values Which value are you going to pick for browsing list? Allow user to select in UI, if possible Lucid Imagination, Inc. – http://www.lucidimagination.com 25
  • 26. What About Browsing Before the Known Sort Value? n Before n After http://hayward-ca.gov/refreshyourlife/wp-content/uploads/2009/07/fiction-spines.jpg Lucid Imagination, Inc. – http://www.lucidimagination.com 26
  • 27. Create Reverse Sortkey Use a simple character mapping to reverse the sort order IF sortkey HAS reversekey GETS 0 Z 1 Y … … 9 Q A P … … Z 0 Lucid Imagination, Inc. – http://www.lucidimagination.com 27
  • 28. Lucid Imagination, Inc. – http://www.lucidimagination.com 28
  • 29. A Little About Dismax Lucid Imagination, Inc. – http://www.lucidimagination.com 29
  • 30. Solr QueryParsing Strategies FEATURE LUCENE DISMAX Boolean √ Each Text Box -> Groups of Index Fields √ √ Each Text Box -> Complex Boosting Equation yuck √ Multiple Text Boxes yuck √ Multiple Query Words Match Across Fields √ Boosting Matches Simple √ “Author” “Title” “Subject” Searches Lucid Imagination, Inc. – http://www.lucidimagination.com 30
  • 31. Dismax (disjoint max) Query Parser: Some of My Favorite Things Assign boost values for field matching at query time BUT complex boosting formulae can reside in solrconfig.xml Index can be neutral; assign query time boosting to fields for different types of queries Easy to boost exact phrase matches higher than query terms scattered across document. Tune how many query words MUST match, and what the other matching thresholds/parameters might be http://wiki.apache.org/solr/DisMaxRequestHandler Lucid Imagination, Inc. – http://www.lucidimagination.com 31
  • 32. Example Dismax Request Handler <!-- author search request handler --> <requestHandler name="search_author" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <!-- require 4 or more terms to match … --> <str name="mm">4&lt;-1 4&lt;90%</str> <!– boost formula --> <str name="qf">author_unstem^10 author native_script_author</str> <!-- boost phrase matches --> <str name="pf">author_unstem^100 author^10 native_script_author^10</str> … http://wiki.apache.org/solr/DisMaxRequestHandler Lucid Imagination, Inc. – http://www.lucidimagination.com 32
  • 33. Sometimes, Simple Search + Facets is Not Enough Lucid Imagination, Inc. – http://www.lucidimagination.com 33
  • 34. WHEN isn’t it enough? Pay attention to user feedback Study Search Logs Queries without results Lucid Imagination, Inc. – http://www.lucidimagination.com 34
  • 35. Our Users Also Asked for: Boolean Targeting a particular (group of) fields “… combined searching feature so that I can specify the author and title.” (author) Mozart (title) sonata 21 – not a book about Mozart’s sonatas “I often search publisher AND year, or publisher AND place of publication, and occasionally need all three terms in combination.” (publisher) “Little, Brown & Co” – not “The Little Brown Jug” Plaintiff, Defendant, Attorney – all? Lucid Imagination, Inc. – http://www.lucidimagination.com 35
  • 36. Search Form has More Than One Text Box Want Features of Dismax Need Way to Boost Appropriately for Each Text Box Need Way to Combine Text Boxes Lucid Imagination, Inc. – http://www.lucidimagination.com 36
  • 37. Local Params LocalParams allow additional, localized instructions to be sent as part of the query. Ways to Parse Query Terms Send in Non-Default Values for Variables Use Variables Declared in Request Handler That Don’t Map To QueryParser Arguments http://wiki.apache.org/solr/LocalParams Lucid Imagination, Inc. – http://www.lucidimagination.com 37
  • 38. Solrconfig.xml <requestHandler name=”multi_box" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">lucene</str> <str name=”q.op”>AND</str> <!– author box --> <str name="qf_author">author fields boost formula</str> <str name="pf_author">author phrase boosts</str> <!– title box --> <str name="qf_title">title fields boost formula</str> <str name="pf_title">title phrase boosts</str> … Lucid Imagination, Inc. – http://www.lucidimagination.com 38
  • 39. Using LocalParams Variables Text boxes combined with AND _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” AND _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms” Text boxes combined with OR _query_:”{!dismax qf=$qf_title pf=$pf_title}title terms” OR _query_ ”{!dismax qf=$qf_author pf=$pf_author} author terms” Lucid Imagination, Inc. – http://www.lucidimagination.com 39
  • 40. Note: DISMAX doesn’t do Boolean within the text boxes: there are workarounds … edismax (Solr 1.5) faking it: http://www.stanford.edu/people/~ndushay/code4lib2010/advSe archSolrQueries.pdf Lucid Imagination, Inc. – http://www.lucidimagination.com 40
  • 41. My Favorite Places To Find Information LucidImagination Search http://www.lucidimagination.com/search/ (NOT a coerced statement!) Solr wikis http://wiki.apache.org/solr/FrontPage Lucid Imagination, Inc. – http://www.lucidimagination.com 41
  • 42. Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project Lucid Imagination, Inc. – http://www.lucidimagination.com 42
  • 43. Hathi Trust Large Scale Search Challenges Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) Challenges: Must scale to 20 million full-text volumes Very long documents compared to most large-scale search applications Multilingual collection OCR quality varies Lucid Imagination, Inc. – http://www.lucidimagination.com 43
  • 44. Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index Lucid Imagination, Inc. – http://www.lucidimagination.com 44
  • 45. Response time varies with query Average: 673 Median: 91 90th: 328 99th: 7,504 Lucid Imagination, Inc. – http://www.lucidimagination.com 45
  • 46. Slowest 5 % of queries The slowest 5% of queries took about 1 second or longer. Response Time 95th percentile (seconds) The slowest 1% of queries took between 10 seconds and 2 minutes. 1,000 Response Time Slowest 0.5% of queries took (seconds) 100 between 30 seconds and 2 minutes 10 These queries affect response time of 1 other queries 0 940 950 960 970 980 990 1,000 Cache pollution Query number Contention for resources Slowest queries are phrase queries containing common words Lucid Imagination, Inc. – http://www.lucidimagination.com 46
  • 47. Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as “the” can be many GB in size This causes lots of disk I/O . Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache Lucid Imagination, Inc. – http://www.lucidimagination.com 47
  • 48. Slow Queries Slowest test query: “the lives and literature of the beat generation” took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query. NUMBER OF POSTINGS LIST TOTAL TERM OCCURRENCES POSITION LIST WORD DOCUMENTS (SIZE MB) (MILLIONS) (SIZE MB) the 800,000 0.8 4,351 4,351 of 892,000 0.89 2,795 2,795 and 769,000 0.77 1,870 1,870 literature 435,000 0.44 9 9 generation 414,000 0.41 5 5 lives 432,000 0.43 5 5 beat 278,000 0.28 1 1 TOTAL 4.02 9,036 Lucid Imagination, Inc. – http://www.lucidimagination.com 48
  • 49. Why not use Stop Words? The word “the” occurs more than 4 billion times in our 1 million document index. Removing “stop” words (“the”, “of” etc.) not desirable for our use cases. Couldn’t search for many phrases “to be or not to be” “the who” “man in the moon” vs. “man on the moon” Stop words in one language are content words in another language German stop words “war” and “die” are content words in English English stop words “is” and “by” are content words (“ice” and “village”) in Swedish Lucid Imagination, Inc. – http://www.lucidimagination.com 49
  • 50. “CommonGrams” Ported Nutch “CommonGrams” algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: “The lives and literature of the beat generation” “the-lives” “lives-and” “and-literature” “literature-of” “of-the” “the-beat” “generation” Lucid Imagination, Inc. – http://www.lucidimagination.com 50
  • 51. Standard index vs. CommonGrams Standard Index Common Grams TOTAL TOTAL NUMBER OF NUMBER OF OCCURRENCES OCCURRENCES WORD DOCS DOCS IN CORPUS IN CORPUS (THOUSANDS) (THOUSANDS) (MILLIONS) TERM (MILLIONS) the 2,013 386 of-the 446 396 of 1,299 440 generation 2.42 262 and 855 376 the-lives 0.36 128 literature 4 210 literature-of 0.35 103 lives 2 194 lives-and 0.25 115 generation 2 199 and-literature 0.24 77 beat 0.6 130 the-beat 0.06 26 TOTAL 4,176 TOTAL 450 Lucid Imagination, Inc. – http://www.lucidimagination.com 51
  • 52. Comparison of Response time (ms) SLOWEST AVERAGE MEDIAN 90th 99th QUERY Standard Index 459 32 146 6,784 120,595 Common 68 3 71 2,226 7,800 Grams Lucid Imagination, Inc. – http://www.lucidimagination.com 52
  • 53. Other issues Analyze your slowest queries We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list. We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked. We discovered that words such as “l’art” were being split into two token phrase queries. We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit. Lucid Imagination, Inc. – http://www.lucidimagination.com 53
  • 54. Other issues We broke Solr … temporarily Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms Solr/Lucene index size was limited to 2.1 Billion unique terms Patched: Now it’s 274 Billion Dirty OCR is difficult to remove without removing “good” words. Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon. Lucid Imagination, Inc. – http://www.lucidimagination.com 54
  • 55. Q&A Download these slides at http://bit.ly/practical-solr On demand replay is available within 24-48 hours of the live webcast Lucid Imagination, Inc. – http://www.lucidimagination.com 55