SlideShare une entreprise Scribd logo
1  sur  28
Mastering
Solr 1.4
Yonik Seeley
March 4, 2010
Agenda



     Faceting
     Trie Fields (numeric ranges)
     Distributed Search
     1.5 Preview
     Q&A
     See http://bit.ly/mastersolr for
         Slides for download after the presentation
         On-demand viewing of the webinar in ~48 hours



                                Lucid Imagination, Inc.
My background



         Creator of Solr, the Lucene Search Server
         Co-founder of Lucid Imagination
         Expertise: Distributed Search systems and performance
         Lucene/Solr committer, a member of the Lucene PMC,
         member of the Apache Software Foundation
         Work: CNET Networks, BEA, Telcordia, among others
         M.S. in Computer Science, Stanford




                           Lucid Imagination, Inc.
Getting the most from this talk



      Assuming you’ve been completed the Solr tutorial
      Assuming you’re familiar faceting and other
      high-level Solr/Lucene concepts
      I don’t expect you to have deployed Solr
      in production, but it will provide helpful context




                              Lucid Imagination, Inc.
Faceting Deep Dive
Existing single-valued faceting algorithm
                     Documents
                     matching the
                     base query                        Lucene FieldCache Entry
                     “Juggernaut”                      (StringIndex) for the “hero” field
                          0                       order: for each
 q=Juggernaut             2    lookup             doc, an index into     lookup: the
 &facet=true              7                       the lookup array       string values
 &facet.field=hero                                         5                (null)
                                                           3              batman
                      accumulator
                                                           5                flash
                         0
                                                           1             spiderman
                         1
                                                           4             superman
                         0   increment
                                                           5             wolverine
                         0
                                                           2
                         0
                                                           1
                         2


                             Lucid Imagination, Inc.
Existing single-valued faceting algorithm



      facet.method=fc
      for each doc in base set:
        ord = FieldCache.order[doc]
        accumulator[ord]++
      O(docs_matching_q)
      Not used for boolean




                               Lucid Imagination, Inc.
Multi-valued faceting: enum method (1 of 2)

facet.method=enum
For each term in field:
  Retrieve filter
                                                                       Solr filterCache (in memory)
  Calculate intersection size
                                                                      hero:batman    hero:flash

        Lucene Inverted         Docs matching
        Index (on disk)         base query                                 1            0
                                                      intersection
                                     0                                     3            1
     batman         1 3 5 8                              count
                                     1                                     5            5
     flash          0 1 5            5                                     8
     spiderman      2 4 7            9               Priority queue
                                                      batman=2
     superman       0 6 9
     wolverine      1 2 7 8



                                         Lucid Imagination, Inc.
Multi-valued faceting: enum method (2 of 2)



      O(n_terms_in_field)
      Short circuits based on term.df
      filterCache entries int[ndocs] or BitSet(maxDoc)
      filterCache concurrency + efficiency upgrades in 1.4
      Size filterCache appropriately
      Prevent filterCache use for small terms with
      facet.enum.cache.minDf




                              Lucid Imagination, Inc.
Multi-valued faceting: new UnInvertedField
facet.method=fc
Like single-valued FieldCache method, but with multi-valued FieldCache
Good for many unique terms, relatively few values per doc
  Best case: 50x faster, 5x smaller (100K unique values, 1-5 per doc)
  O(n_docs), but optimization to count the inverse when n>maxDoc/2
Memory efficient
  Terms in a document are delta coded variable width term numbers
  Term number list for document packed in an int or in a shared byte[]
  Hybrid approach: “big terms” that match >5% of index use filterCache instead
  Only 1/128th of string values in memory


                                   Lucid Imagination, Inc.
Faceting: fieldValueCache
Implicit cache with UnInvertedField entries
  Not autowarmed – use static warming request
  http://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)




                                  Lucid Imagination, Inc.
Faceting: fieldValueCache
  Implicit cache with UnInvertedField entries
     Not autowarmed – use static warming request
     http://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc)



item_cat:{field=cat,memSize=5376,tindexSize=52,time=2,phase1=2,
nTerms=16,bigTerms=10,termInstances=6,uses=44}




                                     Lucid Imagination, Inc.
Migrating from 1.3 to 1.4 faceting


   filterCache
     Lower size (need room for normal fq filters and “big terms”)
     Lower or eliminate autowarm count
   1.3 enum method can sometimes be better
     f.<fieldname>.facet.method=enum
     Field has many values per document and many unique values
     Huge index without faceting time constraints




                                 Lucid Imagination, Inc.
Multi-select faceting
http://search.lucidimagination.com
                                      Very generic support
                                            Reuses localParams syntax {!name=val}
                                            Ability to tag filters
                                            Ability to exclude certain filters when
                                            faceting, by tag
                                     q=index replication&facet=true
                                     &fq={!tag=proj}project:(lucene OR solr)
                                     &facet.field={!ex=proj}project
                                     &facet.field={!ex=src}source




                                     Lucid Imagination, Inc.                          14
New Trie Fields
(numeric ranges)
New Trie* fields

  Numeric,Date fields index at multiple precisions to speed up range
queries
  Base10 Example: 175 is indexed as hundreds:1 tens:17 ones:175
  TrieRangeQuery:[154 TO 183] is executed as
  tens:[16 TO 17] OR ones:[154 TO 159] OR ones:[180 TO 183]
  Best case: 40x speedup of range queries
  Configurable precisionStep per field (expressed in bits)
     precisionStep=8 means index first 8, then first 16, first 24, and 32 bits
     precisionStep=0 means index normally
  Extra bonus: more memory efficient FieldCache entries


                                    Lucid Imagination, Inc.
Trie* Index Overhead
100,000 documents, random 32 bit integers from 0-1000
Precision Step                    Index Size*                              Index Size Multiplier
0                                 223K                                     1
8                                 588K                                     2.6
6                                 838K                                     3.7
4                                 1095K                                    4.9

100,000 documents, random 32 bit integers
Precision Step                    Index Size*                              Index Size Multiplier
0                                 1.17M                                    1
8                                 3.03M                                    2.6
6                                 3.86M                                    3.3
4                                 5.47M                                    4.7

    *Index Size reflects only the portion of the index related to indexing the field

                                                 Lucid Imagination, Inc.
Schema migration

   Use int, float, long, double, date for normal numerics
     <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
     omitNorms="true" positionIncrementGap="0"/>
   Use tint, tfloat, tlong, tdouble, tdate for faster range queries
     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8"
     omitNorms="true" positionIncrementGap="0"/>
     Date faceting also uses range queries
   date, tdate, NOW can be used in function queries
     Date boosting: recip(ms(NOW,mydatefield),3.16e-11,1,1)




                                 Lucid Imagination, Inc.
Distributed Search
Distributed Search



      Split into multiple shards
        When single query latency too long
        • Super-linear speedup possible
        Optimal performance when free RAM > shard size
        • Minimum: RAM > (shard index size – stored_fields)
      Use replication for HA & increased capacity (queries/sec)




                                 Lucid Imagination, Inc.
Distributed Search & Sharding
http://...?shards=VIP1,VIP2
                              VIP1                                    VIP2

                 searchers
 2x4
                    replication


                       master handles updates

http://...?shards=VIP1,VIP2,VIP3,VIP4
                                                4x2
             VIP1                    VIP2                      VIP3          VIP4



 4x2

                                            Lucid Imagination, Inc.
Distributed Search: Use Cases



      2 shards x 4 replicas
        Fewer masters to manage (and fewer total servers)
        Less increased load on other replicas when one goes down (33%
        vs 100%)
        Less network bandwidth
      4 shards x 2 replicas
        Greater indexing bandwidth




                              Lucid Imagination, Inc.
Index Partitioning

  Partition by date
    Easy incremental scalability - add more servers over time as needed
    Easy to remove oldest data
    Enables increased replication factor for newest data
  Partition by document id
    Works well for updating
    Not easily resizable
  Partitioning to allow querying a subset of shards
    Increases system throughput, decreases network bandwidth
    Partition by userId for mailbox search
    Partition by region for geographic search

                                 Lucid Imagination, Inc.
1.5 Preview
1.5 Preview: SolrCloud
Baby steps toward simplifying cluster management
Integrates Zookeeper
   Central configuration (solrconfig.xml, etc)
   Tracks live nodes
   Tracks shards of collections
Removes need for external load balancers
   shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr

Can specify logical shard ids
   shards=NY_shard,NJ_shard

Clients don’t need to know shards:
http://localhost:8983/solr/collection1/select?distrib=true

                                           Lucid Imagination, Inc.
1.5 Preview: Spatial Search

 PointType
   Compound values: 38.89,-77.03
   Range queries and exact matches supported
 Distance Functions
   haversine
 Sorting by function query
 Still needed
   Ability to return sort values
   Distance filtering



                                   Lucid Imagination, Inc.
Q&A
Resources

   Apache Solr web site
     http://lucene.apache.org/solr
   LucidWorks: free Certified Distribution of Solr + Reference Guide
     http://www.lucidimagination.com/Downloads
   Search all of Lucene/Solr (wiki, mailing lists, issues, ref man, etc)
     http://search.lucidimagination.com
   Download slides (in ~4 hours) & re-play this talk (~48 hours)
      http://bit.ly/mastersolr
   Thanks for coming!


                                 Lucid Imagination, Inc.

Contenu connexe

En vedette

20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache彰 村地
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012彰 村地
 
Amazing grace[1]
Amazing grace[1]Amazing grace[1]
Amazing grace[1]tanica
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
Civil War
Civil WarCivil War
Civil Wartanica
 
What Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchWhat Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchLucidworks (Archived)
 
Is this love
Is this loveIs this love
Is this lovetanica
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutesLucidworks (Archived)
 
Tv ролики
Tv роликиTv ролики
Tv роликиtarodnova
 
Windows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作るWindows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作る彰 村地
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Lucidworks (Archived)
 
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneSearching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneLucidworks (Archived)
 

En vedette (18)

What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0What’s New in Apache Lucene 3.0
What’s New in Apache Lucene 3.0
 
Metacognicion
MetacognicionMetacognicion
Metacognicion
 
20101023 ie9 cache
20101023 ie9 cache20101023 ie9 cache
20101023 ie9 cache
 
Webテクノロジー@2012
Webテクノロジー@2012Webテクノロジー@2012
Webテクノロジー@2012
 
корея
кореякорея
корея
 
Amazing grace[1]
Amazing grace[1]Amazing grace[1]
Amazing grace[1]
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
Civil War
Civil WarCivil War
Civil War
 
Picasso
PicassoPicasso
Picasso
 
What Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise SearchWhat Lucene and Solr Open Source Search can do for Enterprise Search
What Lucene and Solr Open Source Search can do for Enterprise Search
 
Presentation: IT Wizard Summer Camp
Presentation: IT Wizard Summer CampPresentation: IT Wizard Summer Camp
Presentation: IT Wizard Summer Camp
 
Is this love
Is this loveIs this love
Is this love
 
Davis mark advanced search analytics in 20 minutes
Davis mark   advanced search analytics in 20 minutesDavis mark   advanced search analytics in 20 minutes
Davis mark advanced search analytics in 20 minutes
 
Tv ролики
Tv роликиTv ролики
Tv ролики
 
Short Presentation
Short PresentationShort Presentation
Short Presentation
 
Windows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作るWindows 8 で魅力的なWeb サイトを作る
Windows 8 で魅力的なWeb サイトを作る
 
Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"Solr Cluster installation tool "Anuenue"
Solr Cluster installation tool "Anuenue"
 
Searching The United States Code with Solr/Lucene
Searching The United States Code with Solr/LuceneSearching The United States Code with Solr/Lucene
Searching The United States Code with Solr/Lucene
 

Plus de Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)
 

Plus de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 

Dernier

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Mastering Solr 1.4 with Yonik Seeley

  • 2. Agenda Faceting Trie Fields (numeric ranges) Distributed Search 1.5 Preview Q&A See http://bit.ly/mastersolr for Slides for download after the presentation On-demand viewing of the webinar in ~48 hours Lucid Imagination, Inc.
  • 3. My background Creator of Solr, the Lucene Search Server Co-founder of Lucid Imagination Expertise: Distributed Search systems and performance Lucene/Solr committer, a member of the Lucene PMC, member of the Apache Software Foundation Work: CNET Networks, BEA, Telcordia, among others M.S. in Computer Science, Stanford Lucid Imagination, Inc.
  • 4. Getting the most from this talk Assuming you’ve been completed the Solr tutorial Assuming you’re familiar faceting and other high-level Solr/Lucene concepts I don’t expect you to have deployed Solr in production, but it will provide helpful context Lucid Imagination, Inc.
  • 6. Existing single-valued faceting algorithm Documents matching the base query Lucene FieldCache Entry “Juggernaut” (StringIndex) for the “hero” field 0 order: for each q=Juggernaut 2 lookup doc, an index into lookup: the &facet=true 7 the lookup array string values &facet.field=hero 5 (null) 3 batman accumulator 5 flash 0 1 spiderman 1 4 superman 0 increment 5 wolverine 0 2 0 1 2 Lucid Imagination, Inc.
  • 7. Existing single-valued faceting algorithm facet.method=fc for each doc in base set: ord = FieldCache.order[doc] accumulator[ord]++ O(docs_matching_q) Not used for boolean Lucid Imagination, Inc.
  • 8. Multi-valued faceting: enum method (1 of 2) facet.method=enum For each term in field: Retrieve filter Solr filterCache (in memory) Calculate intersection size hero:batman hero:flash Lucene Inverted Docs matching Index (on disk) base query 1 0 intersection 0 3 1 batman 1 3 5 8 count 1 5 5 flash 0 1 5 5 8 spiderman 2 4 7 9 Priority queue batman=2 superman 0 6 9 wolverine 1 2 7 8 Lucid Imagination, Inc.
  • 9. Multi-valued faceting: enum method (2 of 2) O(n_terms_in_field) Short circuits based on term.df filterCache entries int[ndocs] or BitSet(maxDoc) filterCache concurrency + efficiency upgrades in 1.4 Size filterCache appropriately Prevent filterCache use for small terms with facet.enum.cache.minDf Lucid Imagination, Inc.
  • 10. Multi-valued faceting: new UnInvertedField facet.method=fc Like single-valued FieldCache method, but with multi-valued FieldCache Good for many unique terms, relatively few values per doc Best case: 50x faster, 5x smaller (100K unique values, 1-5 per doc) O(n_docs), but optimization to count the inverse when n>maxDoc/2 Memory efficient Terms in a document are delta coded variable width term numbers Term number list for document packed in an int or in a shared byte[] Hybrid approach: “big terms” that match >5% of index use filterCache instead Only 1/128th of string values in memory Lucid Imagination, Inc.
  • 11. Faceting: fieldValueCache Implicit cache with UnInvertedField entries Not autowarmed – use static warming request http://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc) Lucid Imagination, Inc.
  • 12. Faceting: fieldValueCache Implicit cache with UnInvertedField entries Not autowarmed – use static warming request http://localhost:8983/solr/admin/stats.jsp (mem size, time to create, etc) item_cat:{field=cat,memSize=5376,tindexSize=52,time=2,phase1=2, nTerms=16,bigTerms=10,termInstances=6,uses=44} Lucid Imagination, Inc.
  • 13. Migrating from 1.3 to 1.4 faceting filterCache Lower size (need room for normal fq filters and “big terms”) Lower or eliminate autowarm count 1.3 enum method can sometimes be better f.<fieldname>.facet.method=enum Field has many values per document and many unique values Huge index without faceting time constraints Lucid Imagination, Inc.
  • 14. Multi-select faceting http://search.lucidimagination.com Very generic support Reuses localParams syntax {!name=val} Ability to tag filters Ability to exclude certain filters when faceting, by tag q=index replication&facet=true &fq={!tag=proj}project:(lucene OR solr) &facet.field={!ex=proj}project &facet.field={!ex=src}source Lucid Imagination, Inc. 14
  • 16. New Trie* fields Numeric,Date fields index at multiple precisions to speed up range queries Base10 Example: 175 is indexed as hundreds:1 tens:17 ones:175 TrieRangeQuery:[154 TO 183] is executed as tens:[16 TO 17] OR ones:[154 TO 159] OR ones:[180 TO 183] Best case: 40x speedup of range queries Configurable precisionStep per field (expressed in bits) precisionStep=8 means index first 8, then first 16, first 24, and 32 bits precisionStep=0 means index normally Extra bonus: more memory efficient FieldCache entries Lucid Imagination, Inc.
  • 17. Trie* Index Overhead 100,000 documents, random 32 bit integers from 0-1000 Precision Step Index Size* Index Size Multiplier 0 223K 1 8 588K 2.6 6 838K 3.7 4 1095K 4.9 100,000 documents, random 32 bit integers Precision Step Index Size* Index Size Multiplier 0 1.17M 1 8 3.03M 2.6 6 3.86M 3.3 4 5.47M 4.7 *Index Size reflects only the portion of the index related to indexing the field Lucid Imagination, Inc.
  • 18. Schema migration Use int, float, long, double, date for normal numerics <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> Use tint, tfloat, tlong, tdouble, tdate for faster range queries <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/> Date faceting also uses range queries date, tdate, NOW can be used in function queries Date boosting: recip(ms(NOW,mydatefield),3.16e-11,1,1) Lucid Imagination, Inc.
  • 20. Distributed Search Split into multiple shards When single query latency too long • Super-linear speedup possible Optimal performance when free RAM > shard size • Minimum: RAM > (shard index size – stored_fields) Use replication for HA & increased capacity (queries/sec) Lucid Imagination, Inc.
  • 21. Distributed Search & Sharding http://...?shards=VIP1,VIP2 VIP1 VIP2 searchers 2x4 replication master handles updates http://...?shards=VIP1,VIP2,VIP3,VIP4 4x2 VIP1 VIP2 VIP3 VIP4 4x2 Lucid Imagination, Inc.
  • 22. Distributed Search: Use Cases 2 shards x 4 replicas Fewer masters to manage (and fewer total servers) Less increased load on other replicas when one goes down (33% vs 100%) Less network bandwidth 4 shards x 2 replicas Greater indexing bandwidth Lucid Imagination, Inc.
  • 23. Index Partitioning Partition by date Easy incremental scalability - add more servers over time as needed Easy to remove oldest data Enables increased replication factor for newest data Partition by document id Works well for updating Not easily resizable Partitioning to allow querying a subset of shards Increases system throughput, decreases network bandwidth Partition by userId for mailbox search Partition by region for geographic search Lucid Imagination, Inc.
  • 25. 1.5 Preview: SolrCloud Baby steps toward simplifying cluster management Integrates Zookeeper Central configuration (solrconfig.xml, etc) Tracks live nodes Tracks shards of collections Removes need for external load balancers shards=localhost:8983/solr|localhost:8900/solr,localhost:7574/solr|localhost:7500/solr Can specify logical shard ids shards=NY_shard,NJ_shard Clients don’t need to know shards: http://localhost:8983/solr/collection1/select?distrib=true Lucid Imagination, Inc.
  • 26. 1.5 Preview: Spatial Search PointType Compound values: 38.89,-77.03 Range queries and exact matches supported Distance Functions haversine Sorting by function query Still needed Ability to return sort values Distance filtering Lucid Imagination, Inc.
  • 27. Q&A
  • 28. Resources Apache Solr web site http://lucene.apache.org/solr LucidWorks: free Certified Distribution of Solr + Reference Guide http://www.lucidimagination.com/Downloads Search all of Lucene/Solr (wiki, mailing lists, issues, ref man, etc) http://search.lucidimagination.com Download slides (in ~4 hours) & re-play this talk (~48 hours) http://bit.ly/mastersolr Thanks for coming! Lucid Imagination, Inc.