SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Powerful Full-Text Search
       with Solr
            Yonik Seeley
          yonik@apache.org
         Web 2.0 Expo, Berlin
          8 November 2007


               download at
      http://www.apache.org/~yonik
What is Lucene
• High performance, scalable, full-text
  search library
• Focus: Indexing + Searching Documents
  – “Document” is just a list of name+value pairs
• No crawlers or document parsing
• Flexible Text Analysis (tokenizers + token
  filters)
• 100% Java, no dependencies, no config
  files
What is Solr
•   A full text search server based on Lucene
•   XML/HTTP, JSON Interfaces
•   Faceted Search (category counting)
•   Flexible data schema to define types and fields
•   Hit Highlighting
•   Configurable Advanced Caching
•   Index Replication
•   Extensible Open Architecture, Plugins
•   Web Administration Interface
•   Written in Java5, deployable as a WAR
Basic App                                HTML


 Indexer
                                                Webapp
       Document
super_name: Mr. Fantastic
                                                Query            Query Response
name: Reed Richards
                                            (powers:agility)     (matching docs)
category: superhero
powers: elasticity


  http://solr/update                  http://solr/select


                         admin   update       select       XML response writer
                                                           JSON response writer
                                           Solr
     Servlet Container




                         XML Update Handler            Standard request handler
                         CSV Update Handler            Custom request handler

                                              Lucene
Indexing Data
HTTP POST to http://localhost:8983/solr/update
<add><doc>
 <field name=“id”>05991</field>
 <field name=“name”>Peter Parker</field>
 <field name=“supername”>Spider-Man</field>
 <field name=“category”>superhero</field>
 <field name=“powers”>agility</field>
 <field name=“powers”>spider-sense</field>
</doc></add>
Indexing CSV data
Iron Man, Tony Stark, superhero, powered armor | flight
Sandman, William Baker|Flint Marko, supervillain, sand transform
Wolverine,James Howlett|Logan, superhero, healing|adamantium
Magneto, Erik Lehnsherr, supervillain, magnetism|electricity




 http://localhost:8983/solr/update/csv?
         fieldnames=supername,name,category,powers
         &separator=,
         &f.name.split=true&f.name.separator=|
         &f.powers.split=true&f.powers.separator=|
Data upload methods
URL=http://localhost:8983/solr/update/csv


• HTTP POST body (curl, HttpClient, etc)
curl $URL -H 'Content-type:text/plain;
  charset=utf-8' --data-binary @info.csv
• Multi-part file upload (browsers)
• Request parameter
?stream.body=‘Cyclops, Scott Summers,…’
• Streaming from URL (must enable)
?stream.url=file://data/info.csv
Indexing with SolrJ
// Solr’s Java Client API… remote or embedded/local!
SolrServer server = new
   CommonsHttpSolrServer(quot;http://localhost:8983/solrquot;);

SolrInputDocument doc = new SolrInputDocument();
doc.addField(quot;supernamequot;,quot;Daredevilquot;);
doc.addField(quot;namequot;,quot;Matt Murdockquot;);
doc.addField(“categoryquot;,“superheroquot;);

server.add(doc);
server.commit();
Deleting Documents
• Delete by Id, most efficient
<delete>
 <id>05591</id>
 <id>32552</id>
</delete>

• Delete by Query
<delete>
 <query>category:supervillain</query>
</delete>
Commit
• <commit/> makes changes visible
  – Triggers static cache warming in
    solrconfig.xml
  – Triggers autowarming from existing caches
• <optimize/> same as commit, merges all
  index segments for faster searching
 _0.fnm
 _0.fdt
 _0.fdx
 _0.frq
                     Lucene Index Segments
 _0.tis
 _0.tii
 _0.prx     _1.fnm
 _0.nrm     _1.fdt
            _1.fdx
 _0_1.del   […]
Searching
http://localhost:8983/solr/select?q=powers:agility
       &start=0&rows=2&fl=supername,category
<response>
 <result numFound=“427quot; start=quot;0quot;>
   <doc>
    <str name=“supernamequot;>Spider-Man</str>
    <str name=“category”>superhero</str>
   </doc>
   <doc>
    <str name=“supernamequot;>Msytique</str>
    <str name=“category”>supervillain</str>
   </doc>
 </result>
</response>
Response Format
• Add &wt=json for JSON formatted response

{“resultquot;: {quot;numFoundquot;:427, quot;startquot;:0,
  quot;docsquot;: [
     {“supername”:”Spider-Man”, “category”:”superhero”},
     {“supername”:” Msytique”, “category”:” supervillain”}
   ]
}

• Also Python, Ruby, PHP, SerializedPHP, XSLT
Scoring
• Query results are sorted by score descending
• VSM – Vector Space Model
• tf – term frequency: numer of matching terms in field
• lengthNorm – number of tokens in field
• idf – inverse document frequency
• coord – coordination factor, number of matching
  terms
• document boost
• query clause boost

http://lucene.apache.org/java/docs/scoring.html
Explain
http://solr/select?q=super fast&indent=on&debugQuery=on

<lst name=quot;debugquot;>
 <lst name=quot;explainquot;>
   <str name=quot;id=Flash,internal_docid=6quot;>
0.16389132 = (MATCH) product of:
 0.32778263 = (MATCH) sum of:
   0.32778263 = (MATCH) weight(text:fast in 6), product of:
    0.5012072 = queryWeight(text:fast), product of:
      2.466337 = idf(docFreq=5)
      0.20321926 = queryNorm
    0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of:
      1.4142135 = tf(termFreq(text:fast)=2)
      2.466337 = idf(docFreq=5)
      0.1875 = fieldNorm(field=fast, doc=6)
 0.5 = coord(1/2)
  </str>
  <str name=quot;id=Superman,internal_docid=7quot;>
0.1365761 = (MATCH) product of:
Lucene Query Syntax
1. justice league
   • Equiv: justice OR league
   • QueryParser default operator is “OR”/optional
2. +justice +league –name:aquaman
   • Equiv: justice AND league NOT name:aquaman
3. “justice league” –name:aquaman
4. title:spiderman^10 description:spiderman
5. description:“spiderman movie”~100
Lucene Query Examples2
1. releaseDate:[2000 TO 2007]
2. Wildcard searches: sup?r, su*r, super*
3. spider~
  •   Fuzzy search: Levenshtein distance
  •   Optional minimum similarity: spider~0.7
4. *:*
5. (Superman AND “Lex Luthor”) OR
   (+Batman +Joker)
DisMax Query Syntax
•   Good for handling raw user queries
    – Balanced quotes for phrase query
    – ‘+’ for required, ‘-’ for prohibited
    – Separates query terms from query structure
http://solr/select?qt=dismax
 &q=super man                       // the user query
 &qf=title^3 subject^2 body         // field to query
 &pf=title^2,body                   // fields to do phrase queries
 &ps=100                            // slop for those phrase q’s
 &tie=.1                            // multi-field match reward
 &mm=2                              // # of terms that should match
 &bf=popularity                     // boost function
DisMax Query Form
• The expanded Lucene Query:

+( DisjunctionMaxQuery( title:super^3 |
  subject:super^2 | body:super)
  DisjunctionMaxQuery( title:man^3 |
  subject:man^2 | body:man)
)
DisjunctionMaxQuery(title:”super man”~100^2
  body:”super man”~100)
FunctionQuery(popularity)

• Tip: set up your own request handler with default parameters
  to avoid clients having to specify them
Function Query
• Allows adding function of field value to score
    – Boost recently added or popular documents
•   Current parser only supports function notation
•   Example: log(sum(popularity,1))
•   sum, product, div, log, sqrt, abs, pow
•   scale(x, target_min, target_max)
    – calculates min & max of x across all docs
• map(x, min, max, target)
    – useful for dealing with defaults
Boosted Query
• Score is multiplied instead of added
  – New local params <!...> syntax added
&q=<!boost b=sqrt(popularity)>super man

• Parameter dereferencing in local params
&q=<!boost b=$boost v=$userq>
&boost=sqrt(popularity)
&userq=super man
Analysis & Search Relevancy
 Document Indexing Analysis                                  Query Analysis

LexCorp BFG-9000                        Lex corp bfg9000

  WhitespaceTokenizer                        WhitespaceTokenizer

 LexCorp      BFG-9000                      Lex     corp    bfg9000

WordDelimiterFilter catenateWords=1     WordDelimiterFilter catenateWords=0

 Lex       Corp    BFG    9000              Lex     corp     bfg      9000
        LexCorp

        LowercaseFilter                           LowercaseFilter

 lex       corp     bfg   9000              lex     corp     bfg      9000
        lexcorp
                                 A Match!
Configuring Relevancy
<fieldType name=quot;textquot; class=quot;solr.TextFieldquot;>
<analyzer>
  <tokenizer class=quot;solr.WhitespaceTokenizerFactoryquot;/>
  <filter class=quot;solr.LowerCaseFilterFactoryquot;/>
  <filter class=quot;solr.SynonymFilterFactoryquot;
          synonyms=quot;synonyms.txt“/>
  <filter class=quot;solr.StopFilterFactory“
          words=“stopwords.txt”/>
  <filter class=quot;solr.EnglishPorterFilterFactoryquot;
          protected=quot;protwords.txtquot;/>
</analyzer>
</fieldType>
Field Definitions
• Field Attributes: name, type, indexed, stored,
  multiValued, omitNorms, termVectors
<field name=quot;id“       type=quot;stringquot;     indexed=quot;truequot; stored=quot;truequot;/>
<field name=quot;sku“      type=quot;textTight” indexed=quot;truequot; stored=quot;truequot;/>
<field name=quot;name“ type=quot;text“          indexed=quot;truequot; stored=quot;truequot;/>
<field name=“inStock“ type=“boolean“ indexed=quot;true“ stored=“falsequot;/>
<field name=“price“    type=“sfloat“    indexed=quot;true“ stored=“falsequot;/>
<field name=quot;category“ type=quot;text_ws“ indexed=quot;truequot; stored=quot;true“
   multiValued=quot;truequot;/>

• Dynamic Fields
<dynamicField name=quot;*_iquot; type=quot;sint“ indexed=quot;truequot; stored=quot;truequot;/>
<dynamicField name=quot;*_squot; type=quot;string“ indexed=quot;truequot; stored=quot;truequot;/>
<dynamicField name=quot;*_tquot; type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/>
copyField
• Copies one field to another at index time
• Usecase #1: Analyze same field different ways
  – copy into a field with a different analyzer
  – boost exact-case, exact-punctuation matches
  – language translations, thesaurus, soundex

<field name=“title” type=“text”/>
<field name=“title_exact” type=“text_exact”
  stored=“false”/>
<copyField source=“title” dest=“title_exact”/>

• Usecase #2: Index multiple fields into single
  searchable field
Facet Query
http://solr/select?q=foo&wt=json&indent=on
 &facet=true&facet.field=cat
 &facet.query=price:[0 TO 100]
 &facet.query=manu:IBM

{quot;responsequot;:{quot;numFoundquot;:26,quot;startquot;:0,quot;docsquot;:[…]},
 “facet_countsquot;:{
   quot;facet_queriesquot;:{
      quot;price:[0 TO 100]quot;:6,
      “manu:IBMquot;:2},
   quot;facet_fieldsquot;:{
      quot;catquot;:[ quot;electronicsquot;,14, quot;memoryquot;,3,
              quot;cardquot;,2, quot;connectorquot;,2]
   }}}
Filters
• Filters are restrictions in addition to the query
• Use in faceting to narrow the results
• Filters are cached separately for speed

1. User queries for memory, query sent to solr is
 &q=memory&fq=inStock:true&facet=true&…
2. User selects 1GB memory size
 &q=memory&fq=inStock:true&fq=size:1GB&…
3. User selects DDR2 memory type
 &q=memory&fq=inStock:true&fq=size:1GB
           &fq=type:DDR2&…
Highlighting
http://solr/select?q=lcd&wt=json&indent=on
 &hl=true&hl.fl=features

{quot;responsequot;:{quot;numFoundquot;:5,quot;startquot;:0,quot;docsquot;:[
    {quot;idquot;:quot;3007WFPquot;, “price”:899.95}, …]
quot;highlightingquot;:{
  quot;3007WFPquot;:{ quot;featuresquot;:[quot;30quot; TFT active matrix
   <em>LCD</em>, 2560 x 1600”
  quot;VA902Bquot;:{ quot;featuresquot;:[quot;19quot; TFT active matrix
   <em>LCD</em>, 8ms response time, 1280 x
   1024 native resolutionquot;]}}}
MoreLikeThis
• Selects documents that are “similar” to the
  documents matching the main query.
&q=id:6H500F0
  &mlt=true&mlt.fl=name,cat,features
quot;moreLikeThisquot;:{
  quot;6H500F0quot;:{quot;numFoundquot;:5,quot;startquot;:0,
   quot;docs”: [
      {quot;namequot;:quot;Apple 60 GB iPod with Video
         Playback Blackquot;, quot;pricequot;:399.0,
       quot;inStockquot;:true, quot;popularityquot;:10, […]
      }, […]
    ]
[…]
High Availability                           Dynamic
                                                                  HTML
                              Appservers                          Generation




                                                                       HTTP search
                            Load Balancer                              requests

                            Solr Searchers



                                              Index Replication

            admin queries
                                             updates
                 updates                                                   DB
                                                             Updater
admin terminal                Solr Master
Resources
• WWW
  – http://lucene.apache.org/solr
  – http://lucene.apache.org/solr/tutorial.html
  – http://wiki.apache.org/solr/
• Mailing Lists
  – solr-user-subscribe@lucene.apache.org
  – solr-dev-subscribe@lucene.apache.org

Contenu connexe

Tendances

April 2010 - JBoss Web Services
April 2010 - JBoss Web ServicesApril 2010 - JBoss Web Services
April 2010 - JBoss Web Services
JBug Italy
 

Tendances (20)

How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance15 Ways to Kill Your Mysql Application Performance
15 Ways to Kill Your Mysql Application Performance
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
第三回Salesforce勉強会
第三回Salesforce勉強会第三回Salesforce勉強会
第三回Salesforce勉強会
 
DNS for Developers - NDC Oslo 2016
DNS for Developers - NDC Oslo 2016DNS for Developers - NDC Oslo 2016
DNS for Developers - NDC Oslo 2016
 
Survey of Percona Toolkit
Survey of Percona ToolkitSurvey of Percona Toolkit
Survey of Percona Toolkit
 
Terraform Cosmos DB
Terraform Cosmos DBTerraform Cosmos DB
Terraform Cosmos DB
 
DSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital LibraryDSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital Library
 
Hibernate java and_oracle
Hibernate java and_oracleHibernate java and_oracle
Hibernate java and_oracle
 
Hive commands
Hive commandsHive commands
Hive commands
 
Solr features
Solr featuresSolr features
Solr features
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sIntroduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]s
 
Amazon aurora 1
Amazon aurora 1Amazon aurora 1
Amazon aurora 1
 
April 2010 - JBoss Web Services
April 2010 - JBoss Web ServicesApril 2010 - JBoss Web Services
April 2010 - JBoss Web Services
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
Oracle database - Get external data via HTTP, FTP and Web Services
Oracle database - Get external data via HTTP, FTP and Web ServicesOracle database - Get external data via HTTP, FTP and Web Services
Oracle database - Get external data via HTTP, FTP and Web Services
 
Create a Database Application Development Environment with Docker
Create a Database Application Development Environment with DockerCreate a Database Application Development Environment with Docker
Create a Database Application Development Environment with Docker
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Apache SOLR in AEM 6
Apache SOLR in AEM 6Apache SOLR in AEM 6
Apache SOLR in AEM 6
 

Similaire à Add Powerful Full Text Search to Your Web App with Solr

[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Donghyeok Kang
 
Scale 16x: Terraform all the Things
Scale 16x: Terraform all the ThingsScale 16x: Terraform all the Things
Scale 16x: Terraform all the Things
Nathan Handler
 

Similaire à Add Powerful Full Text Search to Your Web App with Solr (20)

Os Pruett
Os PruettOs Pruett
Os Pruett
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleGraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
Solr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for YouSolr Search Engine: Optimize Is (Not) Bad for You
Solr Search Engine: Optimize Is (Not) Bad for You
 
Scale 16x: Terraform all the Things
Scale 16x: Terraform all the ThingsScale 16x: Terraform all the Things
Scale 16x: Terraform all the Things
 
Getting started with apache solr
Getting started with apache solrGetting started with apache solr
Getting started with apache solr
 
Rails 2.0 Presentation
Rails 2.0 PresentationRails 2.0 Presentation
Rails 2.0 Presentation
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Rails on Oracle 2011
Rails on Oracle 2011Rails on Oracle 2011
Rails on Oracle 2011
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solr
 
Writing RESTful web services using Node.js
Writing RESTful web services using Node.jsWriting RESTful web services using Node.js
Writing RESTful web services using Node.js
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
(SDD402) Amazon ElastiCache Deep Dive | AWS re:Invent 2014
(SDD402) Amazon ElastiCache Deep Dive | AWS re:Invent 2014(SDD402) Amazon ElastiCache Deep Dive | AWS re:Invent 2014
(SDD402) Amazon ElastiCache Deep Dive | AWS re:Invent 2014
 
DataMapper
DataMapperDataMapper
DataMapper
 
Android and REST
Android and RESTAndroid and REST
Android and REST
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache Solr
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
Oracle adapters for Ruby ORMs
Oracle adapters for Ruby ORMsOracle adapters for Ruby ORMs
Oracle adapters for Ruby ORMs
 

Plus de adunne

Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
adunne
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industry
adunne
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
adunne
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
adunne
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...
adunne
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
adunne
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
adunne
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketing
adunne
 

Plus de adunne (20)

Seedcamp Overview
Seedcamp OverviewSeedcamp Overview
Seedcamp Overview
 
Netvibes Preview
Netvibes PreviewNetvibes Preview
Netvibes Preview
 
Community Practices: From Forums to Social Networks
Community Practices: From Forums to Social NetworksCommunity Practices: From Forums to Social Networks
Community Practices: From Forums to Social Networks
 
Designing Tag Navigation
Designing Tag NavigationDesigning Tag Navigation
Designing Tag Navigation
 
Social Commerce and Community
Social Commerce and CommunitySocial Commerce and Community
Social Commerce and Community
 
The Starfish and the Spider
The Starfish and the SpiderThe Starfish and the Spider
The Starfish and the Spider
 
Ginger Preview
Ginger PreviewGinger Preview
Ginger Preview
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
 
The Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms IndustryThe Impact of Mobile Web 2.0 on the Telecoms Industry
The Impact of Mobile Web 2.0 on the Telecoms Industry
 
Building Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data CentersBuilding Web 2.0: Next-Generation Data Centers
Building Web 2.0: Next-Generation Data Centers
 
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
Killing the Org Chart: Organizational, Cultural and Leadership Models on the ...
 
Designing for a Web of Data
Designing for a Web of DataDesigning for a Web of Data
Designing for a Web of Data
 
Web 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web AppsWeb 2.0 Performance and Reliability: How to Run Large Web Apps
Web 2.0 Performance and Reliability: How to Run Large Web Apps
 
Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...Disrupting the Platform: Harnessing social analytics and other musings on the...
Disrupting the Platform: Harnessing social analytics and other musings on the...
 
Your User's Privacy
Your User's PrivacyYour User's Privacy
Your User's Privacy
 
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data SetUnder the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
Under the Hood: How Geonames Aggregates Over 35 Sources into One Data Set
 
Scalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and ApproachesScalable Web Architectures: Common Patterns and Approaches
Scalable Web Architectures: Common Patterns and Approaches
 
Trends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine MarketingTrends in Search Engine Optimization and Search Engine Marketing
Trends in Search Engine Optimization and Search Engine Marketing
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
Breaking Down The Barriers: Design for Accessibility
Breaking Down The Barriers: Design for AccessibilityBreaking Down The Barriers: Design for Accessibility
Breaking Down The Barriers: Design for Accessibility
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Add Powerful Full Text Search to Your Web App with Solr

  • 1. Powerful Full-Text Search with Solr Yonik Seeley yonik@apache.org Web 2.0 Expo, Berlin 8 November 2007 download at http://www.apache.org/~yonik
  • 2. What is Lucene • High performance, scalable, full-text search library • Focus: Indexing + Searching Documents – “Document” is just a list of name+value pairs • No crawlers or document parsing • Flexible Text Analysis (tokenizers + token filters) • 100% Java, no dependencies, no config files
  • 3. What is Solr • A full text search server based on Lucene • XML/HTTP, JSON Interfaces • Faceted Search (category counting) • Flexible data schema to define types and fields • Hit Highlighting • Configurable Advanced Caching • Index Replication • Extensible Open Architecture, Plugins • Web Administration Interface • Written in Java5, deployable as a WAR
  • 4. Basic App HTML Indexer Webapp Document super_name: Mr. Fantastic Query Query Response name: Reed Richards (powers:agility) (matching docs) category: superhero powers: elasticity http://solr/update http://solr/select admin update select XML response writer JSON response writer Solr Servlet Container XML Update Handler Standard request handler CSV Update Handler Custom request handler Lucene
  • 5. Indexing Data HTTP POST to http://localhost:8983/solr/update <add><doc> <field name=“id”>05991</field> <field name=“name”>Peter Parker</field> <field name=“supername”>Spider-Man</field> <field name=“category”>superhero</field> <field name=“powers”>agility</field> <field name=“powers”>spider-sense</field> </doc></add>
  • 6. Indexing CSV data Iron Man, Tony Stark, superhero, powered armor | flight Sandman, William Baker|Flint Marko, supervillain, sand transform Wolverine,James Howlett|Logan, superhero, healing|adamantium Magneto, Erik Lehnsherr, supervillain, magnetism|electricity http://localhost:8983/solr/update/csv? fieldnames=supername,name,category,powers &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|
  • 7. Data upload methods URL=http://localhost:8983/solr/update/csv • HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' --data-binary @info.csv • Multi-part file upload (browsers) • Request parameter ?stream.body=‘Cyclops, Scott Summers,…’ • Streaming from URL (must enable) ?stream.url=file://data/info.csv
  • 8. Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new CommonsHttpSolrServer(quot;http://localhost:8983/solrquot;); SolrInputDocument doc = new SolrInputDocument(); doc.addField(quot;supernamequot;,quot;Daredevilquot;); doc.addField(quot;namequot;,quot;Matt Murdockquot;); doc.addField(“categoryquot;,“superheroquot;); server.add(doc); server.commit();
  • 9. Deleting Documents • Delete by Id, most efficient <delete> <id>05591</id> <id>32552</id> </delete> • Delete by Query <delete> <query>category:supervillain</query> </delete>
  • 10. Commit • <commit/> makes changes visible – Triggers static cache warming in solrconfig.xml – Triggers autowarming from existing caches • <optimize/> same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq Lucene Index Segments _0.tis _0.tii _0.prx _1.fnm _0.nrm _1.fdt _1.fdx _0_1.del […]
  • 11. Searching http://localhost:8983/solr/select?q=powers:agility &start=0&rows=2&fl=supername,category <response> <result numFound=“427quot; start=quot;0quot;> <doc> <str name=“supernamequot;>Spider-Man</str> <str name=“category”>superhero</str> </doc> <doc> <str name=“supernamequot;>Msytique</str> <str name=“category”>supervillain</str> </doc> </result> </response>
  • 12. Response Format • Add &wt=json for JSON formatted response {“resultquot;: {quot;numFoundquot;:427, quot;startquot;:0, quot;docsquot;: [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Msytique”, “category”:” supervillain”} ] } • Also Python, Ruby, PHP, SerializedPHP, XSLT
  • 13. Scoring • Query results are sorted by score descending • VSM – Vector Space Model • tf – term frequency: numer of matching terms in field • lengthNorm – number of tokens in field • idf – inverse document frequency • coord – coordination factor, number of matching terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html
  • 14. Explain http://solr/select?q=super fast&indent=on&debugQuery=on <lst name=quot;debugquot;> <lst name=quot;explainquot;> <str name=quot;id=Flash,internal_docid=6quot;> 0.16389132 = (MATCH) product of: 0.32778263 = (MATCH) sum of: 0.32778263 = (MATCH) weight(text:fast in 6), product of: 0.5012072 = queryWeight(text:fast), product of: 2.466337 = idf(docFreq=5) 0.20321926 = queryNorm 0.65398633 = (MATCH) fieldWeight(text:fast in 6), product of: 1.4142135 = tf(termFreq(text:fast)=2) 2.466337 = idf(docFreq=5) 0.1875 = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) </str> <str name=quot;id=Superman,internal_docid=7quot;> 0.1365761 = (MATCH) product of:
  • 15. Lucene Query Syntax 1. justice league • Equiv: justice OR league • QueryParser default operator is “OR”/optional 2. +justice +league –name:aquaman • Equiv: justice AND league NOT name:aquaman 3. “justice league” –name:aquaman 4. title:spiderman^10 description:spiderman 5. description:“spiderman movie”~100
  • 16. Lucene Query Examples2 1. releaseDate:[2000 TO 2007] 2. Wildcard searches: sup?r, su*r, super* 3. spider~ • Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7 4. *:* 5. (Superman AND “Lex Luthor”) OR (+Batman +Joker)
  • 17. DisMax Query Syntax • Good for handling raw user queries – Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited – Separates query terms from query structure http://solr/select?qt=dismax &q=super man // the user query &qf=title^3 subject^2 body // field to query &pf=title^2,body // fields to do phrase queries &ps=100 // slop for those phrase q’s &tie=.1 // multi-field match reward &mm=2 // # of terms that should match &bf=popularity // boost function
  • 18. DisMax Query Form • The expanded Lucene Query: +( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super) DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man) ) DisjunctionMaxQuery(title:”super man”~100^2 body:”super man”~100) FunctionQuery(popularity) • Tip: set up your own request handler with default parameters to avoid clients having to specify them
  • 19. Function Query • Allows adding function of field value to score – Boost recently added or popular documents • Current parser only supports function notation • Example: log(sum(popularity,1)) • sum, product, div, log, sqrt, abs, pow • scale(x, target_min, target_max) – calculates min & max of x across all docs • map(x, min, max, target) – useful for dealing with defaults
  • 20. Boosted Query • Score is multiplied instead of added – New local params <!...> syntax added &q=<!boost b=sqrt(popularity)>super man • Parameter dereferencing in local params &q=<!boost b=$boost v=$userq> &boost=sqrt(popularity) &userq=super man
  • 21. Analysis & Search Relevancy Document Indexing Analysis Query Analysis LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WordDelimiterFilter catenateWords=1 WordDelimiterFilter catenateWords=0 Lex Corp BFG 9000 Lex corp bfg 9000 LexCorp LowercaseFilter LowercaseFilter lex corp bfg 9000 lex corp bfg 9000 lexcorp A Match!
  • 22. Configuring Relevancy <fieldType name=quot;textquot; class=quot;solr.TextFieldquot;> <analyzer> <tokenizer class=quot;solr.WhitespaceTokenizerFactoryquot;/> <filter class=quot;solr.LowerCaseFilterFactoryquot;/> <filter class=quot;solr.SynonymFilterFactoryquot; synonyms=quot;synonyms.txt“/> <filter class=quot;solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class=quot;solr.EnglishPorterFilterFactoryquot; protected=quot;protwords.txtquot;/> </analyzer> </fieldType>
  • 23. Field Definitions • Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors <field name=quot;id“ type=quot;stringquot; indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;sku“ type=quot;textTight” indexed=quot;truequot; stored=quot;truequot;/> <field name=quot;name“ type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/> <field name=“inStock“ type=“boolean“ indexed=quot;true“ stored=“falsequot;/> <field name=“price“ type=“sfloat“ indexed=quot;true“ stored=“falsequot;/> <field name=quot;category“ type=quot;text_ws“ indexed=quot;truequot; stored=quot;true“ multiValued=quot;truequot;/> • Dynamic Fields <dynamicField name=quot;*_iquot; type=quot;sint“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_squot; type=quot;string“ indexed=quot;truequot; stored=quot;truequot;/> <dynamicField name=quot;*_tquot; type=quot;text“ indexed=quot;truequot; stored=quot;truequot;/>
  • 24. copyField • Copies one field to another at index time • Usecase #1: Analyze same field different ways – copy into a field with a different analyzer – boost exact-case, exact-punctuation matches – language translations, thesaurus, soundex <field name=“title” type=“text”/> <field name=“title_exact” type=“text_exact” stored=“false”/> <copyField source=“title” dest=“title_exact”/> • Usecase #2: Index multiple fields into single searchable field
  • 25.
  • 26.
  • 27.
  • 28. Facet Query http://solr/select?q=foo&wt=json&indent=on &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {quot;responsequot;:{quot;numFoundquot;:26,quot;startquot;:0,quot;docsquot;:[…]}, “facet_countsquot;:{ quot;facet_queriesquot;:{ quot;price:[0 TO 100]quot;:6, “manu:IBMquot;:2}, quot;facet_fieldsquot;:{ quot;catquot;:[ quot;electronicsquot;,14, quot;memoryquot;,3, quot;cardquot;,2, quot;connectorquot;,2] }}}
  • 29. Filters • Filters are restrictions in addition to the query • Use in faceting to narrow the results • Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…
  • 30. Highlighting http://solr/select?q=lcd&wt=json&indent=on &hl=true&hl.fl=features {quot;responsequot;:{quot;numFoundquot;:5,quot;startquot;:0,quot;docsquot;:[ {quot;idquot;:quot;3007WFPquot;, “price”:899.95}, …] quot;highlightingquot;:{ quot;3007WFPquot;:{ quot;featuresquot;:[quot;30quot; TFT active matrix <em>LCD</em>, 2560 x 1600” quot;VA902Bquot;:{ quot;featuresquot;:[quot;19quot; TFT active matrix <em>LCD</em>, 8ms response time, 1280 x 1024 native resolutionquot;]}}}
  • 31. MoreLikeThis • Selects documents that are “similar” to the documents matching the main query. &q=id:6H500F0 &mlt=true&mlt.fl=name,cat,features quot;moreLikeThisquot;:{ quot;6H500F0quot;:{quot;numFoundquot;:5,quot;startquot;:0, quot;docs”: [ {quot;namequot;:quot;Apple 60 GB iPod with Video Playback Blackquot;, quot;pricequot;:399.0, quot;inStockquot;:true, quot;popularityquot;:10, […] }, […] ] […]
  • 32. High Availability Dynamic HTML Appservers Generation HTTP search Load Balancer requests Solr Searchers Index Replication admin queries updates updates DB Updater admin terminal Solr Master
  • 33. Resources • WWW – http://lucene.apache.org/solr – http://lucene.apache.org/solr/tutorial.html – http://wiki.apache.org/solr/ • Mailing Lists – solr-user-subscribe@lucene.apache.org – solr-dev-subscribe@lucene.apache.org