SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Search Engine-Building
with Lucene and Solr
Part 1
Kai Chan
SoCal Code Camp, November 2013
Overview
● why Lucene/Solr?
● what are Lucene and Solr?
● how to use Lucene and Solr
○ setup
○ indexing
○ searching

● resources
● demo
● questions/answers
How to Make Your Data Searchable
● pay someone to do it
● use some solution someone else has written
● write some solution yourself
How to Search - One Approach
for each document d {
if (query is a substring of d's content) {
add d to the list of results
}
}
sort the result (or not)
How to Search - Problems
● slow
○ reads the whole dataset for each search

● not scalable
○ if you dataset grows by 10x,
your search slows down by 10x

● how to show the most relevant documents
first?
○ list of results can be quite long
○ users have limited time and patience
Inverted Index - Introduction
● like the "index" at the end of books
● a map of one of the following types
○ term → document list
○ term → <document, position> list
documents:
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

inverted index (without positions):
"a":

{2}

"banana": {2}
"is":

{0, 1, 2}

"it":

{0, 1, 2}

"what":

{0, 1}

inverted index (with positions):
"a":

{(2, 2)}

"banana": {(2, 3)}
"is":

{(0, 1), (0, 4), (1, 1), (2, 1)}

"it":

{(0, 0), (0, 3), (1, 2), (2, 0)}

"what":

{(0, 2), (1, 0)}
Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
Inverted Index - Speed
● term list
○ typically very small
○ grows slowly

● term lookup
○ O(1) to O(log(number of terms))

● for a particular term
○ document lists: very small
○ document + position lists: still small

● few terms per query
Inverted Index - Relevance
● information in the index enables:
○ determination (scoring) of relevance of each
document to the query
○ comparison of relevance among documents
○ sorting by (decreasing) relevance
■ i.e. the most relevant document first
Lucene v.s. Solr - Lucene
●
●
●
●

full-text search library
creates, updates and read from the index
takes queries and produces search results
your application creates objects and calls
methods in the Lucene API
● provides building blocks for custom features
Lucene v.s. Solr - Solr
●
●
●
●
●

full-text search platform
uses Lucene for indexing and search
REST-like API over HTTP
different output formats (e.g. XML, JSON)
provides some features not built into Lucene
Lucene:

machine running Java VM
your application
Lucene
Lucene code
libraries
index

Solr:

machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
HTTP

Solr code
Lucene code
index

libraries

client
Workflow
Setup

Indexing

Search
Workflow - Setup
● servlet configuration
○ e.g. port number, max POST size
○ you can usually use the default settings

● Solr configuration
○ e.g. data directory, deduplication, language
identification, highlighting
○ you can usually use the default settings

● schema definition
○ defines fields in your documents
○ you can use the default settings if you name your
fields in a certain way
How Data Are Organized
collection
document

document

document

field

field

field

field

field

field

field

field

field
field
name (e.g. "title" or "price")
content (e.g. "please read" or 30)

type
options
collection
document

document

subject

subject

date

date

date

from

from

from

reply-to

reply-to

text

text

text

document
collection
document

document

document

subject

title

first name

date

SKU

last name

from

price

phone

text

description

address
Solr Field Definition
● field
○ name (e.g. "subject")
○ type (e.g. "text_general")
○ options (e.g. indexed="true" stored="true")

● field type
○ text: "string", "text_general"
○ numeric: "int", "long", "float", "double"

● options
○ indexed: content can be searched
○ stored: content can be returned at search-time
○ multivalued: multiple values per field & document
Solr Dynamic Field
● define field by naming convention
● "amount_i": int, index, stored
● "tag_ss": string, indexed, stored, multivalued
name

type

indexed

stored

multiValued

*_i

int

true

true

false

*_l

long

true

true

false

*_f

float

true

true

false

*_d

double

true

true

false

*_s

string

true

true

false

*_ss

string

true

true

true

*_t

text_general

true

true

false

*_txt

text_general

true

true

true
Solr Copy Field
● copy one or more fields into another field
● can be used to define a catch-all field
○ source: "title", "author", "description"
○ destination: "text"
○ searching the "text" field has the effect of searching
all the other three fields
Indexing - UpdateRequestHandler
● upload (POST) content or file to http://host:
port/solr/update
● formats: XML, JSON, CSV
XML:
<add>
<doc>
<field
<field
<field
</doc>
<doc>
<field
<field
<field
</doc>
</add>

name="id">apple</field>
name="compName">Apple</field>
name="address">1 Infinite Way, Cupertino CA</field>

name="id">asus</field>
name="compName">ASUS Computer</field>
name="address">800 Corporate Way Fremont, CA 94539</field>

JSON:
[
{"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way,
Cupertino CA"}
{"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate
Way Fremont, CA 94539"}
]

CSV:
id,compName_s,address_s
apple,Apple,"1 Infinite Way, Cupertino CA"
asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
Indexing - DataImportHandler
● has its own config file (data-config.xml)
● import data from various sources
○ RDBMS (JDBC)
○ e-mail (IMAP)
○ XML data locally (file) or remotely (HTTP)

● transformers
○ extract data (RegEx, XPath)
○ manipulate data (strip HTML tags)
Indexing - ExtractingRequestHandler
● allows indexing of different formats
○ e.g. PDF, MS Word, XML

● uses Apache Tika to extract text and
metadata
○ Tika: a framework for different file format parsers (e.
g. PDFBox for PDF, Apache POI for MS Word)

● maps extracted text to the “content” field
● maps metadata (e.g. MIME type) to different
fields
Searching - Basics
● send request to http://host:port/solr/search
● parameters
○
○
○
○
○
○
○

q - main query
fq - filter query
defType - query parser (e.g. lucene, edismax)
fl - fields to return
sort - sort criteria
wt - response writer (e.g. xml, json)
indent - set to true for pretty-printing
search handler's URL

main query

http://localhost:8983/solr/select?q=title:tablet&
fl=title,price,inStock&sort=price&wt=json
fields to return

sort criteria

response writer
Searching - Query Syntax - Field
● search a specific field
○ field_name:value

● if field omitted, Solr uses default field:
○ df parameter in URL
○ defaultSearchField setting in schema.xml
○ "text"
Searching - Query Syntax - Term
● a term by itself: matches documents that
contain that term
○ e.g. tablet
Searching - Query Syntax - Boolean
● “conventional” boolean operators supported

●
●
●

○ AND &&
○ OR ||
○ NOT !
e.g. a AND b
○ all of a, b must occur
e.g. a OR b
○ at least one of a, b must occur
e.g. a AND NOT b
○ a must occur and b must not occur
Searching - Query Syntax - Boolean
● Lucene/Solr's boolean operators are not true
boolean operators
● e.g. a OR b OR c does not behave like
(a OR b) OR c
○ instead, a OR b OR c means at least one of a, b, c
must occur

● parentheses are supported
Searching - Query Syntax - Boolean
● "+" prefix means "must"
● "-" prefix means "must not"
● no prefix means "at least one must"
(by default)
○ e.g. a b c
■ at least one of a, b, c must occur

● operators can mix
○ e.g. +a b c d -e
■ a must occur
■ at least one of b, c, d must occur
■ e must not occur
Searching - Query Syntax - Phrase
● phrases are enclosed by double-quotes
● e.g. +"the phrase"
○ the phrase must occur

● e.g. -"the phrase"
○ the phrase must not occur
Searching - Query Syntax - Boost
● manually assign different weights to clauses
● gives more weight to a field
○ e.g. title:a^10 body:a

● gives more weight to a word
○ e.g. title:a title:b^10

● gives phrases more weight than words
○ e.g. title:(+a +b) title:"a b"^10
Searching - Query Syntax - Range
● matches field values within a range
○ inclusive range - denoted by square brackets
○ exclusive range - denoted by curly brackets

● e.g. age:[10 TO 20]
○ matches the field "age" with the value in 10..20

● string or numeric comparison, depending on
the field's type
● open-ended range supported
● e.g. age: [10 TO *]
○ matches the field "age" with the value 10 or larger
Searching - Query Syntax - EDisMax
● suitable for user-generated queries
○ does not complain about the syntax
○ searches for individual words across several fields
("disjunction")
○ uses max score of a word in all fields for scoring
("max")

● configurable (in solrconfig.xml)
○ what fields to search the words in
○ boosting of these fields
Sorting
● default: sorting by decreasing score
● custom sorting rules: use the sort parameter
○ syntax: fieldName (asc|desc)
○ e.g. sort by ascending price (i.e. lowest price first):
price asc
○ e.g. sort by descending date (i.e. newest date first):
date asc
Sorting
● special field names
○ use score for score and _docid_ for document D
○ e.g. sort by ascending score:
score asc
○ e.g. sort by descending document ID
_docid_ desc
Sorting
● multiple fields and orders: separate by
commas
○ e.g. sort by descending starRating and ascending
price:
○ starRating desc, price asc
Sorting
● cannot use multivalued fields
● overrides the default sorting behavior
Faceted Search
● facet values: (distinct) values (generally nonoverlapping) ranges of a field
● displaying facets
○ show possible values
○ let users narrow down their searches easily
facet
facet values (5 of them)
Faceted Search
● set facet parameter to true - enables
faceting
● other parameters
○ facet.field - use the field's values as facets
■ return <value, count> pairs
○ facet.query - use the given queries as facets
■ return <query, count> pairs
○ facet.sort - set the ordering of the facets;
■ can be "count" or "index"
○ facet.offset and face.limit - used for
pagination of facets
Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/

● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book
Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/
Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/

● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/

● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide
Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml

● use the Solr admin interface
○ http://localhost:8983/solr/

Contenu connexe

Tendances

MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced TopicsCésar Rodas
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAlexandre Victoor
 
odoo 11.0 development (CRUD)
odoo 11.0 development (CRUD)odoo 11.0 development (CRUD)
odoo 11.0 development (CRUD)Mohamed Magdy
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchangeChristoph Santschi
 
Java Data Migration with Data Pipeline
Java Data Migration with Data PipelineJava Data Migration with Data Pipeline
Java Data Migration with Data PipelineNorth Concepts
 
file handling, dynamic memory allocation
file handling, dynamic memory allocationfile handling, dynamic memory allocation
file handling, dynamic memory allocationindra Kishor
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Django with MongoDB using MongoEngine
Django with MongoDB using MongoEngineDjango with MongoDB using MongoEngine
Django with MongoDB using MongoEngineRakesh Kumar
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerdeZheng Shao
 

Tendances (20)

MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced Topics
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
C++ files and streams
C++ files and streamsC++ files and streams
C++ files and streams
 
odoo 11.0 development (CRUD)
odoo 11.0 development (CRUD)odoo 11.0 development (CRUD)
odoo 11.0 development (CRUD)
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchange
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Java Data Migration with Data Pipeline
Java Data Migration with Data PipelineJava Data Migration with Data Pipeline
Java Data Migration with Data Pipeline
 
file handling, dynamic memory allocation
file handling, dynamic memory allocationfile handling, dynamic memory allocation
file handling, dynamic memory allocation
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Apache solr
Apache solrApache solr
Apache solr
 
5java Io
5java Io5java Io
5java Io
 
Django with MongoDB using MongoEngine
Django with MongoDB using MongoEngineDjango with MongoDB using MongoEngine
Django with MongoDB using MongoEngine
 
Hive - SerDe and LazySerde
Hive - SerDe and LazySerdeHive - SerDe and LazySerde
Hive - SerDe and LazySerde
 
Comp102 lec 11
Comp102   lec 11Comp102   lec 11
Comp102 lec 11
 

Similaire à Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgressyerram
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17Daniel Eriksson
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data EngineersDuy Do
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and AnalysisOpenThink Labs
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Odoo ORM Methods | Object Relational Mapping in Odoo15
Odoo ORM Methods | Object Relational Mapping in Odoo15 Odoo ORM Methods | Object Relational Mapping in Odoo15
Odoo ORM Methods | Object Relational Mapping in Odoo15 Celine George
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013Emanuel Calvo
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYEmanuel Calvo
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years AgoScott Wlaschin
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
 
Python bible
Python biblePython bible
Python bibleadarsh j
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for ElasticsearchFlorian Hopf
 

Similaire à Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013) (20)

Pgbr 2013 fts
Pgbr 2013 ftsPgbr 2013 fts
Pgbr 2013 fts
 
Kibana: Real-World Examples
Kibana: Real-World ExamplesKibana: Real-World Examples
Kibana: Real-World Examples
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgres
 
Meetup C++ A brief overview of c++17
Meetup C++  A brief overview of c++17Meetup C++  A brief overview of c++17
Meetup C++ A brief overview of c++17
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Elasticsearch for Data Engineers
Elasticsearch for Data EngineersElasticsearch for Data Engineers
Elasticsearch for Data Engineers
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Odoo ORM Methods | Object Relational Mapping in Odoo15
Odoo ORM Methods | Object Relational Mapping in Odoo15 Odoo ORM Methods | Object Relational Mapping in Odoo15
Odoo ORM Methods | Object Relational Mapping in Odoo15
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
 
Python bible
Python biblePython bible
Python bible
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Mongo db
Mongo dbMongo db
Mongo db
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
 

Dernier

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingSelcen Ozturkcan
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Dernier (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

  • 1. Search Engine-Building with Lucene and Solr Part 1 Kai Chan SoCal Code Camp, November 2013
  • 2. Overview ● why Lucene/Solr? ● what are Lucene and Solr? ● how to use Lucene and Solr ○ setup ○ indexing ○ searching ● resources ● demo ● questions/answers
  • 3. How to Make Your Data Searchable ● pay someone to do it ● use some solution someone else has written ● write some solution yourself
  • 4. How to Search - One Approach for each document d { if (query is a substring of d's content) { add d to the list of results } } sort the result (or not)
  • 5. How to Search - Problems ● slow ○ reads the whole dataset for each search ● not scalable ○ if you dataset grows by 10x, your search slows down by 10x ● how to show the most relevant documents first? ○ list of results can be quite long ○ users have limited time and patience
  • 6. Inverted Index - Introduction ● like the "index" at the end of books ● a map of one of the following types ○ term → document list ○ term → <document, position> list
  • 7. documents: T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" inverted index (without positions): "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} inverted index (with positions): "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
  • 8. Inverted Index - Speed ● term list ○ typically very small ○ grows slowly ● term lookup ○ O(1) to O(log(number of terms)) ● for a particular term ○ document lists: very small ○ document + position lists: still small ● few terms per query
  • 9. Inverted Index - Relevance ● information in the index enables: ○ determination (scoring) of relevance of each document to the query ○ comparison of relevance among documents ○ sorting by (decreasing) relevance ■ i.e. the most relevant document first
  • 10. Lucene v.s. Solr - Lucene ● ● ● ● full-text search library creates, updates and read from the index takes queries and produces search results your application creates objects and calls methods in the Lucene API ● provides building blocks for custom features
  • 11. Lucene v.s. Solr - Solr ● ● ● ● ● full-text search platform uses Lucene for indexing and search REST-like API over HTTP different output formats (e.g. XML, JSON) provides some features not built into Lucene
  • 12. Lucene: machine running Java VM your application Lucene Lucene code libraries index Solr: machine running Java VM servlet container (e.g. Tomcat, Jetty) Solr HTTP Solr code Lucene code index libraries client
  • 14. Workflow - Setup ● servlet configuration ○ e.g. port number, max POST size ○ you can usually use the default settings ● Solr configuration ○ e.g. data directory, deduplication, language identification, highlighting ○ you can usually use the default settings ● schema definition ○ defines fields in your documents ○ you can use the default settings if you name your fields in a certain way
  • 15. How Data Are Organized collection document document document field field field field field field field field field
  • 16. field name (e.g. "title" or "price") content (e.g. "please read" or 30) type options
  • 19. Solr Field Definition ● field ○ name (e.g. "subject") ○ type (e.g. "text_general") ○ options (e.g. indexed="true" stored="true") ● field type ○ text: "string", "text_general" ○ numeric: "int", "long", "float", "double" ● options ○ indexed: content can be searched ○ stored: content can be returned at search-time ○ multivalued: multiple values per field & document
  • 20. Solr Dynamic Field ● define field by naming convention ● "amount_i": int, index, stored ● "tag_ss": string, indexed, stored, multivalued name type indexed stored multiValued *_i int true true false *_l long true true false *_f float true true false *_d double true true false *_s string true true false *_ss string true true true *_t text_general true true false *_txt text_general true true true
  • 21. Solr Copy Field ● copy one or more fields into another field ● can be used to define a catch-all field ○ source: "title", "author", "description" ○ destination: "text" ○ searching the "text" field has the effect of searching all the other three fields
  • 22. Indexing - UpdateRequestHandler ● upload (POST) content or file to http://host: port/solr/update ● formats: XML, JSON, CSV
  • 23. XML: <add> <doc> <field <field <field </doc> <doc> <field <field <field </doc> </add> name="id">apple</field> name="compName">Apple</field> name="address">1 Infinite Way, Cupertino CA</field> name="id">asus</field> name="compName">ASUS Computer</field> name="address">800 Corporate Way Fremont, CA 94539</field> JSON: [ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"} ] CSV: id,compName_s,address_s apple,Apple,"1 Infinite Way, Cupertino CA" asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
  • 24. Indexing - DataImportHandler ● has its own config file (data-config.xml) ● import data from various sources ○ RDBMS (JDBC) ○ e-mail (IMAP) ○ XML data locally (file) or remotely (HTTP) ● transformers ○ extract data (RegEx, XPath) ○ manipulate data (strip HTML tags)
  • 25. Indexing - ExtractingRequestHandler ● allows indexing of different formats ○ e.g. PDF, MS Word, XML ● uses Apache Tika to extract text and metadata ○ Tika: a framework for different file format parsers (e. g. PDFBox for PDF, Apache POI for MS Word) ● maps extracted text to the “content” field ● maps metadata (e.g. MIME type) to different fields
  • 26. Searching - Basics ● send request to http://host:port/solr/search ● parameters ○ ○ ○ ○ ○ ○ ○ q - main query fq - filter query defType - query parser (e.g. lucene, edismax) fl - fields to return sort - sort criteria wt - response writer (e.g. xml, json) indent - set to true for pretty-printing
  • 27. search handler's URL main query http://localhost:8983/solr/select?q=title:tablet& fl=title,price,inStock&sort=price&wt=json fields to return sort criteria response writer
  • 28. Searching - Query Syntax - Field ● search a specific field ○ field_name:value ● if field omitted, Solr uses default field: ○ df parameter in URL ○ defaultSearchField setting in schema.xml ○ "text"
  • 29. Searching - Query Syntax - Term ● a term by itself: matches documents that contain that term ○ e.g. tablet
  • 30. Searching - Query Syntax - Boolean ● “conventional” boolean operators supported ● ● ● ○ AND && ○ OR || ○ NOT ! e.g. a AND b ○ all of a, b must occur e.g. a OR b ○ at least one of a, b must occur e.g. a AND NOT b ○ a must occur and b must not occur
  • 31. Searching - Query Syntax - Boolean ● Lucene/Solr's boolean operators are not true boolean operators ● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c must occur ● parentheses are supported
  • 32. Searching - Query Syntax - Boolean ● "+" prefix means "must" ● "-" prefix means "must not" ● no prefix means "at least one must" (by default) ○ e.g. a b c ■ at least one of a, b, c must occur ● operators can mix ○ e.g. +a b c d -e ■ a must occur ■ at least one of b, c, d must occur ■ e must not occur
  • 33. Searching - Query Syntax - Phrase ● phrases are enclosed by double-quotes ● e.g. +"the phrase" ○ the phrase must occur ● e.g. -"the phrase" ○ the phrase must not occur
  • 34. Searching - Query Syntax - Boost ● manually assign different weights to clauses ● gives more weight to a field ○ e.g. title:a^10 body:a ● gives more weight to a word ○ e.g. title:a title:b^10 ● gives phrases more weight than words ○ e.g. title:(+a +b) title:"a b"^10
  • 35. Searching - Query Syntax - Range ● matches field values within a range ○ inclusive range - denoted by square brackets ○ exclusive range - denoted by curly brackets ● e.g. age:[10 TO 20] ○ matches the field "age" with the value in 10..20 ● string or numeric comparison, depending on the field's type ● open-ended range supported ● e.g. age: [10 TO *] ○ matches the field "age" with the value 10 or larger
  • 36. Searching - Query Syntax - EDisMax ● suitable for user-generated queries ○ does not complain about the syntax ○ searches for individual words across several fields ("disjunction") ○ uses max score of a word in all fields for scoring ("max") ● configurable (in solrconfig.xml) ○ what fields to search the words in ○ boosting of these fields
  • 37. Sorting ● default: sorting by decreasing score ● custom sorting rules: use the sort parameter ○ syntax: fieldName (asc|desc) ○ e.g. sort by ascending price (i.e. lowest price first): price asc ○ e.g. sort by descending date (i.e. newest date first): date asc
  • 38. Sorting ● special field names ○ use score for score and _docid_ for document D ○ e.g. sort by ascending score: score asc ○ e.g. sort by descending document ID _docid_ desc
  • 39. Sorting ● multiple fields and orders: separate by commas ○ e.g. sort by descending starRating and ascending price: ○ starRating desc, price asc
  • 40. Sorting ● cannot use multivalued fields ● overrides the default sorting behavior
  • 41. Faceted Search ● facet values: (distinct) values (generally nonoverlapping) ranges of a field ● displaying facets ○ show possible values ○ let users narrow down their searches easily
  • 43. Faceted Search ● set facet parameter to true - enables faceting ● other parameters ○ facet.field - use the field's values as facets ■ return <value, count> pairs ○ facet.query - use the given queries as facets ■ return <query, count> pairs ○ facet.sort - set the ordering of the facets; ■ can be "count" or "index" ○ facet.offset and face.limit - used for pagination of facets
  • 44. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  • 45. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • 46. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  • 47. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/