A Practical Introduction to Apache Solr

Angel Borroy
Software Engineer
March 2020
A Practical
Introduction to
Apache SOLR
CODELAB

22
Requirements
Java Runtime Environment 1.8+
$ java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
Supported Operating Systems
• Linux
• MacOS
• Windows
https://lucene.apache.org/solr/downloads.html
https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr

33
A Practical Introduction to Apache SOLR
• Open Source
• What is SOLR
• Key SOLR Concepts
• SOLR Lab
• Quick References
NEOCOM 2020

5
Why should you use Open Source?
• State of the Art Technologies
• Community Support
• Vast Documentation
• Code is accessible
• Customizable
• Mostly free licensing

6
Why should you contribute to Open Source?
• Share Knowledge and Ideas
• Improve established Technologies
• Become part of a Community
• Not only code, all your skills are relevant
• Be useful to the World

8
What is SOLR
• A Search Engine
• A REST-like API
• Built on Lucene
• Open Source
• Blazing-fast
• Scalable
• Fault tolerant

9
Why SOLR
Scalable
Solr scales by distributing work (indexing and query processing) to multiple servers in a cluster.
Ready to deploy
Solr is open source, is easy to install and configure, and provides a preconfigured example to help you get
started.
Optimized for search
Solr is fast and can execute complex queries in subsecond speed, often only tens of milliseconds.
Large volumes of documents
Solr is designed to deal with indexes containing many millions of documents.
Text-centric
Solr is optimized for searching natural-language text, like emails, web pages, resumes, PDF documents,
and social messages such as tweets or blogs.
Results sorted by relevance
Solr returns documents in ranked order based on how relevant each document is to the user’s query.

10
Lucene based Search Engines
Amazon
Elasticsearch
Service

11
Features overview
• Pagination and sorting
• Faceting
• Autosuggest
• Spell-checking
• Highlighting
• Geospatial search
• More Like This

12
Features overview
• Flexible query support
• Document clustering
• Import rich document formats (PDF, Office…)
• Import data from databases
• Multilingual support
DIH
Data Import Handler

15
Key SOLR Concepts
• Documents
• Searching
• Relevancy
• Precision and Recall
• Searching at Scale STORAGE RETRIEVAL
Tracking
Indexing
Query

16
Lucene Document
• Documents are the unit of information for
indexing and search
• A Document is a set of fields
• Each field has a name and a value
• All field types must be defined, and all field
names (or dynamic field-naming patterns)
should be specified in Solr’s schema.xml
Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition
Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition
DOCUMENT
FIELD

17
Lucene Document – Search problem
The Beginner’s Guide to Buying a House
How to Buy Your First House
Purchasing a Home
Becoming a New Home owner
Buying a New Home
Decorating Your Home
A Fun Guide to Cooking
How to Raise a Child
Buying a New Car
SELECT * FROM Books WHERE Name = 'buying a new home’;
0 results
SELECT * FROM Books
WHERE Name LIKE '%buying%’
AND Name LIKE '%a%’
AND Name LIKE '%home%’;
1 result
Buying a New Home
SELECT * FROM Books
WHERE Name LIKE '%buying%’
OR Name LIKE '%a%’
OR Name LIKE '%home%’;
8 results
A Fun Guide to Cooking, Decorating Your Home, How to Raise a Child, Buying a New Car,
Buying a New Home, The Beginner’s Guide to Buying a House, Purchasing a Home,
Becoming a New Home owner
Unimportant words
Synonyms
Linguistic variations
Ordering

18
Lucene Document – Inverted Index
Doc # Content field Term Doc #
1 A Fun Guide to Cooking a 1,3,4,5,6,7,8
2 Decorating Your Home becoming 8
3 How to Raise a Child beginner’s 6
4 Buying a New Car buy 9
5 Buying a New Home buying 4,5,6
6 The Beginner’s Guide to Buying a House child 3
7 Purchasing a Home cooking 1
8 Becoming a New Home Owner decorating 2
9 How to Buy Your First House home 2,5,7,8
house 6,9
how 3,9
new 4,5,8
purchasing 7
your 2,9
INVERTED
INDEX

19
Searching
TERM DOCS
buying 4,5,6,7,9
home 2,5,6,7,8,9
Unimportant word “a” is skipped
Synonyms purchasing ~ buying
Linguistic variations buy ~ buying
Synonyms house ~ home
(AND) = 5,6,7,9
Buying a New Home
The Beginner’s Guide to Buying a House
Purchasing a Home
How to Buy Your First House

20
Searching operators
• Required terms
• Optional terms
• Negated terms
• Phrases
• Grouped expressions
• Fuzzy matching
• Wildcard
• Range
• Distance
• Proximity
buying AND home
buying OR home
buying NOT home
“buying a home”
(buying OR renting) AND home
offi* off*r off?r
yearsOld:[18-21]
administrator~
“chief officer”~1

21
Relevancy till SOLR 4 (TF/IDF)
A relevancy score for each document is calculated and the search results are sorted from the highest score to the lowest.
Similarity
Term frequency
• A document is more relevant for a particular term if the term appears multiple times
Inverse document frequency
• Measure of how “rare” a search term is, is calculated by finding the document frequency (how many total documents
the search term appears within)
Boosting
• Multiplier in query time to adjust the weight of a field
• title:solr^2.5 description:solr
Normalization factors for fields, queries and coord
Ordering

22
Relevancy from SOLR 6 (BM25)
BM25 improves upon TF/IDF
BM25 stands for “Best Match 25” (25th iteration on TF/IDF)
Includes different factors
• Frequency of a term in all Documents
• Term Frequency in a Document
• Document Length
BM25 limits influence of term frequency:
• less influence of commonwords
With TF/IDF: short fields (title,...) are automatically scored higher
BM25: Scales field length with average
• field length treatment does not automatically boost short fields
Ordering

23
Precision and Recall
Precision is a measure of how “good” each of the results of a query is. A query that returns one single
correct document out of a million other correct documents is still considered perfectly precise.
Recall is a measure of how many of the correct documents are returned. A query that returns one
single correct document out of a million other correct documents is considered a very poor recall
scoring.
>> Precision and Recall balance will improve the quality of your search results.
20 correct documents
Search results containing 10 documents
(8 correct and 2 incorrect)
Precision = 80% (8 / 10)
Recall = 40% (8 / 20)
What is the precision and
recall for the
previous ”buying a home”
sample?

24
Searching at Scale
Scaling SOLR
Solr is able to scale to handle
billions of documents and an
infinite number of queries
by adding servers.
Some limitations
• You can insert, delete, and update documents, but not single fields (easily)
• Solr is not optimized for processing quite long queries (thousands of terms) or returning quite
large result sets to users.

26
Requirements
• Java Runtime Environment 1.8+
$ java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
• Supported Operating Systems
• Linux
• MacOS
• Windows
https://lucene.apache.org/solr/downloads.html

27
Directory layout
bin/
• solr | solr.cmd : start SOLR
• post : posting content to SOLR
• solr.in.sh | solr.in.cmd : configuration
contrib/
• add-ons plugins
dist/
• SOLR Jar files
docs/
• JavaDocs
example/
• CSV, XML and JSON
• DIH for databases
• Word and PDF files
licenses/
• 3rd party libraries
server/
• SOLR Admin UI
• Jetty Libraries
• Log files
• Sample configsets

28
Starting SOLR
• Use the command line interface tool called bin/solr (Linux) or binsolr.cmd (Windows)
$ bin/solr start -p 8983
Waiting up to 180 seconds to see Solr running on port 8983 []
Started Solr server on port 8983 (pid=4521). Happy searching!
• Check if Solr is Running
$ bin/solr status
Found 1 Solr nodes:
Solr process 4521 running on port 8983
{
"solr_home":"/Users/aborroy/Downloads/solr-introduction-university/solr-8.4.1/server/solr",
"version":"8.4.1 832bf13dd9187095831caf69783179d41059d013 - ishan - 2020-01-10 13:40:28",
"startTime":"2020-03-08T08:13:49.969Z",
"uptime":"0 days, 0 hours, 17 minutes, 56 seconds",
"memory":"91.6 MB (%17.9) of 512 MB"}

29
The SOLR Admin Web Interface
http://127.0.0.1:8983/solr/#/

30
Creating a new Core
$ bin/solr create -c films
• -c indicates the collection name
Check default fields added by SOLR to the Schema >>>>>>>
Check JSON Data to be posted in example/films/films.json
{
"id": "/en/45_2006",
"directed_by": [
"Gary Lennon"
],
"initial_release_date": "2006-11-30",
"genre": [
"Black comedy",
"Thriller"
],
"name": ".45"
}

31
Posting data
$ bin/post -c films example/films/films.json
Posting files to [base] url http://localhost:8983/solr/films/update...
POSTing file films.json (application/json) to [base]/json/docs
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/films/update/json/docs
SimplePostTool: WARNING: Response: {
"responseHeader":{
"status":400,
"QTime":120},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","java.lang.NumberFormatException"],
"msg":"ERROR: [doc=/en/quien_es_el_senor_lopez] Error adding field 'name'='¿Quién es el señor López?' msg=For input string:
"¿Quién es el señor López?"",
"code":400}}
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for
URL: http://localhost:8983/solr/films/update/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.323

32
How many results were posted?
http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json
• q: query event
• fq: filter queries
• sort: asc or desc
• start, rows: offset and number of rows
• fl: list of fields to return
• wt: response in XML or JSON

33
What was wrong?
Check carefully JSON Data to be posted in example/films/films.json
{
"id": "/en/quien_es_el_senor_lopez",
"directed_by": [
"Luis Mandoki"
],
"genre": [
"Documentary film"
],
"name": "u00bfQuiu00e9n es el seu00f1or Lu00f3pez?"
},
http://127.0.0.1:8983/solr/#/films/schema?field=name

34
Auto-Generated SOLR Schema
http://127.0.0.1:8983/solr/#/films/files?file=managed-schema
A single document might
contain multiple values
for this field type
The value of the field
can be used in queries
to retrieve matching
documents (true by
default)
SOLR rejects any
attempts to add a
document which does
not have a value for this
field
The actual value of the
field can be retrieved by
queries
name can contain text!

35
Re-Creating the Core
Deleting core “films”
$ bin/solr delete -c films
Deleting core 'films' using command:
http://localhost:8983/solr/admin/cores?action=UNLOAD&core=film
s&deleteIndex=true&deleteDataDir=true&deleteInstanceDir=true
Creating core “films”
$ bin/solr create -c films
Created new core 'films’
Creating the field “name” for the core “films”
http://127.0.0.1:8983/solr/#/films/schema

36
Posting Data 2
$ bin/post -c films example/films/films.json
Posting files to [base] url http://localhost:8983/solr/films/update...
POSTing file films.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.417
http://127.0.0.1:8983/solr/films/select?indent=on&q=*:*&wt=json

37
Exploring SOLR Analyzers
• Solr analyzes both index content and query input before matching the results
• The live analysis can be observed by using “Analysis” option from Solr Admin UI

38
Exploring SOLR Analyzers
• Using the right locale will produce better results

39
Searching
q = genre:Fantasy directed_by:"Robert Zemeckis"
• This query is searching for both genre Fantasy and directed by Robert Zemeckis (OR is default operator)

40
Filtering
q = genre:Fantasy
fq = initial_release_date:[NOW-12YEAR TO *]
• This query is searching for both genre Fantasy in the latest 12 years

41
Sorting
q = *:*
sort = initial_release_date desc
• This query is ordering all the films by release date in descent order

42
Fuzzy Edit
q = directed_by:Zemeckis
q = directed_by:Zemekis~1
q = directed_by:Zemequis~2

43
Faceting
q = *:*
fq = genre:epic
facet = on
facet_field = directed_by_str
http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str
&facet=on&facet.mincount=1&fq=genre:epic&indent=on&q=*:*
&wt=json

44
Faceting
Multiple fields for faceting
http://127.0.0.1:8983/solr/films/select?facet.field=directed_by_str&facet.field=genre&facet=on&indent=on&q=*:*&wt=js
on

45
Highlighting
q = genre:epic
hl = on
hl.fl = genre

46
Indexing Documents
Create a new collection
$ bin/solr create -c files -d example/files/conf
Posting Word and PDF Documents
% bin/post -c files ../Documents
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
Entering recursive mode, max depth=999, delay=0s
Indexing directory ../Documents (3 files, depth=0)
POSTing file Non-text-searchable.pdf (application/pdf) to [base]/extract
POSTing file Sample-Document.pdf (application/pdf) to [base]/extract
POSTing file Sample-Document-scoring.docx (application/vnd.openxmlformats-
officedocument.wordprocessingml.document) to [base]/extract
3 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/files/update...
Time spent: 0:00:06.338

47
Searching documents
q = video

48
Documents : ExtractingUpdateRequestHandler
The magic happens in files/conf/solrconfig.xml

<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="xpath">/xhtml:html/xhtml:body/descendant:node()</str>
<str name="capture">content</str>
<str name="fmap.meta">attr_meta_</str>
<str name="uprefix">attr_</str>
<str name="lowernames">true</str>
</lst>
</requestHandler>

4949
Alfresco using Apache SOLR

50
Alfresco uses an Angular app to get results from SOLR
ADF
Angular App
Repository
REST API
SOLR
IndexesFilesDB
User

51
Alfresco Content Application

53
Quick References
SOLR
• https://lucene.apache.org/solr/resources.html#documentation
• https://www.manning.com/books/solr-in-action
• https://github.com/treygrainger/solr-in-action
“Let’s Build an Inverted Index: Introduction to Apache Lucene/Solr” by Sease
• https://www.slideshare.net/SeaseLtd/lets-build-an-inverted-index-introduction-to-apache-lucenesolr
Source code
• https://github.com/apache/lucene-solr
• https://cwiki.apache.org/confluence/display/solr/HowToContribute
This presentation
• https://www.slideshare.net/angelborroy/a-practical-introduction-to-apache-solr

A Practical Introduction to Apache Solr

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A Practical Introduction to Apache Solr

Similaire à A Practical Introduction to Apache Solr (20)

Plus de Angel Borroy López

Plus de Angel Borroy López (20)

Dernier

Dernier (20)

A Practical Introduction to Apache Solr