The Apache Solr search engine has become almost the default choice for adding superior search capabilities to a web application. In this talk we will go beyond the basics of Solr, and look up at what it offers and how to set it up robustly and properly for production use. We will plan and implement a document model in Solr, and look at how to index different document types with Solr Cell or index data from the web with the Nutch crawler. We will cover options for tuning queries and performance, and examine how best to use more advanced features like faceting, spelling correction and 'more like this'. Solr offers a language agnostic web service, so client examples will be in PHP and Python, but the bulk of the content will be applicable to anyone looking to work well with Solr.
6. y
pla
dis
for <co
u sed nfi
nl y ch <!- g>
s o ear - S
an d i th e s et
e ma of env
iro enc this
sch ure to
nd nme ounte
1. 2"> t his nat x a nt, red 'fals
n=" me of the ynt
a
con an e'
r sio na fl ect em a s y fig you m sev if
you
ve e re ch d b ure ay ere
le" th to e s lue d. wan con wa
p
xam ame" i
s
t his o r t
h
t iVa t s
olr fig nt so
e mul ura l
e =" n ge r f to tio r to
nam bute " han mbe are You kee n e c
ma d c nu s may rro ontin
at tri sh oul s ion fi eld als
p w
ork r. ue
ver . lt --> -Dsol
-- ns ons , all fau o s ing In wo
es. licati
o
o lr'
s
i
i
cat xist t
aul e by d
e <ab r.a
bor et th eve a p rking
pos App def tru or n i rod
s S ppl e abo
rtO tOnCo tOn i uct after
" i y a d not by Con s to f o
ne ion
n. 1.2 d b se d, nCo n
nfi figur fig f it
c tio sion=" ould ha nge ute di , fal roduce gur ati ura alse han
dle
e c ed tio usi
oll ver sh be rib duc int <! ati o n r i
It ly att tro ribute ide -- li onE nErro nEr
ror g by s m
is-
i cs. normal alued e in t nti b d rro
r>
r >$
{so =fa set
ant not iV but ons at fie ire lr. lse tin
g t
sem mul
t
ttr
i i d cti
ves abo he
1. 0: ue d a dPosit sol
rco and can rtO
nCo sys
l use tem
iVa FreqAn s ss" nfi be nfi
e
t
mul Term t e i e "cla g.x
ml
the
m t use gur pro
per
na tur 1. 1: mi t ds . tr ibu Th l
sch
ema or o r d t
o i
ati
onE ty:
: o t fiel t
" a tions. he re
a
he .xm eso nst rro
1.2 tex e
nam fini t in
t l ( lve
any
ruc
t S
r:t
rue
o r h e " e m ine s es All ie: olr }< /
ep t f ns . T ield d deter c las dir
ect
Ana
lyz
"pl
ugi to
exc --> iti
o y f tes jav
a ori ers ns" loa
e fin used b tribu to it If es , R
equ
spe
cif
d a
n J
e d e at ref
er m .a ". and est ied ars
es > d typ t o b other r" rb ati hich/lib pat
hs Han in
p l l e . l e ad w " d
<ty -- fie abe d any Typ th "so . d v r ch ire are dle you
<! t a l an fi eld wi ag e st ore reshol e s nc i i wh cto res rs, r
us e e g ck d/ h ue lud ry olv etc
j ibu
t th n
rti sis pa exe ssT val ed exi ed ...
a ttr ior of s sta y ind compre ) to as sts rel ).
av me ana
l but nal lds -->
if
you in ati
ve
beh ass na solr. zed
, io fie <!-- <li "
b had
you the
C l ch e. an aly an opt rived A d
e
tru dir= use
r i
nst ins
e ir t =" "./
. apa not pport h e d claL as op lib d t anc
eDi
tan
ceD
org is u n t ). -->s ing ssp tio " / he r, ir.
t ype ield s led i s Mis ath n b
y i > fol
low all
d er rt -
<!- , t
--> iel xtF f enab aract " s<
o l his tse ing fil
S trF and Te ( i ch i eld !-- ib di is " lf ad
ue syn
tax
es
fou
T he l d si on (in S trF in Whe r="
../ st= "tr use ds ... nd
- e e . tha ful any in
<!- t rFi ompres n siz s olr t n a
re n gLa ./c
. for fil
- S ts c ai s=" ssi gex ont 4 es
i ert cla
s
dir rtM
i
isase6rib/ inc fo
lim ed a c g" soec B sp ext lud
ing und i
e r in -> d" i
l w l tor n as eci rac n t
exc "st " - Fie l b ed i hy fie
d i tion/
all
me= lse r.Bool ->
-
e i w ich jar he di
S
na v
ie nc l
"fa sol <!-
-li /retr com n a
ddi ib" / s i r
n a ector
--> ldType "/> or =" tb d
lud
ed. ple tio --> y
e e e" las
s <lisen tare dir
< f i s =" t r u tru be b dir
ir= ely n t ect to th
rm e : " ean" c ld <!- = " . "/ "..
> / ut es mat o a ory e
tNo typ bool hou ou
f - I eld /.
. ../ rib ch dir .
olr’s secret plan!
omi n s nd arf Fa y i d ./d t distatt / the ect
lea =" ata reg ory
boo e name e d Bin ir i
irs st/" " r ex , o
-
<!- ldTyp Th r. gF ti
op ege (an nly
e" /> yp e. =" sol is sin on reg
ex= x="
apa cho the
e t ss tM (wi
< f i s =" t r u d a t a cla sor th "ap
ach che-s
red
on
fil
es
N orm inary - > a ry" and or
wit e-s olr bot
o mit !--B s - e =" b i n L ast hou olr -ce h e
< ing ing t a -cl ll- nds
d Str pe nam M iss reg uster
d.
*. )
ode eldty s ort ex) ing jar
enc <fi al is - " /
pt ion use d.*. >
o d a jar
The nd
not
" /
-
< !-- hin ->
g i
s
27. from solr import *
s=SolrConnection(
'http://localhost:8080/solr/main')
doc = dict(
permalink = "http://fooweb.com/strategy/
DCPO",
category = "strategy",
title = "DPCO: A Framework For Synergy",
body = "DPCO, or Dynamic Performance Class
Organisation is a ISO90210 quality oriented
management process [...]",
author = "Sean Alison",
date = "2011-03-01T00:00:00Z",
source_site = "fooweb.com",
)
s
s.add(doc)
s.commit()
impleadd.py
28. <add>
<doc>
<field name="body">
DPCO, or Dynamic Performance Class [...]
</field>
<field name="category">strategy</field>
<field name="permalink">
http://fooweb.com/strategy/DCPO
</field>
<field name="source_site">fooweb.com</field>
<field name="title">
DPCO: A Framework For Synergy
</field>
<field name="date">2011-03-01T00:00:00Z
</field>
<field name="author">Sean Alison</field>
</doc>
</add>
40. # skip some protocols
-^(https|telnet|file|ftp|mailto):
-[?*!@=]
# allow urls in defined domain
+^http://([a-z0-9-A-Z]*.)*fooweb.com/
# skip URLs with slash-delimited segment that
repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+1/[^/]+1/
# deny anything else
-.
r egex-urlfilter.txt
46. from solr import *
url = 'http://localhost:8080/solr/main'
s = SolrConnection(url)
response = s.query('idie manager')
for hit in response.results:
print hit['title']
print hit['body']
$
python
simplequery.py
Overview
of
the
IDIE
manager
To
help
with
those
implementing
IDIE
[...]
IDIE:
The
801g
Of
Talent
Management
Inspiration-‐Direction-‐Influence
[...]
54. $solr = new Apache_Solr_Service
('localhost', 8080, '/solr/main');
$query = "Losing my backpacking virginity";
$p = array('qt' => "mlt");
$results = $solr->search($query, 0, 3, $p);
foreach($results->response->docs as $doc) {
echo $doc->title, PHP_EOL;
}
$
php
mltquery.php
Backpacking
across
USA
social
media
way
Safe
solo
travel
on
New
York
holidays
Cracking
The
Big
Apple's
Big
10
55. THanks!
script: Ian barber (phpir.com)
Art: the internet!
Editor: twitter.com/ianbarber
lettering: ian.barber@gmail.com
http://joind.in/2899
63. http://code.google.com/p/solr-php-client
$
php
sortquery.php
Zola
Jesus
album
review
-‐
Stridulum
II
Zero
7
album
review
-‐
Record
Zebra
and
Giraffe
Young
Knives
video
interview
part
2
Young
Knives
-‐
Road
to
V
winners
on
tour
You
Me
At
Six
@
Wembley
Arena,
London
You
Me
At
Six
-‐
Hold
Me
Down
Yet
again...
Good
Shoes
@
ULU,
London
Yelle:
North
American
tour
review
Yelle:
interview
with
a
French
pop
artiste
73. from solr import *
url = 'http://localhost:8080/solr/main'
s = SolrConnection(url)
response = s.query('ISO90210')
if(response.results.numFound == '0'):
print "No results found!"
$
python
simplefail.py
DPCO:
A
Framework
For
Synergy
DPCO,
or
Dynamic
Performance
Class
Organisation
is
a
ISO90210
quality
[...]