Multilingual Search with Solr

Multilingual Search and Text Analytics
with Solr
Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Open Source Search Conference 2012 1

Agenda

•  Why
is
Language
Important?

•  Approaches
for
language-‐aware
search

•  Solr
Conﬁgura>on
Op>ons


Language
is

Important


Why
is
language
important?

•  Content
is
produced
and
consumed
in
the
na>ve

language

•  Document
collec>ons
oBen
contain
more
than
one

language

•  Each
language
is
unique,
and
presents
diﬀerent

challenges
to
the
search
engine


Language
is
Complex

•  Tokeniza>on

•  Some
languages
do
not
use
spaces

•  Compound
words
combine
two
or
more
words

•  Conjunc>ons

•  Inflec>on

•  In
grammar,
inflec>on
is
the
modifica>on
of
a
word
to

express
different
gramma>cal
categories
such
as

tense,
gramma>cal
mood,
gramma>cal
voice,
aspect,

person,
number,
gender
and
case.


Language
is
Complex

hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png

Language
is
Complex!

•  The
Spanish
word
“pasaportar”
has
more
than
50

inﬂected
forms:

pasaportando
pasaportareis
pasaportarán

pasaportes
pasaportaron
pasaporte

pasaportada
pasaportase
pasaportan

pasaportaba
pasaportemos
pasaporta

pasaportarían
pasaportaría
pasaportaste

pasaportarais
pasaportara
pasaportad

pasaportasen
pasaportasteis
pasaportéis

pasaportaren
pasaportáramos
pasaportadas

pasaportado
pasaportaban
pasaporté

pasaportaremos
pasaportásemos
pasaportados

pasaportábamos
pasaportamos
pasaportaré

pasaportases
pasaporten
pasaportare

pasaportaríais
pasaportaréis
pasaportará

pasaportaran
pasaportabas
pasaportó

pasaportarías
pasaportaríamos
pasaportabais

pasaportaras
pasaportáremos
pasaportaseis

pasaportarás
pasaporto
…

http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar

Language
Examples

•  English:

spoke
(Noun
–
wheel
part)
→
spoke

spoke
(Verb,
past
tense)
→
speak

•  French:

été
(summer)
→

été
(summer)

été
(was)

→
être
(to
be)

•  German:

Robbe
(seal)
→
Robbe
(seal)

robbe
(I
crawl)
→
robben
(to
crawl)

Samstagmorgen
(Saturday
Morning)
→
Samstag,
Morgen
(compound)

•  Japanese:

•  首脳会談後、オバマ大統領は記者団の質問に答える予定

–  Where
are
the
words??


Language-‐Aware
Search
Technology

•  RoseOe
Linguis>c
Plaiorm

•  Language
Iden>ﬁca>on

•  Tokeniza>on

»  Morphological

•  Token
processing

»  Lemma>za>on

•  Higher
level
analy>cs

»  En>ty
Extrac>on

»  Rela>onship
Extrac>on

•  En>ty
Transla>on
and
En>ty
Search


Language
Iden>ﬁca>on

•  Find
a
single
dominant
language
in
a
document

•  Find
mul>ple
languages
in
a
single
document


Tokeniza>on

•  Morphological
Analysis
vs.
N-‐gram

•  Search
Term:

東京ルパン上映時間
•  N-‐gram:

•  Morphological
Analysis:


Token
Processing

•  Stemming
vs.
Lemma>za>on

•  English:
“I
have
spoken
at
several
conferences”

•  Stemming:

•  Lemma>za>on:


Stemming
vs.
Lemma>za>on

•  Two
words
with
the
same
spelling,
but
diﬀerent

meanings
create
the
same
stem.

Stemming
LemmaCzaCon

prensa

→
prens
Prensa
→
prensa

(media)

(media)
(media)

prensa

→
prens
prensa

→
prensar

(he/she
presses)

(he/she
presses)

(to
press)

INCORRECT

CORRECT


Stemming
vs.
Lemma>za>on

•  Two
diﬀerent
words
create
the
same
stem.

Stemming
LemmaCzaCon

publicaciones

→
public
publicaciones
→
publicación

(publicaCons)
(publicaCons)

publico

→
public
publico

→
public

(public)
(public)
(public)

INCORRECT

CORRECT


Token
Processing

German:
“Am
Samstagmorgen
ﬂiege
ich
zurueck
nach

Boston.”

•  Stemming:

•  Lemma>za>on
(and
decompounding!):


How
to
Conﬁgure
Solr

•  Challenges

•  Mul>ple
languages
in
the
data
set

•  Goals:

1.  Language
Iden>ﬁca>on

2.  Language-‐aware
Search:

•  Tokeniza>on

•  Token
Processing


How
to
Conﬁgure
Solr

•  What
tools
does
Solr
have
to
work
with?

•  UpdateRequestProcessor

•  Analyzer/CharFilter/Tokenizer/TokenFilter

•  Solr
Cores

•  Pre-‐process
data
before
Solr?


Solr
UpdateRequestProcessor

•  Runs
Before
Analyzers

•  Full
Access
to
Document

•  Two
op>ons:

•  Run
the
analysis
directly
in
Solr

•  Good
for
Lightweight
Analysis

•  Call
out
to
external
analysis
services

•  Web
Services/UIMA.
Increases
Complexity

•  Limita>ons:

•  Think
through
your
indexing
strategy


Solr
Analyzer/Tokenizer

•  Good
for:

•  Segmenta>on
of
Asian
Language

•  Linguis>cs
-‐
Lemma>za>on

•  Limita>ons:

•  No
access
to
document
object

•  Schema.xml

•  FieldType

•  Analyzer

–  CharFilter

–  Tokenize

–  TokenFilter


Goal
1:
Language
ID

•  UpdateRequestProcessor

•  Runs
before
ﬁeld-‐level
analysis
takes
place

•  Basic
Language
Iden>ﬁer
URP
to
be
included
in
Solr

•  Outside
Solr

What
do
you
do
with
the
language
informa>on??


Goal
2:
Mul>-‐Lingual
Support
in
Solr

•  Three
main
approaches:

1.  One
Solr
ﬁeld
for
each
language

2.  One
Solr
Core
per
language

3.  All
Languages
in
a
Single
Field

Informed
by
Trey
Grainger

@
Careerbuilder:
hOp://www.lucidimagina>on.com/sites/default/ﬁles/Grainger%20Trey%20-‐%20Extending%20Solr,
%20Building%20a%20Cloud-‐Like%20Knowledge%20Discovery%20Plaiorm%20-‐%20rev.pdf


Mul>ple
Languages:
Method
1

•  One
ﬁeld
for
each
language

•  Pro:

•  Simple
approach
and
implementa>on

•  Guarantees
that
queries
are
processed
the
same
way
as

index

•  Con:

•  Increased
query-‐>me
complexity
(mi>gate
with
Dismax)

•  Decreased
query
speed
as
addi>onal
ﬁelds
are
queried

•  May
require
storing
mul>ple
copies
of
text


Mul>ple
Languages:
Method
2

•  One
Solr
core
per
language

Each
Core
has
the
same
field,
with
a
language-‐specific

Analyzer/Tokenizer

•  Pros:

•  No
query-‐>me
performance
overhead

•  Guarantees
that
queries
are
processed
the
same
way
as

index

•  Cons:

•  Significant
complexity
in
managing
mul>ple
cores

•  Must
implement
custom
sharding

•  Does
not
support
mul>lingual
documents


Mul>ple
Languages:
Method
3

•  All
Languages
in
one
ﬁeld

•  Pros:

•  Single
ﬁeld
makes
queries
and
indexing
easy

•  Same
schema/core
as
more
languages
added

•  Cons:

•  Requires
complex
custom
Tokenizer/Analyzer

•  Must
pass
in
language
informa>on
for
queries
and
indexing

•  Does
not
guarantee
queries
are
processed
the
same
as
the

index

•  Poten>al
TF/IDF
confusion


Language
is
Important

•  Use
language
informa>on
at
index
and
query
>me

•  Increase
recall,
maintain
precision

•  BeOer
search
results
for
your
users


My
Contact
Info

•  Steve
Kearns

•  skearns@basistech.com

•  hOp://www.basistech.com


Multilingual Search with Solr

Recommandé

Recommandé

Contenu connexe

Similaire à Multilingual Search with Solr

Similaire à Multilingual Search with Solr (20)

Plus de Basis Technology

Plus de Basis Technology (20)

Dernier

Dernier (20)

Multilingual Search with Solr