This talk will explore the challenges of Multilingual search, including language-specific issues — like N-gram segmentation vs. morphological analysis, stemming vs. lemmatization, and language identification — and the various approaches to configuring your Solr schema. We will also discuss the integration strategies for common text analytics capabilities and the impact of multilingual content on application design.
Solr is a powerful search engine which rapidly gained acceptance as an alternative to commercial search solutions for many applications. There are many features required by organizations to serve their diverse communities, among these is the ability to deliver search excellence in foreign languages. Delivering quality multilingual search involves careful design of schemas and selection of the best linguistic approach for each supported language.
1. Multilingual Search and Text Analytics
with Solr
Steve Kearns
Director of Product Management
Basis Technology
Basis Technology – Open Source Search Conference 2012 1
2. Agenda
• Why
is
Language
Important?
• Approaches
for
language-‐aware
search
• Solr
Configura>on
Op>ons
Basis Technology – Open Source Search Conference 2012 2
3. Language
is
Important
Basis Technology – Open Source Search Conference 2012 3
4. Why
is
language
important?
• Content
is
produced
and
consumed
in
the
na>ve
language
• Document
collec>ons
oBen
contain
more
than
one
language
• Each
language
is
unique,
and
presents
different
challenges
to
the
search
engine
Basis Technology – Open Source Search Conference 2012 4
5. Language
is
Complex
• Tokeniza>on
• Some
languages
do
not
use
spaces
• Compound
words
combine
two
or
more
words
• Conjunc>ons
• Inflec>on
• In
grammar,
inflec>on
is
the
modifica>on
of
a
word
to
express
different
gramma>cal
categories
such
as
tense,
gramma>cal
mood,
gramma>cal
voice,
aspect,
person,
number,
gender
and
case.
Basis Technology – Open Source Search Conference 2012 5
6. Language
is
Complex
Basis Technology – Open Source Search Conference 2012 6
hOp://en.wikipedia.org/wiki/File:Flexi%C3%B3nGato.png
7. Language
is
Complex!
• The
Spanish
word
“pasaportar”
has
more
than
50
inflected
forms:
pasaportando
pasaportareis
pasaportarán
pasaportes
pasaportaron
pasaporte
pasaportada
pasaportase
pasaportan
pasaportaba
pasaportemos
pasaporta
pasaportarían
pasaportaría
pasaportaste
pasaportarais
pasaportara
pasaportad
pasaportasen
pasaportasteis
pasaportéis
pasaportaren
pasaportáramos
pasaportadas
pasaportado
pasaportaban
pasaporté
pasaportaremos
pasaportásemos
pasaportados
pasaportábamos
pasaportamos
pasaportaré
pasaportases
pasaporten
pasaportare
pasaportaríais
pasaportaréis
pasaportará
pasaportaran
pasaportabas
pasaportó
pasaportarías
pasaportaríamos
pasaportabais
pasaportaras
pasaportáremos
pasaportaseis
pasaportarás
pasaporto
…
Basis Technology – Open Source Search Conference 2012 7
http://education.yahoo.com/reference/dict_en_es/spanish/pasaportar
8. Language
Examples
• English:
spoke
(Noun
–
wheel
part)
→
spoke
spoke
(Verb,
past
tense)
→
speak
• French:
été
(summer)
→
été
(summer)
été
(was)
→
être
(to
be)
• German:
Robbe
(seal)
→
Robbe
(seal)
robbe
(I
crawl)
→
robben
(to
crawl)
Samstagmorgen
(Saturday
Morning)
→
Samstag,
Morgen
(compound)
• Japanese:
• 首脳会談後、オバマ大統領は記者団の質問に答える予定
– Where
are
the
words??
Basis Technology – Open Source Search Conference 2012 8
10. Language
Iden>fica>on
• Find
a
single
dominant
language
in
a
document
• Find
mul>ple
languages
in
a
single
document
Basis Technology – Open Source Search Conference 2012 10
12. Token
Processing
• Stemming
vs.
Lemma>za>on
• English:
“I
have
spoken
at
several
conferences”
• Stemming:
• Lemma>za>on:
Basis Technology – Open Source Search Conference 2012 12
13. Stemming
vs.
Lemma>za>on
• Two
words
with
the
same
spelling,
but
different
meanings
create
the
same
stem.
Stemming
LemmaCzaCon
prensa
→
prens
Prensa
→
prensa
(media)
(media)
(media)
prensa
→
prens
prensa
→
prensar
(he/she
presses)
(he/she
presses)
(to
press)
INCORRECT
CORRECT
Basis Technology – Open Source Search Conference 2012 13
14. Stemming
vs.
Lemma>za>on
• Two
different
words
create
the
same
stem.
Stemming
LemmaCzaCon
publicaciones
→
public
publicaciones
→
publicación
(publicaCons)
(publicaCons)
publico
→
public
publico
→
public
(public)
(public)
(public)
INCORRECT
CORRECT
Basis Technology – Open Source Search Conference 2012 14
15. Token
Processing
German:
“Am
Samstagmorgen
fliege
ich
zurueck
nach
Boston.”
• Stemming:
• Lemma>za>on
(and
decompounding!):
Basis Technology – Open Source Search Conference 2012 15
16. How
to
Configure
Solr
• Challenges
• Mul>ple
languages
in
the
data
set
• Goals:
1. Language
Iden>fica>on
2. Language-‐aware
Search:
• Tokeniza>on
• Token
Processing
Basis Technology – Open Source Search Conference 2012 16
17. How
to
Configure
Solr
• What
tools
does
Solr
have
to
work
with?
• UpdateRequestProcessor
• Analyzer/CharFilter/Tokenizer/TokenFilter
• Solr
Cores
• Pre-‐process
data
before
Solr?
Basis Technology – Open Source Search Conference 2012 17
18. Solr
UpdateRequestProcessor
• Runs
Before
Analyzers
• Full
Access
to
Document
• Two
op>ons:
• Run
the
analysis
directly
in
Solr
• Good
for
Lightweight
Analysis
• Call
out
to
external
analysis
services
• Web
Services/UIMA.
Increases
Complexity
• Limita>ons:
• Think
through
your
indexing
strategy
Basis Technology – Open Source Search Conference 2012 18
19. Solr
Analyzer/Tokenizer
• Good
for:
• Segmenta>on
of
Asian
Language
• Linguis>cs
-‐
Lemma>za>on
• Limita>ons:
• No
access
to
document
object
• Schema.xml
• FieldType
• Analyzer
– CharFilter
– Tokenize
– TokenFilter
Basis Technology – Open Source Search Conference 2012 19
20. Goal
1:
Language
ID
• UpdateRequestProcessor
• Runs
before
field-‐level
analysis
takes
place
• Basic
Language
Iden>fier
URP
to
be
included
in
Solr
• Outside
Solr
What
do
you
do
with
the
language
informa>on??
Basis Technology – Open Source Search Conference 2012 20
21. Goal
2:
Mul>-‐Lingual
Support
in
Solr
• Three
main
approaches:
1. One
Solr
field
for
each
language
2. One
Solr
Core
per
language
3. All
Languages
in
a
Single
Field
Informed
by
Trey
Grainger
@
Careerbuilder:
hOp://www.lucidimagina>on.com/sites/default/files/Grainger%20Trey%20-‐%20Extending%20Solr,
%20Building%20a%20Cloud-‐Like%20Knowledge%20Discovery%20Plaiorm%20-‐%20rev.pdf
Basis Technology – Open Source Search Conference 2012 21
22. Mul>ple
Languages:
Method
1
• One
field
for
each
language
• Pro:
• Simple
approach
and
implementa>on
• Guarantees
that
queries
are
processed
the
same
way
as
index
• Con:
• Increased
query-‐>me
complexity
(mi>gate
with
Dismax)
• Decreased
query
speed
as
addi>onal
fields
are
queried
• May
require
storing
mul>ple
copies
of
text
Basis Technology – Open Source Search Conference 2012 22
23. Mul>ple
Languages:
Method
2
• One
Solr
core
per
language
Each
Core
has
the
same
field,
with
a
language-‐specific
Analyzer/Tokenizer
• Pros:
• No
query-‐>me
performance
overhead
• Guarantees
that
queries
are
processed
the
same
way
as
index
• Cons:
• Significant
complexity
in
managing
mul>ple
cores
• Must
implement
custom
sharding
• Does
not
support
mul>lingual
documents
Basis Technology – Open Source Search Conference 2012 23
24. Mul>ple
Languages:
Method
3
• All
Languages
in
one
field
• Pros:
• Single
field
makes
queries
and
indexing
easy
• Same
schema/core
as
more
languages
added
• Cons:
• Requires
complex
custom
Tokenizer/Analyzer
• Must
pass
in
language
informa>on
for
queries
and
indexing
• Does
not
guarantee
queries
are
processed
the
same
as
the
index
• Poten>al
TF/IDF
confusion
Basis Technology – Open Source Search Conference 2012 24
25. Language
is
Important
• Use
language
informa>on
at
index
and
query
>me
• Increase
recall,
maintain
precision
• BeOer
search
results
for
your
users
Basis Technology – Open Source Search Conference 2012 25
26. My
Contact
Info
• Steve
Kearns
• skearns@basistech.com
• hOp://www.basistech.com
Basis Technology – Open Source Search Conference 2012 26