The Typed Index

THE TYPED INDEX
Christoph Goller
christoph.goller@intrafind.de

Chief Scientist at IntraFind Software AG

Outline
•

IntraFind Software AG

•

Analyzers, Inverted File Index

•

Different Types of Terms

•

Why do we need them in one field?

•

The Typed Index

•

Multilingual Search / Mixed Language Documents

A few words about me and about IntraFind

IntraFind Software AG
•
•
•
•
•

Specialist for Information Retrieval and Enterprise Search
Founding of the company: October 2000
More than 850 customers mainly in Germany, Austria, and Switzerland
Employees: 30
Lucene Committers: B. Messer, C. Goller

•
•
•
•

Independent Software Vendor, entirely self-financed
Products are a combination of Open Source Components and in-house Development
Support (up to 7x24), Services, Training,
Focus on Quality / Text Analytics / SOA Architecture
– Linguistic Analyzers for most European Languages
– Semantic Search
– Named Entity Recognition
– Text Classification
– Clustering

Analyzers and the Inverted File Index

Analysis / Tokenization
Break stream of characters into tokens /terms
•

Normalization (e.g. case)

•

Stop Words

•

Stemming

•

Lemmatizer / Decomposer

•

Part of Speech Tagger

•

Information Extraction

Different Term Normalizations
Different Types of Terms

Morphological Analyzer vs. Stemming
•

Lemmatizer: maps words to their base forms
English

German

going



go (Verb)

lief



laufen (Verb)

bought



buy (Verb)

rannte



rennen (Verb)



Buch (Noun)

bags

bag (Noun)

Bücher

bacteria

•




bacterium (Noun)

Taschen 

Tasche (Noun)

Decomposer: decomposes words into their compounds
Kinderbuch (children‘s book)  Kind (Noun) | Buch (Noun)
Versicherungsvertrag (insurance contract)  Versicherung (Noun) | Vertrag (Noun)

Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions)
going -> go
decoder, decoding, decodes -> decod
Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ????
Understemming: spoke -> speak

Bad Precision with Algorithmic Stemmer

High Recall and High Precision with
Morphological Analyzers

Word Decomposition and Search

Federal Ministry for Family Affairs

Why do we need other Normalizations?
•

Stemmers / Lemmatizers are language-specific

•

MultiTermQueries: WildcardQuery, FuzzyQuery
–
–
–
–

•

Case-Sensitive
–
–

•

no stemming, no lemmatization
should work on original terms generated from Tokenizer
only very simple normalizations such as: Citroën -> Citroen
in Solr: <analyzer type=“multiterm”>

Stemmers / Lemmatizers map everything to lowercase
sometimes case matters: MAN vs. man

Phonetic Search (Double Metaphone):
–
–
–

Mazlum -> MSLM; Muslim -> -> MSLM
book -> PK; books -> PKS
Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP

Named Entity Recognition (NER)
Automated extraction of information from
unstructured data
•
People names
•
Company names
•
Brands from product lists
•
Technical key figures from technical data
(raw materials, product types, order IDs,
process numbers, eClass categories)
•
Names of streets and locations
•
Currency and accounting values
•
Dates
•
Phone numbers, email addresses,
hyperlinks

Why do we need these different types of terms
in one field?

Why do we need them in one field?
•

Query: “MAN sagt” PhraseQuery / NearQuery !!!!!
Matching Document: “MAN sagte” not “man sagte”

•

Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!!
Matching Document: books of Kaouther Tabai
– For book to match books we need a stemmer or a lemmatizer
– For the names to match we need phonetics

•

Query: Mazlum
– It leads to matches for the very frequent word Muslim
– Users want: Give me phonetic matches for Mazlim but not Muslim
– Mazlum=P AND NOT Muslim=E doesn’t do the job!!!

–
–
•

• No match for “Mazlum is a member of the Muslim society in Munich”
spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim))
New Syntax: <Mazlim=P BUTNOT Muslim=E>

Query: Persons near synonyms of founding and Microsoft
“E_Person found Microsoft” PhraseQuery / NearQuery

Semantic Search
Question:

Semantic Search

Wer hat Microsoft gegründet?

Semantic Search
Question:

Wo liegen Werke von Audi?

Semantic Search

The Typed Index
Multilingual Search
Mixed Language Documents

The typed Index
•

We need different types of terms in one field

•

Types are term properties: payloads are not a good option

•

Use prefixes to distinguish them:
–
–
–
–

–

•

V_ for fullforms (case sensitive)
N_ for diacritics normalizations
F_ for phonetic normal forms
E_ for entities
• E_Person, E_Location, E_Organization
• E_PersonName_Brown, E_Location_Munich
B_ for baseforms: B_Noun_book, B_Verb_fly, …

Multilingual Search is handled in the same way
B_EN_NOUN_book, B_DE_NOUN_buch

Multilingual Search: Standard Approach
Generate a language-specific copy of every content-field:
– configure language-specific analyzers for the language-specific fields
– Indexing: Adapt indexing chain to determine document language,
generate new language-specific fields
– Search: Use MultiFieldQueryParser to expand query to every
language-specific field

– Highlighting: depending on document-language call Highlighter for
language-specific fields with the respective analyzer
– no solution for mixed-language documents

Multilingual Search and the Typed Index
Choose analyzer depending on language but do not use different fields:
– Analyzers generate terms typed with language: B_EN_NOUN_book,
B_DE_NOUN_buch

– Indexing: choose analyzer in indexing chain based on language
– Search: Use a special MultiAnalyzerQueryParser to expand query to every
language

– Highlighting: choose analyzer based on language and apply it to content-field
– Advantage: you could implement a multi-language analyzer for handling mixedlanguage documents, which switches language even within paragraphs.

Summary: Advantages of Typed Index to
Multi-Field Index
• Keep positions aligned in an easier way
• Only tokenize once : Performance!

• Reuse existing Queries like PhraseQueries, MultiPhraseQueries
• Treatment for Mixed-Language Documents: Use Lemmatizer
Results to switch between languages

Thanks for listening
Questions ?
By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch
Dr. Christoph Goller
Phone:
+49 89 3090446-0
Fax: +49 89 3090446-29
Email:
christoph.goller@intrafind.de
Web:
www.intrafind.de
IntraFindSoftware AG
Landsberger Straße 368
80687 München
Germany

The Typed Index

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to The Typed Index

Similar to The Typed Index (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

The Typed Index