SlideShare a Scribd company logo
1 of 27
Download to read offline
THE TYPED INDEX
Christoph Goller
christoph.goller@intrafind.de

Chief Scientist at IntraFind Software AG
Outline
•

IntraFind Software AG

•

Analyzers, Inverted File Index

•

Different Types of Terms

•

Why do we need them in one field?

•

The Typed Index

•

Multilingual Search / Mixed Language Documents
A few words about me and about IntraFind
IntraFind Software AG
•
•
•
•
•

Specialist for Information Retrieval and Enterprise Search
Founding of the company: October 2000
More than 850 customers mainly in Germany, Austria, and Switzerland
Employees: 30
Lucene Committers: B. Messer, C. Goller

•
•
•
•

Independent Software Vendor, entirely self-financed
Products are a combination of Open Source Components and in-house Development
Support (up to 7x24), Services, Training,
Focus on Quality / Text Analytics / SOA Architecture
– Linguistic Analyzers for most European Languages
– Semantic Search
– Named Entity Recognition
– Text Classification
– Clustering
Selected Customers
Analyzers and the Inverted File Index
Analysis / Tokenization
Break stream of characters into tokens /terms
•

Normalization (e.g. case)

•

Stop Words

•

Stemming

•

Lemmatizer / Decomposer

•

Part of Speech Tagger

•

Information Extraction
Inverted File Index
Different Term Normalizations
Different Types of Terms
Morphological Analyzer vs. Stemming
•

Lemmatizer: maps words to their base forms
English

German

going



go (Verb)

lief



laufen (Verb)

bought



buy (Verb)

rannte



rennen (Verb)



Buch (Noun)

bags

bag (Noun)

Bücher

bacteria

•




bacterium (Noun)

Taschen 

Tasche (Noun)

Decomposer: decomposes words into their compounds
Kinderbuch (children‘s book)  Kind (Noun) | Buch (Noun)
Versicherungsvertrag (insurance contract)  Versicherung (Noun) | Vertrag (Noun)

Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions)
going -> go
decoder, decoding, decodes -> decod
Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ????
Understemming: spoke -> speak
Bad Precision with Algorithmic Stemmer
High Recall and High Precision with
Morphological Analyzers
High Recall and High Precision with
Morphological Analyzers
Word Decomposition and Search

Federal Ministry for Family Affairs
Why do we need other Normalizations?
•

Stemmers / Lemmatizers are language-specific

•

MultiTermQueries: WildcardQuery, FuzzyQuery
–
–
–
–

•

Case-Sensitive
–
–

•

no stemming, no lemmatization
should work on original terms generated from Tokenizer
only very simple normalizations such as: Citroën -> Citroen
in Solr: <analyzer type=“multiterm”>

Stemmers / Lemmatizers map everything to lowercase
sometimes case matters: MAN vs. man

Phonetic Search (Double Metaphone):
–
–
–

Mazlum -> MSLM; Muslim -> -> MSLM
book -> PK; books -> PKS
Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP
Named Entity Recognition (NER)
Automated extraction of information from
unstructured data
•
People names
•
Company names
•
Brands from product lists
•
Technical key figures from technical data
(raw materials, product types, order IDs,
process numbers, eClass categories)
•
Names of streets and locations
•
Currency and accounting values
•
Dates
•
Phone numbers, email addresses,
hyperlinks
Why do we need these different types of terms
in one field?
Why do we need them in one field?
•

Query: “MAN sagt” PhraseQuery / NearQuery !!!!!
Matching Document: “MAN sagte” not “man sagte”

•

Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!!
Matching Document: books of Kaouther Tabai
– For book to match books we need a stemmer or a lemmatizer
– For the names to match we need phonetics

•

Query: Mazlum
– It leads to matches for the very frequent word Muslim
– Users want: Give me phonetic matches for Mazlim but not Muslim
– Mazlum=P AND NOT Muslim=E doesn’t do the job!!!

–
–
•

• No match for “Mazlum is a member of the Muslim society in Munich”
spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim))
New Syntax: <Mazlim=P BUTNOT Muslim=E>

Query: Persons near synonyms of founding and Microsoft
“E_Person found Microsoft” PhraseQuery / NearQuery
Semantic Search
Question:

Semantic Search

Wer hat Microsoft gegründet?
Semantic Search
Question:

Wo liegen Werke von Audi?

Semantic Search
The Typed Index
Multilingual Search
Mixed Language Documents
The typed Index
•

We need different types of terms in one field

•

Types are term properties: payloads are not a good option

•

Use prefixes to distinguish them:
–
–
–
–

–

•

V_ for fullforms (case sensitive)
N_ for diacritics normalizations
F_ for phonetic normal forms
E_ for entities
• E_Person, E_Location, E_Organization
• E_PersonName_Brown, E_Location_Munich
B_ for baseforms: B_Noun_book, B_Verb_fly, …

Multilingual Search is handled in the same way
B_EN_NOUN_book, B_DE_NOUN_buch
Multilingual Search: Standard Approach
Generate a language-specific copy of every content-field:
– configure language-specific analyzers for the language-specific fields
– Indexing: Adapt indexing chain to determine document language,
generate new language-specific fields
– Search: Use MultiFieldQueryParser to expand query to every
language-specific field

– Highlighting: depending on document-language call Highlighter for
language-specific fields with the respective analyzer
– no solution for mixed-language documents
Multilingual Search and the Typed Index
Choose analyzer depending on language but do not use different fields:
– Analyzers generate terms typed with language: B_EN_NOUN_book,
B_DE_NOUN_buch

– Indexing: choose analyzer in indexing chain based on language
– Search: Use a special MultiAnalyzerQueryParser to expand query to every
language

– Highlighting: choose analyzer based on language and apply it to content-field
– Advantage: you could implement a multi-language analyzer for handling mixedlanguage documents, which switches language even within paragraphs.
Summary: Advantages of Typed Index to
Multi-Field Index
• Keep positions aligned in an easier way
• Only tokenize once : Performance!

• Reuse existing Queries like PhraseQueries, MultiPhraseQueries
• Treatment for Mixed-Language Documents: Use Lemmatizer
Results to switch between languages
Thanks for listening
Questions ?
By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch
Dr. Christoph Goller
Phone:
+49 89 3090446-0
Fax: +49 89 3090446-29
Email:
christoph.goller@intrafind.de
Web:
www.intrafind.de
IntraFindSoftware AG
Landsberger Straße 368
80687 München
Germany

More Related Content

Viewers also liked

Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...lucenerevolution
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution
 
Ramp Up Your Web Experiences Using Drupal and Apache Solr
Ramp Up Your Web Experiences Using Drupal and Apache SolrRamp Up Your Web Experiences Using Drupal and Apache Solr
Ramp Up Your Web Experiences Using Drupal and Apache Solrlucenerevolution
 
OpenStreetMap Geocoder Based on Solr
OpenStreetMap Geocoder Based on SolrOpenStreetMap Geocoder Based on Solr
OpenStreetMap Geocoder Based on Solrlucenerevolution
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 

Viewers also liked (7)

Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
Ramp Up Your Web Experiences Using Drupal and Apache Solr
Ramp Up Your Web Experiences Using Drupal and Apache SolrRamp Up Your Web Experiences Using Drupal and Apache Solr
Ramp Up Your Web Experiences Using Drupal and Apache Solr
 
OpenStreetMap Geocoder Based on Solr
OpenStreetMap Geocoder Based on SolrOpenStreetMap Geocoder Based on Solr
OpenStreetMap Geocoder Based on Solr
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 

Similar to The Typed Index

6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptxShowravDuttaAnkur
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesWelocalize
 
Python-unit -I.pptx
Python-unit -I.pptxPython-unit -I.pptx
Python-unit -I.pptxcrAmth
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 
Actonomy semantic search & match corporate short
Actonomy semantic search & match   corporate shortActonomy semantic search & match   corporate short
Actonomy semantic search & match corporate shortFilip de Geijter
 
Addis Ababa University.pptx
Addis Ababa University.pptxAddis Ababa University.pptx
Addis Ababa University.pptxBelay Alemayehu
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Olga Melnikova
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Software Localization: What You Need to Know to Effectively Go Global
Software Localization: What You Need to Know to Effectively Go GlobalSoftware Localization: What You Need to Know to Effectively Go Global
Software Localization: What You Need to Know to Effectively Go GlobalLionbridge
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010Markus Voelter
 
PatSeer Overview
PatSeer OverviewPatSeer Overview
PatSeer OverviewGridlogics
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Lucidworks
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Laura Dent
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionMohammad Ilyas Malik
 
Mobile App Localization Best Practices
Mobile App Localization Best PracticesMobile App Localization Best Practices
Mobile App Localization Best PracticesAndovar
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 

Similar to The Typed Index (20)

IR
IRIR
IR
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT Engines
 
Internationalization (i18n) Primer
Internationalization (i18n) PrimerInternationalization (i18n) Primer
Internationalization (i18n) Primer
 
Python-unit -I.pptx
Python-unit -I.pptxPython-unit -I.pptx
Python-unit -I.pptx
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Actonomy semantic search & match corporate short
Actonomy semantic search & match   corporate shortActonomy semantic search & match   corporate short
Actonomy semantic search & match corporate short
 
Addis Ababa University.pptx
Addis Ababa University.pptxAddis Ababa University.pptx
Addis Ababa University.pptx
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Software Localization: What You Need to Know to Effectively Go Global
Software Localization: What You Need to Know to Effectively Go GlobalSoftware Localization: What You Need to Know to Effectively Go Global
Software Localization: What You Need to Know to Effectively Go Global
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010
 
PatSeer Overview
PatSeer OverviewPatSeer Overview
PatSeer Overview
 
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
Mobile App Localization Best Practices
Mobile App Localization Best PracticesMobile App Localization Best Practices
Mobile App Localization Best Practices
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

The Typed Index

  • 1.
  • 2. THE TYPED INDEX Christoph Goller christoph.goller@intrafind.de Chief Scientist at IntraFind Software AG
  • 3. Outline • IntraFind Software AG • Analyzers, Inverted File Index • Different Types of Terms • Why do we need them in one field? • The Typed Index • Multilingual Search / Mixed Language Documents
  • 4. A few words about me and about IntraFind
  • 5. IntraFind Software AG • • • • • Specialist for Information Retrieval and Enterprise Search Founding of the company: October 2000 More than 850 customers mainly in Germany, Austria, and Switzerland Employees: 30 Lucene Committers: B. Messer, C. Goller • • • • Independent Software Vendor, entirely self-financed Products are a combination of Open Source Components and in-house Development Support (up to 7x24), Services, Training, Focus on Quality / Text Analytics / SOA Architecture – Linguistic Analyzers for most European Languages – Semantic Search – Named Entity Recognition – Text Classification – Clustering
  • 7. Analyzers and the Inverted File Index
  • 8. Analysis / Tokenization Break stream of characters into tokens /terms • Normalization (e.g. case) • Stop Words • Stemming • Lemmatizer / Decomposer • Part of Speech Tagger • Information Extraction
  • 11. Morphological Analyzer vs. Stemming • Lemmatizer: maps words to their base forms English German going  go (Verb) lief  laufen (Verb) bought  buy (Verb) rannte  rennen (Verb)  Buch (Noun) bags bag (Noun) Bücher bacteria •   bacterium (Noun) Taschen  Tasche (Noun) Decomposer: decomposes words into their compounds Kinderbuch (children‘s book)  Kind (Noun) | Buch (Noun) Versicherungsvertrag (insurance contract)  Versicherung (Noun) | Vertrag (Noun) Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions) going -> go decoder, decoding, decodes -> decod Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ???? Understemming: spoke -> speak
  • 12. Bad Precision with Algorithmic Stemmer
  • 13. High Recall and High Precision with Morphological Analyzers
  • 14. High Recall and High Precision with Morphological Analyzers
  • 15. Word Decomposition and Search Federal Ministry for Family Affairs
  • 16. Why do we need other Normalizations? • Stemmers / Lemmatizers are language-specific • MultiTermQueries: WildcardQuery, FuzzyQuery – – – – • Case-Sensitive – – • no stemming, no lemmatization should work on original terms generated from Tokenizer only very simple normalizations such as: Citroën -> Citroen in Solr: <analyzer type=“multiterm”> Stemmers / Lemmatizers map everything to lowercase sometimes case matters: MAN vs. man Phonetic Search (Double Metaphone): – – – Mazlum -> MSLM; Muslim -> -> MSLM book -> PK; books -> PKS Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP
  • 17. Named Entity Recognition (NER) Automated extraction of information from unstructured data • People names • Company names • Brands from product lists • Technical key figures from technical data (raw materials, product types, order IDs, process numbers, eClass categories) • Names of streets and locations • Currency and accounting values • Dates • Phone numbers, email addresses, hyperlinks
  • 18. Why do we need these different types of terms in one field?
  • 19. Why do we need them in one field? • Query: “MAN sagt” PhraseQuery / NearQuery !!!!! Matching Document: “MAN sagte” not “man sagte” • Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!! Matching Document: books of Kaouther Tabai – For book to match books we need a stemmer or a lemmatizer – For the names to match we need phonetics • Query: Mazlum – It leads to matches for the very frequent word Muslim – Users want: Give me phonetic matches for Mazlim but not Muslim – Mazlum=P AND NOT Muslim=E doesn’t do the job!!! – – • • No match for “Mazlum is a member of the Muslim society in Munich” spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim)) New Syntax: <Mazlim=P BUTNOT Muslim=E> Query: Persons near synonyms of founding and Microsoft “E_Person found Microsoft” PhraseQuery / NearQuery
  • 21. Semantic Search Question: Wo liegen Werke von Audi? Semantic Search
  • 22. The Typed Index Multilingual Search Mixed Language Documents
  • 23. The typed Index • We need different types of terms in one field • Types are term properties: payloads are not a good option • Use prefixes to distinguish them: – – – – – • V_ for fullforms (case sensitive) N_ for diacritics normalizations F_ for phonetic normal forms E_ for entities • E_Person, E_Location, E_Organization • E_PersonName_Brown, E_Location_Munich B_ for baseforms: B_Noun_book, B_Verb_fly, … Multilingual Search is handled in the same way B_EN_NOUN_book, B_DE_NOUN_buch
  • 24. Multilingual Search: Standard Approach Generate a language-specific copy of every content-field: – configure language-specific analyzers for the language-specific fields – Indexing: Adapt indexing chain to determine document language, generate new language-specific fields – Search: Use MultiFieldQueryParser to expand query to every language-specific field – Highlighting: depending on document-language call Highlighter for language-specific fields with the respective analyzer – no solution for mixed-language documents
  • 25. Multilingual Search and the Typed Index Choose analyzer depending on language but do not use different fields: – Analyzers generate terms typed with language: B_EN_NOUN_book, B_DE_NOUN_buch – Indexing: choose analyzer in indexing chain based on language – Search: Use a special MultiAnalyzerQueryParser to expand query to every language – Highlighting: choose analyzer based on language and apply it to content-field – Advantage: you could implement a multi-language analyzer for handling mixedlanguage documents, which switches language even within paragraphs.
  • 26. Summary: Advantages of Typed Index to Multi-Field Index • Keep positions aligned in an easier way • Only tokenize once : Performance! • Reuse existing Queries like PhraseQueries, MultiPhraseQueries • Treatment for Mixed-Language Documents: Use Lemmatizer Results to switch between languages
  • 27. Thanks for listening Questions ? By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch Dr. Christoph Goller Phone: +49 89 3090446-0 Fax: +49 89 3090446-29 Email: christoph.goller@intrafind.de Web: www.intrafind.de IntraFindSoftware AG Landsberger Straße 368 80687 München Germany