SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Language Search
ElasticSearch Boston Meetup - 3/27
       Bryan Warner - Traackr
About me
● Bryan Warner - Developer @Traackr
  ○ bwarner@traackr.com

● I've worked with ElasticSearch since early 2012 ...
  before that I had worked with Lucene & Solr

● Primary background is in Java back-end development

● Shifting focus into Scala development past year
About Traackr
● Influencer search engine

● We track content daily & in real-time for our database of
  influential people

● We leverage ElasticSearch parent/child (top-children)
  queries to search content (i.e. the children) to surface
  the influencers who've authored it (i.e. the parents)

● Some of our back-end stack includes: ElasticSearch,
  MongoDb, Java/Spring, Scala/Akka, etc.
Overview
● Indexing / Querying strategies to support language-
  targeted searches within ES

● ES Analyzers / TokenFilters for language analysis

● Custom Analyzers / TokenFilters for ES

● Look at some OS projects that assist in language
  detection & analysis
Use Case
● We have a database of articles written in many
  languages

● We want our users to be able to search articles written
  in a particular language

● We want that search to handle the nuances for that
  particular language
Reference Schema
{
    "settings" : {
      "index": {
        "number_of_shards" : 6, "number_of_replicas" : 1
      },
      "analysis":{
        "analyzer": {}, "tokenizer": {}, "filter":{}
      }
    },
    "mappings": {
      "article": {
        "text" : {"type" : "string", "analyzer":"standard", "store":true},
        "author:" {"type" : "string", "analyzer":"simple", "store": true},
        "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
      }
    }
}
Indexing Strategies



      Separate indices per language
                  - OR -
       Same index for all languages
Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
  ○ IDF = log(numDocs/(docFreq+1)) + 1

CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
   ○ Same problem for Solr Joins
■ Maintain schema per index
Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine

CONS
■ Schema complexity grows
■ IDF values might be skewed
Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
   a. At indexing time, we set the right mapping based on
      the article's language

2. Create different fields per language-analyzed field
   a. At indexing time, we populate the correct text field
      based on the article's language
"mappings": {
  "article_en": {
    "text" : {"type" : "string", "analyzer":"english", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_fr": {
    "text" : {"type" : "string", "analyzer":"french", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_de": {
    "text" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
"mappings": {
  "article": {
    "text_en" : {"type" : "string", "analyzer":"english", "store":true},
    "text_fr" : {"type" : "string", "analyzer":"french", "store":true},
    "text_de" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
Querying Strategies
How do we execute a language-targeted search?

... all based on our indexing strategy.
Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
       .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
       .setTypes(targetMapping);

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
     .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
● Will these strategies support a multi-language search?
  ○ E.g. Search by french and german
  ○ E.g. Search against all languages

● Yes! *

● In the same SearchRequest:
   ○ We can search against multiple indices
   ○ We can search against multiple "mapping" types
   ○ We can search against multiple fields

* Need to give thought which query analyzer to use
Language Analysis
● What does ElasticSearch and/or Lucene offer us for
  analyzing various languages?

● Is there a one-size-fits-all solution?
   ○ e.g. StandardAnalyzer

● Or do we need custom analyzers for each language?
Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
  95% of the way there

● Each language analyzer provides its own flavor to the
  StandardAnalyzer

● FrenchAnalyzer
  ○ Adds an ElisionFilter (l'avion -> avion)
  ○ Adds French StopWords filter
  ○ FrenchLightStemFilter
Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there

● German has a heavy use of compound words
     ■ das Vaterland => The fatherland
     ■ Rechtsanwaltskanzleien => Law Firms

● For best search results, these compound words should
  produce index terms for their individual parts

● GermanAnalyzer lacks a Word Compound Token Filter
Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
  get you far

● Using a Standard Tokenizer to extract tokens from
  Chinese text will not produce accurate terms
  ○ Some 3rd-party Chinese analyzers will extract
     bigrams from Chinese text and index those as if they
     were words

● Need to do your research
Language Analysis
You should also know about...
● ASCII Folding Token Filter
  ○ über => uber

● ICU Analysis Plugin
   ○ http://www.elasticsearch.org/guide/reference/index-
     modules/analysis/icu-plugin.html
   ○ Allows for unicode normalization, collation and
     folding
Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
  text (e.g. remove stemming)

● How do we go about doing this?
   ○ One way is to leverage ElasticSearch's flexible
     schema definitions
Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type": "custom",
           "tokenizer": "standard",
           "filter": ["standard", "lowercase"], stop words, german normalization?
       }
    }
    ....
  }
}
Custom Analyzer / Token Filter
1.   Declare schema filter for german stop_words
2.   We'll also need to create a custom TokenFilter class to wrap Lucene's org.
     apache.lucene.analysis.de.GermanNormalizationFilter
     a.   It does not come as a pre-defined ES TokenFilter
     b.   German text needs to normalize on certain characters based .. e.g.
          'ae' and 'oe' are replaced by 'a', and 'o', respectively.

3.   Declare schema filter for custom GermanNormalizationFilter
package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
  @Inject
  public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings,
           @Assisted String name, @Assisted Settings settings) {
     super(index, indexSettings, name, settings);
  }
  @Override
  public TokenStream create(TokenStream tokenStream) {
     return new GermanNormalizationFilter(tokenStream);
  }
}
Custom Analyzer / Token Filter
Define new token filters in our schema:
"settings" : {
  "analysis":{
     ....
     "filter":{
       "german_normalization":{
          "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory"
       },
       "german_stop":{
          "type":"stop",
          "stopwords":["_german_"],
          "enable_position_increments":"true"
       }
     }
....
Custom Analyzer / Token Filter
Create a custom German analyzer:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type":"custom",
           "tokenizer": "standard",
           "filter": ["german_normalization", "standard", "lowercase", "german_stop"],
       }
    }
    ....
  }
}
OS Projects
Language Detection
●   https://code.google.com/p/language-detection/
     ○ Written in Java
     ○ Provides language profiles with unigram, bigram, and trigram
         character frequencies
     ○ Detector provides accuracy % for each language detected

PROS
 ■ Very fast (~4k pieces of text per second)
 ■ Very reliable for text greater than 30-40 characters

CONS
 ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
   short tweets
OS Projects
German Word Decompounder
●   https://github.com/jprante/elasticsearch-analysis-decompound

●   Lucene offers two compound word token filters, a dictionary- &
    hyphenation-based variant
     ○ Not bundled with Lucene due to licensing issues
     ○ Require loading a word list in memory before they are run

●   The decompounder uses prebuilt Compact Patricia Tries for efficient word
    segmentation provided by the ASV toolbox
     ○ ASV Toolbox project - http://wortschatz.uni-leipzig.
        de/~cbiemann/software/toolbox/index.htm

Contenu connexe

Similaire à Language Search

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and AnalysisOpenThink Labs
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingChase Tingley
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisSebastiano Panichella
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-stepsMatteo Moci
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in GoTakuya Ueda
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
 
Ts archiving
Ts   archivingTs   archiving
Ts archivingConfiz
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 

Similaire à Language Search (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in Go
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 

Dernier

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Language Search

  • 1. Language Search ElasticSearch Boston Meetup - 3/27 Bryan Warner - Traackr
  • 2. About me ● Bryan Warner - Developer @Traackr ○ bwarner@traackr.com ● I've worked with ElasticSearch since early 2012 ... before that I had worked with Lucene & Solr ● Primary background is in Java back-end development ● Shifting focus into Scala development past year
  • 3. About Traackr ● Influencer search engine ● We track content daily & in real-time for our database of influential people ● We leverage ElasticSearch parent/child (top-children) queries to search content (i.e. the children) to surface the influencers who've authored it (i.e. the parents) ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc.
  • 4. Overview ● Indexing / Querying strategies to support language- targeted searches within ES ● ES Analyzers / TokenFilters for language analysis ● Custom Analyzers / TokenFilters for ES ● Look at some OS projects that assist in language detection & analysis
  • 5. Use Case ● We have a database of articles written in many languages ● We want our users to be able to search articles written in a particular language ● We want that search to handle the nuances for that particular language
  • 6. Reference Schema { "settings" : { "index": { "number_of_shards" : 6, "number_of_replicas" : 1 }, "analysis":{ "analyzer": {}, "tokenizer": {}, "filter":{} } }, "mappings": { "article": { "text" : {"type" : "string", "analyzer":"standard", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true}, "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } } }
  • 7. Indexing Strategies Separate indices per language - OR - Same index for all languages
  • 8. Indexing Strategies Separate Indices per language PROS ■ Clean separation ■ Truer IDF values ○ IDF = log(numDocs/(docFreq+1)) + 1 CONS ■ Increased Overhead ■ Parent/Child queries -> parent document duplication ○ Same problem for Solr Joins ■ Maintain schema per index
  • 9. Indexing Strategies Same index for all languages PROS ■ One index to maintain (and one schema) ■ Parent/Child queries are fine CONS ■ Schema complexity grows ■ IDF values might be skewed
  • 10. Indexing Strategies Same index for all languages ... how? 1. Create different "mapping" types per language a. At indexing time, we set the right mapping based on the article's language 2. Create different fields per language-analyzed field a. At indexing time, we populate the correct text field based on the article's language
  • 11. "mappings": { "article_en": { "text" : {"type" : "string", "analyzer":"english", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_fr": { "text" : {"type" : "string", "analyzer":"french", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_de": { "text" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 12. "mappings": { "article": { "text_en" : {"type" : "string", "analyzer":"english", "store":true}, "text_fr" : {"type" : "string", "analyzer":"french", "store":true}, "text_de" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 13. Querying Strategies How do we execute a language-targeted search? ... all based on our indexing strategy.
  • 14. Querying Strategies (1) Separate Indices per language ... String targetIndex = getIndexForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch(targetIndex) .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 15. Querying Strategies (2a) Same index for language - Diff. mappings ... String targetMapping = getMappingForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes(targetMapping); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 16. Querying Strategies (2b) Same index for language - Diff. fields ... SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field(text_en|text_fr|text_de); // pick one query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 17. Querying Strategies ● Will these strategies support a multi-language search? ○ E.g. Search by french and german ○ E.g. Search against all languages ● Yes! * ● In the same SearchRequest: ○ We can search against multiple indices ○ We can search against multiple "mapping" types ○ We can search against multiple fields * Need to give thought which query analyzer to use
  • 18. Language Analysis ● What does ElasticSearch and/or Lucene offer us for analyzing various languages? ● Is there a one-size-fits-all solution? ○ e.g. StandardAnalyzer ● Or do we need custom analyzers for each language?
  • 19. Language Analysis StandardAnalyzer - The Good ● For many languages (french, spanish), it will get you 95% of the way there ● Each language analyzer provides its own flavor to the StandardAnalyzer ● FrenchAnalyzer ○ Adds an ElisionFilter (l'avion -> avion) ○ Adds French StopWords filter ○ FrenchLightStemFilter
  • 20. Language Analysis StandardAnalyzer - The Bad ● For some languages, it will get you 2/3 of the way there ● German has a heavy use of compound words ■ das Vaterland => The fatherland ■ Rechtsanwaltskanzleien => Law Firms ● For best search results, these compound words should produce index terms for their individual parts ● GermanAnalyzer lacks a Word Compound Token Filter
  • 21. Language Analysis StandardAnalyzer - The Ugly ● For other languages (e.g. Asian languages), it will not get you far ● Using a Standard Tokenizer to extract tokens from Chinese text will not produce accurate terms ○ Some 3rd-party Chinese analyzers will extract bigrams from Chinese text and index those as if they were words ● Need to do your research
  • 22. Language Analysis You should also know about... ● ASCII Folding Token Filter ○ über => uber ● ICU Analysis Plugin ○ http://www.elasticsearch.org/guide/reference/index- modules/analysis/icu-plugin.html ○ Allows for unicode normalization, collation and folding
  • 23. Custom Analyzer / Token Filter ● Let's create a custom analyzer definition for German text (e.g. remove stemming) ● How do we go about doing this? ○ One way is to leverage ElasticSearch's flexible schema definitions
  • 24. Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
  • 25. Custom Analyzer / Token Filter Create a custom German analyzer in our schema: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase"], stop words, german normalization? } } .... } }
  • 26. Custom Analyzer / Token Filter 1. Declare schema filter for german stop_words 2. We'll also need to create a custom TokenFilter class to wrap Lucene's org. apache.lucene.analysis.de.GermanNormalizationFilter a. It does not come as a pre-defined ES TokenFilter b. German text needs to normalize on certain characters based .. e.g. 'ae' and 'oe' are replaced by 'a', and 'o', respectively. 3. Declare schema filter for custom GermanNormalizationFilter
  • 27. package org.elasticsearch.index.analysis; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.de.GermanNormalizationFilter; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.inject.assistedinject.Assisted; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.Index; import org.elasticsearch.index.settings.IndexSettings; public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory { @Inject public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new GermanNormalizationFilter(tokenStream); } }
  • 28. Custom Analyzer / Token Filter Define new token filters in our schema: "settings" : { "analysis":{ .... "filter":{ "german_normalization":{ "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory" }, "german_stop":{ "type":"stop", "stopwords":["_german_"], "enable_position_increments":"true" } } ....
  • 29. Custom Analyzer / Token Filter Create a custom German analyzer: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type":"custom", "tokenizer": "standard", "filter": ["german_normalization", "standard", "lowercase", "german_stop"], } } .... } }
  • 30. OS Projects Language Detection ● https://code.google.com/p/language-detection/ ○ Written in Java ○ Provides language profiles with unigram, bigram, and trigram character frequencies ○ Detector provides accuracy % for each language detected PROS ■ Very fast (~4k pieces of text per second) ■ Very reliable for text greater than 30-40 characters CONS ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e. short tweets
  • 31. OS Projects German Word Decompounder ● https://github.com/jprante/elasticsearch-analysis-decompound ● Lucene offers two compound word token filters, a dictionary- & hyphenation-based variant ○ Not bundled with Lucene due to licensing issues ○ Require loading a word list in memory before they are run ● The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox ○ ASV Toolbox project - http://wortschatz.uni-leipzig. de/~cbiemann/software/toolbox/index.htm