SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Text Tokenization 
ElasticSearch Boston Meetup - 10/14/14 
Bryan Warner 
bwarner@traackr.com
Intro 
● Traackr is an advanced Influencer discovery and monitoring tool 
● We offer users the ability to identify influential authors who are relevant 
against a particular set of keywords 
● Keywords can be a mix of exact and non-exact phrases (up to 50)
Background 
● Over the past year, we’ve done a lot of tuning around our text tokenization 
strategies 
● Strategies that work very well for non-exact phrase matching have adverse 
side-effects for exact phrase matching (and vice-versa) 
● Despite our best efforts, a single analyzed field for content could not suffice 
both use cases
Tokenization Primer 
● Whitespace Tokenizer 
● Invisible characters aren’t handled 
(e.g. Left-to-Right markers) 
● @foobar node.js is great. (#tech) => 
@Override 
protected boolean isTokenChar(int c) { 
} 
@foobar[1], node.js[2], is[3], great.[4], (#tech)[5] 
return !Character.isWhitespace(c); 
● Basic - Used in conjunction with a word-delimiter filter
Tokenization Primer 
● Pattern Tokenizer 
● Separate text into terms via regular 
expressions (default is W+) 
● @foobar node.js is great. (#tech) => 
"tokenizer": { 
"custom_pattern_tokenizer”: { 
}} 
@foobar[1], node.js[2], is[3], great.[4], #tech[5] 
"type": "pattern", 
"pattern": "[()s]+", 
"group": -1 
● Regex can often become unwieldy, especially with I18n concerns 
● Contextual token separators not really feasible 
● Used in conjunction with a word-delimiter filter
Tokenization Primer 
Other tokenizers / filters to keep in mind... 
● nGram & Edge nGram Tokenizers 
○ Favored in autocomplete use-cases 
○ Produce a lot of tokens 
● Path Hierarchy / Email-URL Tokenizers 
○ Specialty use-cases 
● Stemmers (Porter, Snowball) 
○ These are filters applied post-tokenization
Word Delimiter 
● Word Delimiter Filter 
● Produces sub-word tokens based on 
numerous rules (non-alphanumeric 
chars, case transitions, etc.) 
● Great for relaxed queries, but causes a 
headache with exact phrases 
● @foobar node.js is great. (#tech) => 
"filter":{ 
"custom_word_delimiter":{ 
"type":"word_delimiter", 
"generate_word_parts":"1", 
"generate_number_parts":"0", 
"catenate_words":"1", 
"catenate_numbers":"0", 
"catenate_all":"0", 
"split_on_case_change":"1", 
"preserve_original":"1" 
}} 
@foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6], 
tech[6]
Word Delimiter 
● In the prev. example, a query_string 
search on ‘@foobar’ will match text containing 
foobar (with or w/o the ‘@’) 
● To solve this, the word delimiter can be 
configured with a type table 
● @foobar node.js is great. (#tech) => 
@foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6] 
● However, many edge-cases still arise and the increased # of tokens has an 
effect on relevance 
"filter":{ 
"custom_word_delimiter":{ 
… 
"type_table":[ 
"# => ALPHA", 
"@ => ALPHA", 
] 
}}
Standard Tokenizer 
● Implements the Word Break rules from the Unicode Text Segmentation 
algorithm, as specified in Unicode Standard Annex #29 
● It does so by assigning character classes to underlying Unicode characters 
○ ALetter, Numeric, Extended_NumLetter 
○ Single_Quote, Double_Quote 
○ Mid_Number, Mid_Letter, Mid_NumLetter 
● These character classes determine how word boundaries are detected in a 
piece of text - smart in that it’s all contextual
Standard Tokenizer 
● For instance, let’s examine the Mid_NumLetter character class 
○ If assigned, it states that a Unicode character will be treated as a word 
boundary break unless surrounded by alpha-numerics 
○ By default, the ‘.’ is categorized as a Mid_NumLetter 
● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are 
treated as hard word boundary breaks 
● @foobar node.js is great. (#tech) => 
foobar[1], node.js[2], is[3], great[4], tech[5]
Standard Tokenizer 
● Unfortunately, there isn’t an easy way to customize the default word 
boundary rules as implemented by the S.T. algorithm. 
● One can either.. 
○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’, 
etc. to other characters that are treated like alpha-numerics (e.g. the 
‘_’ is treated as an extended num-letter) 
○ Copy the S.T. source and make modifications … which is what the 
Javadoc actually says to do :)
Standard Token. Extension 
● Now, there’s an ES plugin to do what we need: https://github. 
com/bbguitar77/elasticsearch-analysis-standardext 
● We can override the default character class 
for any Unicode character that we desire 
● @foobar node.js is great. (#tech) => 
@foobar[1], node.js[2], is[3], great[4], #tech[5] 
"tokenizer": { 
"my_standard_ext": { 
"type": "standard_ext", 
"mappings": [ 
"@=>EXNL", 
"#=>EXNL" 
] 
} 
}
Standard Token. Extension 
The supported word-boundary property types are: 
● L -> A Letter 
● N -> Numeric 
● EXNL -> Extended Number Letter (preserved at start, middle, end of 
alpha-numeric characters - e.g. '_') 
● MNL -> Mid Number Letter (preserved between alpha-numeric characters 
- e.g. '.') 
● MN -> Mid Number (preserved between numerics - e.g. ',') 
● ML -> Mid Letter (preserved between letters - e.g. ':') 
● SQ -> Single-quote 
● DQ -> Double-quote
Standard Token. Extension 
index":{ 
"analysis":{ 
"analyzer" : { 
"my_analyzer" : { 
"type":"custom", 
"char_filter":[], 
"tokenizer": "my_standard_ext", 
"filter":["lowercase", "stop"] 
} 
}, 
"tokenizer": { 
"my_standard_ext": { 
"type": "standard_ext", 
"mappings": [ 
"@=>EXNL", 
"#=>EXNL" 
] 
} 
}
Standard Token. Extension 
● Advantages 
○ Reap all the benefits of the Standard Tokenizer (context-based 
segmentation rules, special char. handling, etc.) while retaining some 
flexibility in how certain Unicode characters are handled 
○ Works very well for analyzed fields where we don’t want to incur the 
overhead / side-effects of the word-delimiter filter 
○ Simpler configuration - tokenization rules are not spread between the 
tokenizer & token-filters
Outcome 
Best strategy for us was having two separate analyzed fields for content 
Strictly analyzed field 
- exact phrase matching - 
“analysis” : { 
... 
"strict_text" : { 
"type":"custom", 
"char_filter" : ["html_strip"], 
"tokenizer" : "my_standard_ext", 
"filter" : ["lowercase", "icu_folding"] 
} 
… 
Broadly analyzed field 
- relaxed phrase matching - 
“analysis” : { 
... 
"broad_text" : { 
"type":"custom", 
"char_filter" : ["html_strip"], 
"tokenizer" : "my_standard_ext", 
"filter" : [“custom_word_delim”, 
"lowercase", “stop”, "icu_folding"] 
} 
…
Outcome 
When a mix of exact & non-exact phrases are present, the user query is 
transformed into a simple Bool query composed of two underlying query_string 
queries 
BoolQueryBuilder query = new BoolQueryBuilder(); 
if (!relaxedKeywords.isEmpty()) 
query.should(buildQueryStringQuery(relaxedKeywords)); 
if (!exactKeywords.isEmpty()) 
query.should(buildQueryStringQuery(exactKeywords)); 
query.minimumNumberShouldMatch(1); 
query.disableCoord(true);
Questions 
● Questions? 
● If you’re interested in learning more about Traackr, 
please see us after the presentation or email us - we’re 
hiring! 
○ jobs@traackr.com 
○ 5k referral program 
● Tech Blog - http://traackr-people.tumblr.com

Contenu connexe

Tendances

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2Yuriy Guts
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyoutsider2
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for CodersAravind Mohanoor
 

Tendances (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Nltk
NltkNltk
Nltk
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
NLTK
NLTKNLTK
NLTK
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Text summarization
Text summarizationText summarization
Text summarization
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
 

En vedette

Boston Startup School - OO Design
Boston Startup School - OO DesignBoston Startup School - OO Design
Boston Startup School - OO DesignBryan Warner
 
Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data ProcessingBryan Warner
 
Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)bluejayjunior
 
Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?Zach Gardner
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to TokenizationNabeel Yoosuf
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?Rambus Inc
 

En vedette (7)

Boston Startup School - OO Design
Boston Startup School - OO DesignBoston Startup School - OO Design
Boston Startup School - OO Design
 
Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data Processing
 
Tokenization
TokenizationTokenization
Tokenization
 
Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)
 
Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to Tokenization
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?
 

Similaire à Text Tokenization

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearchTaimur Qureshi
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxsandeshshahapur
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеSergey Platonov
 
Elastic search custom chinese analyzer
Elastic search custom chinese analyzerElastic search custom chinese analyzer
Elastic search custom chinese analyzerLearningTech
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysismengistu23
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratchMahmoud Yasser
 
Preparing Java 7 Certifications
Preparing Java 7 CertificationsPreparing Java 7 Certifications
Preparing Java 7 CertificationsGiacomo Veneri
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxSreeLaya9
 
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5   XMLM.FLORENCE DAYANA WEB DESIGN -Unit 5   XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XMLDr.Florence Dayana
 
Introduction to pygments
Introduction to pygmentsIntroduction to pygments
Introduction to pygmentsroskakori
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextCarsten Czarski
 

Similaire à Text Tokenization (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Elastic search custom chinese analyzer
Elastic search custom chinese analyzerElastic search custom chinese analyzer
Elastic search custom chinese analyzer
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
Ch2 neworder
Ch2 neworderCh2 neworder
Ch2 neworder
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
C Programming - Refresher - Part IV
C Programming - Refresher - Part IVC Programming - Refresher - Part IV
C Programming - Refresher - Part IV
 
XML Prep
XML PrepXML Prep
XML Prep
 
Preparing Java 7 Certifications
Preparing Java 7 CertificationsPreparing Java 7 Certifications
Preparing Java 7 Certifications
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5   XMLM.FLORENCE DAYANA WEB DESIGN -Unit 5   XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
 
Introduction to pygments
Introduction to pygmentsIntroduction to pygments
Introduction to pygments
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
 
Hacking XPATH 2.0
Hacking XPATH 2.0Hacking XPATH 2.0
Hacking XPATH 2.0
 

Dernier

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Dernier (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Text Tokenization

  • 1. Text Tokenization ElasticSearch Boston Meetup - 10/14/14 Bryan Warner bwarner@traackr.com
  • 2. Intro ● Traackr is an advanced Influencer discovery and monitoring tool ● We offer users the ability to identify influential authors who are relevant against a particular set of keywords ● Keywords can be a mix of exact and non-exact phrases (up to 50)
  • 3. Background ● Over the past year, we’ve done a lot of tuning around our text tokenization strategies ● Strategies that work very well for non-exact phrase matching have adverse side-effects for exact phrase matching (and vice-versa) ● Despite our best efforts, a single analyzed field for content could not suffice both use cases
  • 4. Tokenization Primer ● Whitespace Tokenizer ● Invisible characters aren’t handled (e.g. Left-to-Right markers) ● @foobar node.js is great. (#tech) => @Override protected boolean isTokenChar(int c) { } @foobar[1], node.js[2], is[3], great.[4], (#tech)[5] return !Character.isWhitespace(c); ● Basic - Used in conjunction with a word-delimiter filter
  • 5. Tokenization Primer ● Pattern Tokenizer ● Separate text into terms via regular expressions (default is W+) ● @foobar node.js is great. (#tech) => "tokenizer": { "custom_pattern_tokenizer”: { }} @foobar[1], node.js[2], is[3], great.[4], #tech[5] "type": "pattern", "pattern": "[()s]+", "group": -1 ● Regex can often become unwieldy, especially with I18n concerns ● Contextual token separators not really feasible ● Used in conjunction with a word-delimiter filter
  • 6. Tokenization Primer Other tokenizers / filters to keep in mind... ● nGram & Edge nGram Tokenizers ○ Favored in autocomplete use-cases ○ Produce a lot of tokens ● Path Hierarchy / Email-URL Tokenizers ○ Specialty use-cases ● Stemmers (Porter, Snowball) ○ These are filters applied post-tokenization
  • 7. Word Delimiter ● Word Delimiter Filter ● Produces sub-word tokens based on numerous rules (non-alphanumeric chars, case transitions, etc.) ● Great for relaxed queries, but causes a headache with exact phrases ● @foobar node.js is great. (#tech) => "filter":{ "custom_word_delimiter":{ "type":"word_delimiter", "generate_word_parts":"1", "generate_number_parts":"0", "catenate_words":"1", "catenate_numbers":"0", "catenate_all":"0", "split_on_case_change":"1", "preserve_original":"1" }} @foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6], tech[6]
  • 8. Word Delimiter ● In the prev. example, a query_string search on ‘@foobar’ will match text containing foobar (with or w/o the ‘@’) ● To solve this, the word delimiter can be configured with a type table ● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6] ● However, many edge-cases still arise and the increased # of tokens has an effect on relevance "filter":{ "custom_word_delimiter":{ … "type_table":[ "# => ALPHA", "@ => ALPHA", ] }}
  • 9. Standard Tokenizer ● Implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 ● It does so by assigning character classes to underlying Unicode characters ○ ALetter, Numeric, Extended_NumLetter ○ Single_Quote, Double_Quote ○ Mid_Number, Mid_Letter, Mid_NumLetter ● These character classes determine how word boundaries are detected in a piece of text - smart in that it’s all contextual
  • 10. Standard Tokenizer ● For instance, let’s examine the Mid_NumLetter character class ○ If assigned, it states that a Unicode character will be treated as a word boundary break unless surrounded by alpha-numerics ○ By default, the ‘.’ is categorized as a Mid_NumLetter ● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are treated as hard word boundary breaks ● @foobar node.js is great. (#tech) => foobar[1], node.js[2], is[3], great[4], tech[5]
  • 11. Standard Tokenizer ● Unfortunately, there isn’t an easy way to customize the default word boundary rules as implemented by the S.T. algorithm. ● One can either.. ○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’, etc. to other characters that are treated like alpha-numerics (e.g. the ‘_’ is treated as an extended num-letter) ○ Copy the S.T. source and make modifications … which is what the Javadoc actually says to do :)
  • 12. Standard Token. Extension ● Now, there’s an ES plugin to do what we need: https://github. com/bbguitar77/elasticsearch-analysis-standardext ● We can override the default character class for any Unicode character that we desire ● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], is[3], great[4], #tech[5] "tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] } }
  • 13. Standard Token. Extension The supported word-boundary property types are: ● L -> A Letter ● N -> Numeric ● EXNL -> Extended Number Letter (preserved at start, middle, end of alpha-numeric characters - e.g. '_') ● MNL -> Mid Number Letter (preserved between alpha-numeric characters - e.g. '.') ● MN -> Mid Number (preserved between numerics - e.g. ',') ● ML -> Mid Letter (preserved between letters - e.g. ':') ● SQ -> Single-quote ● DQ -> Double-quote
  • 14. Standard Token. Extension index":{ "analysis":{ "analyzer" : { "my_analyzer" : { "type":"custom", "char_filter":[], "tokenizer": "my_standard_ext", "filter":["lowercase", "stop"] } }, "tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] } }
  • 15. Standard Token. Extension ● Advantages ○ Reap all the benefits of the Standard Tokenizer (context-based segmentation rules, special char. handling, etc.) while retaining some flexibility in how certain Unicode characters are handled ○ Works very well for analyzed fields where we don’t want to incur the overhead / side-effects of the word-delimiter filter ○ Simpler configuration - tokenization rules are not spread between the tokenizer & token-filters
  • 16. Outcome Best strategy for us was having two separate analyzed fields for content Strictly analyzed field - exact phrase matching - “analysis” : { ... "strict_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : ["lowercase", "icu_folding"] } … Broadly analyzed field - relaxed phrase matching - “analysis” : { ... "broad_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : [“custom_word_delim”, "lowercase", “stop”, "icu_folding"] } …
  • 17. Outcome When a mix of exact & non-exact phrases are present, the user query is transformed into a simple Bool query composed of two underlying query_string queries BoolQueryBuilder query = new BoolQueryBuilder(); if (!relaxedKeywords.isEmpty()) query.should(buildQueryStringQuery(relaxedKeywords)); if (!exactKeywords.isEmpty()) query.should(buildQueryStringQuery(exactKeywords)); query.minimumNumberShouldMatch(1); query.disableCoord(true);
  • 18. Questions ● Questions? ● If you’re interested in learning more about Traackr, please see us after the presentation or email us - we’re hiring! ○ jobs@traackr.com ○ 5k referral program ● Tech Blog - http://traackr-people.tumblr.com