SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Text Tokenization 
ElasticSearch Boston Meetup - 10/14/14 
Bryan Warner 
bwarner@traackr.com
Intro 
● Traackr is an advanced Influencer discovery and monitoring tool 
● We offer users the ability to identify influential authors who are relevant 
against a particular set of keywords 
● Keywords can be a mix of exact and non-exact phrases (up to 50)
Background 
● Over the past year, we’ve done a lot of tuning around our text tokenization 
strategies 
● Strategies that work very well for non-exact phrase matching have adverse 
side-effects for exact phrase matching (and vice-versa) 
● Despite our best efforts, a single analyzed field for content could not suffice 
both use cases
Tokenization Primer 
● Whitespace Tokenizer 
● Invisible characters aren’t handled 
(e.g. Left-to-Right markers) 
● @foobar node.js is great. (#tech) => 
@Override 
protected boolean isTokenChar(int c) { 
} 
@foobar[1], node.js[2], is[3], great.[4], (#tech)[5] 
return !Character.isWhitespace(c); 
● Basic - Used in conjunction with a word-delimiter filter
Tokenization Primer 
● Pattern Tokenizer 
● Separate text into terms via regular 
expressions (default is W+) 
● @foobar node.js is great. (#tech) => 
"tokenizer": { 
"custom_pattern_tokenizer”: { 
}} 
@foobar[1], node.js[2], is[3], great.[4], #tech[5] 
"type": "pattern", 
"pattern": "[()s]+", 
"group": -1 
● Regex can often become unwieldy, especially with I18n concerns 
● Contextual token separators not really feasible 
● Used in conjunction with a word-delimiter filter
Tokenization Primer 
Other tokenizers / filters to keep in mind... 
● nGram & Edge nGram Tokenizers 
○ Favored in autocomplete use-cases 
○ Produce a lot of tokens 
● Path Hierarchy / Email-URL Tokenizers 
○ Specialty use-cases 
● Stemmers (Porter, Snowball) 
○ These are filters applied post-tokenization
Word Delimiter 
● Word Delimiter Filter 
● Produces sub-word tokens based on 
numerous rules (non-alphanumeric 
chars, case transitions, etc.) 
● Great for relaxed queries, but causes a 
headache with exact phrases 
● @foobar node.js is great. (#tech) => 
"filter":{ 
"custom_word_delimiter":{ 
"type":"word_delimiter", 
"generate_word_parts":"1", 
"generate_number_parts":"0", 
"catenate_words":"1", 
"catenate_numbers":"0", 
"catenate_all":"0", 
"split_on_case_change":"1", 
"preserve_original":"1" 
}} 
@foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6], 
tech[6]
Word Delimiter 
● In the prev. example, a query_string 
search on ‘@foobar’ will match text containing 
foobar (with or w/o the ‘@’) 
● To solve this, the word delimiter can be 
configured with a type table 
● @foobar node.js is great. (#tech) => 
@foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6] 
● However, many edge-cases still arise and the increased # of tokens has an 
effect on relevance 
"filter":{ 
"custom_word_delimiter":{ 
… 
"type_table":[ 
"# => ALPHA", 
"@ => ALPHA", 
] 
}}
Standard Tokenizer 
● Implements the Word Break rules from the Unicode Text Segmentation 
algorithm, as specified in Unicode Standard Annex #29 
● It does so by assigning character classes to underlying Unicode characters 
○ ALetter, Numeric, Extended_NumLetter 
○ Single_Quote, Double_Quote 
○ Mid_Number, Mid_Letter, Mid_NumLetter 
● These character classes determine how word boundaries are detected in a 
piece of text - smart in that it’s all contextual
Standard Tokenizer 
● For instance, let’s examine the Mid_NumLetter character class 
○ If assigned, it states that a Unicode character will be treated as a word 
boundary break unless surrounded by alpha-numerics 
○ By default, the ‘.’ is categorized as a Mid_NumLetter 
● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are 
treated as hard word boundary breaks 
● @foobar node.js is great. (#tech) => 
foobar[1], node.js[2], is[3], great[4], tech[5]
Standard Tokenizer 
● Unfortunately, there isn’t an easy way to customize the default word 
boundary rules as implemented by the S.T. algorithm. 
● One can either.. 
○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’, 
etc. to other characters that are treated like alpha-numerics (e.g. the 
‘_’ is treated as an extended num-letter) 
○ Copy the S.T. source and make modifications … which is what the 
Javadoc actually says to do :)
Standard Token. Extension 
● Now, there’s an ES plugin to do what we need: https://github. 
com/bbguitar77/elasticsearch-analysis-standardext 
● We can override the default character class 
for any Unicode character that we desire 
● @foobar node.js is great. (#tech) => 
@foobar[1], node.js[2], is[3], great[4], #tech[5] 
"tokenizer": { 
"my_standard_ext": { 
"type": "standard_ext", 
"mappings": [ 
"@=>EXNL", 
"#=>EXNL" 
] 
} 
}
Standard Token. Extension 
The supported word-boundary property types are: 
● L -> A Letter 
● N -> Numeric 
● EXNL -> Extended Number Letter (preserved at start, middle, end of 
alpha-numeric characters - e.g. '_') 
● MNL -> Mid Number Letter (preserved between alpha-numeric characters 
- e.g. '.') 
● MN -> Mid Number (preserved between numerics - e.g. ',') 
● ML -> Mid Letter (preserved between letters - e.g. ':') 
● SQ -> Single-quote 
● DQ -> Double-quote
Standard Token. Extension 
index":{ 
"analysis":{ 
"analyzer" : { 
"my_analyzer" : { 
"type":"custom", 
"char_filter":[], 
"tokenizer": "my_standard_ext", 
"filter":["lowercase", "stop"] 
} 
}, 
"tokenizer": { 
"my_standard_ext": { 
"type": "standard_ext", 
"mappings": [ 
"@=>EXNL", 
"#=>EXNL" 
] 
} 
}
Standard Token. Extension 
● Advantages 
○ Reap all the benefits of the Standard Tokenizer (context-based 
segmentation rules, special char. handling, etc.) while retaining some 
flexibility in how certain Unicode characters are handled 
○ Works very well for analyzed fields where we don’t want to incur the 
overhead / side-effects of the word-delimiter filter 
○ Simpler configuration - tokenization rules are not spread between the 
tokenizer & token-filters
Outcome 
Best strategy for us was having two separate analyzed fields for content 
Strictly analyzed field 
- exact phrase matching - 
“analysis” : { 
... 
"strict_text" : { 
"type":"custom", 
"char_filter" : ["html_strip"], 
"tokenizer" : "my_standard_ext", 
"filter" : ["lowercase", "icu_folding"] 
} 
… 
Broadly analyzed field 
- relaxed phrase matching - 
“analysis” : { 
... 
"broad_text" : { 
"type":"custom", 
"char_filter" : ["html_strip"], 
"tokenizer" : "my_standard_ext", 
"filter" : [“custom_word_delim”, 
"lowercase", “stop”, "icu_folding"] 
} 
…
Outcome 
When a mix of exact & non-exact phrases are present, the user query is 
transformed into a simple Bool query composed of two underlying query_string 
queries 
BoolQueryBuilder query = new BoolQueryBuilder(); 
if (!relaxedKeywords.isEmpty()) 
query.should(buildQueryStringQuery(relaxedKeywords)); 
if (!exactKeywords.isEmpty()) 
query.should(buildQueryStringQuery(exactKeywords)); 
query.minimumNumberShouldMatch(1); 
query.disableCoord(true);
Questions 
● Questions? 
● If you’re interested in learning more about Traackr, 
please see us after the presentation or email us - we’re 
hiring! 
○ jobs@traackr.com 
○ 5k referral program 
● Tech Blog - http://traackr-people.tumblr.com

Contenu connexe

Tendances

Tendances (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Nltk
NltkNltk
Nltk
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
NLTK
NLTKNLTK
NLTK
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2UCU NLP Summer Workshops 2017 - Part 2
UCU NLP Summer Workshops 2017 - Part 2
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Text summarization
Text summarizationText summarization
Text summarization
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
 

En vedette

Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data Processing
Bryan Warner
 
Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)
bluejayjunior
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?
Rambus Inc
 

En vedette (7)

Boston Startup School - OO Design
Boston Startup School - OO DesignBoston Startup School - OO Design
Boston Startup School - OO Design
 
Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data Processing
 
Tokenization
TokenizationTokenization
Tokenization
 
Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)Chapter 01 Planning Computer Program (re-upload)
Chapter 01 Planning Computer Program (re-upload)
 
Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?Encryption and Tokenization: Friend or Foe?
Encryption and Tokenization: Friend or Foe?
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to Tokenization
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?
 

Similaire à Text Tokenization

gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
sandeshshahapur
 

Similaire à Text Tokenization (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
Autocomplete in elasticsearch
Autocomplete in elasticsearchAutocomplete in elasticsearch
Autocomplete in elasticsearch
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Elastic search custom chinese analyzer
Elastic search custom chinese analyzerElastic search custom chinese analyzer
Elastic search custom chinese analyzer
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
Ch2 neworder
Ch2 neworderCh2 neworder
Ch2 neworder
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
C Programming - Refresher - Part IV
C Programming - Refresher - Part IVC Programming - Refresher - Part IV
C Programming - Refresher - Part IV
 
XML Prep
XML PrepXML Prep
XML Prep
 
Preparing Java 7 Certifications
Preparing Java 7 CertificationsPreparing Java 7 Certifications
Preparing Java 7 Certifications
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5   XMLM.FLORENCE DAYANA WEB DESIGN -Unit 5   XML
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
 
Introduction to pygments
Introduction to pygmentsIntroduction to pygments
Introduction to pygments
 
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle TextFind Anything In Your APEX App - Fuzzy Search with Oracle Text
Find Anything In Your APEX App - Fuzzy Search with Oracle Text
 
Hacking XPATH 2.0
Hacking XPATH 2.0Hacking XPATH 2.0
Hacking XPATH 2.0
 

Dernier

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 

Dernier (20)

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 

Text Tokenization

  • 1. Text Tokenization ElasticSearch Boston Meetup - 10/14/14 Bryan Warner bwarner@traackr.com
  • 2. Intro ● Traackr is an advanced Influencer discovery and monitoring tool ● We offer users the ability to identify influential authors who are relevant against a particular set of keywords ● Keywords can be a mix of exact and non-exact phrases (up to 50)
  • 3. Background ● Over the past year, we’ve done a lot of tuning around our text tokenization strategies ● Strategies that work very well for non-exact phrase matching have adverse side-effects for exact phrase matching (and vice-versa) ● Despite our best efforts, a single analyzed field for content could not suffice both use cases
  • 4. Tokenization Primer ● Whitespace Tokenizer ● Invisible characters aren’t handled (e.g. Left-to-Right markers) ● @foobar node.js is great. (#tech) => @Override protected boolean isTokenChar(int c) { } @foobar[1], node.js[2], is[3], great.[4], (#tech)[5] return !Character.isWhitespace(c); ● Basic - Used in conjunction with a word-delimiter filter
  • 5. Tokenization Primer ● Pattern Tokenizer ● Separate text into terms via regular expressions (default is W+) ● @foobar node.js is great. (#tech) => "tokenizer": { "custom_pattern_tokenizer”: { }} @foobar[1], node.js[2], is[3], great.[4], #tech[5] "type": "pattern", "pattern": "[()s]+", "group": -1 ● Regex can often become unwieldy, especially with I18n concerns ● Contextual token separators not really feasible ● Used in conjunction with a word-delimiter filter
  • 6. Tokenization Primer Other tokenizers / filters to keep in mind... ● nGram & Edge nGram Tokenizers ○ Favored in autocomplete use-cases ○ Produce a lot of tokens ● Path Hierarchy / Email-URL Tokenizers ○ Specialty use-cases ● Stemmers (Porter, Snowball) ○ These are filters applied post-tokenization
  • 7. Word Delimiter ● Word Delimiter Filter ● Produces sub-word tokens based on numerous rules (non-alphanumeric chars, case transitions, etc.) ● Great for relaxed queries, but causes a headache with exact phrases ● @foobar node.js is great. (#tech) => "filter":{ "custom_word_delimiter":{ "type":"word_delimiter", "generate_word_parts":"1", "generate_number_parts":"0", "catenate_words":"1", "catenate_numbers":"0", "catenate_all":"0", "split_on_case_change":"1", "preserve_original":"1" }} @foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6], tech[6]
  • 8. Word Delimiter ● In the prev. example, a query_string search on ‘@foobar’ will match text containing foobar (with or w/o the ‘@’) ● To solve this, the word delimiter can be configured with a type table ● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6] ● However, many edge-cases still arise and the increased # of tokens has an effect on relevance "filter":{ "custom_word_delimiter":{ … "type_table":[ "# => ALPHA", "@ => ALPHA", ] }}
  • 9. Standard Tokenizer ● Implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 ● It does so by assigning character classes to underlying Unicode characters ○ ALetter, Numeric, Extended_NumLetter ○ Single_Quote, Double_Quote ○ Mid_Number, Mid_Letter, Mid_NumLetter ● These character classes determine how word boundaries are detected in a piece of text - smart in that it’s all contextual
  • 10. Standard Tokenizer ● For instance, let’s examine the Mid_NumLetter character class ○ If assigned, it states that a Unicode character will be treated as a word boundary break unless surrounded by alpha-numerics ○ By default, the ‘.’ is categorized as a Mid_NumLetter ● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are treated as hard word boundary breaks ● @foobar node.js is great. (#tech) => foobar[1], node.js[2], is[3], great[4], tech[5]
  • 11. Standard Tokenizer ● Unfortunately, there isn’t an easy way to customize the default word boundary rules as implemented by the S.T. algorithm. ● One can either.. ○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’, etc. to other characters that are treated like alpha-numerics (e.g. the ‘_’ is treated as an extended num-letter) ○ Copy the S.T. source and make modifications … which is what the Javadoc actually says to do :)
  • 12. Standard Token. Extension ● Now, there’s an ES plugin to do what we need: https://github. com/bbguitar77/elasticsearch-analysis-standardext ● We can override the default character class for any Unicode character that we desire ● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], is[3], great[4], #tech[5] "tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] } }
  • 13. Standard Token. Extension The supported word-boundary property types are: ● L -> A Letter ● N -> Numeric ● EXNL -> Extended Number Letter (preserved at start, middle, end of alpha-numeric characters - e.g. '_') ● MNL -> Mid Number Letter (preserved between alpha-numeric characters - e.g. '.') ● MN -> Mid Number (preserved between numerics - e.g. ',') ● ML -> Mid Letter (preserved between letters - e.g. ':') ● SQ -> Single-quote ● DQ -> Double-quote
  • 14. Standard Token. Extension index":{ "analysis":{ "analyzer" : { "my_analyzer" : { "type":"custom", "char_filter":[], "tokenizer": "my_standard_ext", "filter":["lowercase", "stop"] } }, "tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] } }
  • 15. Standard Token. Extension ● Advantages ○ Reap all the benefits of the Standard Tokenizer (context-based segmentation rules, special char. handling, etc.) while retaining some flexibility in how certain Unicode characters are handled ○ Works very well for analyzed fields where we don’t want to incur the overhead / side-effects of the word-delimiter filter ○ Simpler configuration - tokenization rules are not spread between the tokenizer & token-filters
  • 16. Outcome Best strategy for us was having two separate analyzed fields for content Strictly analyzed field - exact phrase matching - “analysis” : { ... "strict_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : ["lowercase", "icu_folding"] } … Broadly analyzed field - relaxed phrase matching - “analysis” : { ... "broad_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : [“custom_word_delim”, "lowercase", “stop”, "icu_folding"] } …
  • 17. Outcome When a mix of exact & non-exact phrases are present, the user query is transformed into a simple Bool query composed of two underlying query_string queries BoolQueryBuilder query = new BoolQueryBuilder(); if (!relaxedKeywords.isEmpty()) query.should(buildQueryStringQuery(relaxedKeywords)); if (!exactKeywords.isEmpty()) query.should(buildQueryStringQuery(exactKeywords)); query.minimumNumberShouldMatch(1); query.disableCoord(true);
  • 18. Questions ● Questions? ● If you’re interested in learning more about Traackr, please see us after the presentation or email us - we’re hiring! ○ jobs@traackr.com ○ 5k referral program ● Tech Blog - http://traackr-people.tumblr.com