Text Tokenization

Text Tokenization
ElasticSearch Boston Meetup - 10/14/14
Bryan Warner
bwarner@traackr.com

Intro
● Traackr is an advanced Influencer discovery and monitoring tool
● We offer users the ability to identify influential authors who are relevant
against a particular set of keywords
● Keywords can be a mix of exact and non-exact phrases (up to 50)

Background
● Over the past year, we’ve done a lot of tuning around our text tokenization
strategies
● Strategies that work very well for non-exact phrase matching have adverse
side-effects for exact phrase matching (and vice-versa)
● Despite our best efforts, a single analyzed field for content could not suffice
both use cases

Tokenization Primer
● Whitespace Tokenizer
● Invisible characters aren’t handled
(e.g. Left-to-Right markers)
● @foobar node.js is great. (#tech) =>
@Override
protected boolean isTokenChar(int c) {
}
@foobar[1], node.js[2], is[3], great.[4], (#tech)[5]
return !Character.isWhitespace(c);
● Basic - Used in conjunction with a word-delimiter filter

Tokenization Primer
● Pattern Tokenizer
● Separate text into terms via regular
expressions (default is W+)
"tokenizer": {
"custom_pattern_tokenizer”: {
}}
@foobar[1], node.js[2], is[3], great.[4], #tech[5]
"type": "pattern",
"pattern": "[()s]+",
"group": -1
● Regex can often become unwieldy, especially with I18n concerns
● Contextual token separators not really feasible
● Used in conjunction with a word-delimiter filter

Tokenization Primer
Other tokenizers / filters to keep in mind...
● nGram & Edge nGram Tokenizers
○ Favored in autocomplete use-cases
○ Produce a lot of tokens
● Path Hierarchy / Email-URL Tokenizers
○ Specialty use-cases
● Stemmers (Porter, Snowball)
○ These are filters applied post-tokenization

Word Delimiter
● Word Delimiter Filter
● Produces sub-word tokens based on
numerous rules (non-alphanumeric
chars, case transitions, etc.)
● Great for relaxed queries, but causes a
headache with exact phrases
"filter":{
"custom_word_delimiter":{
"type":"word_delimiter",
"generate_word_parts":"1",
"generate_number_parts":"0",
"catenate_words":"1",
"catenate_numbers":"0",
"catenate_all":"0",
"split_on_case_change":"1",
"preserve_original":"1"
}}
@foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6],
tech[6]

Word Delimiter
● In the prev. example, a query_string
search on ‘@foobar’ will match text containing
foobar (with or w/o the ‘@’)
● To solve this, the word delimiter can be
configured with a type table
@foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6]
● However, many edge-cases still arise and the increased # of tokens has an
effect on relevance
"filter":{
"custom_word_delimiter":{
…
"type_table":[
"# => ALPHA",
"@ => ALPHA",
]
}}

Standard Tokenizer
● Implements the Word Break rules from the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29
● It does so by assigning character classes to underlying Unicode characters
○ ALetter, Numeric, Extended_NumLetter
○ Single_Quote, Double_Quote
○ Mid_Number, Mid_Letter, Mid_NumLetter
● These character classes determine how word boundaries are detected in a
piece of text - smart in that it’s all contextual

Standard Tokenizer
● For instance, let’s examine the Mid_NumLetter character class
○ If assigned, it states that a Unicode character will be treated as a word
boundary break unless surrounded by alpha-numerics
○ By default, the ‘.’ is categorized as a Mid_NumLetter
● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are
treated as hard word boundary breaks
foobar[1], node.js[2], is[3], great[4], tech[5]

Standard Tokenizer
● Unfortunately, there isn’t an easy way to customize the default word
boundary rules as implemented by the S.T. algorithm.
● One can either..
○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’,
etc. to other characters that are treated like alpha-numerics (e.g. the
‘_’ is treated as an extended num-letter)
○ Copy the S.T. source and make modifications … which is what the
Javadoc actually says to do :)

Standard Token. Extension
● Now, there’s an ES plugin to do what we need: https://github.
com/bbguitar77/elasticsearch-analysis-standardext
● We can override the default character class
for any Unicode character that we desire
@foobar[1], node.js[2], is[3], great[4], #tech[5]
"tokenizer": {
"my_standard_ext": {
"type": "standard_ext",
"mappings": [
"@=>EXNL",
"#=>EXNL"
]
}
}

The supported word-boundary property types are:
● L -> A Letter
● N -> Numeric
● EXNL -> Extended Number Letter (preserved at start, middle, end of
alpha-numeric characters - e.g. '_')
● MNL -> Mid Number Letter (preserved between alpha-numeric characters
- e.g. '.')
● MN -> Mid Number (preserved between numerics - e.g. ',')
● ML -> Mid Letter (preserved between letters - e.g. ':')
● SQ -> Single-quote
● DQ -> Double-quote

index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"type":"custom",
"char_filter":[],
"tokenizer": "my_standard_ext",
"filter":["lowercase", "stop"]
}
},
"tokenizer": {
"my_standard_ext": {
"type": "standard_ext",
"mappings": [
"@=>EXNL",
"#=>EXNL"
]
}
}

● Advantages
○ Reap all the benefits of the Standard Tokenizer (context-based
segmentation rules, special char. handling, etc.) while retaining some
flexibility in how certain Unicode characters are handled
○ Works very well for analyzed fields where we don’t want to incur the
overhead / side-effects of the word-delimiter filter
○ Simpler configuration - tokenization rules are not spread between the
tokenizer & token-filters

Outcome
Best strategy for us was having two separate analyzed fields for content
Strictly analyzed field
- exact phrase matching -
“analysis” : {
...
"strict_text" : {
"type":"custom",
"char_filter" : ["html_strip"],
"tokenizer" : "my_standard_ext",
"filter" : ["lowercase", "icu_folding"]
}
…
Broadly analyzed field
- relaxed phrase matching -
“analysis” : {
...
"broad_text" : {
"type":"custom",
"char_filter" : ["html_strip"],
"tokenizer" : "my_standard_ext",
"filter" : [“custom_word_delim”,
"lowercase", “stop”, "icu_folding"]
}
…

Outcome
When a mix of exact & non-exact phrases are present, the user query is
transformed into a simple Bool query composed of two underlying query_string
queries
BoolQueryBuilder query = new BoolQueryBuilder();
if (!relaxedKeywords.isEmpty())
query.should(buildQueryStringQuery(relaxedKeywords));
if (!exactKeywords.isEmpty())
query.should(buildQueryStringQuery(exactKeywords));
query.minimumNumberShouldMatch(1);
query.disableCoord(true);

Questions
● Questions?
● If you’re interested in learning more about Traackr,
please see us after the presentation or email us - we’re
hiring!
○ jobs@traackr.com
○ 5k referral program
● Tech Blog - http://traackr-people.tumblr.com

Text Tokenization

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Text Tokenization

Similaire à Text Tokenization (20)

Dernier

Dernier (20)

Text Tokenization