SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Text Mining 3
String Processing
!
Madrid Summer School 2014
Advanced Statistics and Data Mining
!
Florian Leitner
florian.leitner@upm.es
License:
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Incentive and Applications
• Today’s goals:
‣ converting strings/text into “processable units” (features for Machine Learning)
‣ detecting patterns and comparing strings (string similarity and matching)
!
• Feature generation for machine learning tasks
‣ Detecting a token’s grammatical use-case (noun, verb, adjective, …)
‣ Normalizing/regularizing tokens (“[he] was” and “[I] am” are forms of the verb “be”)
‣ Recognizing entities from a collection of strings (a “gazetteer” or dictionary)
• Near[est] Neighbor-based text/string classification tasks
• String searching (find, UNIX’ “grep”)
• Pattern matching (locating e-mail addresses, telephone numbers, …)
59
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 60
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 61
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Text Extraction from
Documents
• Mostly a commercial business
• “an engineering problem, not a scientific challenge”
• PDF extraction tools
‣ xpdf - “the” PDF extraction library
• http://www.foolabs.com/xpdf/

(available via pdfto*!on POSIX systems)
‣ PDFMiner - a Python 2.x (!) library
• https://euske.github.io/pdfminer/
‣ LA-PDF-Text - layout aware PDF
extraction from scientific articles (Java)
• https://github.com/BMKEG/lapdftextProject
‣ Apache PDFBox - the Java toolkit for
working with PDF documents
• http://pdfbox.apache.org/
• Optical Character Recognition
(OCR) libraries
‣ Tesseract - Google’s OCR engine (C++)
• https://code.google.com/p/tesseract-ocr/
• https://code.google.com/p/pytesser/

(a Python 2.x (!) wrapper for the Tesseract engine)
‣ OCRopus - another Open Source OCR
engine (Python)
• https://code.google.com/p/ocropus/
• Generic text extraction (Office
documents, HTML, Mobi, ePub,
…)
‣ Apache Tika “content analysis
toolkit” (Language Detection!)
• http://tika.apache.org/
62
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 63
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Lexers and Tokenization
• Lexer: converts (“tokenizes”) strings into token sequences
‣ a.k.a.: Tokenizer, Scanner
• Tokenization strategies
‣ whitespace (“s+”) [“PhD.”, “U.S.A.”]
‣ word (letter vs. non-letter) boundary (“b”) [“PhD”, “.”, “U”, “.”, “S”, “.”, “A”, “.”]
‣ Unicode category-based ([“C”, “3”, “P”, “0”], [“Ph”, “D”, “.”, “U”, “.”, “S”, “.” …])
‣ linguistic ([“that”, “ʼs”]; [“do”, “nʼt”]; [“adjective-like”]; [“PhD”, “.”, “U.S.A”, “.”])
‣ whitespace/newline-preserving ([“This”, “ ”, “is”, “ ”, …]) or not
• Task-oriented choice: document classification, entity detection,
linguistic analysis, …
64
This is a sentence .
This is a sentence.
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Regular Expressions
‣ /^[^@]{1,64}@w(?:[w.-]{0,254}w)?.w{2,}$/
• (provided w has Unicode support)
• http://ex-parrot.com/~pdw/Mail-RFC822-Address.html
!
‣ Wild card: . (a dot → matches any character)
‣ Groupings: [A-Z0-9]; Negations: [^a-z]; Captures (save_this)
‣ Markers: ^ “start of string/line”; $ “end of string/line”; [b “word boundary”]
‣ Pattern repetitions: * “zero or more”; *? “zero or more, non-greedy”;

+ “one or more”; +? “one or more, non-greedy”; ? “zero or one”
‣ Example: b([a-z]+)b.+$ captures lower-case words except at the EOS
65
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Finite State Automata 1/2
• Matching single words or patterns:
‣ Regular Expressions are “translated” to finite-state automata
‣ Automata can be deterministic (DFA, O(n)) or non-deterministic (NFA, O(nm))
• where n is the length of the string being scanned and m is the length of the string being matched
‣ Programming languages’ default implementation are (slower) NFAs because
certain special expressions (e.g., “look-ahead” and “look-behind”) cannot be
implemented with a DFA
‣ DFA implementations “in the wild”
• RE2 by Google: https://code.google.com/p/re2/

[C++, and the default RE engine of Golang,

with wrappers for a few other languages]
• dk.brics.automaton: http://www.brics.dk/automaton/

[Java, but very memory “hungry”]
66
1 2 3
a b
b
(ab)+
1 2
a
b
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Finite State Automata 2/2
Dictionaries:
Exact string matching of

word collections
Compressed Prefix Trees (Tries):

PATRICIA Trie
Prefixes → autocompletion functionality
Minimal (compressed) DFA

(a.k.a. MADFA, DAWG or DAFSA)
a compressed dictionary
Python: pip install DAWG

https://github.com/kmike/DAWG

[“kmike” being a NLTK dev.]
Daciuk et al. 2000
67
top
tops
tap
taps
Trie MADFA
PATRICIA Trie
Images Source: WikiMedia Commons, Saffles & Chkno
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 68
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Sentence Segmentation
• Sentences are the fundamental linguistic unit
‣ Sentence: boundary or “constraint” for linguistic phenomena
‣ collocations [“United Kingdom”, “vice president”], idioms [“drop me a line”],
phrases [PP: “of great fame”] and clauses, statements, …
• Rule/pattern-based segmentation
‣ Segment sentences if the marker is followed by an upper-case letter
• /[.!?]s+[A-Z]/
‣ Special cases: abbreviations, digits, lower-case proper nouns (genes,
“amnesty international”, …), hyphens, quotation marks, …
• Supervised sentence boundary detection
‣ Use some Markov model or a conditional random field to identify possible
sentence segmentation tokens
‣ Requires labeled examples (segmented sentences)
69
turns out letter case is a poor manʼs rule!
maintaining a proper rule set
gets messy fast
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Punkt Sentence Tokenizer
(PST) 1/2
• Unsupervised Multilingual Sentence Boundary Detection
• P(●|w-1) > ccpc
• Determines if a marker ● is used as an abbreviation marker by comparing the conditional probability that the
word w-1 before ● is followed by the marker against some (high) cutoff probability.
‣ P(●|w-1) = P(w-1, ●) ÷ P(w-1)
‣ K&S set c = 0.99
• P(w+1|w-1) > P(w+1)
• Evaluates the likelihood that w-1 and w+1 surrounding the marker ● are more commonly collocated than would be
expected by chance: ● is assumed an abbreviation marker (“not independent”) if the LHS is greater than the RHS.
• Flength(w)×Fmarkers(w)×Fpenalty(w) ≥ cabbr
• Evaluates if any of w’s morphology (length of w w/o marker characters, number of periods inside w (e.g., [“U.S.A”]),
penalized when w is not followed by a ●) makes it more likely that w is an abbreviation against some (low) cutoff.
• Fortho(w); Psstarter(w+1|●); …
• Orthography: lower-, upper-case or capitalized word after a probable ● or not
• Sentence Starter: Probability that w is found after a ●
70
Dr.
Mrs. Watson
U.S.A.
. Therefore
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Punkt Sentence Tokenizer
(PST) 2/2
• Unsupervised Multilingual
Sentence Boundary Detection
‣ Kiss & Strunk, MIT Press 2006.
‣ Available from NLTK:
nltk.tokenize.punkt (http://
www.nltk.org/api/nltk.tokenize.html)
• PST is language agnostic
‣ Requires that the language uses the
sentence segmentation marker as an
abbreviation marker
‣ Otherwise, the problem PST solves is not
present
!
!
• PST factors in word length
‣ Abbreviations are relatively shorter than
regular words
• PST takes “internal” markers into
account
‣ E.g., “U.S.A”
• Main weakness: long lists of
abbreviations
‣ E.g., author lists in citations
‣ Can be fixed with a pattern-based post-
processing strategy
• NB: a marker must be present
‣ E.g., chats or fora
71
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 72
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• The Parts of Speech:
‣ noun: NN, verb: VB, adjective: JJ, adverb: RB, preposition: IN,

personal pronoun: PRP, …
‣ e.g. the full Penn Treebank PoS tagset contains 48 tags:
‣ 34 grammatical tags (i.e., “real” parts-of-speech) for words
‣ one for cardinal numbers (“CD”; i.e., a series of digits)
‣ 13 for [mathematical] “SYM” and currency “$” symbols, various types of
punctuation, as well as for opening/closing parenthesis and quotes
73
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
“PoS tags”
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• Automatic PoS Tagging ➡ Supervised Machine Learning
‣ Maximum Entropy Markov Models
‣ Conditional Random Fields
‣ Ensemble Methods
74
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
will be explained in the Text Mining #5!
(last lesson)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 75
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Linguistic Morphology
• [Verbal] Inflections
‣ conjugation (Indo-European
languages)
‣ tense (“availability” and use varies
across languages)
‣ modality (subjunctive, conditional,
imperative)
‣ voice (active/passive)
‣ …
!
• Contractions
‣ don’t say you’re in a man’s world…

• Declensions
• on nouns, pronouns, adjectives, determiners
‣ case (nominative, accusative,
dative, ablative, genitive, …)
‣ gender (female, male, neuter)
‣ number forms (singular, plural,
dual)
‣ possessive pronouns (I➞my,
you➞your, she➞her, it➞its, … car)
‣ reflexive pronouns (for myself,
yourself, …)
‣ …
76
not a contraction:
possessive s
➡ token normalization
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Stemming vs Lemmatization
• Stemming
‣ produced by “stemmers”
‣ produces a word’s “stem”
!
‣ am ➞ am
‣ the going ➞ the go
‣ having ➞ hav
!
‣ fast and simple (pattern-based)
‣ Snowball; Lovins; Porter
• nltk.stem.*

• Lemmatization
‣ produced by “lemmatizers”
‣ produces a word’s “lemma”
!
‣ am ➞ be
‣ the going ➞ the going
‣ having ➞ have
!
‣ requires: a dictionary and PoS
‣ LemmaGen; morpha
• nltk.stem.wordnet
77
➡ token normalization
a.k.a. token “regularization”
(although that is technically the wrong wording)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 78
Text/Document Extraction
Tokenization: Lexer/Scanner
Sentence Segmentation
Part-of-Speech Tagging
Stemming/Lemmatization
String Metrics & Matching
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
• Finding similarly spelled or misspelled words
• Finding “similar” texts/documents
‣ N.B.: no semantics (yet…)
• Detecting entries in a gazetteer
‣ a list of domain-specific words
• Detecting entities in a dictionary
‣ a list of words, each mapped to some URI
‣ N.B.: “entities”, not “concepts” (no semantics…)
String Matching
79
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Jaccard Similarity
80
“words” “woody”
w
o
r
s
d y
J(A, B) =
|A  B|
|A [ B|
=
3
7
= 0.43
vs
o
6 o.5
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Similarity
• Edit Distance Measures
‣ Hamming Distance [1950]
‣ Levenshtein-Damerau Distance [1964/5]
‣ Needelman-Wunsch [1970] and

Smith-Waterman [1981] Distance
• also align the strings (“sequence alignment”)
• use dynamic programming
• Modern approaches: BLAST [1990], BWT [1994]
• Other Distance Metrics
‣ Jaro-Winkler Distance [1989/90]
• coefficient of matching characters within a dynamic
window minus transpositions
‣ Soundex (➞ homophones; spelling!)
‣ …
• Basic Operations: “indels”
‣ Insertions
• ac ➞ abc
‣ Deletions
• abc ➞ ac
!
• “Advanced” Operations
‣ Require two “indels”
‣ Substitutions
• abc ➞ aBc
‣ Transpositions
• ab ➞ ba
81
Wikipedia is gold here!
(for once…)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Distance Measures
• Hamming Distance
‣ only counts substitutions
‣ requires equal string lengths
‣ karolin ⬌ kathrin = 3
‣ karlo ⬌ carol = 3
‣ karlos ⬌ carol = undef
• Levenshtein Distance
‣ counts all but transpositions
‣ karlo ⬌ carol = 3
‣ karlos ⬌ carol = 4
• Damerau-Levenshtein D.
‣ also allows transpositions
‣ has quadratic complexity
‣ karlo ⬌ carol = 2
‣ karlos ⬌ carol = 3
• Jaro Distance (a, b)
‣ calculates (m/|a| + m/|b| + (m-t)/m) / 3
‣ 1. m, the # of matching characters
‣ 2. t, the # of transpositions
‣ window: !max(|a|, |b|) / 2" - 1 chars
‣ karlos ⬌ carol = 0.74
‣ (4 ÷ 6 + 4 ÷ 5 + 3 ÷ 4) ÷ 3
• Jaro-Winkler Distance
‣ adds a bonus for matching prefixes
82
[0,1] range
typos!
nltk.metrics.distance
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Gazetteer/Dictionary
Matching
• Finding all tokens that match the entries
‣ hash table lookups: constant complexity - O(1)
• Exact, single token matches
‣ regular hash table lookup (e.g., MURMUR3)
• Exact, multiple tokens
‣ prefix tree of hash tables pointing to child tables or leafs (=matches)
• Approximate, single tokens
‣ use some string metric - but do not compare all n-to-m cases…
• Approximate, multiple tokens
‣ a prefix tree of whatever we will use for single tokens…
83
L L
L L L
LSH (next)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 1/2
• A hashing approach to group near neighbors.
• Map similar items into the same [hash] buckets.
• LSH “maximizes” (instead of minimizing) hash collisions.
• It is another dimensionality reduction technique.
• For documents or words, minhashing can be used.
‣ Approach from Rajaraman & Ullman, Mining of Massive Datasets, 2010
• http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
84
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 2/2
85
M Vogiatzis. micvog.com 2013.
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 2/2
86
M Vogiatzis. micvog.com 2013.
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Minhash Signatures (1/2)
87
shingleorn-gramID
Create n-gram x token or
k-shingle x document matrix
(likely very sparse!)
Lemma: Two texts will have the same first “true” shingle/ngram when looking
from top to bottom with a probability equal to their Jaccard (Set) Similarity.
!
Idea: Create sufficient permutations of the row (shingle/n-gram) ordering so
that the Jaccard Similarity can be approximated by comparing the number of
coinciding vs. differing rows.
T T T T h h
0 T F F T 1 1
1 F F T F 2 4
2 F T F T 3 2
3 T F T T 4 0
4 F F T F 0 3
a family of hash functions
h1(x) = (x+1)%n
!
h2(x) = (3x+1)%n
!
n=5
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 2/2
88
M Vogiatzis. micvog.com 2013.
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Minhash Signatures (2/2)
89
shingleorn-gramID
S T T T T h h
0 T F F T 1 1
1 F F T F 2 4
2 F T F T 3 2
3 T F T T 4 0
4 F F T F 0 3
h1(x) = (x+1)%n
!
h2(x) = (3x+1)%n
!
n=5
M T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
init matrix M = ∞
for each shingle c,
hash h, and
text t:
if S[t,c] and
M[t,h] > h(c):
M[t,h] = h(c)
perfectly map-reduce-able and embarrassingly parallel!
c=0 T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=1 T T T T
h 1 ∞ ∞ 1
h 1 ∞ ∞ 1
c=2 T T T T
h 1 ∞ 2 1
h 1 ∞ 4 1
c=3 T T T T
h 1 3 2 1
h 1 2 4 1
c=4 T T T T
h 1 3 2 1
h 0 2 0 0
M T T T T
h 1 3 0 1
h 0 2 0 0
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 2/2
90
M Vogiatzis. micvog.com 2013.
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Banded Locality Sensitive
Minhashing
91
M T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=0 T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=1 T T T T
h 1 ∞ ∞ 1
h 1 ∞ ∞ 1
T T T T T T T T …
h 1 0 2 1 7 1 4 5
h 1 2 4 1 6 5 5 6
h 0 5 6 0 6 4 7 9
h 4 0 8 8 7 6 5 7
h 7 7 0 8 3 8 7 3
h 8 9 0 7 2 4 8 2
h 8 5 4 0 9 8 4 7
h 9 4 3 9 0 8 3 9
h 8 5 8 0 0 6 8 0
…
Bands
1
2
3
…
Bands b ∝ pagreement
Hashes/Band r ∝ 1/pagreement
pagreement = 1 - (1 - sr
)b
s = Jaccard(A,B)
(in at least one band)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Practical: Faster Spelling
Correction with LSH
92
• Using the spelling correction tutorial from yesterday and the provided
banded, minhashing-based LSH implementation, develop a spelling
corrector that is nearly as good as Peter Norvig’s, but at least twice
as fast (3x is possible!).
• Tip 1: if you are using NLTK 2.0.x (you probably are!), you might
want to copy-paste the 3.0.x sources for the
nltk.metrics.distance.edit_distance function
• The idea is to play around with the LSH parameters (threshold, size),
the parameter K for K-shingling, and the post-processing of the
matched set to pick the best solution from this “<< n subspace” (see
LSH 2/2 slide).
• Tip 2/remark: without any post-processing of matched sets, you can
achieve about 30% accuracy at a 100x speed up!
John Corvino, 2013
“There is a fine line between reading a message
from the text and reading one into the text.”

Contenu connexe

Tendances

IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining Yi-Shin Chen
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法台灣資料科學年會
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltkJanu Jahnavi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing Rajnish Raj
 

Tendances (20)

IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
From NLP to text mining
From NLP to text mining From NLP to text mining
From NLP to text mining
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 

En vedette

Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewYONG ZHENG
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in AnimationsShuen-Huei Guan
 
Tokenization: Life beyond the Information Age
Tokenization: Life beyond the Information AgeTokenization: Life beyond the Information Age
Tokenization: Life beyond the Information AgeShannon Code
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong KongSammy Fung
 
Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali TokenizerJeet Das
 
Text Similarities - PG Pushpin
Text Similarities - PG PushpinText Similarities - PG Pushpin
Text Similarities - PG Pushpinjsurve
 
Disrupting loyalty solutions
Disrupting loyalty solutionsDisrupting loyalty solutions
Disrupting loyalty solutionsNeeraj Sanghvi
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to TokenizationNabeel Yoosuf
 
Active directory гэж юу вэ?
Active directory  гэж юу вэ?Active directory  гэж юу вэ?
Active directory гэж юу вэ?Ochiroo Dorj
 
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAM
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAMGREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAM
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAMKrishna Chaitanya
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?Rambus Inc
 
Fun Facts about Big Data
Fun Facts about Big DataFun Facts about Big Data
Fun Facts about Big DataCrayon Data
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsYONG ZHENG
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 

En vedette (20)

Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Tokenization
TokenizationTokenization
Tokenization
 
10-1 Vocab of Terms
10-1 Vocab of Terms10-1 Vocab of Terms
10-1 Vocab of Terms
 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in Animations
 
NLP
NLPNLP
NLP
 
Tokenization: Life beyond the Information Age
Tokenization: Life beyond the Information AgeTokenization: Life beyond the Information Age
Tokenization: Life beyond the Information Age
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali Tokenizer
 
Text Similarities - PG Pushpin
Text Similarities - PG PushpinText Similarities - PG Pushpin
Text Similarities - PG Pushpin
 
Disrupting loyalty solutions
Disrupting loyalty solutionsDisrupting loyalty solutions
Disrupting loyalty solutions
 
Introduction to Tokenization
Introduction to TokenizationIntroduction to Tokenization
Introduction to Tokenization
 
Active directory гэж юу вэ?
Active directory  гэж юу вэ?Active directory  гэж юу вэ?
Active directory гэж юу вэ?
 
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAM
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAMGREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAM
GREG JAMES @ SUN MICROSYSTEMS, MANAGING A GLOBAL TEAM
 
What is Payment Tokenization?
What is Payment Tokenization?What is Payment Tokenization?
What is Payment Tokenization?
 
Fun Facts about Big Data
Fun Facts about Big DataFun Facts about Big Data
Fun Facts about Big Data
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 

Similaire à OUTDATED Text Mining 3/5: String Processing

hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignKuppusamy P
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big DataSameer Wadkar
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informaticsbutest
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
 
Linguistic Passphrase Cracking
Linguistic Passphrase CrackingLinguistic Passphrase Cracking
Linguistic Passphrase CrackingPriyanka Aash
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningFindwise
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 

Similaire à OUTDATED Text Mining 3/5: String Processing (20)

Ir 03
Ir   03Ir   03
Ir 03
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Text features
Text featuresText features
Text features
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Automated Abstracts and Big Data
Automated Abstracts and Big DataAutomated Abstracts and Big Data
Automated Abstracts and Big Data
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
Linguistic Passphrase Cracking
Linguistic Passphrase CrackingLinguistic Passphrase Cracking
Linguistic Passphrase Cracking
 
Content Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text MiningContent Processing Architecture and Applications - Introduction to Text Mining
Content Processing Architecture and Applications - Introduction to Text Mining
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 

Dernier

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Dernier (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

OUTDATED Text Mining 3/5: String Processing

  • 1. Text Mining 3 String Processing ! Madrid Summer School 2014 Advanced Statistics and Data Mining ! Florian Leitner florian.leitner@upm.es License:
  • 2. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Incentive and Applications • Today’s goals: ‣ converting strings/text into “processable units” (features for Machine Learning) ‣ detecting patterns and comparing strings (string similarity and matching) ! • Feature generation for machine learning tasks ‣ Detecting a token’s grammatical use-case (noun, verb, adjective, …) ‣ Normalizing/regularizing tokens (“[he] was” and “[I] am” are forms of the verb “be”) ‣ Recognizing entities from a collection of strings (a “gazetteer” or dictionary) • Near[est] Neighbor-based text/string classification tasks • String searching (find, UNIX’ “grep”) • Pattern matching (locating e-mail addresses, telephone numbers, …) 59
  • 3. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 60 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 4. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 61 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 5. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Text Extraction from Documents • Mostly a commercial business • “an engineering problem, not a scientific challenge” • PDF extraction tools ‣ xpdf - “the” PDF extraction library • http://www.foolabs.com/xpdf/
 (available via pdfto*!on POSIX systems) ‣ PDFMiner - a Python 2.x (!) library • https://euske.github.io/pdfminer/ ‣ LA-PDF-Text - layout aware PDF extraction from scientific articles (Java) • https://github.com/BMKEG/lapdftextProject ‣ Apache PDFBox - the Java toolkit for working with PDF documents • http://pdfbox.apache.org/ • Optical Character Recognition (OCR) libraries ‣ Tesseract - Google’s OCR engine (C++) • https://code.google.com/p/tesseract-ocr/ • https://code.google.com/p/pytesser/
 (a Python 2.x (!) wrapper for the Tesseract engine) ‣ OCRopus - another Open Source OCR engine (Python) • https://code.google.com/p/ocropus/ • Generic text extraction (Office documents, HTML, Mobi, ePub, …) ‣ Apache Tika “content analysis toolkit” (Language Detection!) • http://tika.apache.org/ 62
  • 6. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 63 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 7. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Lexers and Tokenization • Lexer: converts (“tokenizes”) strings into token sequences ‣ a.k.a.: Tokenizer, Scanner • Tokenization strategies ‣ whitespace (“s+”) [“PhD.”, “U.S.A.”] ‣ word (letter vs. non-letter) boundary (“b”) [“PhD”, “.”, “U”, “.”, “S”, “.”, “A”, “.”] ‣ Unicode category-based ([“C”, “3”, “P”, “0”], [“Ph”, “D”, “.”, “U”, “.”, “S”, “.” …]) ‣ linguistic ([“that”, “ʼs”]; [“do”, “nʼt”]; [“adjective-like”]; [“PhD”, “.”, “U.S.A”, “.”]) ‣ whitespace/newline-preserving ([“This”, “ ”, “is”, “ ”, …]) or not • Task-oriented choice: document classification, entity detection, linguistic analysis, … 64 This is a sentence . This is a sentence.
  • 8. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining String Matching with Regular Expressions ‣ /^[^@]{1,64}@w(?:[w.-]{0,254}w)?.w{2,}$/ • (provided w has Unicode support) • http://ex-parrot.com/~pdw/Mail-RFC822-Address.html ! ‣ Wild card: . (a dot → matches any character) ‣ Groupings: [A-Z0-9]; Negations: [^a-z]; Captures (save_this) ‣ Markers: ^ “start of string/line”; $ “end of string/line”; [b “word boundary”] ‣ Pattern repetitions: * “zero or more”; *? “zero or more, non-greedy”;
 + “one or more”; +? “one or more, non-greedy”; ? “zero or one” ‣ Example: b([a-z]+)b.+$ captures lower-case words except at the EOS 65
  • 9. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining String Matching with Finite State Automata 1/2 • Matching single words or patterns: ‣ Regular Expressions are “translated” to finite-state automata ‣ Automata can be deterministic (DFA, O(n)) or non-deterministic (NFA, O(nm)) • where n is the length of the string being scanned and m is the length of the string being matched ‣ Programming languages’ default implementation are (slower) NFAs because certain special expressions (e.g., “look-ahead” and “look-behind”) cannot be implemented with a DFA ‣ DFA implementations “in the wild” • RE2 by Google: https://code.google.com/p/re2/
 [C++, and the default RE engine of Golang,
 with wrappers for a few other languages] • dk.brics.automaton: http://www.brics.dk/automaton/
 [Java, but very memory “hungry”] 66 1 2 3 a b b (ab)+ 1 2 a b
  • 10. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining String Matching with Finite State Automata 2/2 Dictionaries: Exact string matching of
 word collections Compressed Prefix Trees (Tries):
 PATRICIA Trie Prefixes → autocompletion functionality Minimal (compressed) DFA
 (a.k.a. MADFA, DAWG or DAFSA) a compressed dictionary Python: pip install DAWG
 https://github.com/kmike/DAWG
 [“kmike” being a NLTK dev.] Daciuk et al. 2000 67 top tops tap taps Trie MADFA PATRICIA Trie Images Source: WikiMedia Commons, Saffles & Chkno
  • 11. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 68 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 12. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Sentence Segmentation • Sentences are the fundamental linguistic unit ‣ Sentence: boundary or “constraint” for linguistic phenomena ‣ collocations [“United Kingdom”, “vice president”], idioms [“drop me a line”], phrases [PP: “of great fame”] and clauses, statements, … • Rule/pattern-based segmentation ‣ Segment sentences if the marker is followed by an upper-case letter • /[.!?]s+[A-Z]/ ‣ Special cases: abbreviations, digits, lower-case proper nouns (genes, “amnesty international”, …), hyphens, quotation marks, … • Supervised sentence boundary detection ‣ Use some Markov model or a conditional random field to identify possible sentence segmentation tokens ‣ Requires labeled examples (segmented sentences) 69 turns out letter case is a poor manʼs rule! maintaining a proper rule set gets messy fast
  • 13. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Punkt Sentence Tokenizer (PST) 1/2 • Unsupervised Multilingual Sentence Boundary Detection • P(●|w-1) > ccpc • Determines if a marker ● is used as an abbreviation marker by comparing the conditional probability that the word w-1 before ● is followed by the marker against some (high) cutoff probability. ‣ P(●|w-1) = P(w-1, ●) ÷ P(w-1) ‣ K&S set c = 0.99 • P(w+1|w-1) > P(w+1) • Evaluates the likelihood that w-1 and w+1 surrounding the marker ● are more commonly collocated than would be expected by chance: ● is assumed an abbreviation marker (“not independent”) if the LHS is greater than the RHS. • Flength(w)×Fmarkers(w)×Fpenalty(w) ≥ cabbr • Evaluates if any of w’s morphology (length of w w/o marker characters, number of periods inside w (e.g., [“U.S.A”]), penalized when w is not followed by a ●) makes it more likely that w is an abbreviation against some (low) cutoff. • Fortho(w); Psstarter(w+1|●); … • Orthography: lower-, upper-case or capitalized word after a probable ● or not • Sentence Starter: Probability that w is found after a ● 70 Dr. Mrs. Watson U.S.A. . Therefore
  • 14. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Punkt Sentence Tokenizer (PST) 2/2 • Unsupervised Multilingual Sentence Boundary Detection ‣ Kiss & Strunk, MIT Press 2006. ‣ Available from NLTK: nltk.tokenize.punkt (http:// www.nltk.org/api/nltk.tokenize.html) • PST is language agnostic ‣ Requires that the language uses the sentence segmentation marker as an abbreviation marker ‣ Otherwise, the problem PST solves is not present ! ! • PST factors in word length ‣ Abbreviations are relatively shorter than regular words • PST takes “internal” markers into account ‣ E.g., “U.S.A” • Main weakness: long lists of abbreviations ‣ E.g., author lists in citations ‣ Can be fixed with a pattern-based post- processing strategy • NB: a marker must be present ‣ E.g., chats or fora 71
  • 15. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 72 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 16. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Parts of Speech • The Parts of Speech: ‣ noun: NN, verb: VB, adjective: JJ, adverb: RB, preposition: IN,
 personal pronoun: PRP, … ‣ e.g. the full Penn Treebank PoS tagset contains 48 tags: ‣ 34 grammatical tags (i.e., “real” parts-of-speech) for words ‣ one for cardinal numbers (“CD”; i.e., a series of digits) ‣ 13 for [mathematical] “SYM” and currency “$” symbols, various types of punctuation, as well as for opening/closing parenthesis and quotes 73 I ate the pizza with green peppers . PRP VB DT NN IN JJ NN . “PoS tags”
  • 17. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Parts of Speech • Automatic PoS Tagging ➡ Supervised Machine Learning ‣ Maximum Entropy Markov Models ‣ Conditional Random Fields ‣ Ensemble Methods 74 I ate the pizza with green peppers . PRP VB DT NN IN JJ NN . will be explained in the Text Mining #5! (last lesson)
  • 18. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 75 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 19. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Linguistic Morphology • [Verbal] Inflections ‣ conjugation (Indo-European languages) ‣ tense (“availability” and use varies across languages) ‣ modality (subjunctive, conditional, imperative) ‣ voice (active/passive) ‣ … ! • Contractions ‣ don’t say you’re in a man’s world…
 • Declensions • on nouns, pronouns, adjectives, determiners ‣ case (nominative, accusative, dative, ablative, genitive, …) ‣ gender (female, male, neuter) ‣ number forms (singular, plural, dual) ‣ possessive pronouns (I➞my, you➞your, she➞her, it➞its, … car) ‣ reflexive pronouns (for myself, yourself, …) ‣ … 76 not a contraction: possessive s ➡ token normalization
  • 20. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Stemming vs Lemmatization • Stemming ‣ produced by “stemmers” ‣ produces a word’s “stem” ! ‣ am ➞ am ‣ the going ➞ the go ‣ having ➞ hav ! ‣ fast and simple (pattern-based) ‣ Snowball; Lovins; Porter • nltk.stem.*
 • Lemmatization ‣ produced by “lemmatizers” ‣ produces a word’s “lemma” ! ‣ am ➞ be ‣ the going ➞ the going ‣ having ➞ have ! ‣ requires: a dictionary and PoS ‣ LemmaGen; morpha • nltk.stem.wordnet 77 ➡ token normalization a.k.a. token “regularization” (although that is technically the wrong wording)
  • 21. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining 78 Text/Document Extraction Tokenization: Lexer/Scanner Sentence Segmentation Part-of-Speech Tagging Stemming/Lemmatization String Metrics & Matching
  • 22. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining • Finding similarly spelled or misspelled words • Finding “similar” texts/documents ‣ N.B.: no semantics (yet…) • Detecting entries in a gazetteer ‣ a list of domain-specific words • Detecting entities in a dictionary ‣ a list of words, each mapped to some URI ‣ N.B.: “entities”, not “concepts” (no semantics…) String Matching 79
  • 23. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Jaccard Similarity 80 “words” “woody” w o r s d y J(A, B) = |A B| |A [ B| = 3 7 = 0.43 vs o 6 o.5
  • 24. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining String Similarity • Edit Distance Measures ‣ Hamming Distance [1950] ‣ Levenshtein-Damerau Distance [1964/5] ‣ Needelman-Wunsch [1970] and
 Smith-Waterman [1981] Distance • also align the strings (“sequence alignment”) • use dynamic programming • Modern approaches: BLAST [1990], BWT [1994] • Other Distance Metrics ‣ Jaro-Winkler Distance [1989/90] • coefficient of matching characters within a dynamic window minus transpositions ‣ Soundex (➞ homophones; spelling!) ‣ … • Basic Operations: “indels” ‣ Insertions • ac ➞ abc ‣ Deletions • abc ➞ ac ! • “Advanced” Operations ‣ Require two “indels” ‣ Substitutions • abc ➞ aBc ‣ Transpositions • ab ➞ ba 81 Wikipedia is gold here! (for once…)
  • 25. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Distance Measures • Hamming Distance ‣ only counts substitutions ‣ requires equal string lengths ‣ karolin ⬌ kathrin = 3 ‣ karlo ⬌ carol = 3 ‣ karlos ⬌ carol = undef • Levenshtein Distance ‣ counts all but transpositions ‣ karlo ⬌ carol = 3 ‣ karlos ⬌ carol = 4 • Damerau-Levenshtein D. ‣ also allows transpositions ‣ has quadratic complexity ‣ karlo ⬌ carol = 2 ‣ karlos ⬌ carol = 3 • Jaro Distance (a, b) ‣ calculates (m/|a| + m/|b| + (m-t)/m) / 3 ‣ 1. m, the # of matching characters ‣ 2. t, the # of transpositions ‣ window: !max(|a|, |b|) / 2" - 1 chars ‣ karlos ⬌ carol = 0.74 ‣ (4 ÷ 6 + 4 ÷ 5 + 3 ÷ 4) ÷ 3 • Jaro-Winkler Distance ‣ adds a bonus for matching prefixes 82 [0,1] range typos! nltk.metrics.distance
  • 26. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Gazetteer/Dictionary Matching • Finding all tokens that match the entries ‣ hash table lookups: constant complexity - O(1) • Exact, single token matches ‣ regular hash table lookup (e.g., MURMUR3) • Exact, multiple tokens ‣ prefix tree of hash tables pointing to child tables or leafs (=matches) • Approximate, single tokens ‣ use some string metric - but do not compare all n-to-m cases… • Approximate, multiple tokens ‣ a prefix tree of whatever we will use for single tokens… 83 L L L L L LSH (next)
  • 27. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Locality Sensitive Hashing (LSH) 1/2 • A hashing approach to group near neighbors. • Map similar items into the same [hash] buckets. • LSH “maximizes” (instead of minimizing) hash collisions. • It is another dimensionality reduction technique. • For documents or words, minhashing can be used. ‣ Approach from Rajaraman & Ullman, Mining of Massive Datasets, 2010 • http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf 84
  • 28. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Locality Sensitive Hashing (LSH) 2/2 85 M Vogiatzis. micvog.com 2013.
  • 29. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Locality Sensitive Hashing (LSH) 2/2 86 M Vogiatzis. micvog.com 2013.
  • 30. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Minhash Signatures (1/2) 87 shingleorn-gramID Create n-gram x token or k-shingle x document matrix (likely very sparse!) Lemma: Two texts will have the same first “true” shingle/ngram when looking from top to bottom with a probability equal to their Jaccard (Set) Similarity. ! Idea: Create sufficient permutations of the row (shingle/n-gram) ordering so that the Jaccard Similarity can be approximated by comparing the number of coinciding vs. differing rows. T T T T h h 0 T F F T 1 1 1 F F T F 2 4 2 F T F T 3 2 3 T F T T 4 0 4 F F T F 0 3 a family of hash functions h1(x) = (x+1)%n ! h2(x) = (3x+1)%n ! n=5
  • 31. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Locality Sensitive Hashing (LSH) 2/2 88 M Vogiatzis. micvog.com 2013.
  • 32. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Minhash Signatures (2/2) 89 shingleorn-gramID S T T T T h h 0 T F F T 1 1 1 F F T F 2 4 2 F T F T 3 2 3 T F T T 4 0 4 F F T F 0 3 h1(x) = (x+1)%n ! h2(x) = (3x+1)%n ! n=5 M T T T T h ∞ ∞ ∞ ∞ h ∞ ∞ ∞ ∞ init matrix M = ∞ for each shingle c, hash h, and text t: if S[t,c] and M[t,h] > h(c): M[t,h] = h(c) perfectly map-reduce-able and embarrassingly parallel! c=0 T T T T h ∞ ∞ ∞ ∞ h ∞ ∞ ∞ ∞ c=1 T T T T h 1 ∞ ∞ 1 h 1 ∞ ∞ 1 c=2 T T T T h 1 ∞ 2 1 h 1 ∞ 4 1 c=3 T T T T h 1 3 2 1 h 1 2 4 1 c=4 T T T T h 1 3 2 1 h 0 2 0 0 M T T T T h 1 3 0 1 h 0 2 0 0
  • 33. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Locality Sensitive Hashing (LSH) 2/2 90 M Vogiatzis. micvog.com 2013.
  • 34. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Banded Locality Sensitive Minhashing 91 M T T T T h ∞ ∞ ∞ ∞ h ∞ ∞ ∞ ∞ c=0 T T T T h ∞ ∞ ∞ ∞ h ∞ ∞ ∞ ∞ c=1 T T T T h 1 ∞ ∞ 1 h 1 ∞ ∞ 1 T T T T T T T T … h 1 0 2 1 7 1 4 5 h 1 2 4 1 6 5 5 6 h 0 5 6 0 6 4 7 9 h 4 0 8 8 7 6 5 7 h 7 7 0 8 3 8 7 3 h 8 9 0 7 2 4 8 2 h 8 5 4 0 9 8 4 7 h 9 4 3 9 0 8 3 9 h 8 5 8 0 0 6 8 0 … Bands 1 2 3 … Bands b ∝ pagreement Hashes/Band r ∝ 1/pagreement pagreement = 1 - (1 - sr )b s = Jaccard(A,B) (in at least one band)
  • 35. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Practical: Faster Spelling Correction with LSH 92 • Using the spelling correction tutorial from yesterday and the provided banded, minhashing-based LSH implementation, develop a spelling corrector that is nearly as good as Peter Norvig’s, but at least twice as fast (3x is possible!). • Tip 1: if you are using NLTK 2.0.x (you probably are!), you might want to copy-paste the 3.0.x sources for the nltk.metrics.distance.edit_distance function • The idea is to play around with the LSH parameters (threshold, size), the parameter K for K-shingling, and the post-processing of the matched set to pick the best solution from this “<< n subspace” (see LSH 2/2 slide). • Tip 2/remark: without any post-processing of matched sets, you can achieve about 30% accuracy at a 100x speed up!
  • 36. John Corvino, 2013 “There is a fine line between reading a message from the text and reading one into the text.”