Call Girls In Nandini Layout ā 7737669865 š„µ Book Your One night Stand
Ā
Overview of text mining and NLP (+software)
1. Text mining and natural
language processing
Florian Leitner
Technical University of Madrid (UPM), Spain
!
Tyba
Madrid, ES, 12th of June, 2015
License:
2. Florian Leitner
Is language understanding & generationāØ
key to artiļ¬cial intelligence?
ā¢ āHerā (Samantha) Movie, 2013
ā¢ āThe Singularity: ~2030āāØ
Ray Kurzweil, Googleās director of engineering
ā¢ āWatsonā & āCRUSHāāØ
IBMās bet on the future: Datastreams, Mainframes & AI
2
āpredict crimes before they happenā
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
āprocessing information more like a
human than a machineā
GoogleGoogle
3. Florian Leitner
Examples of text mining andāØ
natural language processing applications.
ā¢ Spam ļ¬ltering
ā¢ Document classiļ¬cation
ā¢ Social media/brand monitoring
ā¢ Opinion mining (& text classiļ¬cation)
ā¢ Search engines
ā¢ Information retrieval
ā¢ Plagiarism detection
ā¢ Content-based recommendation systems
ā¢ Watson (Jeopardy!, IBM)
ā¢ Question answering
ā¢ Spelling correction
ā¢ Language modeling
ā¢ Website translation (Google)
ā¢ Machine translation
ā¢ Digital assistants (MSā Clippy)
ā¢ Dialog systems (āTuring testā)
ā¢ Siri (Apple) and Google Now
ā¢ Speech recognit. & language understand.
ā¢ Event detection (in e-mails)
ā¢ Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down hereā¦ (MIT, ALv2, GPL, BSD, ā¦)
5. Florian Leitner
Document and textāØ
classiļ¬cation/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (āLearning to classify from examplesā, e.g., spam ļ¬ltering)
vs.
Unsupervised (āExploratory groupingā, e.g., topic modeling)
LIBSVM
6. Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ātokenizationā
Splitting:
Character-based,
Regular Expressions,
Probabilistic, ā¦
Token or Shingle
7. Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ātokenizationā
Splitting:
Character-based,
Regular Expressions,
Probabilistic, ā¦
Snag: the terms āshingleā, ātokenā and ān-gramā are not used consistentlyā¦
but ān-gramā and ātokenā are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
āk-shinglingā
e.g. all trigrams of the word āsentenceā:āØ
[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
8. Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standardāØ
PoS tagset
{NN, JJ, DT, VBZ, ā¦}
Penn Treebank
B-I-O
chunk encoding
common
alternatives:
I-O
I-E-O
B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many moreā¦
FreeLing
Linguistic annotations of tokens (used to train automated classiļ¬ers).
Begin-Inside-Outside
(relevant) token
}
chunk
9. Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
Ī±
Ī³
Ī²
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:
E.g., cosine similarity
Text vectorization:
Inverted index
Text 1: He that not wills to the end neither
wills to the means.
Text 2: If the mountain will not go to Moses,
then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
āSearch engine basicsā
eachtoken/wordisadimension!
10. Florian Leitner
Inverted indices andāØ
the central dogma of machine learning
9
Ć=
y = hā(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
ā¦
Inverted index
(transposed)
ParametersāØ
(Īø)
ātextsā(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
11. Florian Leitner
Inverted indices andāØ
the central dogma of machine learning
9
Ć=
y = hā(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
ā¦
Inverted index
(transposed)
ParametersāØ
(Īø)
ātextsā(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
āNonparametricā
per instance
12. Florian Leitner
The curse of dimensionalityāØ
(R.E. Bellman, 1961) [inventor of dynamic programming]
ā¢ p ā« n (far more tokens/features than texts/instances)
ā¢ Inverted indices (X) are (discrete) sparse matrices.
ā¢ Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.
ā£ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (ānothingā, outside) than other instances.
ā Remedy: (feature) dimensionality reductionāØ
The āblessing of non-uniformity.ā
ā¢ feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), ā¦
ā¢ feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, ā¦
10
14. Florian Leitner
Googleās review summaries:āØ
Opinion mining (āsentimentā analysis).
12
Donāt do it, pleaseā¦ ;-) (If you must: see document and text classiļ¬cation software.)
15. Florian Leitner
Polarity of sentiment keywords in IMDB.
ā¢ Ć„
13
Cristopher Potts. On the negativity of negation. 2011
ānot goodā
16. Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many moreā¦RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesniĆØreN. Chomsky
17. Florian Leitner
Automatic text summarization:
Automatic text summarization:
ā¢ Variance/human agreement: When is a
summary ācorrectā?
ā¢ Coherence: providing discourse
structure (text ļ¬ow) to the summary.
ā¢ Paraphrasing: important sentences are
repeated, but with diļ¬erent wordings.
ā¢ Implied messages: (the Dow Jones
index rose 10 points ā the economy is
thriving)
ā¢ Anaphora (coreference) resolution:
very hard, but crucial.
15
ā¦is very difficult becauseā¦
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Googleā¦
18. Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
ā£have only one gender (en) or use opposing gendersāØ
(es vs. de: el/die !; la/der "; ā¦/das #)
ā£have different verb placements (esā¬de).
ā£have a different concepts of verbs (latin, arab, cjk).
ā£use different tenses (enā¬de).
ā£have different word orders (latin, arab, cjk).
Different languagesā¦
DL4J
19. Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its ļ¬nal scene includes the line āI
do wish we could chat longer, but Iām
having an old friend for dinnerā
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.
Socrates probably is a manā¦
ā¦Therefore, Socrates
might be mortal.
(cognitive computing)
20. Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classiļ¬cation
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
21. Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
āYou should shut up,ā
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
22. Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
āYou should shut up,ā
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?