SlideShare a Scribd company logo
1 of 22
Download to read offline
Text mining and natural
language processing
Florian Leitner

Technical University of Madrid (UPM), Spain

!
Tyba

Madrid, ES, 12th of June, 2015
License:
Florian Leitner
Is language understanding & generationā€Ø
key to artiļ¬cial intelligence?
ā€¢ ā€œHerā€ (Samantha) Movie, 2013

ā€¢ ā€œThe Singularity: ~2030ā€ā€Ø
Ray Kurzweil, Googleā€™s director of engineering

ā€¢ ā€œWatsonā€ & ā€œCRUSHā€ā€Ø
IBMā€™s bet on the future: Datastreams, Mainframes & AI
2
ā€œpredict crimes before they happenā€
Criminal Reduction
Utilizing Statistical History
(IBM, reality)
!
Precogs
(Minority Report, movie)
if? when?
cognitive computing:
ā€œprocessing information more like a
human than a machineā€
GoogleGoogle
Florian Leitner
Examples of text mining andā€Ø
natural language processing applications.
ā€¢ Spam ļ¬ltering

ā€¢ Document classiļ¬cation

ā€¢ Social media/brand monitoring

ā€¢ Opinion mining (& text classiļ¬cation)

ā€¢ Search engines

ā€¢ Information retrieval

ā€¢ Plagiarism detection

ā€¢ Content-based recommendation systems

ā€¢ Watson (Jeopardy!, IBM)

ā€¢ Question answering

ā€¢ Spelling correction

ā€¢ Language modeling

ā€¢ Website translation (Google)

ā€¢ Machine translation

ā€¢ Digital assistants (MSā€™ Clippy)

ā€¢ Dialog systems (ā€œTuring testā€)

ā€¢ Siri (Apple) and Google Now

ā€¢ Speech recognit. & language understand.

ā€¢ Event detection (in e-mails)

ā€¢ Information extraction
3
TextMining
LanguageProcessing
Relevant FOSS (only!) libraries will be down hereā€¦ (MIT, ALv2, GPL, BSD, ā€¦)
Concepts & Terminology
Florian Leitner
Document and textā€Ø
classiļ¬cation/clustering
5
1st Principal Component
2ndPrincipalComponent
document
distance
1st
Principal Component
2nd
PrincipalComponent
Centroid
Cluster
Supervised (ā€œLearning to classify from examplesā€, e.g., spam ļ¬ltering)

vs.

Unsupervised (ā€œExploratory groupingā€, e.g., topic modeling)
LIBSVM
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ā€œtokenizationā€
Splitting:
Character-based,
Regular Expressions,
Probabilistic, ā€¦
Token or Shingle
Florian Leitner
Words, Tokens,
and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
ā€œtokenizationā€
Splitting:
Character-based,
Regular Expressions,
Probabilistic, ā€¦
Snag: the terms ā€œshingleā€, ā€œtokenā€ and ā€œn-gramā€ are not used consistentlyā€¦
but ā€œn-gramā€ and ā€œtokenā€ are far more common!
shingles
(unigrams)
2-shingles
(bigrams)
3-shingles
(trigrams)
ā€œk-shinglingā€
e.g. all trigrams of the word ā€œsentenceā€:ā€Ø
[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and
Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standardā€Ø
PoS tagset

{NN, JJ, DT, VBZ, ā€¦}

Penn Treebank
B-I-O
chunk encoding
common

alternatives:

I-O

I-E-O

B-I-E-W-O
End token
(unigram) Word
Stanford CoreNLP FACTORIE and many moreā€¦
FreeLing
Linguistic annotations of tokens (used to train automated classiļ¬ers).
Begin-Inside-Outside
(relevant) token
}
chunk
Florian Leitner
Word vectors and inverted indices
8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
count(Word2)
Text1
Text2
Ī±
Ī³
Ī²
Similarity(T1
, T2
) := cos(T1
, T2
)
count(Word3
)
Comparing text vectors:

E.g., cosine similarity
Text vectorization:

Inverted index
Text 1: He that not wills to the end neither

wills to the means.

Text 2: If the mountain will not go to Moses,

then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
ā€œSearch engine basicsā€
eachtoken/wordisadimension!
Florian Leitner
Inverted indices andā€Ø
the central dogma of machine learning
9
Ɨ=
y = hāœ“(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
ā€¦
Inverted index
(transposed)
Parametersā€Ø
(Īø)
ā€œtextsā€(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
Florian Leitner
Inverted indices andā€Ø
the central dogma of machine learning
9
Ɨ=
y = hāœ“(X)
XTy Īø
Rank,
Class,
Expectation,
Probability,
Descriptor*,
ā€¦
Inverted index
(transposed)
Parametersā€Ø
(Īø)
ā€œtextsā€(n)
n-grams (p)
instances,
observations
variables,
features
(Hyperparameters are settings that control the learning algorithm.)
per feature
ā€œNonparametricā€
per instance
Florian Leitner
The curse of dimensionalityā€Ø
(R.E. Bellman, 1961) [inventor of dynamic programming]
ā€¢ p ā‰« n (far more tokens/features than texts/instances)

ā€¢ Inverted indices (X) are (discrete) sparse matrices.

ā€¢ Even with millions of training examples, unseen tokens will keep
popping up in during evaluation or in production.

ā€£ In such a high-dimensional hypercube, most instances are closer to
the face of the cube (ā€œnothingā€, outside) than other instances.

āœ“ Remedy: (feature) dimensionality reductionā€Ø
The ā€œblessing of non-uniformity.ā€

ā€¢ feature extraction (compression): PCA/LSA (projection), factor analysis (regression),
compression, auto-encoders & deep learning (compression & embedding), ā€¦

ā€¢ feature selection (elimination): LASSO (regularization), SVM (support vectors),
Bayesian nets (structure learning), locality sensitivity hashing, random projections, ā€¦
10
Applications
Florian Leitner
Googleā€™s review summaries:ā€Ø
Opinion mining (ā€œsentimentā€ analysis).
12
Donā€™t do it, pleaseā€¦ ;-) (If you must: see document and text classiļ¬cation software.)
Florian Leitner
Polarity of sentiment keywords in IMDB.
ā€¢ Ć„
13
Cristopher Potts. On the negativity of negation. 2011
ā€œnot goodā€
Florian Leitner
Language understanding:
Parsing and semantic analysis.
14
disambiguation!
Coreference
(Anaphora)
Resolution
Named Entity
Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many moreā€¦RedShift
Entity
Grounding
disambiguation!
disambiguation!
L. TesniĆØreN. Chomsky
Florian Leitner
Automatic text summarization:
Automatic text summarization:
ā€¢ Variance/human agreement: When is a
summary ā€œcorrectā€?

ā€¢ Coherence: providing discourse
structure (text ļ¬‚ow) to the summary.

ā€¢ Paraphrasing: important sentences are
repeated, but with diļ¬€erent wordings.

ā€¢ Implied messages: (the Dow Jones
index rose 10 points ā†’ the economy is
thriving)

ā€¢ Anaphora (coreference) resolution:
very hard, but crucial.
15
ā€¦is very difficult becauseā€¦
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaser
the author got hired by Googleā€¦
Florian Leitner
Machine translation:
Deep learning with auto-encoders.
16
ā€£have only one gender (en) or use opposing gendersā€Ø
(es vs. de: el/die !; la/der "; ā€¦/das #)
ā€£have different verb placements (esā¬Œde).
ā€£have a different concepts of verbs (latin, arab, cjk).
ā€£use different tenses (enā¬Œde).
ā€£have different word orders (latin, arab, cjk).
Different languagesā€¦
DL4J
Florian Leitner
Question answering:
The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies
Hint: Its ļ¬nal scene includes the line ā€œI
do wish we could chat longer, but Iā€™m
having an old friend for dinnerā€
!
!
!
!
Answer: Silence of the Lamb
All men are mortal.

Socrates probably is a manā€¦
ā€¦Therefore, Socrates

might be mortal.
(cognitive computing)
Florian Leitner
Information extraction:
Knowledge mining for molecular biology.
18
Biological
Repositories
Binary
Interactions
Named Entity
Recognition
Entity Associations
Entity Mapping
(Grounding)
Relationship
Extraction
Relationship
Annotations
Cdk5 Rat
TaxID
10116
UniProt
Q03114
Experimental
Methods
Article
Classiļ¬cation
Biological Model
Articles
Short Factoid
Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
19
Anaphora resolution
Carl and Bob were fighting:
ā€œYou should shut up,ā€
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?
Florian Leitner
Text mining and language processing
is all about resolving ambiguities.
20
Anaphora resolution
Carl and Bob were fighting:
ā€œYou should shut up,ā€
Carl told him.
Part-of-Speech tagging
The robot wheels out the iron.
Paraphrasing
Unemployment is on the rise.
vs
The economy is slumping.
Entity recognition & grounding
Is Princeton really good for you?

More Related Content

What's hot

Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
Nitesh Singh
Ā 

What's hot (20)

Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
Ā 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
Ā 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Ā 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
Ā 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Ā 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
Ā 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
Ā 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
Ā 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
Ā 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
Ā 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
Ā 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
Ā 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Ā 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, Nikolenko
Ā 
Esa act
Esa actEsa act
Esa act
Ā 
Logic programming (1)
Logic programming (1)Logic programming (1)
Logic programming (1)
Ā 
ŁˆŲ±Ų“Ų© ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ ŁŁŠ Ų§Ł„ŲŖŲ¹Ł„Ł… Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshop
ŁˆŲ±Ų“Ų© ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ ŁŁŠ Ų§Ł„ŲŖŲ¹Ł„Ł… Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshopŁˆŲ±Ų“Ų© ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ ŁŁŠ Ų§Ł„ŲŖŲ¹Ł„Ł… Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshop
ŁˆŲ±Ų“Ų© ŲŖŲ¶Ł…ŁŠŁ† Ų§Ł„ŁƒŁ„Ł…Ų§ŲŖ ŁŁŠ Ų§Ł„ŲŖŲ¹Ł„Ł… Ų§Ł„Ų¹Ł…ŁŠŁ‚ Word embeddings workshop
Ā 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
Ā 
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive EditorsCodeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Ā 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
Ā 

Viewers also liked

Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESAN
Yabed Contreras Zambrano
Ā 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
Ā 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
Ā 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
Ā 

Viewers also liked (20)

Understanding Voice of Members via Text Mining ā€“ How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining ā€“ How Linkedin Built a Text An...Understanding Voice of Members via Text Mining ā€“ How Linkedin Built a Text An...
Understanding Voice of Members via Text Mining ā€“ How Linkedin Built a Text An...
Ā 
Aplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESANAplicaciones de PLN en empresas - Fab Lab ESAN
Aplicaciones de PLN en empresas - Fab Lab ESAN
Ā 
Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī Ī¤ĪŸĪ„ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ Ī¤Ī—Ī£ Ī”Ī”Ī‘ĪœĪ‘Ī£
Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī Ī¤ĪŸĪ„ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ Ī¤Ī—Ī£ Ī”Ī”Ī‘ĪœĪ‘Ī£Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī Ī¤ĪŸĪ„ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ Ī¤Ī—Ī£ Ī”Ī”Ī‘ĪœĪ‘Ī£
Ī£Ī„ĪĪŸĪ Ī¤Ī™ĪšĪ— Ī Ī‘Ī”ĪŸĪ„Ī£Ī™Ī‘Ī£Ī— Ī¤Ī©Ī Ī£Ī¤Ī‘Ī˜ĪœĪ©Ī Ī¤ĪŸĪ„ Ī Ī™Ī›ĪŸĪ¤Ī™ĪšĪŸĪ„ Ī•Ī”Ī“ĪŸĪ„ Ī¤Ī—Ī£ Ī”Ī”Ī‘ĪœĪ‘Ī£
Ā 
Python + NoSQL in Animations
Python + NoSQL in AnimationsPython + NoSQL in Animations
Python + NoSQL in Animations
Ā 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
Ā 
Yahoo answers
Yahoo answersYahoo answers
Yahoo answers
Ā 
Text mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsingText mining - from Bayes rule to dependency parsing
Text mining - from Bayes rule to dependency parsing
Ā 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
Ā 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ā 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
Ā 
Basic NLP with Python and NLTK
Basic NLP with Python and NLTKBasic NLP with Python and NLTK
Basic NLP with Python and NLTK
Ā 
Text data mining1
Text data mining1Text data mining1
Text data mining1
Ā 
Text mining
Text miningText mining
Text mining
Ā 
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
Ā 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Ā 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Ā 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Ā 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
Ā 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Ā 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Ā 

Similar to Overview of text mining and NLP (+software)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Lucidworks
Ā 
Weakly supervised learning
Weakly supervised learningWeakly supervised learning
Weakly supervised learning
Christoforos Anagnostopoulos
Ā 
PPT slides
PPT slidesPPT slides
PPT slides
butest
Ā 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
Ā 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
Ā 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docx
theodorelove43763
Ā 

Similar to Overview of text mining and NLP (+software) (20)

Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Inside the Black Box: How Does a Neural Network Understand Names? - Philip Bl...
Ā 
ODSC London 2018
ODSC London 2018ODSC London 2018
ODSC London 2018
Ā 
Weakly supervised learning
Weakly supervised learningWeakly supervised learning
Weakly supervised learning
Ā 
KiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with PythonKiwiPyCon 2014 talk - Understanding human language with Python
KiwiPyCon 2014 talk - Understanding human language with Python
Ā 
PPT slides
PPT slidesPPT slides
PPT slides
Ā 
Smart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language ProcessingSmart Data Webinar: Advances in Natural Language Processing
Smart Data Webinar: Advances in Natural Language Processing
Ā 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
Ā 
F# Eye for the C# Guy
F# Eye for the C# GuyF# Eye for the C# Guy
F# Eye for the C# Guy
Ā 
Machine reading for the Semantic Web
Machine reading for the Semantic WebMachine reading for the Semantic Web
Machine reading for the Semantic Web
Ā 
Nltk
NltkNltk
Nltk
Ā 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
Ā 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
Ā 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
Ā 
Data Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docxData Type is a basic classification which identifies.docx
Data Type is a basic classification which identifies.docx
Ā 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
Ā 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Ā 
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
MITRE ATT&CKcon 2018: From Automation to Analytics: Simulating the Adversary ...
Ā 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
Ā 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural Logic
Ā 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
Ā 

Recently uploaded

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
Ā 
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
amitlee9823
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
amitlee9823
Ā 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
amitlee9823
Ā 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
Ā 
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
amitlee9823
Ā 
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
amitlee9823
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
amitlee9823
Ā 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
amitlee9823
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men šŸ”DindigulšŸ” Escor...
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men  šŸ”DindigulšŸ”   Escor...āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men  šŸ”DindigulšŸ”   Escor...
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men šŸ”DindigulšŸ” Escor...
amitlee9823
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
amitlee9823
Ā 
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
amitlee9823
Ā 

Recently uploaded (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Ā 
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Ā 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Ā 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Ā 
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Ā 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
Ā 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Ā 
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Ā 
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ā˜Ž 7737669865ā˜Ž Book Your One night Stand (B...
Ā 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
Ā 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Ā 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Ā 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Ā 
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Ser...
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men šŸ”DindigulšŸ” Escor...
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men  šŸ”DindigulšŸ”   Escor...āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men  šŸ”DindigulšŸ”   Escor...
āž„šŸ” 7737669865 šŸ”ā–» Dindigul Call-girls in Women Seeking Men šŸ”DindigulšŸ” Escor...
Ā 
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ba...
Ā 
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Ā 

Overview of text mining and NLP (+software)

  • 1. Text mining and natural language processing Florian Leitner Technical University of Madrid (UPM), Spain ! Tyba Madrid, ES, 12th of June, 2015 License:
  • 2. Florian Leitner Is language understanding & generationā€Ø key to artiļ¬cial intelligence? ā€¢ ā€œHerā€ (Samantha) Movie, 2013 ā€¢ ā€œThe Singularity: ~2030ā€ā€Ø Ray Kurzweil, Googleā€™s director of engineering ā€¢ ā€œWatsonā€ & ā€œCRUSHā€ā€Ø IBMā€™s bet on the future: Datastreams, Mainframes & AI 2 ā€œpredict crimes before they happenā€ Criminal Reduction Utilizing Statistical History (IBM, reality) ! Precogs (Minority Report, movie) if? when? cognitive computing: ā€œprocessing information more like a human than a machineā€ GoogleGoogle
  • 3. Florian Leitner Examples of text mining andā€Ø natural language processing applications. ā€¢ Spam ļ¬ltering ā€¢ Document classiļ¬cation ā€¢ Social media/brand monitoring ā€¢ Opinion mining (& text classiļ¬cation) ā€¢ Search engines ā€¢ Information retrieval ā€¢ Plagiarism detection ā€¢ Content-based recommendation systems ā€¢ Watson (Jeopardy!, IBM) ā€¢ Question answering ā€¢ Spelling correction ā€¢ Language modeling ā€¢ Website translation (Google) ā€¢ Machine translation ā€¢ Digital assistants (MSā€™ Clippy) ā€¢ Dialog systems (ā€œTuring testā€) ā€¢ Siri (Apple) and Google Now ā€¢ Speech recognit. & language understand. ā€¢ Event detection (in e-mails) ā€¢ Information extraction 3 TextMining LanguageProcessing Relevant FOSS (only!) libraries will be down hereā€¦ (MIT, ALv2, GPL, BSD, ā€¦)
  • 5. Florian Leitner Document and textā€Ø classiļ¬cation/clustering 5 1st Principal Component 2ndPrincipalComponent document distance 1st Principal Component 2nd PrincipalComponent Centroid Cluster Supervised (ā€œLearning to classify from examplesā€, e.g., spam ļ¬ltering) vs. Unsupervised (ā€œExploratory groupingā€, e.g., topic modeling) LIBSVM
  • 6. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: ā€œtokenizationā€ Splitting: Character-based, Regular Expressions, Probabilistic, ā€¦ Token or Shingle
  • 7. Florian Leitner Words, Tokens, and N-Grams/Shingles 6 This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: ā€œtokenizationā€ Splitting: Character-based, Regular Expressions, Probabilistic, ā€¦ Snag: the terms ā€œshingleā€, ā€œtokenā€ and ā€œn-gramā€ are not used consistentlyā€¦ but ā€œn-gramā€ and ā€œtokenā€ are far more common! shingles (unigrams) 2-shingles (bigrams) 3-shingles (trigrams) ā€œk-shinglingā€ e.g. all trigrams of the word ā€œsentenceā€:ā€Ø [sen, ent, nte, ten, enc, nce] Token N-Grams Character N-Grams Token or Shingle
  • 8. Florian Leitner Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER) 7 Token Lemma PoS NER Constitutive constitutive JJ O binding binding NN O to to TO O the the DT O peri-! peri-kappa NN B-DNA B B NN I-DNA site site NN I-DNA is be VBZ O seen see VBN O in in IN O monocytes monocyte NNS B-cell . . . O de facto standardā€Ø PoS tagset {NN, JJ, DT, VBZ, ā€¦} Penn Treebank B-I-O chunk encoding common alternatives: I-O I-E-O B-I-E-W-O End token (unigram) Word Stanford CoreNLP FACTORIE and many moreā€¦ FreeLing Linguistic annotations of tokens (used to train automated classiļ¬ers). Begin-Inside-Outside (relevant) token } chunk
  • 9. Florian Leitner Word vectors and inverted indices 8 0 1 2 3 4 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 count(Word1) count(Word2) Text1 Text2 Ī± Ī³ Ī² Similarity(T1 , T2 ) := cos(T1 , T2 ) count(Word3 ) Comparing text vectors: E.g., cosine similarity Text vectorization: Inverted index Text 1: He that not wills to the end neither wills to the means. Text 2: If the mountain will not go to Moses, then Moses must go to the mountain. tokens Text 1 Text 2 end 1 0 go 0 2 he 1 0 if 0 1 means 1 0 Moses 0 2 mountain 0 2 must 0 1 not 1 1 that 1 0 the 2 2 then 0 1 to 2 2 will 2 1 INDRI ā€œSearch engine basicsā€ eachtoken/wordisadimension!
  • 10. Florian Leitner Inverted indices andā€Ø the central dogma of machine learning 9 Ɨ= y = hāœ“(X) XTy Īø Rank, Class, Expectation, Probability, Descriptor*, ā€¦ Inverted index (transposed) Parametersā€Ø (Īø) ā€œtextsā€(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature
  • 11. Florian Leitner Inverted indices andā€Ø the central dogma of machine learning 9 Ɨ= y = hāœ“(X) XTy Īø Rank, Class, Expectation, Probability, Descriptor*, ā€¦ Inverted index (transposed) Parametersā€Ø (Īø) ā€œtextsā€(n) n-grams (p) instances, observations variables, features (Hyperparameters are settings that control the learning algorithm.) per feature ā€œNonparametricā€ per instance
  • 12. Florian Leitner The curse of dimensionalityā€Ø (R.E. Bellman, 1961) [inventor of dynamic programming] ā€¢ p ā‰« n (far more tokens/features than texts/instances) ā€¢ Inverted indices (X) are (discrete) sparse matrices. ā€¢ Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production. ā€£ In such a high-dimensional hypercube, most instances are closer to the face of the cube (ā€œnothingā€, outside) than other instances. āœ“ Remedy: (feature) dimensionality reductionā€Ø The ā€œblessing of non-uniformity.ā€ ā€¢ feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), ā€¦ ā€¢ feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, ā€¦ 10
  • 14. Florian Leitner Googleā€™s review summaries:ā€Ø Opinion mining (ā€œsentimentā€ analysis). 12 Donā€™t do it, pleaseā€¦ ;-) (If you must: see document and text classiļ¬cation software.)
  • 15. Florian Leitner Polarity of sentiment keywords in IMDB. ā€¢ Ć„ 13 Cristopher Potts. On the negativity of negation. 2011 ā€œnot goodā€
  • 16. Florian Leitner Language understanding: Parsing and semantic analysis. 14 disambiguation! Coreference (Anaphora) Resolution Named Entity Recognition Apple Siri Stanford BLLIP (C-J) Malt LinkGrammar and many moreā€¦RedShift Entity Grounding disambiguation! disambiguation! L. TesniĆØreN. Chomsky
  • 17. Florian Leitner Automatic text summarization: Automatic text summarization: ā€¢ Variance/human agreement: When is a summary ā€œcorrectā€? ā€¢ Coherence: providing discourse structure (text ļ¬‚ow) to the summary. ā€¢ Paraphrasing: important sentences are repeated, but with diļ¬€erent wordings. ā€¢ Implied messages: (the Dow Jones index rose 10 points ā†’ the economy is thriving) ā€¢ Anaphora (coreference) resolution: very hard, but crucial. 15 ā€¦is very difficult becauseā€¦ Image Source: www.lexalytics.com Lex[Page]Rank (JUNG) sumy TextTeaser the author got hired by Googleā€¦
  • 18. Florian Leitner Machine translation: Deep learning with auto-encoders. 16 ā€£have only one gender (en) or use opposing gendersā€Ø (es vs. de: el/die !; la/der "; ā€¦/das #) ā€£have different verb placements (esā¬Œde). ā€£have a different concepts of verbs (latin, arab, cjk). ā€£use different tenses (enā¬Œde). ā€£have different word orders (latin, arab, cjk). Different languagesā€¦ DL4J
  • 19. Florian Leitner Question answering: The champions league of TM & NLP. 17 Biggest issue: statistical inference IBM Watson WolframAlpha Category: Oscar Winning Movies Hint: Its ļ¬nal scene includes the line ā€œI do wish we could chat longer, but Iā€™m having an old friend for dinnerā€ ! ! ! ! Answer: Silence of the Lamb All men are mortal. Socrates probably is a manā€¦ ā€¦Therefore, Socrates might be mortal. (cognitive computing)
  • 20. Florian Leitner Information extraction: Knowledge mining for molecular biology. 18 Biological Repositories Binary Interactions Named Entity Recognition Entity Associations Entity Mapping (Grounding) Relationship Extraction Relationship Annotations Cdk5 Rat TaxID 10116 UniProt Q03114 Experimental Methods Article Classiļ¬cation Biological Model Articles Short Factoid Question Answering Ontologies & Thesauri WWW MITIE OpenDMAP ClearTK
  • 21. Florian Leitner Text mining and language processing is all about resolving ambiguities. 19 Anaphora resolution Carl and Bob were fighting: ā€œYou should shut up,ā€ Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?
  • 22. Florian Leitner Text mining and language processing is all about resolving ambiguities. 20 Anaphora resolution Carl and Bob were fighting: ā€œYou should shut up,ā€ Carl told him. Part-of-Speech tagging The robot wheels out the iron. Paraphrasing Unemployment is on the rise. vs The economy is slumping. Entity recognition & grounding Is Princeton really good for you?