2. *
* Introduction to Natural Language
Processing and Text Mining
* Linguistic and Statistical Approaches
*Critiquing Classifier Results
* A New Dawn: Deep Learning
* What’s Next
3. *
* Enterprise Architect, Big Data and
Analytics
* Former Research Scientist,
bioinformatics institute
* Completing PhD in Computational
Biology with focus on text mining
*Author
*Contact
*dan@dsapptech.com
*@dsapptech
*Linkedin.com/in/dansullivanpdx
7. Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
8. *
* Linguistic (from 1960s)
* Focus on syntax
* Transformational Grammar
* Sentence parsing
*Statistical (from 1990s)
* Focus on words, ngrams, etc.
* Statistics and Probability
* Related work in Information
Retrieval
* Topic Modeling and Classification
* Deep Learning (from ~2006)
* Focus on multi-layered neural net
computing non-linear functions
* Light on theory, heavy on
engineering
* Multiple NLP tasks
15. *
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
17. *
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
18. Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
19. * 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
20. *
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
21. *Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
23. * Bag of words model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
24. Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
26. Non-VF, Predicted VF:
“Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
“Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
“Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
“The DsbLI system also comprises a functional redox pair”
27. Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
29. * Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
30. *TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
32. *
*Dense vector
representation (n = 50 …
300 or more)
*Capture semantics –
similar words close by
cosine measure
*Captures language
features
*Syntactic relations
*Semantic relations
34. *Large volume of data
*Billions of words in context
*Multiple passes over data
*Algorithms
*Word2Vec
*CBOW
*Skip-gram
*GloVe
*Linguistic terms with similar
distributions have similar meaning
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
55. *
* Word2Vec – command line tool
* Gensim – Python topic modeling tool with
word2vec module
* GloVe (Global Vector for Word Representation)
– command line tool
56. *
* Theano: Python CPU/GPU symbolic expression compiler
* Torch: Scientific framework for LuaJIT
* PyLearn2: Python deep learning platform
* Lasange: light weight framework on Theano
* Keras: Python library for working with Theano
* DeepDist: Deep Learning on Spark
* Deeplearning4J: Java and Scala, integrated with Hadoop and
Spark
57. *
*Deep Learning Bibliography - http://memkite.com/deep-
learning-bibliography/
* Deep Learning Reading List –
http://deeplearning.net/reading-list/
*Kim, Yoon. "Convolutional neural networks for sentence
classification." arXiv preprint arXiv:1408.5882 (2014).
* Goldberg, Yav. “A Primer on Neural Network Models for
Natural Language Processing”
http://u.cs.biu.ac.il/~yogo/nnlp.pdf
Linguistic and statisitcal at symbol level
Deep learning at subsymbolic or representation level (representation theory)
Symbolic – well formed, unambiguous definition associated with symbol
Sub-symbolic – more like Wittgenstein arguing that words do not need precise definitions to be meaningful
Manually crafted rules
Early deterministic parser had 50-80 rules?
Manaul
Comprehensive – cover all parts of domain
Accurate – reflect relationships
Unambiguous
1. – Process used in VF
2. – No idea why this labeled as a 1
3. Probably from a Methods section, refers to resistance cassette
4.
Alanine, isolucene and valine are all hydrophobic
Arginine is charged as is aspartic acid
Proteobacteria is phylum
Superkingdom
Kingdom
Phylum
Class
Order
Family
Genus
Species
ReLU better than TanH better than Sigmoid
Manifold Hypothesis
Distributional Semantics exist, based on linear algebra.
What new operations can be defined?