Text mining meets neural nets

Dan Sullivan
October 21, 2015
Portland, OR

*
* Introduction to Natural Language
Processing and Text Mining
* Linguistic and Statistical Approaches
*Critiquing Classifier Results
* A New Dawn: Deep Learning
* What’s Next

*
* Enterprise Architect, Big Data and
Analytics
* Former Research Scientist,
bioinformatics institute
* Completing PhD in Computational
Biology with focus on text mining
*Author
*Contact
*dan@dsapptech.com
*@dsapptech
*Linkedin.com/in/dansullivanpdx

Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations

*
* Linguistic (from 1960s)
* Focus on syntax
* Transformational Grammar
* Sentence parsing
*Statistical (from 1990s)
* Focus on words, ngrams, etc.
* Statistics and Probability
* Related work in Information
Retrieval
* Topic Modeling and Classification
* Deep Learning (from ~2006)
* Focus on multi-layered neural net
computing non-linear functions
* Light on theory, heavy on
engineering
* Multiple NLP tasks

*
http://www.slideshare.net/DanSullivan10
/text-mining-meets-neural-nets

*
Image: http://www.nltk.org/book_1ed/ch08.html

*
Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267

*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis

*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS

Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*

*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*

The 1 0 0 0 0 0 0
Esp8 0 1 0 0 0 0 0
gene 0 0 1 0 0 0 0
is 0 0 0 1 0 0 0
a 0 0 0 0 1 0 0
known 0 0 0 0 0 1 0
virulenc
e 0 0 0 0 0 0 1
translocat
es reduced levels of Esp8 host cell
Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011
Sentence 2 0 0.0091 0.0621 0 0 0 0
Sentence 3 0 0 0 0 0.028 0.0113 0
Sentence 4 0.021 0 0 0 0 0 0
One Hot Representation
TF-IDF Representation
*

* Bag of words model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*

Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*

Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”

 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error

8 Alternative Algorithms
Select 10,000 most important features using chi-square

* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*

*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall

*
*Dense vector
representation (n = 50 …
300 or more)
*Capture semantics –
similar words close by
cosine measure
*Captures language
features
*Syntactic relations
*Semantic relations

*
[0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955
0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554
0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825
0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695
-0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499
-0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985
-0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133
0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432
0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888
-0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067
-0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765
-0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685
-0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094
-0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679
0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609
-0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586]
immune_system

*Large volume of data
*Billions of words in context
*Multiple passes over data
*Algorithms
*Word2Vec
*CBOW
*Skip-gram
*GloVe
*Linguistic terms with similar
distributions have similar meaning
*
T. Mikolov, et. al. “Efﬁcient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

*
Image:
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc

*
Heart : Cardiovascular as Kidney:

*
Salmonella : Proteobacteria
Staphylococcus

*
Salmonella : Enterobacteriacea as
Staphylococcus
Staphylococcaceae

*
Image: http://u.cs.biu.ac.il/~yogo/nnlp.pdf

*
https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_
Functions

*
* Non-linear Activation Function
*Sigmoid
*Hyberbolic tangent (tanh)
*Rectifier (ReLU)
* Word embeddings
* Window size
* Loss function
*Binary
*Multiclass
*Cross-entropy

*
Images: http://u.cs.biu.ac.il/~yogo/nnlp.pdf; http://blog.datumbox.com/tuning-
the-learning-rate-in-gradient-descent/

*
Image: https://aclweb.org/anthology/P/P14/P14-2105.xhtml

*
Image: http://greg.org/archive/2010/07/05/the_planck_all-sky_survey.html

*
http://riotwire.com/column/immigrants-socialists-and-semantics-oh-my/

*
* Word2Vec – command line tool
* Gensim – Python topic modeling tool with
word2vec module
* GloVe (Global Vector for Word Representation)
– command line tool

*
* Theano: Python CPU/GPU symbolic expression compiler
* Torch: Scientific framework for LuaJIT
* PyLearn2: Python deep learning platform
* Lasange: light weight framework on Theano
* Keras: Python library for working with Theano
* DeepDist: Deep Learning on Spark
* Deeplearning4J: Java and Scala, integrated with Hadoop and
Spark

*
*Deep Learning Bibliography - http://memkite.com/deep-
learning-bibliography/
* Deep Learning Reading List –
http://deeplearning.net/reading-list/
*Kim, Yoon. "Convolutional neural networks for sentence
classification." arXiv preprint arXiv:1408.5882 (2014).
* Goldberg, Yav. “A Primer on Neural Network Models for
Natural Language Processing”
http://u.cs.biu.ac.il/~yogo/nnlp.pdf

Text mining meets neural nets

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Text mining meets neural nets

Similaire à Text mining meets neural nets (20)

Plus de Dan Sullivan, Ph.D.

Plus de Dan Sullivan, Ph.D. (10)

Dernier

Dernier (20)

Text mining meets neural nets

Notes de l'éditeur