SlideShare une entreprise Scribd logo
1  sur  58
Dan Sullivan
October 21, 2015
Portland, OR
*
* Introduction to Natural Language
Processing and Text Mining
* Linguistic and Statistical Approaches
*Critiquing Classifier Results
* A New Dawn: Deep Learning
* What’s Next
*
* Enterprise Architect, Big Data and
Analytics
* Former Research Scientist,
bioinformatics institute
* Completing PhD in Computational
Biology with focus on text mining
*Author
*Contact
*dan@dsapptech.com
*@dsapptech
*Linkedin.com/in/dansullivanpdx
*
*
*
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Linguistic (from 1960s)
* Focus on syntax
* Transformational Grammar
* Sentence parsing
*Statistical (from 1990s)
* Focus on words, ngrams, etc.
* Statistics and Probability
* Related work in Information
Retrieval
* Topic Modeling and Classification
* Deep Learning (from ~2006)
* Focus on multi-layered neural net
computing non-linear functions
* Light on theory, heavy on
engineering
* Multiple NLP tasks
*
VS.
*
http://www.slideshare.net/DanSullivan10
/text-mining-meets-neural-nets
http://www.slideshare.net/DanSullivan10
/text-mining-meets-neural-nets
http://www.slideshare.net/DanSullivan10
/text-mining-meets-neural-nets
*
*
Image: http://www.nltk.org/book_1ed/ch08.html
*
Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267
*
*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
The 1 0 0 0 0 0 0
Esp8 0 1 0 0 0 0 0
gene 0 0 1 0 0 0 0
is 0 0 0 1 0 0 0
a 0 0 0 0 1 0 0
known 0 0 0 0 0 1 0
virulenc
e 0 0 0 0 0 0 1
translocat
es reduced levels of Esp8 host cell
Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011
Sentence 2 0 0.0091 0.0621 0 0 0 0
Sentence 3 0 0 0 0 0.028 0.0113 0
Sentence 4 0.021 0 0 0 0 0 0
One Hot Representation
TF-IDF Representation
*
* Bag of words model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
*
Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”
 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
8 Alternative Algorithms
Select 10,000 most important features using chi-square
* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
*
*
*Dense vector
representation (n = 50 …
300 or more)
*Capture semantics –
similar words close by
cosine measure
*Captures language
features
*Syntactic relations
*Semantic relations
*
[0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955
0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554
0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825
0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695
-0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499
-0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985
-0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133
0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432
0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888
-0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067
-0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765
-0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685
-0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094
-0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679
0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609
-0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586]
immune_system
*Large volume of data
*Billions of words in context
*Multiple passes over data
*Algorithms
*Word2Vec
*CBOW
*Skip-gram
*GloVe
*Linguistic terms with similar
distributions have similar meaning
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
*
Image:
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
*
Image:
https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc
*
*
*
*
*
Heart : Cardiovascular as Kidney:
*
Salmonella : Proteobacteria
Staphylococcus
*
Salmonella : Enterobacteriacea as
Staphylococcus
Staphylococcaceae
*
*
Image: http://u.cs.biu.ac.il/~yogo/nnlp.pdf
*
https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_
Functions
*
* Non-linear Activation Function
*Sigmoid
*Hyberbolic tangent (tanh)
*Rectifier (ReLU)
* Word embeddings
* Window size
* Loss function
*Binary
*Multiclass
*Cross-entropy
*
Images: http://u.cs.biu.ac.il/~yogo/nnlp.pdf; http://blog.datumbox.com/tuning-
the-learning-rate-in-gradient-descent/
*
Image: https://aclweb.org/anthology/P/P14/P14-2105.xhtml
*
*
*
Image: http://greg.org/archive/2010/07/05/the_planck_all-sky_survey.html
*
http://riotwire.com/column/immigrants-socialists-and-semantics-oh-my/
*
*
* Word2Vec – command line tool
* Gensim – Python topic modeling tool with
word2vec module
* GloVe (Global Vector for Word Representation)
– command line tool
*
* Theano: Python CPU/GPU symbolic expression compiler
* Torch: Scientific framework for LuaJIT
* PyLearn2: Python deep learning platform
* Lasange: light weight framework on Theano
* Keras: Python library for working with Theano
* DeepDist: Deep Learning on Spark
* Deeplearning4J: Java and Scala, integrated with Hadoop and
Spark
*
*Deep Learning Bibliography - http://memkite.com/deep-
learning-bibliography/
* Deep Learning Reading List –
http://deeplearning.net/reading-list/
*Kim, Yoon. "Convolutional neural networks for sentence
classification." arXiv preprint arXiv:1408.5882 (2014).
* Goldberg, Yav. “A Primer on Neural Network Models for
Natural Language Processing”
http://u.cs.biu.ac.il/~yogo/nnlp.pdf
*

Contenu connexe

Tendances

Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
Ian Foster
 

Tendances (20)

Providing Tools for Author Evaluation - A case study
Providing Tools for Author Evaluation - A case studyProviding Tools for Author Evaluation - A case study
Providing Tools for Author Evaluation - A case study
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
Wikipedia Document Classification
Wikipedia Document Classification Wikipedia Document Classification
Wikipedia Document Classification
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
Data Management for Quantitative Biology - Database systems, May 7, 2015, Dr....
 
Analysis of the “KDD Cup-1999” Datasets
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” Datasets
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Analysis of the Datasets
Analysis of the DatasetsAnalysis of the Datasets
Analysis of the Datasets
 
Organizing EEG data using the Brain Imaging Data Structure
Organizing EEG data using the Brain Imaging Data Structure Organizing EEG data using the Brain Imaging Data Structure
Organizing EEG data using the Brain Imaging Data Structure
 
Automatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object databaseAutomatic mechanism data migration between relational and object database
Automatic mechanism data migration between relational and object database
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 

En vedette

En vedette (6)

A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 

Similaire à Text mining meets neural nets

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Dan Sullivan, Ph.D.
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
Snow Owl
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
mkim8
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptx
Shree Shree
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
SK Ahammad Fahad
 

Similaire à Text mining meets neural nets (20)

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
 
Text Analysis of Academic Papers Archived in Institutional Repositories
Text Analysis of Academic Papers Archived in Institutional RepositoriesText Analysis of Academic Papers Archived in Institutional Repositories
Text Analysis of Academic Papers Archived in Institutional Repositories
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Model-based programming and AI-assisted software development
Model-based programming and AI-assisted software developmentModel-based programming and AI-assisted software development
Model-based programming and AI-assisted software development
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Usage Statistics & Information Behaviors: understanding User Behavior with Qu...
Usage Statistics & Information Behaviors: understanding User Behavior with Qu...Usage Statistics & Information Behaviors: understanding User Behavior with Qu...
Usage Statistics & Information Behaviors: understanding User Behavior with Qu...
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
 
Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
 
A biologist in e-Science
A biologist in e-ScienceA biologist in e-Science
A biologist in e-Science
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptx
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 

Plus de Dan Sullivan, Ph.D.

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Dan Sullivan, Ph.D.
 

Plus de Dan Sullivan, Ph.D. (10)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 

Dernier

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Dernier (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Text mining meets neural nets

  • 1. Dan Sullivan October 21, 2015 Portland, OR
  • 2. * * Introduction to Natural Language Processing and Text Mining * Linguistic and Statistical Approaches *Critiquing Classifier Results * A New Dawn: Deep Learning * What’s Next
  • 3. * * Enterprise Architect, Big Data and Analytics * Former Research Scientist, bioinformatics institute * Completing PhD in Computational Biology with focus on text mining *Author *Contact *dan@dsapptech.com *@dsapptech *Linkedin.com/in/dansullivanpdx
  • 4. *
  • 5. *
  • 6. *
  • 7. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 8. * * Linguistic (from 1960s) * Focus on syntax * Transformational Grammar * Sentence parsing *Statistical (from 1990s) * Focus on words, ngrams, etc. * Statistics and Probability * Related work in Information Retrieval * Topic Modeling and Classification * Deep Learning (from ~2006) * Focus on multi-layered neural net computing non-linear functions * Light on theory, heavy on engineering * Multiple NLP tasks
  • 11. *
  • 13. * Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267
  • 14. *
  • 15. * * Technique for identify dominant themes in document * Does not require training * Multiple Algorithms * Probabilistic Latent Semantic Indexing (PLSI) * Latent Dirichlet allocation (LDA) *Assumptions *Documents about a mixture of topics *Words used in document attributable to topic Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
  • 16. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 17. * * Topics represented by words; documents about a set of topics *Doc 1: 50% politics, 50% presidential *Doc 2: 25% CPU, 30% memory, 45% I/O *Doc 3: 30% cholesterol, 40% arteries, 30% heart * Learning Topics *Assign each word to a topic *For each word and topic, compute * Probability of topic given a document P(topic|doc) * Probability of word given a topic P(word|topic) * Reassign word to new topic with probability P(topic|doc) * P(word|topic) * Reassignment based on probability that topic T generated use of word W TOPICS
  • 18. Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
  • 19. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * TF-IDF * Vector space representation * Cosine of vectors measure of similarity * Algorithms * Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 20. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 21. *Term Frequency (TF) tf(t,d) = # of occurrences of t in d t is a term d is a document *Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document *TF-IDF = tf(t,d) * idf(t,D) *TF-IDF is *large when high term frequency in document and low term frequency in all documents *small when term appears in many documents *
  • 22. The 1 0 0 0 0 0 0 Esp8 0 1 0 0 0 0 0 gene 0 0 1 0 0 0 0 is 0 0 0 1 0 0 0 a 0 0 0 0 1 0 0 known 0 0 0 0 0 1 0 virulenc e 0 0 0 0 0 0 1 translocat es reduced levels of Esp8 host cell Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011 Sentence 2 0 0.0091 0.0621 0 0 0 0 Sentence 3 0 0 0 0 0.028 0.0113 0 Sentence 4 0.021 0 0 0 0 0 0 One Hot Representation TF-IDF Representation *
  • 23. * Bag of words model * Ignores structure (syntax) and meaning (semantics) of sentences * Representation vector length is the size of set of unique words in corpus * Stemming used to remove morphological differences * Each word is assigned an index in the representation vector, V * The value V[i] is non-zero if word appears in sentence represented by vector * The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus *
  • 24. Support Vector Machine (SVM) is large margin classifier Commonly used in text classification Initial results based on life sciences sentence classifier Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png *
  • 25. *
  • 26. Non-VF, Predicted VF:  “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.”  “Data were log-transformed to correct for heterogeneity of the variances where necessary.”  “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.” VF, Predicted Non-VF  “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “  “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “  “The DsbLI system also comprises a functional redox pair”
  • 27.  Adding additional examples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error
  • 28. 8 Alternative Algorithms Select 10,000 most important features using chi-square
  • 29. * Increase quantity of data (not always helpful; see error curves) * Improve quality of data * Utilize multiple supervised algorithms, ensemble and non-ensemble * Use unlabeled data and semi-supervised techniques * Feature Selection * Parameter Tuning * Feature Engineering * Given: * High quality data in sufficient quantity * State of the art machine learning algorithms * How to improve results: Change Representation? *
  • 30. *TF-IDF *Loss of syntactic and semantic information *No relation between term index and meaning *No support for disambiguation *Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties *  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  • 31. *
  • 32. * *Dense vector representation (n = 50 … 300 or more) *Capture semantics – similar words close by cosine measure *Captures language features *Syntactic relations *Semantic relations
  • 33. * [0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955 0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554 0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825 0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695 -0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499 -0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985 -0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133 0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432 0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888 -0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067 -0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765 -0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685 -0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094 -0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679 0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609 -0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586] immune_system
  • 34. *Large volume of data *Billions of words in context *Multiple passes over data *Algorithms *Word2Vec *CBOW *Skip-gram *GloVe *Linguistic terms with similar distributions have similar meaning * T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  • 37. *
  • 38. *
  • 39. *
  • 40. *
  • 43. * Salmonella : Enterobacteriacea as Staphylococcus Staphylococcaceae
  • 44. *
  • 47. * * Non-linear Activation Function *Sigmoid *Hyberbolic tangent (tanh) *Rectifier (ReLU) * Word embeddings * Window size * Loss function *Binary *Multiclass *Cross-entropy
  • 50. *
  • 51. *
  • 54. *
  • 55. * * Word2Vec – command line tool * Gensim – Python topic modeling tool with word2vec module * GloVe (Global Vector for Word Representation) – command line tool
  • 56. * * Theano: Python CPU/GPU symbolic expression compiler * Torch: Scientific framework for LuaJIT * PyLearn2: Python deep learning platform * Lasange: light weight framework on Theano * Keras: Python library for working with Theano * DeepDist: Deep Learning on Spark * Deeplearning4J: Java and Scala, integrated with Hadoop and Spark
  • 57. * *Deep Learning Bibliography - http://memkite.com/deep- learning-bibliography/ * Deep Learning Reading List – http://deeplearning.net/reading-list/ *Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). * Goldberg, Yav. “A Primer on Neural Network Models for Natural Language Processing” http://u.cs.biu.ac.il/~yogo/nnlp.pdf
  • 58. *

Notes de l'éditeur

  1. Linguistic and statisitcal at symbol level Deep learning at subsymbolic or representation level (representation theory)
  2. Symbolic – well formed, unambiguous definition associated with symbol Sub-symbolic – more like Wittgenstein arguing that words do not need precise definitions to be meaningful
  3. Manually crafted rules Early deterministic parser had 50-80 rules?
  4. Manaul Comprehensive – cover all parts of domain Accurate – reflect relationships Unambiguous
  5. 1. – Process used in VF 2. – No idea why this labeled as a 1 3. Probably from a Methods section, refers to resistance cassette 4.
  6. Alanine, isolucene and valine are all hydrophobic Arginine is charged as is aspartic acid
  7. Proteobacteria is phylum Superkingdom Kingdom Phylum Class Order Family Genus Species
  8. ReLU better than TanH better than Sigmoid
  9. Manifold Hypothesis
  10. Distributional Semantics exist, based on linear algebra. What new operations can be defined?