This presentation talks about Natural Language Processing using Java. At Museaic, a music intelligence platform, we spent time figuring out how to extract central themes from song lyrics. In this talk, I will cover some of the tasks involved in natural language processing such as named entity recognition, word sense disambiguation and concept/theme extraction. I will also cover libraries available in java such as stanford-nlp, dbpedia-spotlight and graph approaches using WordNet and semantic databases. This talk would help people understand text processing beyond simple keyword approaches and provide them with some of the best techniques/libraries for it in the Java world.
2. Agenda
• Text Retrieval and Search
• Implementing Search
• Evaluating Search Results
• NLP - Document Level Analysis
• Parsing and Part of Speech Tagging
• Entity Extraction
• Word Sense Disambiguation
• Concept Extraction
• Concept Polarity
• NLP - Sentence Level Analysis
• Document Summarization
• Dependency Analysis and Coreference
• Example Question Parsing System
• Sentiment Analysis
• Final Thoughts/Questions 2
3. Text Retrieval and Search
• An collection of text documents exists in a system. This is called
the corpus.
• The documents are preprocessed and indexed before query time.
• User performs a query - the query defines one or more concepts
that the user is interested in. For e.g. “Thai restaurant in Atlanta”
• The search engine is expected to retrieve most relevant
documents based on a ranking function
• The search engine can also apply some heuristics based on user
feedback (such as always ignoring a specific document) to further
prune the results.
3
4. Search - Vector Space Model
• Term: Is a word or set of words (ngrams)
• Each term defines one dimension
• Query Vector: q = (X1,…,Xn)
• Document Vector: d = (Y1,…,Ym)
• relevance (q,d) ~ similarity(q,d)
4
5. Preparing Text for Search
• Tokenization: For each document, we split it into paragraphs, split paragraphs into
sentences and sentences into words.
• Word Normalization:
• Index text and query terms have same form e.g. match U.S.A and USA
• Usually lower cased
• Stop word Removal: An optional step where a predefined list of stop words are
removed. More important for small corpuses
• Stemming - Reduce terms to their stems
• Language dependent - in English, every word has 2 parts, the stem and the affix
• automate(s), automatic, automation => automat, plural forms like cats => cat
• The “stem” may not be an actual word for e.g. consolidating => consolid
5http://snowball.tartarus.org/algorithms/english/stemmer.html
6. 6The inverted index part of the image taken from http://butchiso.com/assets/posts/mysql-full-text-search-p3/inverted_index.png
7. Search Example
• For any given term in the query:
• Term Frequency (TF) - The number of times a term occurs in a document. Normalize this by the
total number of terms in a document.
• Document Frequency (DF) - The number of documents that the term occurs in
• Inverse Document Frequency (IDF) - Inverse of above. So, it will be high for less frequent terms
and low for more frequent terms.
• Simple ranking of documents for a query
• For all the terms in the query, sum up the product of TF and IDF. This can be used to rank
the results with the documents with the highest tf-idf on top.
• Example:
• Document 1 = “The rose is red”
• Document 2 = “Red shoe”
• Query 1 = “Red” => both Document 1 and Document 2 because both documents have same
number of terms after removing stop words
7
8. Evaluating Search Results
• Search results can be evaluated by 2 metrics that encourage two kinds of algorithm
behavior:
• High Precision - Very few false positives. Critical for systems that cannot make a
wrong recommendation.
• High Recall - Very few misses. Critical for systems where every missed opportunity
needs to be minimized but there is a low cost associated with a false positive.
• FMeasure - The harmonic mean of precision and recall. It tries to balance out the
explorative nature of search with the preciseness of the results.
8
10. Section Summary
• In this section, we applied NLP techniques across an entire
corpus. This is where frameworks like map reduce play an
important role.
• The NLP techniques by themselves were shallow but were able to
implicitly handle compound words and stop words.
• Introduced a simple formula for ranking and retrieving search
results. The real world involve more complex probabilistic models
like BM25 that follow the same principles.
• Reviewed some techniques for evaluating search algorithms.
These simple approaches can also be used for other NLP and
machine learning problems.
10http://en.wikipedia.org/wiki/Okapi_BM25
12. Extracting Concepts From Text
• We apply various NLP techniques to analyze the contents of a document. Some example are:
• Mentions of people, places, locations etc.
• Central Themes or concepts in the document
• This is different from search
• Search follows a pull model where the users take initiative in querying the system for
relevant documents.
• In concept extraction, we can infer abstract concepts from text and push it to interested
users. We may also be able to infer the concepts a user is interested in based on the
content they consume.
12
14. Sentence Segmentation
• Periods are ambiguous - Abbreviations, decimals etc.
• !, ? - Less ambiguous
• Classifier - rules (using case, punctuation rules etc.), ML etc.
• StanfordNLP sentence detection and tokenizer
• Trained on Penn Bank dataset and is hence suited towards more
formal english.
• OpenNLP has a sentence detection and tokenizer as well.
• Both these libraries perform pretty well for English and there is not
much to choose between them. They can also be retrained.
14
http://nlp.stanford.edu/software/tokenizer.shtml
https://opennlp.apache.org
https://github.com/dpdearing/nlp
15. Part of Speech Tagging using
StanfordNLP
• StanfordNLP is quite accurate (~90%) and has
been trained using a maximum entropy tagger.
15
TAG POS TAG POS
DT Determiner PRP Pronoun
JJ,JJR,JJS Adjective VB Verbs
NN,NNS Noun IN Preposition
NNP,NNPS Proper Noun CC Conjunction
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
16. Named Entity Recognition
• Named Entity Recognition is the NLP task of recognizing proper nouns in a
document.
• Named Entity Recognition consists of three steps:
• Spotting: Statistical model pre-trained on well known corpus data help us
“spot” entities in the text.
• Disambiguation: Once spots are found, we may need to disambiguate them
(for e.g. there are multiple entities with the same name and the correct url
needs to be retrieved)
• Filtering: Remove named entities whose types we are not interested in or
entities that have very few links pointing to them.
• At the end of NER, we get back a set of url of resources that were referenced
in the text.
16
17. Spotting is the process of identifying and assigning classes to named
entities.
17
STANFORDNLP OPENNLP
I go to school at <ORGANIZATION>Stanford
University</ORGANIZATION>, which is located in
<LOCATION>California</LOCATION>.
I go to school at <ORGANIZATION>Stanford University</
ORGANIZATION> which is located in <LOCATION>California</
LOCATION>
Schooled in the <LOCATION>Philippines</LOCATION> Schooled in the <LOCATION>Philippines</LOCATION>
Where does <ORGANIZATION>Toyota</
ORGANIZATION> have its factories?
Where does <ORGANIZATION>Toyota</ORGANIZATION>
have its factories?
What does <ORGANIZATION>GM</ORGANIZATION>
produce?
What does <ORGANIZATION>GM</ORGANIZATION> produce?
Is <ORGANIZATION>GM</ORGANIZATION> moving
its jobs to <LOCATION>Atlanta</LOCATION>.
is <ORGANIZATION>GM</ORGANIZATION> moving its jobs to
<LOCATION>Atlanta</LOCATION>.
I work at <ORGANIZATION>Chevy</
ORGANIZATION>.
I work at Chevy.
I work at <ORGANIZATION>chevy</
ORGANIZATION>.
I work at chevy.
I am fixing a <ORGANIZATION>General Motors</
ORGANIZATION> car
I am fixing a <ORGANIZATION>General Motors</
ORGANIZATION> car
You told me I was like the <LOCATION>Dead Sea</
LOCATION>
You told me I was like the <LOCATION>Dead Sea</LOCATION>
18. Dbpedia Spotlight
• Dbpedia Spotlight is an API that can be used to perform all 3 steps of NER
• Spots - It identifies spots using a statistical backed model.
• Spots are disambiguated based on other references in the document
• Uri’s are retrieved for each of the identified named entities. These are usually
dbpedia urls with references to freebase and other ontologies.
• Provides API to perform the steps of NER separately as well
• Spotting - Identifies only the spots
• Disambiguate - Performs disambiguation based on different options provided
• Annotate - Performs all 3 steps of NER and provides results
• Candidates - Provides a ranked list of candidates for each spot
18https://github.com/dbpedia-spotlight/dbpedia-spotlight
19. Dbpedia Spotlight Results
19
ID SONG EXPECTED ACTUAL PRECISION RECALL FMEASURE
1
Here We Stand
(Talking Heads)
http://dbpedia.org/resource/Pizza_Hut
http://dbpedia.org/resource/7-Eleven
http://dbpedia.org/resource/
Dairy_Queen
http://dbpedia.org/resource/7-
Eleven
1.0 0.33 0.5
2
Kodachrome
(Paul Simon)
http://dbpedia.org/resource/
Kodachrome
http://dbpedia.org/resource/Nikon
http://dbpedia.org/resource/
Nikon
1.0 0.5 0.66
3
Brand New
Cadillac
(The Crash)
http://dbpedia.org/resource/Cadillac
http://dbpedia.org/resource/
Cadillac
1.0 1.0 1.0
4
A Certain
Romance
(Arctic Monkeys)
http://dbpedia.org/resource/Reebok
http://dbpedia.org/resource/
Converse_(shoe_company)
http://dbpedia.org/resource/
Reebok
http://dbpedia.org/resource/
Converse_(shoe_company)
1.0 1.0 1.0
5
My Humps
(Black Eyed Peas)
http://dbpedia.org/resource/Prada
http://dbpedia.org/resource/Gucci
http://dbpedia.org/resource/Fendi
http://dbpedia.org/resource/
Dolce_&_Gabbana
http://dbpedia.org/resource/
True_Religion
http://dbpedia.org/resource/
Prada
http://dbpedia.org/resource/
Gucci
1.0 0.33 0.5
Mean 1.0 0.63 0.73
20. Querying the Semantic Web
• SPARQL is a query language to interact with the semantic web.
• SPARQL is the equivalent of SQL for RDF stores.
• Ontologies provide knowledge about different entities usually
in the form of a subject-predicate-object triple.
• English version of dbpedia contains 4.58 million things with 584
million facts.
20
SELECT ?industry WHERE {<http://dbpedia.org/
resource/Fendi? dbprop:industry ?industry>
http://dbpedia.org/sparql
http://wiki.dbpedia.org/Datasets#h434-9
22. Extracting Concepts using Word Senses
22http://www.picgifs.com/clip-art/activities/sweating/clip-art-sweating-328953.jpg
23. Word Sense Disambiguation
• For many words, multiple senses of the word exists based on the context. For
e.g. there are multiple senses for the word “bank” (even within the same part of
speech).
• Extremely difficult for Computers. A combination of context and common sense
information make this quite easy for humans.
• Word Sense Disambiguation can be useful for
• Machine translation between languages (surface form loses value during
translation because the only thing that matters is the sense of the word)
• Information Retrieval - Correct interpretation of the query. However this can
be overcome by providing enough terms to only retrieve relevant documents.
• Automatic annotation of text
• Measuring semantic relatedness between documents.
23
http://babelnet.org/
https://code.google.com/p/dkpro-wsd/wiki/LSRs
24. • Solving the Word Sense Disambiguation Problem
• Need an inventory of knowledge that can be used to disambiguate words. Usually a graph
structure. Some examples are:
• WordNet
• Wikipedia
• Yago
• Freebase
• ConceptNet
• Algorithms to traverse the inventory to retrieve most likely disambiguation of a word.
These are usually graph algorithms that work on a measure of centrality like degree
centrality etc.
• Assumptions:
• The document has enough context to disambiguate the word correct. If not, we would
default to the most frequent sense of a word.
• Single sense per discourse
24
25. WordNet
• WordNet is a hierarchically organized lexical database widely used in NLP applications. Started at
Princeton in 1985.
• Contains nouns, verbs adjectives and adverbs
• Words are separated into senses and are represented as synsets.
• The noun “bank” can have multiple senses based on the context (for e.g. bank of a river, financial
institution etc.)
• Synsets are connected by well defined semantic relationships
• Majority of WordNet relations connect words from same part of speech.
• Can be accessed in Java using the extJWNL library
25
PART OF SPEECH UNIQUE
STRINGSNoun 117,798
Verb 11,529
Adjective 22,479
Adverb 4,481
http://extjwnl.sourceforge.net/
26. WordNet Synsets
26
Synset format => baseform#pos#index
bank#n#1 -> river bank
bank#n#2 -> Financial institution
bank#v#3 -> bank with a financial institution
http://wordnetweb.princeton.edu/perl/webwn
27. WordNet Relationships
• Hypernym - Defines a superordinate relationship.
• Motor vehicle is a hypernym of car
• Hyponym - Subordinate relationship
• Mango is a hyponym of fruit
• The root node of nouns is “entity”
• Other relationships: InstanceOf, Synonyms/Antonyms, Meronym (PartOf) etc.
27
29. Accessing WordNet using extJWNL
• Download WordNet 3.0 dataset
• Use the properties file to point to the location of WordNet
• on the file system or database
• Lemmatization - Needed to get the base form of a word
(different from stemming) using the WordNet dictionary.
• cat and cats have same lemma
29
val dictionary = Dictionary.getInstance(new FileInputStream(“data/file_properties.xml"))
def getBaseForm(pos: POS, word: String): String = {
dictionary.getMorphologicalProcessor.lookupBaseForm(pos, word.toLowerCase)
}
http://extjwnl.sourceforge.net/
30. WSD using WordNet
• Example 1 - “I am going to the bank”
• “bank” by itself usually just defaults to bank#n#1
• Example 2 - “What is the difference between a bank
and a credit union?”
• Credit Union only has one sense - credit_union#n#1
• Because credit union is present, “bank” is
disambiguated to “bank#n#2”
30https://code.google.com/p/dkpro-wsd/wiki/LSRs
31. Concept Graph
• WordNet does not capture any common sense information. For e.g.
bank (financial institution) and money do not have a close relationship
in WordNet.
• It is possible to use other resource like ConceptNet that map common
sense knowledge to WordNet (and ontologies like dbpedia). For e.g. we
can download mappings for concepts like Money, Love, Sports, Family
etc.
• Another option is to deploy a custom concept graph:
• Deploy WordNet onto a Graph database. That forms the base graph.
• Deploy custom concept mapping to the WordNet synsets.
• Add mappings for relevant wikipedia (dbpedia) categories
31http://conceptnet5.media.mit.edu/data/5.3/c/en/family?limit=1000
34. Concept Polarity
• SentiWordNet is a lexical resource for opinion mining and
sentiment analysis
• SentiWordNet provides sentiment values for the different WordNet
sysnsets. For each synset in WordNet, SentiWordNet assigns it
scores on 3 dimensions - positivity, negativity and objectivity.
• Once the central concepts are found, we can extract the polarity of
the concepts.
• Example:
• “They are really happy to be here” => happy#a#1 has a very
positive polarity.
34http://sentiwordnet.isti.cnr.it/
35. Section Summary
• Went beyond surface forms and analyzed the concepts
contained in documents.
• The approach was still mostly bag of words meaning that
the structure of the individual sentences did not matter.
• The approaches in tandem with common sense
knowledge sources help in extracting concepts from
documents.
• It also allows documents to be compared based on
semantic similarity measures.
35
37. Document Summarization
• Objective - Reduce the document in order to create a summary that retains the most
important points of the original document.
• Two Approaches:
• Extractive: Extract the sentences that are most representative of the content of the
document.
• Generative: Generate a summary of the text using words that may not be part of the
original text. This is a difficult task and is often not attempted.
• Evaluating summarization techniques:
• Somewhat subjective because humans sometimes cannot agree on the best summary
• Extractive Approaches
• Based on term frequency
• Based on sentence similarity
37
38. 38http://en.wikipedia.org/wiki/Apache_Cassandra
ID SENTENCE
EXPECTED
SCORE
1
Apache Cassandra is an open source distributed database management system designed to handle large
amounts of data across many commodity servers, providing high availability with no single point of failure.
High
2
Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless
replication allowing low latency operations for all clients.
High
3 Cassandra also places a high value on performance. Low
4
In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability,
there is a clear winner throughout our experiments.
Low
5
Cassandra achieves the highest throughput for the maximum number of nodes in all experiments" although
"this comes at the price of high write and read latencies."
High
6 Cassandra's data model is a partitioned row store with tunable consistency. Medium
7
Rows are organized into tables; the first component of a table's primary key is the partition key; within a
partition, rows are clustered by the remaining columns of the key.
Medium
8 Other columns may be indexed separately from the primary key. Low
9 Tables may be created, dropped, and altered at runtime without blocking updates and queries. Low
10 Cassandra does not support joins or subqueries, except for batch analysis via Hadoop. Medium
11 Rather, Cassandra emphasizes denormalization through features like collections. Medium
39. TextRank
• A graph approach where each vertex is a sentence and each edge has a weight
corresponding to the similarity between the two sentences. Every vertex is
connected to every other vertex.
• For every sentence:
• Calculate its similarity to every other sentence. The similarity measure can be
simple for e.g. normalized value of the number of common terms between the
2 sentences
• Sum the similarity of the sentence to every other row (sum up each of the
rows). That is the score of the sentence.
• Sort the vertices based on the sum of the weights of their edges and return the
top k sentences.
39http://lit.csci.unt.edu/index.php/Graph-based_NLP
40. 40
TOP SENTENCES SCORE
Cassandra offers robust support for clusters spanning
multiple datacenters, with asynchronous masterless
replication allowing low latency operations for all clients.
1.6
Cassandra also places a high value on performance. 1.125
Other columns may be indexed separately from the primary
key.
0.999
• Can the similarity metric be improved?
41. Dependency Analysis in Sentences
• StanfordNLP can be used to analyze the grammatical
structure of sentences and provide a dependency graph
between the different elements of the sentence.
• LexicalizedParser can provide a graph where the vertices
are the words and the edges are the grammatical
relationships in a sentence.
41http://nlp.stanford.edu/software/lex-parser.shtml
42. 42
TAG MEANING TAG MEANING
advmod
Adverbial
Modifier
dobj
Direct Object
(she,gave)
neg
Negation
Modifier
iobj
Indirect Object
(gave,me)
nsubj Nominal Subject amod
Adjective
Modifier
nsubjpass
Passive Nominal
Subject
prep Preposition
44. Dependency Analysis
• Works well for short sentences. It loses accuracy when
the scope is increased to a document.
• May aid in text simplification by using the relationships
between the entities.
• By analyzing the subject and the object, we can clearly
establish a point of view (for e.g. direct address vs first
person vs. second person etc.).
• Could potentially help in story extrapolation but does
not generalize well. So this is a topic of research.
44
45. Sentiment Analysis
• StanfordNLP has a deep learning model for sentiment
analysis.
• Takes a deep parsing approach to sentiment analysis - the
structure of the sentence is constructed prior to the analysis.
• Was trained on movie reviews data and obtained an
accuracy of 5% more than the closest model.
• Uses an annotated dataset called the Stanford Sentiment
Treebank. Users are encouraged to add labels to improve
the model further.
45
48. StanfordNLP Sentiment Analysis
• Provides relatively good results for short sentences.
• Sentences that are similar to the training data (movie
reviews) perform much better than other sentences.
• No good way to aggregate sentiments across a
document. A future work would probably involve
document level dependency parsing and sentiment
analysis.
• Only provides overall sentiment. Does not provide an
indication of the object of the sentiment.
48
49. Final Thoughts
• Shallow NLP is employed in text retrieval and search and
provide good results for general search use cases.
• Deeper NLP involves semantic parsing, common sense
interpolation (both local and global knowledge bases) and
tends to be harder.
• Deeper NLP is more practical after picking a specific
domain for e.g. medical records, legal documents etc.
• 2 cents on Intelligence - Memory based systems
• http://watson-um-demo.mybluemix.net/
49http://en.wikipedia.org/wiki/On_Intelligence