2. MONK Project
MONK provides:
• 1400 works of literature in English from the 16th -
19th century = 108 million words, POS-tagged,
TEI-tagged, in a MySQL database.
• Several different open-source interfaces for
working with this data
• A public API to the datastore
• SEASR under the hood, for analytics
3. MONK Project
Executes flows for
each analysis
requested
– Predictive
modeling using
Naïve Bayes
– Predictive
modeling using
Support Vector
Machines (SVM)
4. Dunning Loglikelihood TagCloud
• Words that are under-represented in writings by Victorian
women as compared to Victorian men. —Sara Steger
5. Feature Lens
“The discussion of the children
introduces each of the short
internal
narratives. This champions the
view that her method of
repetition was patterned:
controlled, intended, and a
measured means to an end.
It would have been impossible
to discern through traditional
reading“
6. Semantic Analysis: Information Extraction
• Definition: Information extraction is the
identification of specific semantic elements within
a text (e.g., entities, properties, relations)
• Extract the relevant informa1on and ignore
non‐relevant informa1on (important!)
• Link related informa1on and output in a
predetermined format
7. Information Extraction
Informa(on Type State of the art (Accuracy)
En((es 90‐98%
an object of interest such as a
person or organiza1on.
A9ributes 80%
a property of an en1ty such as its
name, alias, descriptor, or type.
Facts 60‐70%
a rela1onship held between two or
more en11es such as Posi1on of a
Person in a Company.
Events 50‐60%
an ac1vity involving several en11es
such as a terrorist act, airline crash,
management change, new product
introduc1on.
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
8. Information Extraction Approaches
• Terminology (name) lists
– This works very well if the list of names and name expressions is
stable and available
• Tokenization and morphology
– This works well for things like formulas or dates, which are readily
recognized by their internal format (e.g., DD/MM/YY or chemical
formulas)
• Use of characteristic patterns
– This works fairly well for novel entities
– Rules can be created by hand or learned via machine learning or
statistical algorithms
– Rules capture local paFerns that characterize en11es from
instances of annotated training data
9. Semantic Analytics
Named Entity (NE) Tagging
NE:Person NE:Time
Mayor Rex Luthor announced today the establishment
NE:Location
of a new research facility in Alderwood. It will be
NE:Organization
known as Boynton Laboratory.
10. Semantic Analysis
Co-reference Resolution for entities and unnamed
entities
Mayor Rex Luthor announced today the establishment
UNE:Organization
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
11. Semantic Analysis
Semantic Role Analysis
ACTOR ACTION WHEN OBJECT
Mayor Rex Luthor announced today the establishment
WHERE OBJECT
of a new research facility in Alderwoon. It will be
ACTION COMPL
known as Boynton Laboratory
12. Semantic Analysis
Concept-Relation Extraction
today
e
tim n )
time
e
(w h
actor
Rex Luthor announce
(who)
person action
ob w h a
(
je t)
ct
establ.
loc(whe
event
ha t
(w jec
at re)
t)
b
io
o
n
Boynton
Alderwood
Lab
organiz. location
15. UIMA Structured data
• Two SEASR examples using UIMA POS data
– Frequent patterns (rule associations) on nouns
(fpgrowth)
– Sentiment analysis on adjectives
17. UIMA + P.O.S. tagging
Four Analysis Engines to analyze document to
record Part Of Speech information.
OpenNLP OpenNLP OpenNLP
POSWriter
Tokenizer PosTagger SentanceDetector
Serialization of the UIMA CAS
19. SEASR + UIMA: Frequent Patterns
Frequent Pattern Analysis on nouns
• Goal:
– Discover a cast of characters within the text
– Discover nouns that frequently occur together
• character relationships
22. UIMA + SEASR: Sentiment Analysis
• Classifying text based on its sentiment
– Determining the attitude of a speaker or a writer
– Determining whether a review is positive/negative
• Ask: What emotion is being conveyed within a body of text?
– Look at only adjectives (UIMA POS)
• lots of issues, challenges, and but’s “but … “
• Need to Answer:
– What emotions to track?
– How to measure/classify an adjective to one of the selected
emotions?
– How to visualize the results?
25. UIMA + SEASR: Sentiment Analysis
• How to classify adjectives:
– Lots of metrics we could use …
• Lists of adjectives already classified
– http://www.derose.net/steve/resources/emotionwords/ewords.html
– Need a “nearness” metric for missing adjectives
– How about the thesaurus game ?
• Using only a thesaurus, find a path between two words
– no antonyms
– no colloquialisms or slang
26. UIMA + SEASR: Sentiment Analysis
• How to get from delightful to rainy ?
['delightful', 'fair', 'balmy', 'moist', 'rainy'].
• sexy to joyless?
['sexy', 'provocative', 'blue', 'joyless’]
• bitter to lovable?
['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’]
27. UIMA + SEASR: Sentiment Analysis
• Use this game as a metric for
measuring a given adjective to one
of the six emotions.
• Assume the longer the path, the “farther
away” the two words are.
• address some of issues
29. UIMA + SEASR: Sentiment Analysis
• SynNet Metrics
• Common nodes
• Path length
• Symmetric: a->b->c c->b->a
• Link strength:
• tangy->sweet
• sweet->lovable
• Use of slang or informal usage
30. UIMA + SEASR: Sentiment Analysis
• Common Nodes
• depth of common
31. UIMA + SEASR: Sentiment Analysis
• Symmetry of path in common nodes
32. UIMA + SEASR: Sentiment Analysis
• Find the shortest path between
adjective and each emotion:
• ['delightful', 'beatific', 'joyful']
• ['delightful', 'ineffable', 'unspeakable',
'fearful']
• Pick the emotion with shortest path
length
• tie breaking procedures
33. UIMA + SEASR: Sentiment Analysis
• Not a perfect solution
– still need context to get quality
• Vain
– ['vain', 'insignificant', 'contemptible', 'hateful']
– ['vain', 'misleading', 'puzzling', 'surprising’]
• Animal
['animal', 'sensual', 'pleasing', 'joyful']
–
['animal', 'bestial', 'vile', 'hateful']
–
['animal', 'gross', 'shocking', 'fearful']
–
['animal', 'gross', 'grievous', 'sorrowful']
–
• Negation
– “My mother was not a hateful person.”
34. UIMA + SEASR: Sentiment Analysis
• Process Overview
• Extract the adjectives (UIMA POS analysis)
• Read in adjectives (SEASR library)
• Label each adjective (SynNet)
• Summarize windows of adjectives
• lots of experimentation here
• Visualize the windows
35. UIMA + SEASR: Sentiment Analysis
• Visualization
• New SEASR visualization component
• Based on flare ActionScript Library
• http://flare.prefuse.org/
• Still in development
• http://demo.seasr.org:1714/public/resources/data/emotions/
ev/EmotionViewer.html