1. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Working with linguistic data
Ekaterina Vylomova
April 14, 2014
Ekaterina Vylomova Working with linguistic data
2. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Possible data sources
Dictionaries and corpora
Linked Data
Social media
Ekaterina Vylomova Working with linguistic data
3. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Thesauri & Corpora
WordNets
Roget's Thesaurus
Moby Project
Ekaterina Vylomova Working with linguistic data
4. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Moby Project
Moby Hyphenator - 185,000 entries fully hyphenated
Moby Language - Word lists in ve of the world's great
languages
Moby Part-of-Speech - 230,000 entries fully described by
part(s) of speech
Moby Pronunciator - 175,000 entries fully International
Phonetic Alphabet coded
Moby Thesaurus - 30,000 root words, 2.5 million synonyms
and related words
Moby Words - 610,000+ words and phrases
Ekaterina Vylomova Working with linguistic data
5. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Linked Structured Data
Using RDF format.
DBPedia is a project aiming to extract structured content
from the information created as part of Wikipedia project
FreeBase is a large collaborative knowledge base consisting of
metadata composed mainly by its community members
BabelNet is a multilingual lexicalized semantic network and
ontology. Automatically created using Wikipedia.
YAGO is a knowledge base developed at the Max Planck
Institute. Also automatically built.
Ekaterina Vylomova Working with linguistic data
6. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Spoken corpus
TalkBank(multilingual): rst language acquisition, second
language acquisition, conversation analysis, classroom
discourse, and aphasic language.
CHILDES(part of TalkBank): Child Language Data Exchange
System
Ekaterina Vylomova Working with linguistic data
7. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Sentiment data
SentiWordNet
Dictionary by Warriner et al.
Dictionary by Hu and Liu
Ekaterina Vylomova Working with linguistic data
8. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Social media
Rating systems: IMDB, Amazon, TripAdvisor, OpenTable
Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage
Facebook (OpenGraph)
Twitter
Blogs (LiveJournal, Blogger, etc.)
Ekaterina Vylomova Working with linguistic data
9. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Possible ways to get the data
Corpora: just download it!
Facebook, Twitter and other social media: use API
Blogs, Internet data: parse HTML or XML (download webpage
using wget/curl)
Linked data: parse RDF
Ekaterina Vylomova Working with linguistic data
10. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Don'n forget this step!
Tokenization
Remove punctuation, may be number and stop words,
lower-case everything
Lemmatization or stemming(Porter, Snowball)
In case of bag-of-words you may
create term x document or term x term matrix(using TF,
TFIDF, RIDF for normalization)
Ekaterina Vylomova Working with linguistic data
11. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Few key words from data mining
Compute set similarity: Jaccard, Dice, F-scores
Transform words to vectors: LSA, MDS
Get topics of documents: LDA
For classication you may use: SVM, neural networks,
discriminant analysis, bayesian networks, decision trees,
random forest,adaboost
For clustering you may use: k-means, knn, SOM, SVM
For regression you may use: SVM, neural networks, GLM, NLS
Ekaterina Vylomova Working with linguistic data
12. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Connect to Facebook OpenGraph
Get access token
Go to
https:
//developers.facebook.com/tools/access_token/
Check it works:
https://developers.facebook.com/tools/explorer?
method=GETpath=me%3Ffields%3Did%2Cnameme?fields=
id,name,gender
Use tutorial:
https://developers.facebook.com/docs/graph-api/
common-scenarios/
Ekaterina Vylomova Working with linguistic data
13. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Facebook Python
Download the package:
https://github.com/pythonforfacebook/facebook-sdk
Install it : python setup.py install
Ekaterina Vylomova Working with linguistic data
14. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Facebook Python
Get names and gender of your friends. Possible project: prediction
of gender according to the names
import facebook
token='your_token '
graph = facebook.GraphAPI(token)
profile = graph.get_object(me)
friends = graph.get_connections(me, friends)
friend_list = [friend['id'] for friend in friends['data']]
for friend_id in friend_list:
data=graph.get_object(friend_id)
if 'gender ' in data.keys():
print data['name'], data['gender ']
Ekaterina Vylomova Working with linguistic data
15. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Using R
Packages you may need
tm - text mining + tm.plugin.webmining for webcorpora, html
parsers, plain text extraction
topicmodels - topicality
wordcloud - create a cloud of words
qdap - sentiment analysis
RCurl - curl (get the contents of a webpage)
twitteR - to use data from twitter
Wordnet - wordnet usage (dictionary needed)
e1071 - machine learning(clustering, SVM, naive Bayes, LSA)
Ekaterina Vylomova Working with linguistic data
16. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Packages usage
Installation: install.packages(name)
Usage: library(name)
Ekaterina Vylomova Working with linguistic data
17. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Load packages:
library(twitteR)
library(tm)
library(RCurl)
library(qdap)
library(wordcloud)
Ekaterina Vylomova Working with linguistic data
18. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Get Token:
reqURL - https://api.twitter.com/oauth/request_token
accessURL - https://api.twitter.com/oauth/access_token
authURL - https://api.twitter.com/oauth/authorize
consumerKey - key
consumerSecret - secret
twitCred - OAuthFactory$new(consumerKey=consumerKey ,
consumerSecret=consumerSecret ,
requestURL=reqURL ,
accessURL=accessURL ,
authURL=authURL)
# The method will return a link to get a PIN code , you
should enter the code
twitCred$handshake(cainfo = system.file(CurlSSL, cacert.
pem,
package = RCurl))
registerTwitterOAuth(twitCred)
Ekaterina Vylomova Working with linguistic data
19. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Get the data and convert to corpus:
# search by hashtag , you may also search by plain words. Get
n=1000 entries
gglTweets - searchTwitter('#sochi2014 ', n=1000)
n - length(gglTweets)
# show first 3 entries
gglTweets [1:3]
# put it in a data frame
df - do.call(rbind,
lapply(gglTweets , as.data.frame))
# get dimenstionality
dim(df)
# create a corpus
myCorpus - Corpus(VectorSource(df$text))
Ekaterina Vylomova Working with linguistic data
20. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Do normalization:
myCorpus - tm_map(myCorpus , tolower)
# remove punctuation
myCorpus - tm_map(myCorpus , removePunctuation)
# remove numbers
myCorpus - tm_map(myCorpus , removeNumbers)
# remove stopwords (very frequent words , e.g. articles ,
prepositions)
myStopwords - c(stopwords('english ')), sochi,amp, get
)
myCorpus - tm_map(myCorpus , removeWords , myStopwords)
Ekaterina Vylomova Working with linguistic data
21. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Stem the documents:
dictCorpus - myCorpus
# apply stemming for normalization , you may use
lemmatization instead
myCorpus - tm_map(myCorpus , stemDocument)
inspect(myCorpus [1:3])
myCorpus - tm_map(myCorpus ,
stemCompletion , dictionary=dictCorpus)
inspect(myCorpus [1:3])
Ekaterina Vylomova Working with linguistic data
22. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Create TDM:
# create term -document matrix , you may use TF or TFIDF
metric
myDtm - TermDocumentMatrix(myCorpus , control =
list(minWordLength = 1,
weighting = weightTfIdf))
inspect(myDtm [66:70 ,11:20])
# frequent terms and associations
findFreqTerms(myDtm , lowfreq =10)
Ekaterina Vylomova Working with linguistic data
23. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Create a wordcloud:
# convert TDM to plain matrix
m-as.matrix(myDtm)
# sort by decreasing frequencies
v-sort(rowSums(m),decreasing=TRUE)
# show first 14 entries
head(v,14)
# get colnames
words -names(v)
# create dataframe for words with frequencies
dat -data.frame(word=words ,freq=v)
# create wordcloud from words which appeared at least 5
times
wordcloud(dat$word ,dat$freq , min.freq =5)
Ekaterina Vylomova Working with linguistic data
24. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Experience Project is a free social networking website consisting of
various online communities. Users/members submit
experiences personal stories, confessions, blogs, groups, photos,
and videos.
The users assign categories to the stories.
Example: I really hate being shy ... I just want to be able to talk to
someone about anything and everything and be myself ... That's all
I've ever wanted.
Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;
Author age: 21
Author gender:female
Text group: friends
Ekaterina Vylomova Working with linguistic data
25. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data
Let's load the data:
# read .cvs file with data
ep = read.csv('ep3 -context.csv')
Here: Count is the number of Category reactions received by
confessions containing Word in Group with an author of Gender
and Age.
Total is the number of Category reactions used by confessions
containing any Word in Group with an author of Gender and Age.
Ekaterina Vylomova Working with linguistic data
26. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data
Look at dierent parameters:
# show examples of words
levels(ep$Word)
Ekaterina Vylomova Working with linguistic data
27. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Check if there is any correlation between words and categories
# include source file
source('ep.R')
# create a subset for word funny
funny = epCollapsedFrame(ep, 'funny')
# plot the frequencies of the word for each category
plot(funny$Category , funny$Count , xlab='Category ', ylab='
Count', main='funny')
Ekaterina Vylomova Working with linguistic data
28. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Funnycorresponds to understandcategory. This doesn't look
realistically..Ekaterina Vylomova Working with linguistic data
29. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
We need normalization!
# apply normalization (divide by the total number of words)
funny$Count / funny$Total
# get a subset for funny, take frequencies into account
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE)
# create a plot
plot(funny$Category , funny$Freq , xlab='Category ', ylab='
Count/Total', main='funny')
Ekaterina Vylomova Working with linguistic data
30. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Much better!Ekaterina Vylomova Working with linguistic data
31. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Probability theory
Get category from word
Freq corresponds to the conditional probability P(word|category),
i.e. the probability to that a speaker used 'word' in a given
'category'.
Let's apply Bayesian rule and compute P(category|word), i.e. the
probability of category given that a speaker used 'word'.
funny$Freq / sum(funny$Freq)
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE
)
plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count
/Total)/sum(Count/Total)', main='funny')
Question: any other words specic for a category?
Ekaterina Vylomova Working with linguistic data
32. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Compare with estimated value
Estimate expected value
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE
, oe=TRUE)
Estimated value: Exp = N
i=1 xip(xi), p(xi) is a probability of xi.
category.probs = (funny$Total/sum(funny$Total))
funny.count = sum(funny$Count)
funny.expected = funny.count * category.probs
funny.expected
Compare estimated and observed values:
(funny$observed / funny.expected) - 1
Value less than 0 means that a word is underrepresented in a
category.
Ekaterina Vylomova Working with linguistic data
33. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by gender
Usage of 'awesome' by male/female/unknown
eptok = read.csv('ep3 -context -tokencounts.csv')
par(mfrow=c(1,3))
epPlot(ep , eptok , 'awesome ', genders='male', probs=T)
epPlot(ep , eptok , 'awesome ', genders='female ', probs=T)
epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T)
Ekaterina Vylomova Working with linguistic data
34. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by gender
Usage of 'awesome' by male/female/unknown
Ekaterina Vylomova Working with linguistic data
35. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by age
Usage of 'awesome' by people of dierent ages
par(mfrow=c(2,3))
for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs=
T) }
Ekaterina Vylomova Working with linguistic data
36. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by age
Usage of 'awesome' by people of dierent ages
Ekaterina Vylomova Working with linguistic data
37. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' comparing gender with the
category
'Awesome': gender+category
Changing the parameter for each category separately:
epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs
=T, type='b')
Ekaterina Vylomova Working with linguistic data
38. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' comparing gender with the
category
'Awesome': gender+category
Ekaterina Vylomova Working with linguistic data
39. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'drunk' comparing gender with the category
'Drunk': gender+category
Stories with drunkdepend on the age:
epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T,
type='b')
Ekaterina Vylomova Working with linguistic data
40. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'drunk' comparing gender with the category
'Drunk': gender+category
Ekaterina Vylomova Working with linguistic data
41. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Let's create a regression model: predict the frequency of 'drunk'
using age and category
drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5))
drunk$Age = as.numeric(drunk$Age)
fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age ,
data=drunk , family=binomial)
summary(fit.glm)
Ekaterina Vylomova Working with linguistic data
42. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Find a function that predicts a word according to the category and
age of person
FittedGlmFunc = function(fit , category , age) {
coefs = fit$coef
cat.coef = coefs[[ paste('Category ',category , sep='')]]
prediction = plogis(cat.coef + coefs [['Age']]*age)
return(prediction)
}
Calling the function:
FittedGlmFunc(fit.glm , 'wow', 1)
Ekaterina Vylomova Working with linguistic data
43. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Visualize the data and compare empirical(black) values with
tted(red) data.
par(mfrow=c(2,3))
cats = levels(ep$Category)
for(i in 1:5) {
epPlot(ep , eptok , 'drunk', age=i)
for (j in 1:5) {
val = FittedGlmFunc(fit.glm , cats[j], i)
points(j, val , col='red', pch =19)
}
}
Ekaterina Vylomova Working with linguistic data
44. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Calculating expected value
Regression modelling
Visualize the data and compare empirical(black) values with
tted(red) data.
Ekaterina Vylomova Working with linguistic data
45. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
IMDB data
Analysis of ADV-ADJcollocations
Ekaterina Vylomova Working with linguistic data
46. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data from rating systems
Data
We will use the data from rating systems(Amazon.com,
OpenTable.com, Goodreads.com, IMDB.com). Load them:
d = read.csv('ratings -advadj.csv')
head(d)
Ekaterina Vylomova Working with linguistic data
47. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Extract subsets
'Horrid' categories
horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers=
NULL , modifier.types=NULL , ratingmax =0)
nrow(horrid)
head(horrid)
Ekaterina Vylomova Working with linguistic data
48. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Extract subsets
'Absolutely'+'Horrid'
With modier:
horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely '
)
nrow(horrid)
head(horrid)
Ekaterina Vylomova Working with linguistic data
49. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Tonality evaluation for adjectives
Probabilities of categories for 'horrid'
horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs
=TRUE)
horrid
Ekaterina Vylomova Working with linguistic data
50. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Tonality
Probabilities vs frequencies
par(mfrow=c(1,2))
ratingPlot(d, 'horrid ', probs=FALSE)
ratingPlot(d, 'horrid ', probs=TRUE)
Ekaterina Vylomova Working with linguistic data
51. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Evaluating expectation
Predict category using adjective
Predict a category based on adjective.
Expectation:
sum(horrid$Category * horrid$Pr)
The same does ExpectedCategory function:
ExpectedCategory(horrid)
Adding value to the plot:
ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE)
Ekaterina Vylomova Working with linguistic data
52. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Evaluating expectation
Predict category using adjective
Ekaterina Vylomova Working with linguistic data
53. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Let's create a model to predict probability that a word will be in
particular category
fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$
Count) ~ Category , family=quasibinomial , data=horrid)
fit.horrid
Ekaterina Vylomova Working with linguistic data
54. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Ekaterina Vylomova Working with linguistic data
55. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Improve the model by adding quadratic function
GlmWordQuadratic -function(pf) {
pf$Category2 = pf$Category ^2
fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 ,
family=quasibinomial , data=pf)
return(fit)
}
par(mfrow=c(2,2))
ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)
, ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)
, ratingmax =10, ylim=c(0, 0.3))
ratingPlot(d, 'disappointing ', probs=TRUE , models=c(
GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'disappointing ', probs=TRUE , models=c(
GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3))
Ekaterina Vylomova Working with linguistic data
56. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Ekaterina Vylomova Working with linguistic data
57. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Vector space models
Vector space models
How to transform words to vectors:
LSA (latent semantic analysis)
MDS (multidimensional scaling)
Ekaterina Vylomova Working with linguistic data
58. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Basics about vectors
Euclidean distance:
EuclideanDist(x, y) =
n
i=1
(xi − yi)2
Vector length:
VectorLength(x) =
n
i=1
(xi)2
Vector normalization - component divided by its length.
Cosine between vectors:
CosineDist(x, y) = 1 −
n
i=1 (xi) ∗ n
i=1 (yi)
VectorLength(x) ∗ VectorLength(y)
Ekaterina Vylomova Working with linguistic data
59. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Vector space models
Data from IMDB
Initail data: term x term matrix, xij element of matrix is a
frequency of cooccurrence of termi and termj in context(document,
sentences, etc.)
source('vsm.R')
# co-occurrence matrix(words appearing in the same context(
phrase , sentence , paragraph))
imdb = Csv2Matrix('imdb -wordword.csv')
imdb [100:110 , 100:110]
Ekaterina Vylomova Working with linguistic data
60. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantically related words
Extract semantically related words
df = Neighbors(imdb , 'happy')
head(df)
Ekaterina Vylomova Working with linguistic data
61. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantically related words
Problem
a = c(1000 , 2000, 3000)
b = c(1, 2, 3)
a/sum(a)
# 0.1666667 0.3333333 0.5000000
b/sum(b)
# 0.1666667 0.3333333 0.5000000
LengthNorm(a)
# 0.2672612 0.5345225 0.8017837
LengthNorm(b)
[1] 0.2672612 0.5345225 0.801783
Ekaterina Vylomova Working with linguistic data
62. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
PMI - Pointwise mutual information
How to deal with it? - PMI!
PMI(x, y) = log
p(x, y)
p(x) ∗ p(y)
PMI normalization:
NPMI(i, j) = pmi(i, j)∗
p(i, j)
p(i, j) + 1
∗
min ( m
k=1 p(k, j), n
k=1 p(k, j))
min ( m
k=1 p(k, j), n
k=1 p(k, j)) + 1
Where p(i,j)=M/sum(M), M - term x term matrix
Ekaterina Vylomova Working with linguistic data
63. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
PMI - Pointwise mutual information
PMI
imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE)
df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc=
CosineDistance)
head(df)
Ekaterina Vylomova Working with linguistic data
64. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantic orientation method
Semantic orientation
Describe 2 sets of words S1 è S2 (vector representations)
Choose the distance measure
For a word w: calculate the sum of distances to vectors of S1
and S2
The tonality is a dierence between two sums
Ekaterina Vylomova Working with linguistic data
65. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantic orientation method
Example of semantic orientation method
neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', '
wrong', 'inferior ')
pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate '
, 'correct ', 'superior ')
SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg ,
seeds2=pos , distfunc=CosineDistance)
# 0.8923544
SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg ,
seeds2=pos , distfunc=CosineDistance)
# -0.04741898
Ekaterina Vylomova Working with linguistic data
66. Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
More information
Data examples
For more detailed examples and tutorials about sentiment analysis
go to Chris Potts tutorials.
http://nasslli2012.christopherpotts.net
http://sentiment.christopherpotts.net
Email me if you need any help!
Ekaterina Vylomova Working with linguistic data