SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Working with linguistic data
Ekaterina Vylomova
April 14, 2014
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Possible data sources
Dictionaries and corpora
Linked Data
Social media
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Thesauri & Corpora
WordNets
Roget's Thesaurus
Moby Project
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter & R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Moby Project
Moby Hyphenator - 185,000 entries fully hyphenated
Moby Language - Word lists in ve of the world's great
languages
Moby Part-of-Speech - 230,000 entries fully described by
part(s) of speech
Moby Pronunciator - 175,000 entries fully International
Phonetic Alphabet coded
Moby Thesaurus - 30,000 root words, 2.5 million synonyms
and related words
Moby Words - 610,000+ words and phrases
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Linked  Structured Data
Using RDF format.
DBPedia is a project aiming to extract structured content
from the information created as part of Wikipedia project
FreeBase is a large collaborative knowledge base consisting of
metadata composed mainly by its community members
BabelNet is a multilingual lexicalized semantic network and
ontology. Automatically created using Wikipedia.
YAGO is a knowledge base developed at the Max Planck
Institute. Also automatically built.
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Spoken corpus
TalkBank(multilingual): rst language acquisition, second
language acquisition, conversation analysis, classroom
discourse, and aphasic language.
CHILDES(part of TalkBank): Child Language Data Exchange
System
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Possible Data Sources
Sentiment data
SentiWordNet
Dictionary by Warriner et al.
Dictionary by Hu and Liu
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Possible Data Sources
Social media
Rating systems: IMDB, Amazon, TripAdvisor, OpenTable
Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage
Facebook (OpenGraph)
Twitter
Blogs (LiveJournal, Blogger, etc.)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Possible ways to get the data
Corpora: just download it!
Facebook, Twitter and other social media: use API
Blogs, Internet data: parse HTML or XML (download webpage
using wget/curl)
Linked data: parse RDF
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Don'n forget this step!
Tokenization
Remove punctuation, may be number and stop words,
lower-case everything
Lemmatization or stemming(Porter, Snowball)
In case of bag-of-words you may
create term x document or term x term matrix(using TF,
TFIDF, RIDF for normalization)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Few key words from data mining
Compute set similarity: Jaccard, Dice, F-scores
Transform words to vectors: LSA, MDS
Get topics of documents: LDA
For classication you may use: SVM, neural networks,
discriminant analysis, bayesian networks, decision trees,
random forest,adaboost
For clustering you may use: k-means, knn, SOM, SVM
For regression you may use: SVM, neural networks, GLM, NLS
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Connect to Facebook OpenGraph
Get access token
Go to
https:
//developers.facebook.com/tools/access_token/
Check it works:
https://developers.facebook.com/tools/explorer?
method=GETpath=me%3Ffields%3Did%2Cnameme?fields=
id,name,gender
Use tutorial:
https://developers.facebook.com/docs/graph-api/
common-scenarios/
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Facebook  Python
Download the package:
https://github.com/pythonforfacebook/facebook-sdk
Install it : python setup.py install
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Facebook  Python
Get names and gender of your friends. Possible project: prediction
of gender according to the names
import facebook
token='your_token '
graph = facebook.GraphAPI(token)
profile = graph.get_object(me)
friends = graph.get_connections(me, friends)
friend_list = [friend['id'] for friend in friends['data']]
for friend_id in friend_list:
data=graph.get_object(friend_id)
if 'gender ' in data.keys():
print data['name'], data['gender ']
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Using R
Packages you may need
tm - text mining + tm.plugin.webmining for webcorpora, html
parsers, plain text extraction
topicmodels - topicality
wordcloud - create a cloud of words
qdap - sentiment analysis
RCurl - curl (get the contents of a webpage)
twitteR - to use data from twitter
Wordnet - wordnet usage (dictionary needed)
e1071 - machine learning(clustering, SVM, naive Bayes, LSA)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Packages usage
Installation: install.packages(name)
Usage: library(name)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Load packages:
library(twitteR)
library(tm)
library(RCurl)
library(qdap)
library(wordcloud)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Get Token:
reqURL - https://api.twitter.com/oauth/request_token
accessURL - https://api.twitter.com/oauth/access_token
authURL - https://api.twitter.com/oauth/authorize
consumerKey - key
consumerSecret - secret
twitCred - OAuthFactory$new(consumerKey=consumerKey ,
consumerSecret=consumerSecret ,
requestURL=reqURL ,
accessURL=accessURL ,
authURL=authURL)
# The method will return a link to get a PIN code , you
should enter the code
twitCred$handshake(cainfo = system.file(CurlSSL, cacert.
pem,
package = RCurl))
registerTwitterOAuth(twitCred)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Get the data and convert to corpus:
# search by hashtag , you may also search by plain words. Get
n=1000 entries
gglTweets - searchTwitter('#sochi2014 ', n=1000)
n - length(gglTweets)
# show first 3 entries
gglTweets [1:3]
# put it in a data frame
df - do.call(rbind,
lapply(gglTweets , as.data.frame))
# get dimenstionality
dim(df)
# create a corpus
myCorpus - Corpus(VectorSource(df$text))
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Do normalization:
myCorpus - tm_map(myCorpus , tolower)
# remove punctuation
myCorpus - tm_map(myCorpus , removePunctuation)
# remove numbers
myCorpus - tm_map(myCorpus , removeNumbers)
# remove stopwords (very frequent words , e.g. articles ,
prepositions)
myStopwords - c(stopwords('english ')), sochi,amp, get
)
myCorpus - tm_map(myCorpus , removeWords , myStopwords)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Stem the documents:
dictCorpus - myCorpus
# apply stemming for normalization , you may use
lemmatization instead
myCorpus - tm_map(myCorpus , stemDocument)
inspect(myCorpus [1:3])
myCorpus - tm_map(myCorpus ,
stemCompletion , dictionary=dictCorpus)
inspect(myCorpus [1:3])
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Create TDM:
# create term -document matrix , you may use TF or TFIDF
metric
myDtm - TermDocumentMatrix(myCorpus , control =
list(minWordLength = 1,
weighting = weightTfIdf))
inspect(myDtm [66:70 ,11:20])
# frequent terms and associations
findFreqTerms(myDtm , lowfreq =10)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Twitter with R
Create a wordcloud:
# convert TDM to plain matrix
m-as.matrix(myDtm)
# sort by decreasing frequencies
v-sort(rowSums(m),decreasing=TRUE)
# show first 14 entries
head(v,14)
# get colnames
words -names(v)
# create dataframe for words with frequencies
dat -data.frame(word=words ,freq=v)
# create wordcloud from words which appeared at least 5
times
wordcloud(dat$word ,dat$freq , min.freq =5)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Experience Project is a free social networking website consisting of
various online communities. Users/members submit
experiences personal stories, confessions, blogs, groups, photos,
and videos.
The users assign categories to the stories.
Example: I really hate being shy ... I just want to be able to talk to
someone about anything and everything and be myself ... That's all
I've ever wanted.
Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0;
Author age: 21
Author gender:female
Text group: friends
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data
Let's load the data:
# read .cvs file with data
ep = read.csv('ep3 -context.csv')
Here: Count is the number of Category reactions received by
confessions containing Word in Group with an author of Gender
and Age.
Total is the number of Category reactions used by confessions
containing any Word in Group with an author of Gender and Age.
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data
Look at dierent parameters:
# show examples of words
levels(ep$Word)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Check if there is any correlation between words and categories
# include source file
source('ep.R')
# create a subset for word funny
funny = epCollapsedFrame(ep, 'funny')
# plot the frequencies of the word for each category
plot(funny$Category , funny$Count , xlab='Category ', ylab='
Count', main='funny')
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Funnycorresponds to understandcategory. This doesn't look
realistically..Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
We need normalization!
# apply normalization (divide by the total number of words)
funny$Count / funny$Total
# get a subset for funny, take frequencies into account
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE)
# create a plot
plot(funny$Category , funny$Freq , xlab='Category ', ylab='
Count/Total', main='funny')
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Words and categories
Word-Category Correlation
Much better!Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Probability theory
Get category from word
Freq corresponds to the conditional probability P(word|category),
i.e. the probability to that a speaker used 'word' in a given
'category'.
Let's apply Bayesian rule and compute P(category|word), i.e. the
probability of category given that a speaker used 'word'.
funny$Freq / sum(funny$Freq)
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE
)
plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count
/Total)/sum(Count/Total)', main='funny')
Question: any other words specic for a category?
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Compare with estimated value
Estimate expected value
funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE
, oe=TRUE)
Estimated value: Exp = N
i=1 xip(xi), p(xi) is a probability of xi.
category.probs = (funny$Total/sum(funny$Total))
funny.count = sum(funny$Count)
funny.expected = funny.count * category.probs
funny.expected
Compare estimated and observed values:
(funny$observed / funny.expected) - 1
Value less than 0 means that a word is underrepresented in a
category.
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by gender
Usage of 'awesome' by male/female/unknown
eptok = read.csv('ep3 -context -tokencounts.csv')
par(mfrow=c(1,3))
epPlot(ep , eptok , 'awesome ', genders='male', probs=T)
epPlot(ep , eptok , 'awesome ', genders='female ', probs=T)
epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by gender
Usage of 'awesome' by male/female/unknown
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by age
Usage of 'awesome' by people of dierent ages
par(mfrow=c(2,3))
for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs=
T) }
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' by age
Usage of 'awesome' by people of dierent ages
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' comparing gender with the
category
'Awesome': gender+category
Changing the parameter for each category separately:
epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs
=T, type='b')
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'awesome' comparing gender with the
category
'Awesome': gender+category
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'drunk' comparing gender with the category
'Drunk': gender+category
Stories with drunkdepend on the age:
epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T,
type='b')
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Adding context: 'drunk' comparing gender with the category
'Drunk': gender+category
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Let's create a regression model: predict the frequency of 'drunk'
using age and category
drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5))
drunk$Age = as.numeric(drunk$Age)
fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age ,
data=drunk , family=binomial)
summary(fit.glm)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Find a function that predicts a word according to the category and
age of person
FittedGlmFunc = function(fit , category , age) {
coefs = fit$coef
cat.coef = coefs[[ paste('Category ',category , sep='')]]
prediction = plogis(cat.coef + coefs [['Age']]*age)
return(prediction)
}
Calling the function:
FittedGlmFunc(fit.glm , 'wow', 1)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Creating a logistic regression model
Regression modelling
Visualize the data and compare empirical(black) values with
tted(red) data.
par(mfrow=c(2,3))
cats = levels(ep$Category)
for(i in 1:5) {
epPlot(ep , eptok , 'drunk', age=i)
for (j in 1:5) {
val = FittedGlmFunc(fit.glm , cats[j], i)
points(j, val , col='red', pch =19)
}
}
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Calculating expected value
Regression modelling
Visualize the data and compare empirical(black) values with
tted(red) data.
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
IMDB data
Analysis of ADV-ADJcollocations
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Data from rating systems
Data
We will use the data from rating systems(Amazon.com,
OpenTable.com, Goodreads.com, IMDB.com). Load them:
d = read.csv('ratings -advadj.csv')
head(d)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Extract subsets
'Horrid' categories
horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers=
NULL , modifier.types=NULL , ratingmax =0)
nrow(horrid)
head(horrid)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Extract subsets
'Absolutely'+'Horrid'
With modier:
horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely '
)
nrow(horrid)
head(horrid)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Tonality evaluation for adjectives
Probabilities of categories for 'horrid'
horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs
=TRUE)
horrid
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Tonality
Probabilities vs frequencies
par(mfrow=c(1,2))
ratingPlot(d, 'horrid ', probs=FALSE)
ratingPlot(d, 'horrid ', probs=TRUE)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Evaluating expectation
Predict category using adjective
Predict a category based on adjective.
Expectation:
sum(horrid$Category * horrid$Pr)
The same does ExpectedCategory function:
ExpectedCategory(horrid)
Adding value to the plot:
ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Evaluating expectation
Predict category using adjective
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Let's create a model to predict probability that a word will be in
particular category
fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$
Count) ~ Category , family=quasibinomial , data=horrid)
fit.horrid
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Improve the model by adding quadratic function
GlmWordQuadratic -function(pf) {
pf$Category2 = pf$Category ^2
fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 ,
family=quasibinomial , data=pf)
return(fit)
}
par(mfrow=c(2,2))
ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)
, ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic)
, ratingmax =10, ylim=c(0, 0.3))
ratingPlot(d, 'disappointing ', probs=TRUE , models=c(
GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5))
ratingPlot(d, 'disappointing ', probs=TRUE , models=c(
GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3))
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Regression model
A model for predicting
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Vector space models
Vector space models
How to transform words to vectors:
LSA (latent semantic analysis)
MDS (multidimensional scaling)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Basics about vectors
Euclidean distance:
EuclideanDist(x, y) =
n
i=1
(xi − yi)2
Vector length:
VectorLength(x) =
n
i=1
(xi)2
Vector normalization - component divided by its length.
Cosine between vectors:
CosineDist(x, y) = 1 −
n
i=1 (xi) ∗ n
i=1 (yi)
VectorLength(x) ∗ VectorLength(y)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Vector space models
Data from IMDB
Initail data: term x term matrix, xij element of matrix is a
frequency of cooccurrence of termi and termj in context(document,
sentences, etc.)
source('vsm.R')
# co-occurrence matrix(words appearing in the same context(
phrase , sentence , paragraph))
imdb = Csv2Matrix('imdb -wordword.csv')
imdb [100:110 , 100:110]
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantically related words
Extract semantically related words
df = Neighbors(imdb , 'happy')
head(df)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantically related words
Problem
a = c(1000 , 2000, 3000)
b = c(1, 2, 3)
a/sum(a)
# 0.1666667 0.3333333 0.5000000
b/sum(b)
# 0.1666667 0.3333333 0.5000000
LengthNorm(a)
# 0.2672612 0.5345225 0.8017837
LengthNorm(b)
 [1] 0.2672612 0.5345225 0.801783
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
PMI - Pointwise mutual information
How to deal with it? - PMI!
PMI(x, y) = log
p(x, y)
p(x) ∗ p(y)
PMI normalization:
NPMI(i, j) = pmi(i, j)∗
p(i, j)
p(i, j) + 1
∗
min ( m
k=1 p(k, j), n
k=1 p(k, j))
min ( m
k=1 p(k, j), n
k=1 p(k, j)) + 1
Where p(i,j)=M/sum(M), M - term x term matrix
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
PMI - Pointwise mutual information
PMI
imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE)
df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc=
CosineDistance)
head(df)
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantic orientation method
Semantic orientation
Describe 2 sets of words S1 è S2 (vector representations)
Choose the distance measure
For a word w: calculate the sum of distances to vectors of S1
and S2
The tonality is a dierence between two sums
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
Semantic orientation method
Example of semantic orientation method
neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', '
wrong', 'inferior ')
pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate '
, 'correct ', 'superior ')
SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg ,
seeds2=pos , distfunc=CosineDistance)
# 0.8923544
SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg ,
seeds2=pos , distfunc=CosineDistance)
# -0.04741898
Ekaterina Vylomova Working with linguistic data
Data Sources
How to retrieve the data?
Data preprocessing
Some key concepts
Facebook
R package
Twitter  R
Sentiment analysis (Based on Chris Potts tutorial )
Experience project
Experience project
IMDB: Vector space models
More information
Data  examples
For more detailed examples and tutorials about sentiment analysis
go to Chris Potts tutorials.
http://nasslli2012.christopherpotts.net
http://sentiment.christopherpotts.net
Email me if you need any help!
Ekaterina Vylomova Working with linguistic data

Contenu connexe

Tendances

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesJeffrey Breen
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information RetrievalSumin Byeon
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Goran S. Milovanovic
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Slides
SlidesSlides
Slidesbutest
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsRuben Verborgh
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 

Tendances (20)

R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
Cross-Language Information Retrieval
Cross-Language Information RetrievalCross-Language Information Retrieval
Cross-Language Information Retrieval
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Google code search
Google code searchGoogle code search
Google code search
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Slides
SlidesSlides
Slides
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 

En vedette

Sentiment Analysis via R Programming
Sentiment Analysis via R ProgrammingSentiment Analysis via R Programming
Sentiment Analysis via R ProgrammingSkillspeed
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
Presentacion taltac2
Presentacion taltac2Presentacion taltac2
Presentacion taltac2CIDES UMSA
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 
Sentiment Analysis in R
Sentiment Analysis in RSentiment Analysis in R
Sentiment Analysis in REdureka!
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Google datastore & search api
Google datastore & search apiGoogle datastore & search api
Google datastore & search apiGeoffrey Garnotel
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API Mohd Shadab Alam
 
Emploi: Faire face aux tests Rorschah et TAT
Emploi: Faire face aux tests Rorschah et TATEmploi: Faire face aux tests Rorschah et TAT
Emploi: Faire face aux tests Rorschah et TATREALIZ
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Presentacion diseño curricular 2011 carla fuentes
Presentacion diseño curricular 2011 carla fuentesPresentacion diseño curricular 2011 carla fuentes
Presentacion diseño curricular 2011 carla fuentesCIDES UMSA
 
Présentation sur Twitter et le microblogging
Présentation sur Twitter et le microbloggingPrésentation sur Twitter et le microblogging
Présentation sur Twitter et le microbloggingDamien Guinet
 

En vedette (20)

Sentiment Analysis via R Programming
Sentiment Analysis via R ProgrammingSentiment Analysis via R Programming
Sentiment Analysis via R Programming
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Presentacion taltac2
Presentacion taltac2Presentacion taltac2
Presentacion taltac2
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Sentiment Analysis in R
Sentiment Analysis in RSentiment Analysis in R
Sentiment Analysis in R
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
RHadoop, R meets Hadoop
RHadoop, R meets HadoopRHadoop, R meets Hadoop
RHadoop, R meets Hadoop
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Google datastore & search api
Google datastore & search apiGoogle datastore & search api
Google datastore & search api
 
Social media analysis in R using twitter API
Social media analysis in R using twitter API Social media analysis in R using twitter API
Social media analysis in R using twitter API
 
Emploi: Faire face aux tests Rorschah et TAT
Emploi: Faire face aux tests Rorschah et TATEmploi: Faire face aux tests Rorschah et TAT
Emploi: Faire face aux tests Rorschah et TAT
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Introduction à Twitter
Introduction à TwitterIntroduction à Twitter
Introduction à Twitter
 
Presentacion diseño curricular 2011 carla fuentes
Presentacion diseño curricular 2011 carla fuentesPresentacion diseño curricular 2011 carla fuentes
Presentacion diseño curricular 2011 carla fuentes
 
Présentation sur Twitter et le microblogging
Présentation sur Twitter et le microbloggingPrésentation sur Twitter et le microblogging
Présentation sur Twitter et le microblogging
 

Similaire à Working with text data

Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Using Hyperlinks to Enrich Message Board Content with Linked Data
Using Hyperlinks to Enrich Message Board Content with Linked DataUsing Hyperlinks to Enrich Message Board Content with Linked Data
Using Hyperlinks to Enrich Message Board Content with Linked DataSheila Kinsella
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
Linking Media and Data using Apache Marmotta (LIME workshop keynote)
Linking Media and Data using Apache Marmotta  (LIME workshop keynote)Linking Media and Data using Apache Marmotta  (LIME workshop keynote)
Linking Media and Data using Apache Marmotta (LIME workshop keynote)LinkedTV
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Linked Media and Data Using Apache Marmotta
Linked Media and Data Using Apache MarmottaLinked Media and Data Using Apache Marmotta
Linked Media and Data Using Apache MarmottaSebastian Schaffert
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...Peter de Haas
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataAndre Freitas
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 

Similaire à Working with text data (20)

Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Using Hyperlinks to Enrich Message Board Content with Linked Data
Using Hyperlinks to Enrich Message Board Content with Linked DataUsing Hyperlinks to Enrich Message Board Content with Linked Data
Using Hyperlinks to Enrich Message Board Content with Linked Data
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Linking Media and Data using Apache Marmotta (LIME workshop keynote)
Linking Media and Data using Apache Marmotta  (LIME workshop keynote)Linking Media and Data using Apache Marmotta  (LIME workshop keynote)
Linking Media and Data using Apache Marmotta (LIME workshop keynote)
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Linked Media and Data Using Apache Marmotta
Linked Media and Data Using Apache MarmottaLinked Media and Data Using Apache Marmotta
Linked Media and Data Using Apache Marmotta
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 

Plus de Katerina Vylomova

Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Katerina Vylomova
 
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...Katerina Vylomova
 
Sigmorphon 2021. Keynote. UniMorph, Morphological inflection
Sigmorphon 2021. Keynote. UniMorph, Morphological inflectionSigmorphon 2021. Keynote. UniMorph, Morphological inflection
Sigmorphon 2021. Keynote. UniMorph, Morphological inflectionKaterina Vylomova
 
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...Katerina Vylomova
 
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionKaterina Vylomova
 
Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Katerina Vylomova
 
Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Katerina Vylomova
 
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyEvaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyKaterina Vylomova
 
Contextualization of Morphological Inflection
Contextualization of Morphological InflectionContextualization of Morphological Inflection
Contextualization of Morphological InflectionKaterina Vylomova
 
Paradigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyParadigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyKaterina Vylomova
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingKaterina Vylomova
 
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Katerina Vylomova
 
Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Katerina Vylomova
 
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Katerina Vylomova
 
Neural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsNeural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsKaterina Vylomova
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian languageKaterina Vylomova
 
Ekaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationEkaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationKaterina Vylomova
 

Plus de Katerina Vylomova (17)

Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages
 
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...
The UniMorph Project and Morphological Reinflection Task: Past, Present, and ...
 
Sigmorphon 2021. Keynote. UniMorph, Morphological inflection
Sigmorphon 2021. Keynote. UniMorph, Morphological inflectionSigmorphon 2021. Keynote. UniMorph, Morphological inflection
Sigmorphon 2021. Keynote. UniMorph, Morphological inflection
 
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
 
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
 
Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2
 
Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1
 
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyEvaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
 
Contextualization of Morphological Inflection
Contextualization of Morphological InflectionContextualization of Morphological Inflection
Contextualization of Morphological Inflection
 
Paradigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyParadigm Completion for Derivational Morphology
Paradigm Completion for Derivational Morphology
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
 
Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017
 
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
 
Neural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsNeural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chants
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian language
 
Ekaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationEkaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentation
 

Dernier

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Dernier (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Working with text data

  • 1. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Working with linguistic data Ekaterina Vylomova April 14, 2014 Ekaterina Vylomova Working with linguistic data
  • 2. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Possible data sources Dictionaries and corpora Linked Data Social media Ekaterina Vylomova Working with linguistic data
  • 3. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Thesauri & Corpora WordNets Roget's Thesaurus Moby Project Ekaterina Vylomova Working with linguistic data
  • 4. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter & R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Moby Project Moby Hyphenator - 185,000 entries fully hyphenated Moby Language - Word lists in ve of the world's great languages Moby Part-of-Speech - 230,000 entries fully described by part(s) of speech Moby Pronunciator - 175,000 entries fully International Phonetic Alphabet coded Moby Thesaurus - 30,000 root words, 2.5 million synonyms and related words Moby Words - 610,000+ words and phrases Ekaterina Vylomova Working with linguistic data
  • 5. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Linked Structured Data Using RDF format. DBPedia is a project aiming to extract structured content from the information created as part of Wikipedia project FreeBase is a large collaborative knowledge base consisting of metadata composed mainly by its community members BabelNet is a multilingual lexicalized semantic network and ontology. Automatically created using Wikipedia. YAGO is a knowledge base developed at the Max Planck Institute. Also automatically built. Ekaterina Vylomova Working with linguistic data
  • 6. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Spoken corpus TalkBank(multilingual): rst language acquisition, second language acquisition, conversation analysis, classroom discourse, and aphasic language. CHILDES(part of TalkBank): Child Language Data Exchange System Ekaterina Vylomova Working with linguistic data
  • 7. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Possible Data Sources Sentiment data SentiWordNet Dictionary by Warriner et al. Dictionary by Hu and Liu Ekaterina Vylomova Working with linguistic data
  • 8. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible Data Sources Social media Rating systems: IMDB, Amazon, TripAdvisor, OpenTable Sentiment: ExperienceProject, FMyLife, MyLifeIsAverage Facebook (OpenGraph) Twitter Blogs (LiveJournal, Blogger, etc.) Ekaterina Vylomova Working with linguistic data
  • 9. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Possible ways to get the data Corpora: just download it! Facebook, Twitter and other social media: use API Blogs, Internet data: parse HTML or XML (download webpage using wget/curl) Linked data: parse RDF Ekaterina Vylomova Working with linguistic data
  • 10. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Don'n forget this step! Tokenization Remove punctuation, may be number and stop words, lower-case everything Lemmatization or stemming(Porter, Snowball) In case of bag-of-words you may create term x document or term x term matrix(using TF, TFIDF, RIDF for normalization) Ekaterina Vylomova Working with linguistic data
  • 11. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Few key words from data mining Compute set similarity: Jaccard, Dice, F-scores Transform words to vectors: LSA, MDS Get topics of documents: LDA For classication you may use: SVM, neural networks, discriminant analysis, bayesian networks, decision trees, random forest,adaboost For clustering you may use: k-means, knn, SOM, SVM For regression you may use: SVM, neural networks, GLM, NLS Ekaterina Vylomova Working with linguistic data
  • 12. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Connect to Facebook OpenGraph Get access token Go to https: //developers.facebook.com/tools/access_token/ Check it works: https://developers.facebook.com/tools/explorer? method=GETpath=me%3Ffields%3Did%2Cnameme?fields= id,name,gender Use tutorial: https://developers.facebook.com/docs/graph-api/ common-scenarios/ Ekaterina Vylomova Working with linguistic data
  • 13. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Facebook Python Download the package: https://github.com/pythonforfacebook/facebook-sdk Install it : python setup.py install Ekaterina Vylomova Working with linguistic data
  • 14. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Facebook Python Get names and gender of your friends. Possible project: prediction of gender according to the names import facebook token='your_token ' graph = facebook.GraphAPI(token) profile = graph.get_object(me) friends = graph.get_connections(me, friends) friend_list = [friend['id'] for friend in friends['data']] for friend_id in friend_list: data=graph.get_object(friend_id) if 'gender ' in data.keys(): print data['name'], data['gender '] Ekaterina Vylomova Working with linguistic data
  • 15. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Using R Packages you may need tm - text mining + tm.plugin.webmining for webcorpora, html parsers, plain text extraction topicmodels - topicality wordcloud - create a cloud of words qdap - sentiment analysis RCurl - curl (get the contents of a webpage) twitteR - to use data from twitter Wordnet - wordnet usage (dictionary needed) e1071 - machine learning(clustering, SVM, naive Bayes, LSA) Ekaterina Vylomova Working with linguistic data
  • 16. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Packages usage Installation: install.packages(name) Usage: library(name) Ekaterina Vylomova Working with linguistic data
  • 17. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Load packages: library(twitteR) library(tm) library(RCurl) library(qdap) library(wordcloud) Ekaterina Vylomova Working with linguistic data
  • 18. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Get Token: reqURL - https://api.twitter.com/oauth/request_token accessURL - https://api.twitter.com/oauth/access_token authURL - https://api.twitter.com/oauth/authorize consumerKey - key consumerSecret - secret twitCred - OAuthFactory$new(consumerKey=consumerKey , consumerSecret=consumerSecret , requestURL=reqURL , accessURL=accessURL , authURL=authURL) # The method will return a link to get a PIN code , you should enter the code twitCred$handshake(cainfo = system.file(CurlSSL, cacert. pem, package = RCurl)) registerTwitterOAuth(twitCred) Ekaterina Vylomova Working with linguistic data
  • 19. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Get the data and convert to corpus: # search by hashtag , you may also search by plain words. Get n=1000 entries gglTweets - searchTwitter('#sochi2014 ', n=1000) n - length(gglTweets) # show first 3 entries gglTweets [1:3] # put it in a data frame df - do.call(rbind, lapply(gglTweets , as.data.frame)) # get dimenstionality dim(df) # create a corpus myCorpus - Corpus(VectorSource(df$text)) Ekaterina Vylomova Working with linguistic data
  • 20. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Do normalization: myCorpus - tm_map(myCorpus , tolower) # remove punctuation myCorpus - tm_map(myCorpus , removePunctuation) # remove numbers myCorpus - tm_map(myCorpus , removeNumbers) # remove stopwords (very frequent words , e.g. articles , prepositions) myStopwords - c(stopwords('english ')), sochi,amp, get ) myCorpus - tm_map(myCorpus , removeWords , myStopwords) Ekaterina Vylomova Working with linguistic data
  • 21. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Stem the documents: dictCorpus - myCorpus # apply stemming for normalization , you may use lemmatization instead myCorpus - tm_map(myCorpus , stemDocument) inspect(myCorpus [1:3]) myCorpus - tm_map(myCorpus , stemCompletion , dictionary=dictCorpus) inspect(myCorpus [1:3]) Ekaterina Vylomova Working with linguistic data
  • 22. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Create TDM: # create term -document matrix , you may use TF or TFIDF metric myDtm - TermDocumentMatrix(myCorpus , control = list(minWordLength = 1, weighting = weightTfIdf)) inspect(myDtm [66:70 ,11:20]) # frequent terms and associations findFreqTerms(myDtm , lowfreq =10) Ekaterina Vylomova Working with linguistic data
  • 23. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Twitter with R Create a wordcloud: # convert TDM to plain matrix m-as.matrix(myDtm) # sort by decreasing frequencies v-sort(rowSums(m),decreasing=TRUE) # show first 14 entries head(v,14) # get colnames words -names(v) # create dataframe for words with frequencies dat -data.frame(word=words ,freq=v) # create wordcloud from words which appeared at least 5 times wordcloud(dat$word ,dat$freq , min.freq =5) Ekaterina Vylomova Working with linguistic data
  • 24. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Experience Project is a free social networking website consisting of various online communities. Users/members submit experiences personal stories, confessions, blogs, groups, photos, and videos. The users assign categories to the stories. Example: I really hate being shy ... I just want to be able to talk to someone about anything and everything and be myself ... That's all I've ever wanted. Reactions: hugs: 1; rock: 1; teehee: 2; understand: 10; wow: 0; Author age: 21 Author gender:female Text group: friends Ekaterina Vylomova Working with linguistic data
  • 25. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data Let's load the data: # read .cvs file with data ep = read.csv('ep3 -context.csv') Here: Count is the number of Category reactions received by confessions containing Word in Group with an author of Gender and Age. Total is the number of Category reactions used by confessions containing any Word in Group with an author of Gender and Age. Ekaterina Vylomova Working with linguistic data
  • 26. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data Look at dierent parameters: # show examples of words levels(ep$Word) Ekaterina Vylomova Working with linguistic data
  • 27. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Check if there is any correlation between words and categories # include source file source('ep.R') # create a subset for word funny funny = epCollapsedFrame(ep, 'funny') # plot the frequencies of the word for each category plot(funny$Category , funny$Count , xlab='Category ', ylab=' Count', main='funny') Ekaterina Vylomova Working with linguistic data
  • 28. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Funnycorresponds to understandcategory. This doesn't look realistically..Ekaterina Vylomova Working with linguistic data
  • 29. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation We need normalization! # apply normalization (divide by the total number of words) funny$Count / funny$Total # get a subset for funny, take frequencies into account funny = epCollapsedFrame(ep, 'funny', freqs=TRUE) # create a plot plot(funny$Category , funny$Freq , xlab='Category ', ylab=' Count/Total', main='funny') Ekaterina Vylomova Working with linguistic data
  • 30. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Words and categories Word-Category Correlation Much better!Ekaterina Vylomova Working with linguistic data
  • 31. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Probability theory Get category from word Freq corresponds to the conditional probability P(word|category), i.e. the probability to that a speaker used 'word' in a given 'category'. Let's apply Bayesian rule and compute P(category|word), i.e. the probability of category given that a speaker used 'word'. funny$Freq / sum(funny$Freq) funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE ) plot(funny$Category , funny$Pr , xlab='Category ', ylab='(Count /Total)/sum(Count/Total)', main='funny') Question: any other words specic for a category? Ekaterina Vylomova Working with linguistic data
  • 32. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Compare with estimated value Estimate expected value funny = epCollapsedFrame(ep, 'funny', freqs=TRUE , probs=TRUE , oe=TRUE) Estimated value: Exp = N i=1 xip(xi), p(xi) is a probability of xi. category.probs = (funny$Total/sum(funny$Total)) funny.count = sum(funny$Count) funny.expected = funny.count * category.probs funny.expected Compare estimated and observed values: (funny$observed / funny.expected) - 1 Value less than 0 means that a word is underrepresented in a category. Ekaterina Vylomova Working with linguistic data
  • 33. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by gender Usage of 'awesome' by male/female/unknown eptok = read.csv('ep3 -context -tokencounts.csv') par(mfrow=c(1,3)) epPlot(ep , eptok , 'awesome ', genders='male', probs=T) epPlot(ep , eptok , 'awesome ', genders='female ', probs=T) epPlot(ep , eptok , 'awesome ', genders='unknown ', probs=T) Ekaterina Vylomova Working with linguistic data
  • 34. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by gender Usage of 'awesome' by male/female/unknown Ekaterina Vylomova Working with linguistic data
  • 35. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by age Usage of 'awesome' by people of dierent ages par(mfrow=c(2,3)) for (i in 1:5) { epPlot(ep, eptok , 'awesome ', ages=i, probs= T) } Ekaterina Vylomova Working with linguistic data
  • 36. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' by age Usage of 'awesome' by people of dierent ages Ekaterina Vylomova Working with linguistic data
  • 37. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' comparing gender with the category 'Awesome': gender+category Changing the parameter for each category separately: epCategoryByFactorPlot(ep, eptok , 'awesome ', 'Gender ', probs =T, type='b') Ekaterina Vylomova Working with linguistic data
  • 38. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'awesome' comparing gender with the category 'Awesome': gender+category Ekaterina Vylomova Working with linguistic data
  • 39. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'drunk' comparing gender with the category 'Drunk': gender+category Stories with drunkdepend on the age: epCategoryByFactorPlot(ep, eptok , 'drunk', 'Age', probs=T, type='b') Ekaterina Vylomova Working with linguistic data
  • 40. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Adding context: 'drunk' comparing gender with the category 'Drunk': gender+category Ekaterina Vylomova Working with linguistic data
  • 41. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Let's create a regression model: predict the frequency of 'drunk' using age and category drunk = epFullFrame(ep, 'drunk', age=c(1,2,3,4,5)) drunk$Age = as.numeric(drunk$Age) fit.glm = glm(cbind(Count ,Total -Count) ~ Category - 1 + Age , data=drunk , family=binomial) summary(fit.glm) Ekaterina Vylomova Working with linguistic data
  • 42. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Find a function that predicts a word according to the category and age of person FittedGlmFunc = function(fit , category , age) { coefs = fit$coef cat.coef = coefs[[ paste('Category ',category , sep='')]] prediction = plogis(cat.coef + coefs [['Age']]*age) return(prediction) } Calling the function: FittedGlmFunc(fit.glm , 'wow', 1) Ekaterina Vylomova Working with linguistic data
  • 43. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Creating a logistic regression model Regression modelling Visualize the data and compare empirical(black) values with tted(red) data. par(mfrow=c(2,3)) cats = levels(ep$Category) for(i in 1:5) { epPlot(ep , eptok , 'drunk', age=i) for (j in 1:5) { val = FittedGlmFunc(fit.glm , cats[j], i) points(j, val , col='red', pch =19) } } Ekaterina Vylomova Working with linguistic data
  • 44. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Calculating expected value Regression modelling Visualize the data and compare empirical(black) values with tted(red) data. Ekaterina Vylomova Working with linguistic data
  • 45. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models IMDB data Analysis of ADV-ADJcollocations Ekaterina Vylomova Working with linguistic data
  • 46. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Data from rating systems Data We will use the data from rating systems(Amazon.com, OpenTable.com, Goodreads.com, IMDB.com). Load them: d = read.csv('ratings -advadj.csv') head(d) Ekaterina Vylomova Working with linguistic data
  • 47. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Extract subsets 'Horrid' categories horrid = ratingFullFrame(d, 'horrid ', types=NULL , modifiers= NULL , modifier.types=NULL , ratingmax =0) nrow(horrid) head(horrid) Ekaterina Vylomova Working with linguistic data
  • 48. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Extract subsets 'Absolutely'+'Horrid' With modier: horrid = ratingFullFrame(d, 'horrid ', modifiers='absolutely ' ) nrow(horrid) head(horrid) Ekaterina Vylomova Working with linguistic data
  • 49. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Tonality evaluation for adjectives Probabilities of categories for 'horrid' horrid = ratingCollapsedFrame(d, 'horrid ', freqs=TRUE , probs =TRUE) horrid Ekaterina Vylomova Working with linguistic data
  • 50. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Tonality Probabilities vs frequencies par(mfrow=c(1,2)) ratingPlot(d, 'horrid ', probs=FALSE) ratingPlot(d, 'horrid ', probs=TRUE) Ekaterina Vylomova Working with linguistic data
  • 51. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Evaluating expectation Predict category using adjective Predict a category based on adjective. Expectation: sum(horrid$Category * horrid$Pr) The same does ExpectedCategory function: ExpectedCategory(horrid) Adding value to the plot: ratingPlot(d, 'horrid ', probs=TRUE , ec=TRUE) Ekaterina Vylomova Working with linguistic data
  • 52. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Evaluating expectation Predict category using adjective Ekaterina Vylomova Working with linguistic data
  • 53. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Let's create a model to predict probability that a word will be in particular category fit.horrid = glm(cbind(horrid$Count , horrid$Total -horrid$ Count) ~ Category , family=quasibinomial , data=horrid) fit.horrid Ekaterina Vylomova Working with linguistic data
  • 54. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Ekaterina Vylomova Working with linguistic data
  • 55. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Improve the model by adding quadratic function GlmWordQuadratic -function(pf) { pf$Category2 = pf$Category ^2 fit = glm(cbind(Count ,Total -Count) ~ Category + Category2 , family=quasibinomial , data=pf) return(fit) } par(mfrow=c(2,2)) ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic) , ratingmax=5, ylim=c(0, 0.5)) ratingPlot(d, 'good', probs=TRUE , models=c(GlmWordQuadratic) , ratingmax =10, ylim=c(0, 0.3)) ratingPlot(d, 'disappointing ', probs=TRUE , models=c( GlmWordQuadratic), ratingmax=5, ylim=c(0, 0.5)) ratingPlot(d, 'disappointing ', probs=TRUE , models=c( GlmWordQuadratic), ratingmax =10, ylim=c(0, 0.3)) Ekaterina Vylomova Working with linguistic data
  • 56. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Regression model A model for predicting Ekaterina Vylomova Working with linguistic data
  • 57. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Vector space models Vector space models How to transform words to vectors: LSA (latent semantic analysis) MDS (multidimensional scaling) Ekaterina Vylomova Working with linguistic data
  • 58. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Basics about vectors Euclidean distance: EuclideanDist(x, y) = n i=1 (xi − yi)2 Vector length: VectorLength(x) = n i=1 (xi)2 Vector normalization - component divided by its length. Cosine between vectors: CosineDist(x, y) = 1 − n i=1 (xi) ∗ n i=1 (yi) VectorLength(x) ∗ VectorLength(y) Ekaterina Vylomova Working with linguistic data
  • 59. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Vector space models Data from IMDB Initail data: term x term matrix, xij element of matrix is a frequency of cooccurrence of termi and termj in context(document, sentences, etc.) source('vsm.R') # co-occurrence matrix(words appearing in the same context( phrase , sentence , paragraph)) imdb = Csv2Matrix('imdb -wordword.csv') imdb [100:110 , 100:110] Ekaterina Vylomova Working with linguistic data
  • 60. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantically related words Extract semantically related words df = Neighbors(imdb , 'happy') head(df) Ekaterina Vylomova Working with linguistic data
  • 61. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantically related words Problem a = c(1000 , 2000, 3000) b = c(1, 2, 3) a/sum(a) # 0.1666667 0.3333333 0.5000000 b/sum(b) # 0.1666667 0.3333333 0.5000000 LengthNorm(a) # 0.2672612 0.5345225 0.8017837 LengthNorm(b) [1] 0.2672612 0.5345225 0.801783 Ekaterina Vylomova Working with linguistic data
  • 62. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models PMI - Pointwise mutual information How to deal with it? - PMI! PMI(x, y) = log p(x, y) p(x) ∗ p(y) PMI normalization: NPMI(i, j) = pmi(i, j)∗ p(i, j) p(i, j) + 1 ∗ min ( m k=1 p(k, j), n k=1 p(k, j)) min ( m k=1 p(k, j), n k=1 p(k, j)) + 1 Where p(i,j)=M/sum(M), M - term x term matrix Ekaterina Vylomova Working with linguistic data
  • 63. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models PMI - Pointwise mutual information PMI imdb.ppcd = PMI(imdb , positive=TRUE , discounting=TRUE) df = Neighbors(imdb.ppcd , 'happy', byrow=TRUE , distfunc= CosineDistance) head(df) Ekaterina Vylomova Working with linguistic data
  • 64. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantic orientation method Semantic orientation Describe 2 sets of words S1 è S2 (vector representations) Choose the distance measure For a word w: calculate the sum of distances to vectors of S1 and S2 The tonality is a dierence between two sums Ekaterina Vylomova Working with linguistic data
  • 65. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models Semantic orientation method Example of semantic orientation method neg = c('bad', 'nasty ', 'poor', 'negative ', 'unfortunate ', ' wrong', 'inferior ') pos = c('good', 'nice', 'excellent ', 'positive ', 'fortunate ' , 'correct ', 'superior ') SemanticOrientation(imdb.ppcd , word='great ', seeds1=neg , seeds2=pos , distfunc=CosineDistance) # 0.8923544 SemanticOrientation(imdb.ppci , word='horrid ', seeds1=neg , seeds2=pos , distfunc=CosineDistance) # -0.04741898 Ekaterina Vylomova Working with linguistic data
  • 66. Data Sources How to retrieve the data? Data preprocessing Some key concepts Facebook R package Twitter R Sentiment analysis (Based on Chris Potts tutorial ) Experience project Experience project IMDB: Vector space models More information Data examples For more detailed examples and tutorials about sentiment analysis go to Chris Potts tutorials. http://nasslli2012.christopherpotts.net http://sentiment.christopherpotts.net Email me if you need any help! Ekaterina Vylomova Working with linguistic data