OpenEdition Lab projects in Text Mining

OPENEDITION LAB
TEXT MINING
PROJECTS
Patrice
Bellot 
Aix-‐Marseille
Université
-‐
CNRS
(LSIS
UMR
7296
;
OpenEdition)

!
patrice.bellot@univ-‐amu.fr
LSIS
-‐
DIMAG
team
http://www.lsis.org/spip.php?id_rubrique=291

OpenEdition
Lab
:
http://lab.hypotheses.org

Hypotheses!
600+ blogs
Revues.org!
300+ journals
Calenda!
20 000+ events
OpenEdition Books!
1000+ books
A
European
Web

platform
for
Human

and
Social
Sciences
A
digital
infrastructure

for
open
access
A
lab
for
experimenting

new
Text
Mining
 
and
new
IR
systems

Open Edition - a Facility of Excellence
3
2012-2020 7 millions €
Objectives:!
15 000 + books!
2000 + blogs!
Freemium!
Multilingual

P.
Bellot
(AMU-‐CNRS,
LSIS-‐OpenEdition)
OpenEdition Lab — Our Team
Directors : Patrice Bellot (Professor in Comp. Sc. / NLP / IR) - Marin Dacos (Head of OpenEdition)
Engineers : Elodie Faath - Arnaud Cordier
PhD Students : Hussam Hamdan, Chahinez Benkoussas, Anaïs Ollagnier
Post-docs : Young-Min Kim (2012-13), Shereen Albitar (2014)
4
http://lab.hypotheses.org

• 220
learned
societes
and
centers
(France)

• 30
university
presses
(France,
UK,
Belgium,
Switzerland,
Canada,
Mexico,
Hungary/USA)

• CCSD
–
France
-‐
Lyon
(HAL
/
DataCenter),

• CHNM
–
USA
–
Washington,

• OAPEN
–
NL
-‐
The
Hague,

• UNED
–
Spain
-‐
Universidad
Nacional
de
Educación
a
Distancia,

• Fundação
Calouste
Gulbenkian
–
Portugal,

• Max
Weber
Stinftung
–
Germany,

• Google
–
USA
(Google
Grants
for
DH),

• DARIAH
–
Europe.
Our partners
And
you?

P.
Bellot
(AMU-‐CNRS,
OpenEdition Lab (Text Mining Projects for DL)
Aims to :
— Link papers / books / blogs automatically (reference analysis, Named Entities…)
— Detect hot topics, hot books, hot papers : content oriented analysis (not only by using logs) 
- sentiment analysis 
- review of books analysis (and finding)
— Book searching with complex and long queries
— Reading recommandation
6

Project 1: BILBO
EN SVM CRF
Natural
Language
Processing
/
Text
Mining
/
Information
Retrieval
/
Machine
Learning

P.
Bellot
(AMU-‐CNRS,
LSIS-‐OpenEdition) 8
Projet N°2 : ECHO!
Détec'on!automa'que!de!compte0rendus!de!lecture! LREC,&2014!
Mise!en!rela'on!!
(BILBO)!
Recherche!Web!
Analyse!de!
sen'ments!
Mesure&de&l’écho!
NAACL-SEMEVAL,&2013!
logs,!métriques…!

P.
Bellot
(AMU-‐CNRS,
Projet N°3 : COOKER!
BILBO!
ECHO!
Graphe!des!contenus!
(puis!hypergraphe)!
Recommanda)on!
COOKER!
Classiﬁca?on!
automa?que!
et!métaCdonnées!
(thèmes,!langues,!
auteurs…)!

P.
Bellot
(AMU-‐CNRS,
Semantic Annotation of Bib. References
10

P.
Bellot
(AMU-‐CNRS,
A
–
references
in
a
specific
section

B
:
references
in
notes
C
:
references
in
the
body

P.
Bellot
(AMU-‐CNRS,
BILBO: A software for Annotating Bibliographical
Reference in Digital Humanities
Google Digital Humanities Research Awards (2011, 2012)
State of the art :  
— CiteSeer system (Giles et al., 1998) in computer science, 80% of precision for author and
40% for pages. Conditional Random Fields (CRFs) (Peng et al., 2006, Lafferty et al., 2001) for
scientific articles, 95% of average precision (99% for author, 95% for title, 85% for editor).
— Run on the cover page (title) and/or on the Reference section at the end of papers : not in
the footnotes, not in the text body
— Not very robust (in the real world : no stylesheets - poorly respected)
!
13

P.
Bellot
(AMU-‐CNRS,
References – three levels Architecture
Source XHTML (TEI guidelines)
LEARNING
AUTOMATIC ANNOTATION
TXT
Estimated XML files
• Revues.org online journals
- 340 journals
- Various reference formats
- 20 different languages
(90% in French)
• Unstructured and scattered reference data
• Prototype development, Web service
! source code will be distributed (GPL)
• Google Digital Humanities Research Awards (’10, ’11)
Part of Equipex future investment award: DILOH (’12)
Web Service
Plain text input
Future platform
Level 1
Level 2
Learning data
Tokenizer, Extractor
New data
Tokenizer, Extractor
Manual annotation
External machine
learning modules
Level 1
model
Level 2
model
Level 3
model
Machine learning
modules
Mallet, SVMlight
Conditional Random Fields
Automatic annotator
Call a model
Level 1 Level 2
Bibliography Notes Implicit References
Level 3
Comparison with other online tools
New Data : Reference data of library of
Kim
&
Bellot,
2012

P.
Bellot
(AMU-‐CNRS,
Conditional Random Fields for IE
— A discriminative model that is specified over a graph that encodes the conditional dependencies
(relationships between observations)
— Can be employed for sequential labeling (linear chain CRF)
— Take context into account
— The probability of a label sequence y given an observation 
sequence x is :  
 
 
 
 
with F the (rich) feature functions (transition and state functions) 
 
 
 
Parameters must be estimated using an iterative technique such as iterative scaling or gradient-
based methods
15
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data 
Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001)

Yi 1 Yi Yi+1
?
s
s -
?
s
s -
?
s
s
Xi 1 Xi Xi+1
Yi 1 Yi Yi+1
c
6
s -
c
6
s -
c
6
s
Xi 1 Xi Xi+1
Yi 1 Yi Yi+1
c
s
c
s
c
s
Xi 1 Xi Xi+1
Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.
An open circle indicates that the variable is not generated by the model.
sequence. In addition, the features do not need to specify
completely a state or observation, so one might expect that
the model can be estimated from less training data. Another
attractive property is the convexity of the loss function; in-
deed, CRFs share all of the convexity properties of general
maximum entropy models.
For the remainder of the paper we assume that the depen-
dencies of Y, conditioned on X, form a chain. To sim-
plify some expressions, we add special start and stop states
Y0 = start and Yn+1 = stop. Thus, we will be using the
graphical structure shown in Figure 2. For a chain struc-
ture, the conditional probability of a label sequence can be
expressed concisely in matrix form, which will be useful
in describing the parameter estimation and inference al-
gorithms in Section 4. Suppose that p✓(Y | X) is a CRF
given by (1). For each position i in the observation se-
quence x, we define the |Y| ⇥ |Y| matrix random variable
Mi(x) = [Mi(y0
, y | x)] by
Mi(y0
, y | x) = exp (⇤i(y0
, y | x))
⇤i(y0
, y | x) =
P
k k fk(ei, Y|ei
= (y0
, y), x) +
P
k µk gk(vi, Y|vi
= y, x) ,
of the training data. Both algorithms are based on the im-
proved iterative scaling (IIS) algorithm of Della Pietra et al.
(1997); the proof technique based on auxiliary functions
can be extended to show convergence of the algorithms for
CRFs.
Iterative scaling algorithms update the weights as k
k + k and µk µk + µk for appropriately chosen
k and µk. In particular, the IIS update k for an edge
feature fk is the solution of
eE[fk]
def
=
X
x,y
ep(x, y)
n+1X
i=1
fk(ei, y|ei
, x)
=
X
x,y
ep(x) p(y | x)
n+1X
i=1
fk(ei, y|ei
, x) e kT (x,y)
.
where T(x, y) is the total feature count
T(x, y)
def
=
X
i,k
fk(ei, y|ei , x) +
X
i,k
gk(vi, y|vi , x) .
The equations for vertex feature updates µk have similar
tj(yi−1, yi, x, i) =
b(x, i) if yi−1 = IN and yi = NNP
0 otherwise.
In the remainder of this report, notation is simplified by writing
s(yi, x, i) = s(yi−1, yi, x, i)
and
Fj(y, x) =
n
i=1
fj(yi−1, yi, x, i),
where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-
tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y
given an observation sequence x to be written as
p(y|x, λ) =
1
Z(x)
exp (
j
λjFj(y, x)). (3)
Z(x) is a normalization factor.
4
exp (
j
λjtj(yi−1, yi, x, i) +
k
µksk(yi, x, i)), (2)
where tj(yi−1, yi, x, i) is a transition feature function of the entire observation
sequence and the labels at positions i and i−1 in the label sequence; sk(yi, x, i)
is a state feature function of the label at position i and the observation sequence;
and λj and µk are parameters to be estimated from training data.
When defining feature functions, we construct a set of real-valued features
b(x, i) of the observation to expresses some characteristic of the empirical dis-
tribution of the training data that should also hold of the model distribution.
An example of such a feature is
b(x, i) =
1 if the observation at position i is the word “September”
0 otherwise.
Each feature function takes on the value of one of these real-valued observation
features b(x, i) if the current state (in the case of a state function) or previous
and current states (in the case of a transition function) take on particular val-
ues. All feature functions are therefore real-valued. For example, consider the
following transition function:
tj(yi−1, yi, x, i) =
b(x, i) if yi−1 = IN and yi = NNP
0 otherwise.
In the remainder of this report, notation is simplified by writing
s(yi, x, i) = s(yi−1, yi, x, i)
and
Fj(y, x) =
n
i=1
fj(yi−1, yi, x, i),
where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-
tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y
given an observation sequence x to be written as
p(y|x, λ) =
1
Z(x)
exp (
j
λjFj(y, x)). (3)
Z(x) is a normalization factor.
4
1.2 Graphical Models 7
Logistic Regression
HMMs
Linear-chain CRFs
Naive Bayes
SEQUENCE
SEQUENCE
CONDITIONAL CONDITIONAL
Generative directed models
General CRFs
CONDITIONAL
General
GRAPHS
General
GRAPHS
Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,
HMMs, linear-chain CRFs, generative models, and general CRFs.
Furthermore, even when naive Bayes has good classification accuracy, its prob-
ability estimates tend to be poor. To understand why, imagine training naive
Bayes on a data set in which all the features are repeated, that is, x =
(x1, x1, x2, x2, . . . , xK, xK). This will increase the confidence of the naive Bayes
probability estimates, even though no new information has been added to the data.
Assumptions like naive Bayes can be especially problematic when we generalize
to sequence models, because inference essentially combines evidence from di↵erent
parts of the model. If probability estimates at a local level are overconfident, it
might be di cult to combine them sensibly.
Actually, the di↵erence in performance between naive Bayes and logistic regression
is due only to the fact that the first is generative and the second discriminative;
the two classifiers are, for discrete input, identical in all other respects. Naive Bayes
and logistic regression consider the same hypothesis space, in the sense that any
logistic regression classifier can be converted into a naive Bayes classifier with the
same decision boundary, and vice versa. Another way of saying this is that the naive
Bayes model (1.5) defines the same family of distributions as the logistic regression
model (1.7), if we interpret it generatively as
p(y, x) =
exp {
P
k kfk(y, x)}
P
˜y,˜x exp {
P
k kfk(˜y, ˜x)}
. (1.9)
This means that if the naive Bayes model (1.5) is trained to maximize the con-
ditional likelihood, we recover the same classifier as from logistic regression. Con-
versely, if the logistic regression model is interpreted generatively, as in (1.9), and is
1.3 Linear-Chain Conditional Random Fields 9
. . .
. . .
y
x
Figure 1.3 Graphical model of an HMM-like linear-chain CRF.
. . .
. . .
y
x
Figure 1.4 Graphical model of a linear-chain CRF in which the transition score
depends on the current observation.
1.3 Linear-Chain Conditional Random Fields

P.
Bellot
(AMU-‐CNRS,
Table V: Verified input, local and global features. The selected ones in BILBO are
written in black, and the non-selected ones are in gray.
Input features
Feature category Description
Raw input token (I1) Tokenized word itself in the input string and the lowercased
word
Preceding or following tokens (I2) Three preceding and three following tokens of current token
N-gram (I3) Attachment of preceding or following N-gram tokens
Prefix/suffix in character level (I4) 8 different prefix/suffix as in [Councill et al. 2008]
Local features
Feature cate-
gory
Feature name Description Example
Number ALLNUMBERS All characters are numbers 1984
(F1) NUMBERS One or more characters are numbers in-4
DASH One or more dashes are included in numbers 665-680
(F1digit) 1DIGIT, 2DIGIT ... If number, number of digits in it 5, 78, ...
Capitalization ALLCAPS All characters are capital letters RAYMOND
(F2) FIRSTCAP First character is capital letter Paris
ALLSAMLL All characters are lower cased pouvoirs
NONIMPCAP Capital letters are mixed dell’Ateneo
Regular form INITIAL Initialized expression Ch.-R.
(F3) WEBLINK Regular expression for web pages apcss.org
Emphasis (F4) ITALIC Italic characters Regional
Location BIBL START Position is in the first one-third of reference -
(F5) BIBL IN Position is between the one-third and two-third -
BIBL END Position is between the two-third and the end -
Lexicon POSSEDITOR Possible for the abbreviation of editor ed.
(F6) POSSPAGE Possible for the abbreviation of page pp.
POSSMONTH Possible for month September
POSSVOLUME Possible for the abbreviation of volume vol.
External list SURNAMELIST Found in an external surname list RAYMOND
(F7) FORENAMELIST Found in an external forename list Simone
PLACELIST Found in an external place list New York
JOURNALLIST Found in an external journal list African
Affaire
Punctuation
(F8)
COMMA,
POINT, LINK,
PUNC, LEAD-
INGQUOTES,
END-
INGQUOTES,
PAIREDBRACES
Punctuation mark itself (comma, point) or punc-
tuation type. These features are defined espe-
cially for the case of non-separated punctuation.
46-55, 1993.
S.; [en “The
design”. (1)
Global features
Feature category Feature name Description
Local feature existence [local feature name] Corresponding local feature is found in the input
string
(G1) (F3, F4, and F6 features are finally selected)
Feature distribution
(G2) NOPUNC, 1PUNC,
2PUNC, MORE-
PUNC
There are no, 1, 2, or more PUNC features in the in-
put string
NONUMBER There is no number in the input string
STARTINITIAL The input string starts with an initial expression
ENDQUOTECOMMA An ending quote is followed by a comma
FIRSTCAPCOMMA A token having FIRSTCAP feature is followed by a
comma
Kim
&
Bellot,
2013

P.
Bellot
(AMU-‐CNRS,
Fig. 3: Basic tokenization effect. Each point is the averaged value of 10 different cross-
validated experiments.
●
●
●
●
●
Cumulative local feature effect
Training sets
Micro−averagedF−measure
50% 60% 70% 80% 90%
757779818385
● F0(Base)
F1(Num.)
F2(Cap.)
F3(Reg.)
F4(Emp.)
F5(Loc.)
F6(Lex.)
F7(Ext.)
F8(Pun.)
(a) Corpus level 1
●
●
●
●
●
Cumulative local feature effect
Training sets
50% 60% 70% 80% 90%
868890929496
● F0(Base)
F1(Num.)
F2(Cap.)
F3(Reg.)
F4(Emp.)
F5(Loc.)
F6(Lex.)
F7(Ext.)
F8(Pun.)
(b) Cora dataset
Fig. 4: Cumulative local feature effect from F1 to F8 with C1 and Cora
uation. We repeat cross validations by cumulatively adding features of each category
from F1 to F8. Too detailed features such as that of category F1-sub are excluded here
because by testing the detailed ones at the end, we want to eliminate them if they
Kim
&
Bellot,
2013

P.
Bellot
(AMU-‐CNRS,
Reference Parsing in Digital Humanities 39:25
●
●
●
●
●
Cumulative List feature effect
Training sets
50% 60% 70% 80% 90%
8283848586
● F6
F7a
F7b
F7c
F7
(a) Corpus level 1
●
●
●
●
●
Cumulative List feature effect
Training sets
50% 60% 70% 80% 90%
9192939495
● F6
F7a
F7b
F7c
F7
(b) Cora dataset
Fig. 5: Cumulative external list feature effect
Detailed analysis of the effect of external lists and lexicon. One of interesting discoveries
from the above analysis is lexical features are not always effective for reference pars-
ing. Lexicon features deﬁned with strict rules without overlapping have actually no
signiﬁcant impact, whereas external lists such as surname, forename, place, and jour-
Kim
&
Bellot,
2013

P.
Bellot
(AMU-‐CNRS,
39:30 Y.-M. Kim and P. Bellot
Table VII: Micro averaged precision and recall per field for C1 and Cora with finally
chosen strategy
(a) C1 - detailed labels
Fields #true #annot. #exist. prec.(%) recall(%)
surname 1080 1164 1203 92.78 89.78
forename 1128 1220 1244 92.46 90.68
title(m) 3277 4132 3690 79.31 88.81
title(a) 2782 3253 3069 85.52 90.65
title(j) 440 564 681 78.01 64.61
title(u) 511 660 652 77.42 78.37
title(s) 18 24 118 75.00 15.25
publisher 1021 1367 1171 74.69 87.19
date 793 838 855 94.63 92.75
biblscope(pp) 210 223 219 94.17 95.89
biblscope(i) 152 191 189 79.58 80.42
biblscope(v) 75 87 102 86.21 73.53
extent 66 69 70 95.65 94.29
place 433 524 539 82.63 80.33
abbr 417 468 502 89.10 83.07
nolabel 231 306 488 75.49 47.34
edition 46 178 211 25.84 21.80
orgname 74 87 118 85.06 62.71
bookindicator 47 49 65 95.92 72.31
OTHERS 95 177 395 53.67 24.05
Average 12896 15581 15581 82.77 82.77
(b) Cora dataset
Fields #true #annot. #exist. prec.(%) recall(%)
author 2797 2855 2830 97.97 98.83
title 3508 3613 3560 97.10 98.54
booktitle 1750 1882 1865 92.99 93.83
journal 546 615 617 88.78 88.49
date 636 641 642 99.22 99.07
institution 268 299 306 89.63 87.58
publisher 165 188 203 87.77 81.28
location 247 279 289 88.53 85.47
editor 232 261 295 88.89 78.64
pages 422 429 438 98.37 96.35
volume 306 327 320 93.58 95.63
tech 130 155 178 83.87 73.03
note 75 122 123 61.48 60.98
Average 11082 11666 11666 95.00 95.00
punctuation is attached, but the latter is significantly negative when reference fields
are much detailed. Hypothesis 2 is confirmed with these observations.
For our system BILBO, input and local features written in black in Table V are fi-
nally selected. BILBO provides two different labeling levels, simple model using only
Learning
on
FR
data

Testing
on
US
data
(715
references)
Kim
&
Bellot,
2013

P.
Bellot
(AMU-‐CNRS,
Test
:
http://bilbo.openeditionlab.org

Sources
:
http://github.com/OpenEdition/bilbo

P.
Bellot
(AMU-‐CNRS,
EQUIPEX OpenEdition: BILBO
21

P.
Bellot
(AMU-‐CNRS,
IR and Digital Libraries
!
Sentiment Analysis
22

P.
Bellot
(AMU-‐CNRS,
Searching for book reviews
• Applying and testing classical supervised approaches for filtering reviews = a new kind of genre
classification.
• Developing a corpus of reviews of books from the OpenEdition.org platforms and from the Web.
• Collecting two kinds of reviews: 
— Long reviews of scientific books written by expert reviewers in scientific journals 
— Short reviews such reader comments on social web sites
• Linking reviews to their corresponding books using BILBO
23
Review

≠

Abstract

P.
Bellot
(AMU-‐CNRS,
Searching for book reviews
• A supervised classification approach
• Feature selection : decision trees, Z-score
• Features : localisation of named entities,
24
was performed. We can see that a lot of this fea-
tures relate to the classe where they predominate.
Table 3: Distribution of the 30 highest normalized
Z scores across the corpus.
# Feature Z
score
# Feature Z
score
1 abandonne 30.14 16 winter 9.23
2 seront 30.00 17 cleo 8.88
3 biographie 21.84 18 visible 8.75
4 entranent 21.20 19 fondamentale 8.67
5 prise 21.20 20 david 8.54
6 sacre 21.20 21 pratiques 8.52
7 toute 20.70 22 signification 8.47
8 quitte 19.55 23 01 8.38
9 dimension 15.65 24 institutionnels 8.38
10 les 14.43 25 1930 8.16
11 commandement 11.01 26 attaques 8.14
12 lie 10.61 27 courrier 8.08
13 construisent 10.16 28 moyennes 7.99
14 lieux 10.14 29 petite 7.85
15 garde 9.75 30 adapted 7.84
In our training corpus, we have 106 911 words
obtained from the Bag-of-Words approach. We se-
lected all tokens (features) that appear more than
5 times in each classes. The goal is therefore to
design a method capable of selecting terms that
clearly belong to one genre of documents. We ob-
know, this section contains authors’ names, loca-
tions, dates, etc... However, in the Review class
this section is quite often absent. Based on this
analysis, we tagged all documents of each class
using the Named Entity Recognition tool TagEN
(Poibeau, 2003). We aim to explore the distribu-
tion of 3 named entities (”authors’ names”, ”loca-
tions” and ”dates”) in the text after removing all
XML-HTML tags. After that, we divided texts
into 10 parts (the size of each part = total num-
ber of words / 10). The distribution ratio of each
named entity in each part is used as feature to build
the new document representation and we obtained
a set of 30 features.
Figure 3: ”Person” named entity distribution
6 Experiments
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
6.2 Support Vector Machines (SVM)
SVM designates a learning approach introduced
by Vapnik in 1995 for solving two-class pattern
recognition problem (Vapnik, 1995). The SVM
method is based on the Structural Risk Mini-
mization principle (Vapnik, 1995) from computa-
tional learning theory. In their basic form, SVMs
learn linear threshold function. Nevertheless, by
a simple plug-in of an appropriate kernel func-
tion, they can be used to learn linear classifiers,
radial basic function (RBF) networks, and three-
layer sigmoid neural nets (Joachims, 1998). The
key in such classifiers is to determine the opti-
mal boundaries between the different classes and
use them for the purposes of classification (Ag-
garwal and Zhai, 2012). Having the vectors form
the different representations presented below. we
used the Weka toolkit to learning model. This
model with the use of the linear kernel and Radial
|w| indicates the number of words included in the
current document and wj is the number of words
that appear in the document.
arg max
hi
P(hi).
|w|
Y
j=1
P(wj|hi) (5)
where P(wj|hi) =
tfj,hi
nhi
We estimate the probabilities with the Equation
(5) and get the relation between the lexical fre-
quency of the word wj in the whole size of the
collection Thi
(denoted tfj,hi
) and the size of the
corresponding corpus.
Table 4: Results showing the performances of
the classification models using different indexing
schemes on the test set. The best values for the
Review class are noted in bold and those for
Review class are are underlined
Review Review
# Model R P F-M R P F-M
1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%
SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%
* C = 5.0
* = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%
SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%
SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%
* C = 32.0
* = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%
SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%
SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%
* C = 8.0
* = 0.03125
Benkoussas
&
Bellot,
LREC
2014

P.
Bellot
(AMU-‐CNRS,

P.
Bellot
(AMU-‐CNRS,
Sentiment Analysis in Twitter
26
Authorities are only too aware that Kashgar is 4,000 kilometres (2,500 miles) from Beijing but only a tenth of
the distance from the Pakistani border, and are desperate to ensure instability or militancy does not leak over the
frontiers.
Taiwan-made products stood a good chance of becoming even more competitive thanks to wider access to overseas
markets and lower costs for material imports, he said.
”March appears to be a more reasonable estimate while earlier admission cannot be entirely ruled out,” according
to Chen, also Taiwan’s chief WTO negotiator.
friday evening plans were great, but saturday’s plans didnt go as expected – i went dancing & it was an ok club,
but terribly crowded :-(
WHY THE HELL DO YOU GUYS ALL HAVE MRS. KENNEDY! SHES A FUCKING DOUCHE
AT&T was okay but whenever they do something nice in the name of customer service it seems like a favor, while
T-Mobile makes that a normal everyday thin
obama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemies
what we had. #Coward #Traitor
My graduation speech: ”I’d like to thanks Google, Wikipedia and my computer! :D #iThingteens
Table 5: List of example sentences with annotations that were provided to the annotators. All subjective phrases are
italicized. Positive phrases are in green, negative phrases are in red, and neutral phrases are in blue.
Worker 1 I would love to watch Vampire Diaries :) and some Heroes! Great combination 9/13
Intersection I would love to watch Vampire Diaries :) and some Heroes! Great combination
Table 6: Example of a sentence annotated for subjectivity on Mechanical Turk. Words and phrases that were marked as
subjective are italicized and highlighted in bold. The first five rows are annotations provided by Turkers, and the final
row shows their intersection. The final column shows the accuracy for each annotation compared to the intersection.
Note that ignoring Fneutral does not reduce the
task to predicting positive vs. negative labels only
(even though some participants have chosen to do
so) since the gold standard still contains neutral
For both subtasks, there were teams that only sub-
mitted results for the Twitter test set. Some teams
submitted both a constrained and an unconstrained
version (e.g., AVAYA and teragram). As one would
ation methodology. We then summarize the charac-
teristics of the approaches taken by the participating
systems and we discuss their scores.
2 Task Description
We had two subtasks: an expression-level subtask
and a message-level subtask. Participants could
choose to participate in either or both subtasks. Be-
low we provide short descriptions of the objectives
of these two subtasks.
Subtask A: Contextual Polarity Disambiguation
Given a message containing a marked instance
of a word or a phrase, determine whether that
instance is positive, negative or neutral in that
context. The boundaries for the marked in-
stance were provided: this was a classification
task, not an entity recognition task.
2
http://www.daedalus.es/TASS/corpus.php
this lexicon was used to automatically label addi-
tional Tweet/SMS messages and then used with the
original data to train the classifier, then such a sys-
tem would be unconstrained.
3 Dataset Creation
In the following sections we describe the collection
and annotation of the Twitter and SMS datasets.
3.1 Data Collection
Twitter is the most common micro-blogging site on
the Web, and we used it to gather tweets that express
sentiment about popular topics. We first extracted
named entities using a Twitter-tuned NER system
(Ritter et al., 2011) from millions of tweets, which
we collected over a one-year period spanning from
January 2012 to January 2013; we used the public
streaming Twitter API to download tweets.
313
e RT “Until tonight I never realised how fucked up I was” -
So, wat interview did you go to? How did it go?
om each corpus that contain subjective phrases.
al-
pro-
mpis
pan-
al.,
rom
d on
MS
ions
task
to a
yed
oal,
con-
MS
res-
olar-
We
lua-
Subtask B: Message Polarity Classification
Given a message, decide whether it is of
positive, negative, or neutral sentiment. For
messages conveying both a positive and a
negative sentiment, whichever is the stronger
one was to be chosen.
Each participating team was allowed to submit re-
sults for two different systems per subtask: one con-
strained, and one unconstrained. A constrained sys-
tem could only use the provided data for training,
but it could also use other resources such as lexi-
cons obtained elsewhere. An unconstrained system
could use any additional data as part of the training
process; this could be done in a supervised, semi-
supervised, or unsupervised fashion.
Note that constrained/unconstrained refers to the
data used to train a classifier. For example, if other
data (excluding the test data) was used to develop
a sentiment lexicon, and the lexicon was used to
generate features, the system would still be con-
strained. However, if other data (excluding the test
vey an opinion. Given a sentence, identify whether it is objective,
bjective word or phrase in the context of the sentence and mark
elow. The number above each word indicates its position. The
box so that you can confirm that you chose the correct range.
ing one of the radio buttons: positive, negative, or neutral. If a
indicating that ”There are no subjective words/phrases”. Please
nning if this is your first time answering this hit.
kers on Mechanical Turk followed by a screenshot.
Total Phrase Count Vocabulary
ers Positive Negative Neutral Size
0.0 5,895 3,131 471 20,012
0.0 648 430 57 4,426
1.2 2,734 1,541 160 11,736
5.6 1,071 1,104 159 3,562
tatistics for Subtask A.
med
tion
this
ered
oned
ffer-
ds.
to-
us-
We
ent-
one
that
Corpus Positive Negative Objective
/ Neutral
Twitter - Training 3,662 1,466 4,600
Twitter - Dev 575 340 739
Twitter - Test 1,573 601 1,640
SMS - Test 492 394 1,208
Table 3: Statistics for Subtask B.
We annotated the same Twitter messages with an-
notations for subtask A and subtask B. However,

P.
Bellot
(AMU-‐CNRS,
Aspect Based Sentiment Analysis
— Subtask 1: Aspect term extraction
Given a set of sentences with pre-identified entities (e.g., restaurants), identify the aspect terms
present in the sentence and return a list containing all the distinct aspect terms. An aspect term
names a particular aspect of the target entity.
For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I
loved the staff”. Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g.,
in “The hard disk is very noisy” the only aspect term is “hard disk”). 
 
— Subtask 2: Aspect term polarity
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect
term is positive, negative, neutral or conflict (i.e., both positive and negative).
For example:
“I loved their fajitas” → {fajitas: positive} 
“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive} 
“The fajitas are their first plate” → {fajitas: neutral} 
“The fajitas were great to taste, but not to see” → {fajitas: conflict}
27
http://alt.qcri.org/semeval2014/task4/

P.
Bellot
(AMU-‐CNRS,
— Subtask 3: Aspect category detection
Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories
discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of
Subtask 1, and they do not necessarily occur as terms in the given sentence.
For example, given the set of aspect categories {food, service, price, ambience, anecdotes/
miscellaneous}:
“The restaurant was too expensive” → {price} 
“The restaurant was expensive, but the menu was great” → {price, food} 
 
— Subtask 4: Aspect category polarity
Given a set of pre-identified aspect categories (e.g., {food, price}), determine the polarity
(positive, negative, neutral or conflict) of each aspect category.
For example:
“The restaurant was too expensive” → {price: negative} 
“The restaurant was expensive, but the menu was great” → {price: negative, food: positive}
28
http://alt.qcri.org/semeval2014/task4/

P.
Bellot
(AMU-‐CNRS,
Hussam Hamdan1 Frédéric Léchet1 Patrice Lellot
Hussam:hamdan_lsis:org1 Frederic:bechet_lif:univ-mrs:fr1 Patrice:bellot_lsis:org
Vix-Marseille University1 Marseille France
Twitter is a real-time1 highly social microblogging service that allows us to post short messages1The Sentiment Vnalysis of Twitter is useful for many domains )Marketing1Finance1
Social1 etc:::E1 Many approaches were proposed for this task1 we have applied several machine learning approaches in order to classify the Tweets using the dataset of SemEval
DNj!: Many resources were used for feature extractionG WordNet )similar adjectives and verb groupsE1 RLpedia )the hidden conceptsE1 SentiWordNet )the polarity and subjectivityE1
and other Twitter specific features such as number of y1w1_1 etc:
Highlights
Results
Naive Bayes Model
Average F-measure of
negative and positive
classes has been
improved by 45 wrt
uni-gram model
SVM Model
Average F-measure of negative
and positive classes has been
improved by 1(55
wrt uni-gram model
System Architecture
Preprocessing Feature Extraction
Classification
Model
Training Set:
6#56 Tweets
Development SetG
j48# Tweets
Gas by my house hit §!:!5yyyy1
Ikm going to *hapel Hill on Sat: GE
DBpedia WordNet
Senti-Features
xSentiWordNetC
Twitter
Specific
Pos Neg Objective
Conclusion
Classification
Twitter
Dictionary
Gas by my house hit §!:!5yyyy1
Ikm going to *hapel Hill on Sat:
very happy
Settlement
connected1 blessed
move1 displace1 sit
sit_down
w_1 wy1 ww
polarity1 subjectivity
wpos wneg
Preprocessing
Feature Extraction
*lassification model
- Using the similar adjectives from WordNet has a significant effect with Naive Layes but a little effect with SVM:
- Using the hidden concepts is not so significant in this data set1 more significant for the objective class with SVM
- Using Senti-features and Twitter specific features and verb groups were useful with SVM
Experiments with DBpediaD WordNet and SentiWordNet as resources for
Sentiment analysis in micro-blogging
)Linear kernelE
2 PG Precision1 RG Recall1 FG F-measure
2
SemEval
2013

P.
Bellot
(AMU-‐CNRS,
Sentiment Analysis on Twitter : Using Z-Score
• Z-Score helps to discriminate words for Document Classification, Authorship Attribution (J. Savoy,
ACM TOIS 2013)
30
Z_score for each term ti in a class Cj (tij) by cal-
culating its term relative frequency tfrij in a par-
ticular class Cj, as well as the mean (meani)
which is the term probability over the whole cor-
pus multiplied by nj the number of terms in the
class Cj, and standard deviation (sdi) of term ti
according to the underlying corpus (see Eq.
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency in a class
in compassion to others will have a salient
Z_score. Z_score was exploited for SA by
(Zubaryeva and Savoy 2010) , they choose a
threshold (>2) for selecting the number of terms
having Z_score more than the threshold, then
they used a logistic regression for combining
Bing Liu's Opinion Lexicon which is created by
(Hu and Liu 2004) and augmented in many latter
works. We extract the number of positive, nega-
tive and neutral words in tweets according to the-
se lexicons. Bing Liu's lexicon only contains
negative and positive annotation but Subjectivity
contains negative, positive and neutral.
- Part Of Speech (POS)
We annotate each word in the tweet by its POS
tag, and then we compute the number of adjec-
tives, verbs, nouns, adverbs and connectors in
each tweet.
4 Evaluation
4.1 Data collection
We used the data set provided in SemEval 2013
and 2014 for subtask B of sentiment analysis in
Twitter(Rosenthal, Ritter et al. 2014) (Wilson,
Kozareva et al. 2013). The participants were
provided with training tweets annotated as posi-
Z_score for each term ti in a class Cj (tij) by cal-
culating its term relative frequency tfrij in a par-
ticular class Cj, as well as the mean (meani)
which is the term probability over the whole cor-
pus multiplied by nj the number of terms in the
class Cj, and standard deviation (sdi) of term ti
according to the underlying corpus (see Eq.
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
these scores. We use Z_scores as added features
for classification because the tweet is too short,
therefore many tweets does not have any words
with salient Z_score. The three following figures
1,2,3 show the distribution of Z_score over each
Bin
(Hu
wor
tive
se
neg
con
- Pa
We
tag,
tive
each
4
4.1
W
and
Twi
Koz
prov
tive
twe
we
of p
pus multiplied by nj the number of term
class Cj, and standard deviation (sdi) o
according to the underlying corpus (
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency in
in compassion to others will have a
Z_score. Z_score was exploited for
(Zubaryeva and Savoy 2010) , they c
threshold (>2) for selecting the number
having Z_score more than the thresho
they used a logistic regression for co
these scores. We use Z_scores as added
for classification because the tweet is to
therefore many tweets does not have an
with salient Z_score. The three following
1,2,3 show the distribution of Z_score o
class, we remark that the majority of te
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
these scores. We use Z_scores as added features
for classification because the tweet is too short,
therefore many tweets does not have any words
with salient Z_score. The three following figures
1,2,3 show the distribution of Z_score over each
class, we remark that the majority of terms has
Z_score between -1.5 and 2.5 in each class and
the rest are either vey frequent (>2.5) or very rare
(<-1.5). It should indicate that negative value
means that the term is not frequent in this class in
comparison with its frequencies in other classes.
Table1 demonstrates the first ten terms having
the highest Z_scores in each class. We have test-
ed to use different values for the threshold, the
best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love
Good
Happy
Great
Excite
Best
Thank
Hope
Cant
Wait
14.31
14.01
12.30
11.10
10.35
9.24
9.21
8.24
8.10
8.05
Not
Fuck
Don’t
Shit
Bad
Hate
Sad
Sorry
Cancel
stupid
13.99
12.97
10.97
8.99
8.40
8.29
8.28
8.11
7.53
6.83
Httpbit
Httpfb
Httpbnd
Intern
Nov
Httpdlvr
Open
Live
Cloud
begin
6.44
4.56
3.78
3.58
3.45
3.40
3.30
3.28
3.28
3.17
Table1. The first ten terms having the highest Z_score in
each class
- Part Of Speech (POS)
We annotate each word in the tweet by its POS
tag, and then we compute the number of adjec-
tives, verbs, nouns, adverbs and connectors in
each tweet.
4 Evaluation
4.1 Data collection
We used the data set provided in SemEval 2013
and 2014 for subtask B of sentiment analysis in
Twitter(Rosenthal, Ritter et al. 2014) (Wilson,
Kozareva et al. 2013). The participants were
provided with training tweets annotated as posi-
tive, negative or neutral. We downloaded these
tweets using a given script. Among 9646 tweets,
we could only download 8498 of them because
of protected profiles and deleted tweets. Then,
we used the development set containing 1654
tweets for evaluating our methods. We combined
the development set with training set and built a
new model which predicted the labels of the test
set 2013 and 2014.
4.2 Experiments
Official Results
The results of our system submitted for
SemEval evaluation gave 46.38%, 52.02% for
test set 2013 and 2014 respectively. It should
mention that these results are not correct because
of a software bug discovered after the submis-
sion deadline, therefore the correct results is
demonstrated as non-official results. In fact the
previous results are the output of our classifier
which is trained by all the features in section 3,
but because of index shifting error the test set
was represented by all the features except the
terms.
Non-official Results
We have done various experiments using the
features presented in Section 3 with Multinomial
Naïve-Bayes model. We firstly constructed fea-
features which improve the performance by 6.5%
and 10.9%, then by pre-polarity features which
also improve the f-measure by 4%, 6%, but the
extending with POS tags decreases the f-
measure. We also test all combinations with the-
se previous features, Table2 demonstrates the
results of each combination, we remark that POS
tags are not useful over all the experiments, the
best result is obtained by combining Z_score and
pre-polarity features. We find that Z_score fea-
tures improve significantly the f-measure and
they are better than pre-polarity features.
Figure 1 Z_score distribution in positive class
Figure 2 Z_score distribution in neutral class
Features F-measure
2013 2014
Terms 49.42 46.31
Terms+Z 55.90 57.28
Terms+POS 43.45 41.14
Terms+POL 53.53 52.73
Terms+Z+POS 52.59 54.43
Terms+Z+POL 58.34 59.38
Terms+POS+POL 48.42 50.03
Terms+Z+POS+POL 55.35 58.58
Table 2. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets.
We repeated all previous experiments after using
a twitter dictionary where we extend the tweet by
the expressions related to each emotion icons or
abbreviations in tweets. The results in Table3
demonstrate that using that dictionary improves
the f-measure over all the experiments, the best
results obtained also by combining Z_scores and
pre-polarity features.
Features F-measure
2013 2014
Terms 50.15 48.56
Terms+Z 57.17 58.37
Terms+POS 44.07 42.64
Terms+POL 54.72 54.53
Terms+Z+POS 53.20 56.47
Terms+Z+POL 59.66 61.07
ses of SemEval2013 and 2014 test sets after using a twitter
dictionary.
5 Conclusion
In this paper we tested the impact of using
Twitter Dictionary, Sentiment Lexicons, Z_score
features and POS tags for the sentiment classifi-
cation of tweets. We extended the feature vector
of tweets by all these features; we have proposed
new type of features Z_score and demonstrated
that they can improve the performance.
Figure 3 Z_score distribution in negative class
Features F-measure
2013 2014
Terms 50.15 48.56
Terms+Z 57.17 58.37
Terms+POS 44.07 42.64
Terms+POL 54.72 54.53
Terms+Z+POS 53.20 56.47
Terms+Z+POL 59.66 61.07
dictionary.
5 Conclusion
new type of features Z_score and demonstrated
that they can improve the performance.
We think that Z_score can be used in different
ways for improving the Sentiment Analysis, we
are going to test it in another type of corpus and
using other methods in order to combine these
features.
Reference
Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen
Rambow and Rebecca Passonneau (2011).
Sentiment analysis of Twitter data.
Proceedings of the Workshop on Languages
se previous features, Table2 demonstrates the
results of each combination, we remark that POS
tags are not useful over all the experiments, the
best result is obtained by combining Z_score and
pre-polarity features. We find that Z_score fea-
tures improve significantly the f-measure and
they are better than pre-polarity features.
Terms+POL 53.53 52.73
Terms+Z+POS 52.59 54.43
Terms+Z+POL 58.34 59.38
ses of SemEval2013 and 2014 test sets.
We repeated all previous experiments after using
a twitter dictionary where we extend the tweet by
the expressions related to each emotion icons or
abbreviations in tweets. The results in Table3
Features F-measure
2013 2014
Terms 50.15 48.56
Terms+Z 57.17 58.37
Terms+POS 44.07 42.64
Terms+POL 54.72 54.53
Terms+Z+POS 53.20 56.47
Terms+Z+POL 59.66 61.07
dictionary.
5 Conclusion
[Hamdan,
Béchet
&
Bellot,
SemEval
2014]
Run Const- Unconst- Use Super-
rained rained Neut.? vised?
NRC-Canada 69.02 yes yes
GU-MLT-LT 65.27 yes yes
teragram 64.86 64.86(1) yes yes
BOUNCE 63.53 yes yes
KLUE 63.06 yes yes
AMI&ERIC 62.55 61.17(3) yes yes/semi
FBM 61.17 yes yes
AVAYA 60.84 64.06(2) yes yes/semi
SAIL 60.14 61.03(4) yes yes
UT-DB 59.87 yes yes
FBK-irst 59.76 yes yes
Run Const- Unconst-
rained rained N
NRC-Canada 68.46
GU-MLT-LT 62.15
KLUE 62.03
AVAYA 60.00 59.47(1)
teragram 59.10(2)
NTNU 57.97 54.55(6)
CodeX 56.70
FBK-irst 54.87
AMI&ERIC 53.63 52.62(7)
ECNUCS 53.21 54.77(5)
UT-DB 52.46
Best
official
2013
results
[Hamdan,
Bellot
&
Béchet,
SemEval
2014]

P.
Bellot
(AMU-‐CNRS,
Subjectivity lexicon : MPQA
- The MPQA (Multi-Perspective Question Answering) Subjectivity Lexicon
31
http://mpqa.cs.pitt.edu
Theresa Wilson, Janyce Wiebe, and Paul Hoﬀmann (2005). Recognizing Contextual Polarity in Phrase-‐‑Level Sentiment Analysis. Proc. of HLT-‐‑EMNLP-‐‑2005.

P.
Bellot
(AMU-‐CNRS,
http://wordnetweb.princeton.edu/perl/webwn http://www.cs.rochester.edu/research/cisd/wordnet
COMMUNICATIONS OF THE ACM November 1995/Vol. 38, No. 11 39

P.
Bellot
(AMU-‐CNRS,
http://sentiwordnet.isti.cnr.it

P.
Bellot
(AMU-‐CNRS,
— Dataset : 3K English sentences from the restaurant review + 3K English sentences extracted
from customer reviews of laptops + tagged by experienced human annotators
— We proposed :  
1. Aspect term extraction: CRF model
2. Aspect Term Polarity Detection: Multinomial Naive-Bayes classifier with some features
such as Z-score, POS and prior polarity extracted from Subjectivity Lexicon (Wilson, Wiebe et al.
2005) and Bing Liu's Opinion Lexicon
3. Category Detection & Category Polarity Detection : Z-score model 
!
!
34
elative frequency tfrij
s well as the mean
probability over the
by nj the number of
andard deviation (sdi)
nderlying corpus (see
1)
))
Eq. (2)
r SA by (Zubaryeva
ose a threshold (Z>2)
terms having Z_score
n they used a logistic
hese scores. We use
or multinomial Naive
We remark that our system is 24% and 21%
above the baseline for aspect terms extraction in
restaurant and laptop reviews respectively, and
above 3% for category detection in restaurant
reviews.
Data subtask P R F
Res 1 Baseline 0,52 0,42 0,47
System 0.81 0.63 0.71
3 Baseline 0,73 0,59 0,65
System 0.77 0.60 0.68
Lap 1 Baseline 0,44 0,29 0,35
System 0.76 0.45 0.56
Table 1. Results of subtask 1, 2 for restaurant reviews,
subtask 1 for laptop reviews
The second step involves the evaluation of
subtask 2 and 4, we were provided with(1)
restaurant review sentences annotated by their
aspect terms, and categories, we had to
determine the polarity for each aspect term and
category; (2) laptop review sentences annotated
(tij) by calculating its term relative frequency tfrij
in a particular class Cj, as well as the mean
(meani) which is the term probability over the
whole corpus multiplied by nj the number of
terms in the class Cj, and standard deviation (sdi)
of term ti according to the underlying corpus (see
Eq. (1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
Z_score was exploited for SA by (Zubaryeva
and Savoy 2010), they choose a threshold (Z>2)
for selecting the number of terms having Z_score
more than the threshold, then they used a logistic
regression for combining these scores. We use
Z_score as added features for multinomial Naive
Bayes classifier.
3.4 Subtask4: Category Polarity Detection
We have used Multinomial Naive-Bayes as in
the subtask2 step (2) with the same features, but
the different that we add also the name of the
category as a feature. Thus, for each sentence
having n category we add n examples to the
training set, the difference between them is the
feature of the category.
4 Experiments and Evaluations
We tested our system using the training and
testing data provided by SemEval 2014 ABSA
task. Two data sets were provided; the first
contains3Ksentences of restaurant reviews
annotated by the aspect terms, their polarities,
their categories, the polarities of each category.
The second contains of 3K sentences of laptop
reviews annotated just by the aspect terms, their
polarities.
The evaluation process was done in two steps.
First step is concerning the subtasks 1 and 3
We remark that our system is 24% and 21%
above the baseline for aspect terms extraction in
above 3% for category detection in restaurant
reviews.
Data subtask P R F
Res 1 Baseline 0,52 0,42 0,47
System 0.81 0.63 0.71
3 Baseline 0,73 0,59 0,65
System 0.77 0.60 0.68
Lap 1 Baseline 0,44 0,29 0,35
System 0.76 0.45 0.56
The second step involves the evaluation of
subtask 2 and 4, we were provided with(1)
restaurant review sentences annotated by their
aspect terms, and categories, we had to
determine the polarity for each aspect term and
category; (2) laptop review sentences annotated
by aspect terms and we had to determine the
aspect term polarity. Table 2 demonstrates the
results of our system and the baseline (A:
accuracy, R: number of true retrieved examples,
All: number of all true examples).
Data subtask R All A
Res 2 Baseline 673 1134 0,64
System 818 1134 0.72
4 Baseline 673 1025 0,65
System 739 1025 0.72
Lap 2 Baseline 336 654 0,51
System 424 654 0,64
We remark that our system is 8% and 13% above
the baseline for aspect terms polarity detection in
7% above for category polarity detection in
restaurant reviews.

P.
Bellot
(AMU-‐CNRS,
IR and Digital Libraries
!
Social Book Search
35

P.
Bellot
(AMU-‐CNRS,
INEX
topics

P.
Bellot
(AMU-‐CNRS,
INEX 2014 Social Book Search Track
— In 2014, the Social Book Search Track consists of two tasks:
• Suggestion task: a system-oriented batch retrieval/recommendation task
• Interactive task: a user-oriented interactive task where we want to gather user data on
searching for different search tasks and different search interfaces.
— 2.8 million book descriptions with metadata from Amazon and LibraryThing
— 14 million reviews (1.5 million books have no review)
— Amazon: formal metadata like booktitle, author, publisher, publication year, library classification
codes, Amazon categories and similar product information, as well as user-generated content in the
form of user ratings and reviews
— LibraryThing, there are user tags and user-provided metadata on awards, book characters and
locations and blurbs
37
https://inex.mmci.uni-‐saarland.de/tracks/books/

P.
Bellot
(AMU-‐CNRS,

P.
Bellot
(AMU-‐CNRS,
<browseNode> fields.
Table 1. Some facts about the Amazon collection.
Number of pages (i.e. books) 2, 781, 400
Number of reviews 15, 785, 133
Number of pages that contain a least a review 1, 915, 336
3 Retrieval model
3.1 Sequential Dependence Model
Like the previous year, we used a language modeling approach to retrieval [4].
We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integrate
multiword phrases in the query. Specifically, we use the Sequential Dependance
Run nDCG@10 P@10 MRR MAP
p4-inex2011SB.xml social.fb.10.50 0.3101 0.2071 0.4811 0.2283
p54-run4.all-topic-fields.reviews-split.combSUM 0.2991 0.1991 0.4731 0.1945
p4-inex2011SB.xml social 0.2913 0.1910 0.4661 0.2115
p4-inex2011SB.xml full.fb.10.50 0.2853 0.1858 0.4453 0.2051
p54-run2.all-topic-fields.all-doc-fields 0.2843 0.1910 0.4567 0.2035
p62.recommendation 0.2710 0.1900 0.4250 0.1770
p54-run3.title.reviews-split.combSUM 0.2643 0.1858 0.4195 0.1661
p62.sdm-reviews-combine 0.2618 0.1749 0.4361 0.1755
p62.baseline-sdm 0.2536 0.1697 0.3962 0.1815
p62.baseline-tags-browsenode 0.2534 0.1687 0.3877 0.1884
p4-inex2011SB.xml full 0.2523 0.1649 0.4062 0.1825
wiki-web-nyt-gw 0.2502 0.1673 0.4001 0.1857
p4-inex2011SB.xml amazon 0.2411 0.1536 0.3939 0.1722
p62.sdm-wiki 0.1953 0.1332 0.3017 0.1404
p62.sdm-wiki-anchors 0.1724 0.1199 0.2720 0.1253
p4-inex2011SB.xml lt 0.1592 0.1052 0.2695 0.1199
p18.UPF QE group BTT02 0.1531 0.0995 0.2478 0.1223
p18.UPF QE genregroup BTT02 0.1327 0.0934 0.2283 0.1001
p18.UPF QEGr BTT02 RM 0.1291 0.0872 0.2183 0.0973
p18.UPF base BTT02 0.1281 0.0863 0.2135 0.1018
p18.UPF QE genre BTT02 0.1214 0.0844 0.2089 0.0910
p18.UPF base BT02 0.1202 0.0796 0.2039 0.1048
p54-run1.title.all-doc-fields 0.1129 0.0801 0.1982 0.0868
Table 2. O cial results of the Best Books for Social Search task of the INEX 2011
Book track, using judgements derived from the LibraryThing discussion groups. Our
runs are identified by the p62 prefix and are in boldface.
Run nDCG@10 P@10 MRR MAP
p62.baseline-sdm 0.6092 0.5875 0.7794 0.3896
p4-inex2011SB.xml amazon 0.6055 0.5792 0.7940 0.3500
p62.baseline-tags-browsenode 0.6012 0.5708 0.7779 0.3996
p4-inex2011SB.xml full 0.6011 0.5708 0.7798 0.3818
p4-inex2011SB.xml full.fb.10.50 0.5929 0.5500 0.8075 0.3898
p62.sdm-reviews-combine 0.5654 0.5208 0.7584 0.2781
p4-inex2011SB.xml social 0.5464 0.5167 0.7031 0.3486
p4-inex2011SB.xml social.fb.10.50 0.5425 0.5042 0.7210 0.3261
p54-run2.all-topic-fields.all-doc-fields 0.5415 0.4625 0.8535 0.3223
Table 3. Top runs of the Best Books for Social Search task of the INEX 2011 Book
track, using judgements obtained by crowdsourcing (Amazon Mechanical Turk). Our
runs are identified by the p62 prefix and are in boldface.
Model (SDM), which is a special case of the MRF. In this model three features
are considered: single term features (standard unigram language model features,
fT ), exact phrase features (words appearing in sequence, fO) and unordered
window features (require words to be close together, but not necessarily in an
exact sequence order, fU ).
Documents are thus ranked according to the following scoring function:
scoreSDM (Q, D) = T
X
q2Q
fT (q, D)
+ O
|Q| 1
X
i=1
fO(qi, qi+1, D)
+ U
|Q| 1
X
i=1
fU (qi, qi+1, D)
where the features weights are set according to the author’s recommendation
( T = 0.85, O = 0.1, U = 0.05). fT , fO and fU are the log maximum likelihood
estimates of query terms in document D, computed over the target collection
with a Dirichlet smoothing.
3.2 External resources combination
As previously done last year, we exploited external resources in a Pseudo-Relevance
Feedback (PRF) fashion to expand the query with informative terms. Given a re-
source R, we form a subset RQ of informative documents considering the initial
query Q using pseudo-relevance feedback. To this end we first rank documents
of R using the SDM ranking function. An entropy measure HRQ
(t) is then com-
puted for each term t over RQ in order to weigh them according to their relative
informativeness:
HRQ
(t) =
X
w2t
p(w|RQ) · log p(w|RQ)
These external weighted terms are finally used to expand the original query.
Sequential DependanceModel (SDM) - Markov Random Field (Metzler & Croft, 2004)
We use our SDM baseline defined in section 3.1 and incorporate the ab
recommendation estimate:
scorerecomm(Q, D) = D scoreSDM (Q, D) + (1 D) tD
where the D parameter was set based on the observation over the test to
made available to participants for training purposes. Indeed we observed
these topics that the tD had no influence on the ranking of documents after
hundredth result (average estimation). Hence we fix the smoothing param
to:
D =
arg maxD scoreSDM (Q, D) scoreSDM (Q, D)100
NResults
In practice, this approach is re-ranking of the results of the SDM retri
model based on the popularity and the likability of the di↵erent books.
4 Runs
together. Children node pages (or sub-articles) are weighted half that of their
parents in order to minimize a potential topic drift. We avoid loops in the graph
(i.e. a children node can not be linked to one of his elder) because it brings
no additional information. It also could change weights between linked articles.
Informative words are then extracted from the sub-articles and incorporated to
our retrieval model like another external resource.
3.4 Social opinion for book search
The test collection used this year for the Book Track contains Amazon pages
of books. These pages are composed amongst others of editorial information,
like the number of pages or the blurb, user ratings and user reviews. However,
contrary to the previous years, the actual content of the books is not available.
Hence, the task is to rank books according to the sparse informative content and
the opinion of readers expressed in the reviews, considering that the user ratings
are integers between 1 and 5.
Here, we wanted to model two social popularity assumptions: a product that
have a lot of reviews must be relevant (or at least popular), and a high rated
product must be relevant. Then, a product having a large number of good reviews
really must be relevant. However in the collection there is often a small amount
of ratings for a given book. The challenge was to determine whether each user
rating is significant or not. To do so, we first define XD
R a random set of ”bad”
ratings (1, 2 or 3 over 5 points) for book D. Then, we evaluate the statistical
significant di↵erences between XD
R and XD
R [ XD
U using Welch’s t-test, where
XD
U is the actual set of user rating for book D. The statistical test is computed
by:
tD =
XD
R [ XD
U XD
U
sXD
R [XD
U XD
U
where
sXD
R [XD
U XD
U
=
s
s2
RU
nRU
+
s2
U
nU
Where s2
is the unbiased estimator of the variance of the two sets and nX is the
number of ratings for set X.
The underlying assumption is that significant di↵erences occur under two
di↵erent situations. First, when there is a small amount of user ratings (Xi
U )
but they all are very good. For example this is the case of good but little-known
books. Second, when there is a very large amount of user ratings but there are
average. Hence this statistical test gives us a single estimate of both likability
and popularity.
Test statistique entre les
notes observées et des
notes aléatoires
Est-ce qu’une note
est significative ?
Projet ANR CAAS

P.
Bellot
(AMU-‐CNRS,
Query Expansion : with Concepts from DBPedia
40

P.
Bellot
(AMU-‐CNRS,
Terms only vs. Extended Features
— We modeled book likeliness based on the following idea: the more the number of reviews it has,
the more interesting the book is (it may not be a good or popular book but a book that has a high
impact)
— InL2 information retrieval model alone (DFR-based model, Divergence From Randomness) seem
to perform better than SDM (Language Modeling) with extended features
41
Benkoussas,
Hamdan,
Albitar,
Ollagnier
&
Bellot,
2014

OpenEdition Lab projects in Text Mining

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à OpenEdition Lab projects in Text Mining

Similaire à OpenEdition Lab projects in Text Mining (20)

Dernier

Dernier (20)

OpenEdition Lab projects in Text Mining