SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
SCHOLARLY BOOK RECOMMENDATION :
A TASK COMBINING 

SENTIMENT ANALYSIS, 

INFORMATION EXTRACTION AND IR
Patrice	Bellot		(M.	Dacos,	C.	Benkoussas,	A.	Ollagnier,	M.	Orban,	H.	Hamdan,	A.	Htait,	E.	Faath,	Y.M.	Kim,	F.	Béchet)

Aix-Marseille	Université	-	CNRS	(LSIS	UMR	7296	;	OpenEdition)	
patrice.bellot@univ-amu.fr
LSIS	-	DIMAG	team	http://www.lsis.org/dimag	
OpenEdition	Lab	:	http://lab.hypotheses.org
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Objectives
— To present the OpenEdition platforms (open access)
— To present some results we obtained

1- Social Book Search on Amazon

2- Scholarly Book Search on OpenEdition 

and still open questions (towards a future collaborative project ?)
— OpenEdition Lab : a DH Research Program that aims to :

— Detect hot topics, hot books, hot papers

— Develop approaches for recommending and searching for books

2
Our	partners:	libraries	an	institutions	
all	over	the	world		
>	4	million	unique	visitors	/	month
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Some Open Questions
— How to combine Information Retrieval and Recommendation 

(Learning to Rank ?)
— Is combining official meta-data and Reader-generated content (reviews)
useful ?
— How to identify and exploit bibliographic references (cross-linking) ?
— Is the content of the books useful ? (vs. reviews / abstracts only) ?
— How to deal with complex queries ?
— How to go beyond topical relevance ? (genre, reading level, recentness…)
Classical solution (the Vector Space Model for IR): a weight is associated to every word
in every document (a vector), the query is represented as a vector in the document
space, a degree of similarity between the documents and the query can be estimated
3
~d
~q
~d =
0
B
B
B
@
wm1,d
wm2,d
...
wmn,d
1
C
C
C
A
~q =
0
B
B
B
@
wm1,q
wm2,q
...
wmn,q
1
C
C
C
A
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,d
qPn
j=1 w2
j,d
(2)
s(~d, ~q) =
i=nX wi,d
qPn 2
·
wi,q
qPn 2
=
~d · ~q
kdk · kqk
= cos(~d, ~q) (3)
Topicality	only	?
OpenEdition	home	page
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Searching for Books ?
Very diverse needs :
— Topicality
— With a precise context eg. arts in China during the XXth century
— With named entities : locations (the book is about a specific location OR the action
takes place at this location), proper names…
— Style / Expertise / Language
— fiction, novel, essay, proceedings, position papers…
— for experts / for dummies / for children …
— in English, in French, in old French, in (very) local languages …
— looking for citations / references
— in what book appears a given citation
— what are the books that refer to a given one
— Authority :
— What are the most important books about … (what most important means ?)
— What are the most popular books about …
5
For Digital Humanities (and for Research Purposes) : Researchers may not be interested
in the best-sellers but in the impact of a book = in the debate it provokes
Searching for books on Amazon
6
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 7
Searching for books : a complex (?) query
I	am	looking	for	the	best	mystery	books	set	in	New	York	City
Searching for books : a complex (?) query
8
I	am	looking	for	the	best	mystery	books	set	in	New	York	City		
not	«	M
ystery	»
9
Category	:	Mystery	
Sort	by	Avg.	Cust.	Review	
Query	:	New	York
in	Buffalo	NY,	near	Toronto
first	position	on	The	New	York	Times	
Best	Seller	list	in	2013,	«	New	York	
city	»	occurs	only	once	
in	the	book.
«	New	York	city	»	occurs	only	once	
in	the	book.	But	published	in	NYC,	

reviewed	by	the	NY	Times
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 10
Sort	by	Relevance		?
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 11
http://social-book-search.humanities.uva.nl/#/overview
2 The Amazon collection
The document used for this year’s Book Track is composed of Amazon pages of
existing books. These pages consist of editorial information such as ISBN num-
ber, title, number of pages etc... However, in this collection the most important
content resides in social data. Indeed Amazon is social-oriented, and user can
comment and rate products they purchased or they own. Reviews are identi-
fied by the <review> fields and are unique for a single user: Amazon does not
allow a forum-like discussion. They can also assign tags of their creation to a
product. These tags are useful for refining the search of other users in the way
that they are not fixed: they reflect the trends for a specific product. In the
XML documents, they can be found in the <tag> fields. Apart from this user
classification, Amazon provides its own category labels that are contained in the
<browseNode> fields.
Table 1. Some facts about the Amazon collection.
Number of pages (i.e. books) 2, 781, 400
Number of reviews 15, 785, 133
Number of pages that contain a least a review 1, 915, 336
3 Retrieval model
3.1 Sequential Dependence Model
Like the previous year, we used a language modeling approach to retrieval [4].
We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integrate
multiword phrases in the query. Specifically, we use the Sequential Dependance
Organizers
Marijn Koolen (University of Amsterdam)

Toine Bogers (Aalborg University Copenhagen)

Antal van den Bosch (Radboud University Nijmegen)

Antoine Doucet (University of Caen)

Maria Gaede (Humboldt University Berlin)

Preben Hansen (Stockholm University)

Mark Hall (Edge Hill University)

Iris Hendrickx (Radboud University Nijmegen)

Hugo Huurdeman (University of Amsterdam)

Jaap Kamps (University of Amsterdam)

Vivien Petras (Humboldt University Berlin)

Michael Preminger (Oslo and Akershus University College of Applied Sciences)

Mette Skov (Aalborg University Copenhagen)

Suzan Verberne (Radboud University Nijmegen)

David Walsh (Edge Hill University)
Searching for Books : Looking at Reviews
12
Social Tagging
13
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 14
User Profiles (catalog, reviews, ratings)
Amazon : Organization of Items (Nodes & Items)
15
Product Advertising API https://aws.amazon.com/
cf.	http://www.codediesel.com/libraries/amazon-advertising-api-browsenodes/
Amazon Navigation

Graph : YASIV
16
http://www.yasiv.com/#/Search?q=orwell&category=Books&lang=US
SBS Collection : Meta-data and Social Data
17
2.8	Million	«	Documents	»	(book	descriptions)
http://social-book-search.humanities.uva.nl
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 18
— Using NLP : to what extent ? (Named Entity Recognition, Syntactic Analysis…)
http://social-book-search.humanities.uva.nl
SBS Collection : Real Topics from Library Thing
Our proposal for CLEF Social Book Search (Amazon)
19
IR : Sequential Dependance Model (SDM) - Markov Random Field (Metzler & Croft, 2004) and/or Divergence
From Randomness (InL2) model + Query Expansion with Dependance Analysis
Ratings : The more a book has reviews and the more it has good ratings, the more relevant it is.
Graph : Expanding the retrieved books with Similar Books then Reranking with PageRank
13
●
We tested many reranking methods. Combining the
retrieval model scores and other scores based on social
information.
●
For each document compute:
– PageRank: algorithm that exploits link structure to score
the important of nodes in the graph.
– Likeliness: Computed from information generated by
users (reviews and ratings). More the book has a lot of
reviews and good ratings, the more interesting it is.
Graph Modeling – Reranking Schemes
12
ti Retrieving
Collection
DGD
Dti
DStartingNodes
Neighbors
SPnodes
DgraphDgraph
Delete
duplications
D nal
1 2
3
5
6
7
89 + 10
Reranking
11
Graph Modeling - Recommendation
Page	Rank	+	Similar	Products -	Best	results	in	2011	

(Judgements	obtained	by	crowdsourcing)	
(IR	and	ratings)

P@10	≈	0.58

-	Good	results	in	2014	
(IR,	ratings,	expansion)	
P@10	≈	0.23	;	MAP	≈	0.44



-	in	2015	:	rank	25/47	
(IR	+	graph	

but	graph	improved	IR)	
P@10	≈	0.2	
(best	0.39,	included	

the	price	of	books)
20
Our proposal for Scholarly Book Search (OpenEdition)
BILBO
ÉCHO
CLASSIFICATION
AUTOMATIQUE
ET MÉTADONNÉES
RECOMMANDATION
GRAPHE
DE CONTENUS
questions de
communication
vertigo
edc
echogeo
vertigo
quaderni
BILBO - MISE EN RELATION
DES COMPTES-RENDUS
AVEC LES LIVRES
ÉCHO - ANALYSE
DES SENTIMENTS
Langouet, G., (1986), « Innovations pédago-
giques et technologies éducatives », Revue
française de pédagogie, n° 76, pp. 25-30.
Langouet, G., (1986), « Innovations pédagogiques et technologies éducatives
», Revue française de pédagogie, n° 76, pp. 25-30.
DOI : 10.3406/rfp.1986.1499
18 Voir Permanent Mandates Commission, Minutes of
the Fifteenth Session (Geneva: League of Nations,
1929), pp. 100-1. Pour plus de détails, voir Paul Ghali,
Les nationalités détachées de l’Empire ottoman à la
suite de la guerre (Paris: Les Éditions Domat-Mont-
chrestien, 1934), pp. 221-6.
ils ont déjà édité trois recueils et auxquelles ils ont consacré
de nombreux travaux critiques. Leur nouvel ouvrage, intitulé
Le Roman véritable. Stratégies préfacielles au XVIIIe siècle
et rédigé à six mains par Jan Herman, Mladen Kozul et
Nathalie Kremer – chaque auteur se chargeant de certains
chapitres au sein
BILBO
NIVEAU 1
NIVEAU 2
NIVEAU 3
<bibl><author><surname>Langouet</surname>,
<forename>G.</forename>,</author> (<date>1986</date>),
<title level="a">« Innovations pédagogiques et
technologies éducatives »</title>, <title
level="j">Revue française de pédagogie</title>,
<abbr>n°</abbr> <biblScope
type="issue">76</biblScope>, <abbr>pp.</abbr>
<biblScope type=page>25-30</biblScope>. <idno
type="DOI">DOI : 10.3406/rfp.1986.1499</idno></bibl>
RI sociale
OpenEdition
OpenEdition
INFORMATION	EXTRACTION
FINDING	REVIEWS	
LINKING	TO	BOOKS
SENTIMENT	ANALYSIS BOOK	RECOMMENDATION
SVM	-	Z-score	-	CRF Graph	scoring
RATINGS	
POLARITY
GRAPH
RECOMMENDATION
REFERENCE	ANALYSIS
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Searching for Book Reviews in Scientific Journals
• Supervised approaches for filtering = a new kind of genre classification
• Feature selection : Threshold (Z-score) + random forest
• Features : unigrams (words), localisation of named entities / dates
21
tations (bag-of-words, feature selection using z-
score, Named Entity repartition in the text).
6.1 Naive Bayes (NB)
In order to evaluate different classification mod-
els, we have adopted as a baseline the naive Bayes
approach (Zubaryeva and Savoy, 2010). The clas-
sification system has to choose between two pos-
sible hypotheses: h0 = It is a Review and h1 =
It is not a Review the class that has the maxi-
mum value according to the Equation (5). Where
|w| indicates the number of words included in the
current document and wj is the number of words
that appear in the document.
arg max
hi
P(hi).
|w|
Y
j=1
P(wj|hi) (5)
where P(wj|hi) =
tfj,hi
nhi
We estimate the probabilities with the Equation
(5) and get the relation between the lexical fre-
quency of the word wj in the whole size of the
collection Thi
(denoted tfj,hi
) and the size of the
corresponding corpus.
We have used different strategies to represent each
textual unit. First, the unigram model (Bag-of-
Words) where all words are considered as features.
We also used feature selection based on the nor-
malized z-score by keeping the first 1000 words
according to this score (after removing all words
that appear less than 5 times). As the third ap-
proach, we suggested that the common features
between the Review collection can be located in
the Named Entity distribution in the text.
Table 4: Results showing the performances of
the classification models using different indexing
schemes on the test set. The best values for the
Review class are noted in bold and those for
Review class are are underlined
Review Review
# Model R P F-M R P F-M
1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%
SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%
* C = 5.0
* = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%
SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%
SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%
* C = 32.0
* = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%
SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%
SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%
* C = 8.0
* = 0.03125
Z scores across the corpus.
# Feature Z
score
# Feature Z
score
1 abandonne 30.14 16 winter 9.23
2 seront 30.00 17 cleo 8.88
3 biographie 21.84 18 visible 8.75
4 entranent 21.20 19 fondamentale 8.67
5 prise 21.20 20 david 8.54
6 sacre 21.20 21 pratiques 8.52
7 toute 20.70 22 signification 8.47
8 quitte 19.55 23 01 8.38
9 dimension 15.65 24 institutionnels 8.38
10 les 14.43 25 1930 8.16
11 commandement 11.01 26 attaques 8.14
12 lie 10.61 27 courrier 8.08
13 construisent 10.16 28 moyennes 7.99
14 lieux 10.14 29 petite 7.85
15 garde 9.75 30 adapted 7.84
In our training corpus, we have 106 911 words
obtained from the Bag-of-Words approach. We se-
lected all tokens (features) that appear more than
5 times in each classes. The goal is therefore to
design a method capable of selecting terms that
clearly belong to one genre of documents. We ob-
tained a vector space that contains 5 957 words
(features). After calculating the normalized z-
score of all features, we selected the first 1 000
features according to this score.
(Poibeau, 2003). We aim to explore the distribu-
tion of 3 named entities (”authors’ names”, ”loca-
tions” and ”dates”) in the text after removing all
XML-HTML tags. After that, we divided texts
into 10 parts (the size of each part = total num-
ber of words / 10). The distribution ratio of each
named entity in each part is used as feature to build
the new document representation and we obtained
a set of 30 features.
Figure 3: ”Person” named entity distribution
6 Experiments
In this section we describe results from experi-
ments using a collection of documents from Re-
vues.org and the Web. We use supervised learning
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
methods to build our classifiers, and evaluate the
resulting models on new test cases. The focus of
our work has been on comparing the effectiveness
of different inductive learning algorithms (Naive
Bayes, Support Vector Machines with RBF and
Linear Kernels) in terms of classification accuracy.
We also explored alternative document represen-
tations (bag-of-words, feature selection using z-
score, Named Entity repartition in the text).
6.2 Supp
SVM desig
by Vapnik
recognition
method is
mization p
tional learn
learn linea
a simple p
tion, they
radial basi
layer sigm
key in suc
mal bound
use them f
garwal and
the differen
used the W
model with
Basic Func
a good lev
growth of t
stage.(Kum
6.3 Resu
We have us
textual uni
Words) wh
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
methods to build our classifiers, and evaluate the
resulting models on new test cases. The focus of
our work has been on comparing the effectiveness
of different inductive learning algorithms (Naive
Bayes, Support Vector Machines with RBF and
Linear Kernels) in terms of classification accuracy.
We also explored alternative document represen-
tations (bag-of-words, feature selection using z-
score, Named Entity repartition in the text).
6.1 Naive Bayes (NB)
In order to evaluate different classification mod-
els, we have adopted as a baseline the naive Bayes
approach (Zubaryeva and Savoy, 2010). The clas-
sification system has to choose between two pos-
sible hypotheses: h0 = It is a Review and h1 =
It is not a Review the class that has the maxi-
mum value according to the Equation (5). Where
6.2 Support Vector Machines (SVM)
SVM designates a learning approach introduced
by Vapnik in 1995 for solving two-class pattern
recognition problem (Vapnik, 1995). The SVM
method is based on the Structural Risk Mini-
mization principle (Vapnik, 1995) from computa-
tional learning theory. In their basic form, SVMs
learn linear threshold function. Nevertheless, by
a simple plug-in of an appropriate kernel func-
tion, they can be used to learn linear classifiers,
radial basic function (RBF) networks, and three-
layer sigmoid neural nets (Joachims, 1998). The
key in such classifiers is to determine the opti-
mal boundaries between the different classes and
use them for the purposes of classification (Ag-
garwal and Zhai, 2012). Having the vectors form
the different representations presented below. we
used the Weka toolkit to learning model. This
model with the use of the linear kernel and Radial
Basic Function(RBF) sometimes allows to reach
a good level of performance at the cost of fast
growth of the processing time during the learning
stage.(Kummer, 2012)
6.3 Results
We have used different strategies to represent each
textual unit. First, the unigram model (Bag-of-
Words) where all words are considered as features.
We also used feature selection based on the nor-
malized z-score by keeping the first 1000 words
according to this score (after removing all words
that appear less than 5 times). As the third ap-
proach, we suggested that the common features
between the Review collection can be located in
the Named Entity distribution in the text.
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 22
http://reviewofbooks.openeditionlab.org
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Linking Contents by Analyzing the References
In Books : no common stylesheet (or a lot of stylesheets poorly respected…)
Our proposal :
1) Searching for references in the document / footnotes (Support Vector Machines)
2) Annotating the references (Conditional Random Fields)
BILBO : Our (open-source) software for Reference Analysis
23
Google Digital Humanities Research Awards (2012)
Annotation
DOI	search	
(Crossref)
OpenEdition	Journals	:	more	than	1.5	million	references	analyzed
Test	:	http://bilbo.openeditionlab.org	
Sources	:	http://github.com/OpenEdition/bilbo
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 24
Test	:	http://bilbo.openeditionlab.org	
Sources	:	http://github.com/OpenEdition/bilbo
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Sentiment Analysis on Reviews
• Statistical Metrics (PMI, Z-score, odd ratio…)
• Combined with Linguistic Ressources
25
Z_score for each term ti in a class Cj (tij) by cal-
culating its term relative frequency tfrij in a par-
ticular class Cj, as well as the mean (meani)
which is the term probability over the whole cor-
pus multiplied by nj the number of terms in the
class Cj, and standard deviation (sdi) of term ti
according to the underlying corpus (see Eq.
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency in a class
in compassion to others will have a salient
Z_score. Z_score was exploited for SA by
(Zubaryeva and Savoy 2010) , they choose a
threshold (>2) for selecting the number of terms
having Z_score more than the threshold, then
they used a logistic regression for combining
these scores. We use Z_scores as added features
Bing Liu's Opinion Lexicon which is created by
(Hu and Liu 2004) and augmented in many latter
works. We extract the number of positive, nega-
tive and neutral words in tweets according to the-
se lexicons. Bing Liu's lexicon only contains
negative and positive annotation but Subjectivity
contains negative, positive and neutral.
- Part Of Speech (POS)
We annotate each word in the tweet by its POS
tag, and then we compute the number of adjec-
tives, verbs, nouns, adverbs and connectors in
each tweet.
4 Evaluation
4.1 Data collection
We used the data set provided in SemEval 2013
and 2014 for subtask B of sentiment analysis in
Twitter(Rosenthal, Ritter et al. 2014) (Wilson,
Kozareva et al. 2013). The participants were
provided with training tweets annotated as posi-
tive, negative or neutral. We downloaded these
i ij
culating its term relative frequency tfrij in a par-
ticular class Cj, as well as the mean (meani)
which is the term probability over the whole cor-
pus multiplied by nj the number of terms in the
class Cj, and standard deviation (sdi) of term ti
according to the underlying corpus (see Eq.
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency in a class
in compassion to others will have a salient
Z_score. Z_score was exploited for SA by
(Zubaryeva and Savoy 2010) , they choose a
threshold (>2) for selecting the number of terms
having Z_score more than the threshold, then
they used a logistic regression for combining
these scores. We use Z_scores as added features
for classification because the tweet is too short,
therefore many tweets does not have any words
with salient Z_score. The three following figures
(Hu
wor
tive
se
neg
con
- Pa
We
tag,
tive
each
4
4.1
W
and
Twi
Koz
prov
tive
twe
we
of p
pus multiplied by nj the number of term
class Cj, and standard deviation (sdi) o
according to the underlying corpus (
(1,2)).
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency i
in compassion to others will have a
Z_score. Z_score was exploited for
(Zubaryeva and Savoy 2010) , they c
threshold (>2) for selecting the number
having Z_score more than the thresho
they used a logistic regression for co
these scores. We use Z_scores as added
for classification because the tweet is to
therefore many tweets does not have an
with salient Z_score. The three following
1,2,3 show the distribution of Z_score o
class, we remark that the majority of te
Z!"#$% !!"
=
!"#!"!!"#$!
!"#
Eq. (1)
Z!"#$% !!"
=
!"#!"!!!∗!(!")
!"∗! !" ∗(!!!(!"))
Eq. (2)
The term which has salient frequency in a class
in compassion to others will have a salient
Z_score. Z_score was exploited for SA by
(Zubaryeva and Savoy 2010) , they choose a
threshold (>2) for selecting the number of terms
having Z_score more than the threshold, then
they used a logistic regression for combining
these scores. We use Z_scores as added features
for classification because the tweet is too short,
therefore many tweets does not have any words
with salient Z_score. The three following figures
1,2,3 show the distribution of Z_score over each
class, we remark that the majority of terms has
Z_score between -1.5 and 2.5 in each class and
the rest are either vey frequent (>2.5) or very rare
(<-1.5). It should indicate that negative value
means that the term is not frequent in this class in
comparison with its frequencies in other classes.
Table1 demonstrates the first ten terms having
the highest Z_scores in each class. We have test-
ed to use different values for the threshold, the
best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love
Good
Happy
Great
Excite
Best
Thank
Hope
Cant
Wait
14.31
14.01
12.30
11.10
10.35
9.24
9.21
8.24
8.10
8.05
Not
Fuck
Don’t
Shit
Bad
Hate
Sad
Sorry
Cancel
stupid
13.99
12.97
10.97
8.99
8.40
8.29
8.28
8.11
7.53
6.83
Httpbit
Httpfb
Httpbnd
Intern
Nov
Httpdlvr
Open
Live
Cloud
begin
6.44
4.56
3.78
3.58
3.45
3.40
3.30
3.28
3.28
3.17
Table1. The first ten terms having the highest Z_score in
each class
- Part Of Speech (POS)
We annotate each word in the tweet by its POS
tag, and then we compute the number of adjec-
tives, verbs, nouns, adverbs and connectors in
each tweet.
4 Evaluation
4.1 Data collection
We used the data set provided in SemEval 2013
and 2014 for subtask B of sentiment analysis in
Twitter(Rosenthal, Ritter et al. 2014) (Wilson,
Kozareva et al. 2013). The participants were
provided with training tweets annotated as posi-
tive, negative or neutral. We downloaded these
tweets using a given script. Among 9646 tweets,
we could only download 8498 of them because
of protected profiles and deleted tweets. Then,
we used the development set containing 1654
tweets for evaluating our methods. We combined
the development set with training set and built a
new model which predicted the labels of the test
set 2013 and 2014.
4.2 Experiments
Official Results
The results of our system submitted for
SemEval evaluation gave 46.38%, 52.02% for
test set 2013 and 2014 respectively. It should
mention that these results are not correct because
of a software bug discovered after the submis-
sion deadline, therefore the correct results is
demonstrated as non-official results. In fact the
previous results are the output of our classifier
which is trained by all the features in section 3,
but because of index shifting error the test set
was represented by all the features except the
terms.
Non-official Results
We have done various experiments using the
features presented in Section 3 with Multinomial
Naïve-Bayes model. We firstly constructed fea-
ture vector of tweet terms which gave 49%, 46%
features which improve the performance by 6.5%
and 10.9%, then by pre-polarity features which
also improve the f-measure by 4%, 6%, but the
extending with POS tags decreases the f-
measure. We also test all combinations with the-
se previous features, Table2 demonstrates the
results of each combination, we remark that POS
tags are not useful over all the experiments, the
best result is obtained by combining Z_score and
pre-polarity features. We find that Z_score fea-
tures improve significantly the f-measure and
they are better than pre-polarity features.
Figure 1 Z_score distribution in positive class
Figure 2 Z_score distribution in neutral class
Features F-measure
2013 2014
Terms 49.42 46.31
Terms+Z 55.90 57.28
Terms+POS 43.45 41.14
Terms+POL 53.53 52.73
Terms+Z+POS 52.59 54.43
Terms+Z+POL 58.34 59.38
Terms+POS+POL 48.42 50.03
Terms+Z+POS+POL 55.35 58.58
Table 2. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets.
We repeated all previous experiments after using
a twitter dictionary where we extend the tweet by
the expressions related to each emotion icons or
abbreviations in tweets. The results in Table3
demonstrate that using that dictionary improves
the f-measure over all the experiments, the best
results obtained also by combining Z_scores and
pre-polarity features.
Features F-measure
2013 2014
Terms 50.15 48.56
Terms+Z 57.17 58.37
Terms+POS 44.07 42.64
Terms+POL 54.72 54.53
Terms+Z+POS 53.20 56.47
Terms+Z+POL 59.66 61.07
Terms+POS+POL 48.97 51.90
Terms+Z+POS+POL 55.83 60.22
Table 3. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets after using a twitter
dictionary.
5 Conclusion
In this paper we tested the impact of using
Twitter Dictionary, Sentiment Lexicons, Z_score
features and POS tags for the sentiment classifi-
cation of tweets. We extended the feature vector
of tweets by all these features; we have proposed
new type of features Z_score and demonstrated
that they can improve the performance.
features which improve the performance by 6.5%
and 10.9%, then by pre-polarity features which
also improve the f-measure by 4%, 6%, but the
extending with POS tags decreases the f-
measure. We also test all combinations with the-
se previous features, Table2 demonstrates the
results of each combination, we remark that POS
tags are not useful over all the experiments, the
best result is obtained by combining Z_score and
pre-polarity features. We find that Z_score fea-
tures improve significantly the f-measure and
they are better than pre-polarity features.
Figure 1 Z_score distribution in positive class
Features F-measure
2013 2014
Terms 49.42 46.31
Terms+Z 55.90 57.28
Terms+POS 43.45 41.14
Terms+POL 53.53 52.73
Terms+Z+POS 52.59 54.43
Terms+Z+POL 58.34 59.38
Terms+POS+POL 48.42 50.03
Terms+Z+POS+POL 55.35 58.58
Table 2. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets.
We repeated all previous experiments after using
a twitter dictionary where we extend the tweet by
the expressions related to each emotion icons or
abbreviations in tweets. The results in Table3
demonstrate that using that dictionary improves
the f-measure over all the experiments, the best
results obtained also by combining Z_scores and
pre-polarity features.
Features F-measure
2013 2014
Terms 50.15 48.56
Terms+Z 57.17 58.37
Terms+POS 44.07 42.64
Terms+POL 54.72 54.53
Terms+Z+POS 53.20 56.47
Terms+Z+POL 59.66 61.07
Terms+POS+POL 48.97 51.90
Terms+Z+POS+POL 55.83 60.22
Table 3. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets after using a twitter
dictionary.
5 Conclusion
[Hamdan,	Béchet	&	Bellot,	SemEval	2014]
http://sentiwordnet.isti.cnr.it
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 26
http://sentiment-analyser.openeditionlab.org/aboutsemeval
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Building a Graph of Books
— Nodes = books + properties (metadata, #reviews and ranking, page ranks, ratings…)
— Edges = links between books
— Book A refers to Book B according to:

— Bibliographic references and citations (in the book / in the reviews)

— Amazon recommendation (People who bought A bought B, People who liked A liked B…)
— A is similar to B

— They share bibliographic references

— Full-text similarity + similarity between the metadata
27
The	graph	allows	to	
estimate		
—	«	Book	Ranks	»	
(cf.	the	Google’s	Page	Rank)	
—	Neighborhood	
—	Shortest	paths
…Evaluation	in	progress…
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Conclusion : (Still) (Half)Open Questions
— How to combine Information Retrieval and Recommendation ?

- Learning to Rank : the best way now

but… what is a good recommendation is not clear (crowdsourcing ?)

- Using Reader Profiles (cf. CLEF Social Book Search track since 2015)
— Is combining « official » meta-data and Reader-generated content (reviews) useful ?

YES
— How to go beyond topical relevance ? (quality, authority, genre, expertise…)

Combining Graph analysis / Sentiment analysis

Natural Language Processing
— How to deal with complex queries ?

NLP (dependencies, sub-queries, aspects…) — cf. Question Answering

Translating Natural Language Query into Structured Queries
— How to identify and exploit references (cross-linking) ?

Supervised probabilistic approaches (robust, effective and efficient)
— Is the content of the books useful ? (vs. reviews / abstracts only) ?

Not sure for retrieval…

For linking : YES (at least when the book is not a best-seller - not many reviews)
28
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) 29
http://lab.hypotheses.org

Contenu connexe

Similaire à Scholarly Book Recommendation

The Value of Purchasing E-books From A Large Publisher
The Value of Purchasing E-books From A Large PublisherThe Value of Purchasing E-books From A Large Publisher
The Value of Purchasing E-books From A Large PublisherAaron K. Shrimplin
 
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...Jennifer Bazeley
 
The Art of Collection Budgets
The Art of Collection BudgetsThe Art of Collection Budgets
The Art of Collection Budgetshebertm3308
 
Explore open access books - Springer Nature & Digital Science event in Boston...
Explore open access books - Springer Nature & Digital Science event in Boston...Explore open access books - Springer Nature & Digital Science event in Boston...
Explore open access books - Springer Nature & Digital Science event in Boston...Springer Nature
 
Academic Social Networks and Researcher Ranking
Academic Social Networks and Researcher RankingAcademic Social Networks and Researcher Ranking
Academic Social Networks and Researcher RankingAmanyalsayed
 
Oliver and Gourlay
Oliver and GourlayOliver and Gourlay
Oliver and GourlayMoira Wright
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 
Publish or Perish - Realising Google Scholar's potential to democratise citat...
Publish or Perish - Realising Google Scholar's potential to democratise citat...Publish or Perish - Realising Google Scholar's potential to democratise citat...
Publish or Perish - Realising Google Scholar's potential to democratise citat...Anne-Wil Harzing
 
Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. TIB Hannover
 
Exploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations
Exploring Google Books Ngram Viewer for Big Data Text Corpus VisualizationsExploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations
Exploring Google Books Ngram Viewer for Big Data Text Corpus VisualizationsShalin Hai-Jew
 
Ebooks presentationppt palacrd2013
Ebooks presentationppt palacrd2013Ebooks presentationppt palacrd2013
Ebooks presentationppt palacrd2013Beth Transue
 
Library support on writing 111122
Library support on writing 111122Library support on writing 111122
Library support on writing 111122Anke Versteeg
 
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se..."Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...Gillian Byrne
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of UnderstandingPeter Morville
 

Similaire à Scholarly Book Recommendation (20)

The Value of Purchasing E-books From A Large Publisher
The Value of Purchasing E-books From A Large PublisherThe Value of Purchasing E-books From A Large Publisher
The Value of Purchasing E-books From A Large Publisher
 
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...
The Value of Purchasing E-book Collections from a Large Publisher (Oxford Uni...
 
The Art of Collection Budgets
The Art of Collection BudgetsThe Art of Collection Budgets
The Art of Collection Budgets
 
Explore open access books - Springer Nature & Digital Science event in Boston...
Explore open access books - Springer Nature & Digital Science event in Boston...Explore open access books - Springer Nature & Digital Science event in Boston...
Explore open access books - Springer Nature & Digital Science event in Boston...
 
Academic Social Networks and Researcher Ranking
Academic Social Networks and Researcher RankingAcademic Social Networks and Researcher Ranking
Academic Social Networks and Researcher Ranking
 
The Google Scholar Revolution: a big data bibliometric tool
The Google Scholar Revolution:  a big data bibliometric toolThe Google Scholar Revolution:  a big data bibliometric tool
The Google Scholar Revolution: a big data bibliometric tool
 
Oliver and Gourlay
Oliver and GourlayOliver and Gourlay
Oliver and Gourlay
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
Publish or Perish - Realising Google Scholar's potential to democratise citat...
Publish or Perish - Realising Google Scholar's potential to democratise citat...Publish or Perish - Realising Google Scholar's potential to democratise citat...
Publish or Perish - Realising Google Scholar's potential to democratise citat...
 
Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint.
 
Research Skills 2
Research Skills 2Research Skills 2
Research Skills 2
 
Exploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations
Exploring Google Books Ngram Viewer for Big Data Text Corpus VisualizationsExploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations
Exploring Google Books Ngram Viewer for Big Data Text Corpus Visualizations
 
Maine directors
Maine directorsMaine directors
Maine directors
 
U ottawa jan.2013
U ottawa jan.2013U ottawa jan.2013
U ottawa jan.2013
 
Ebooks presentationppt palacrd2013
Ebooks presentationppt palacrd2013Ebooks presentationppt palacrd2013
Ebooks presentationppt palacrd2013
 
Library support on writing 111122
Library support on writing 111122Library support on writing 111122
Library support on writing 111122
 
ANT1CAG: support session (library help)
ANT1CAG: support session (library help)ANT1CAG: support session (library help)
ANT1CAG: support session (library help)
 
Mich la april 2011
Mich la april 2011Mich la april 2011
Mich la april 2011
 
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se..."Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...
"Is one Stop Shopping all we Dreamed it Would be? Usability and the Single Se...
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 

Plus de Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Plus de Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I) (9)

Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Introduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielleIntroduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielle
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Introduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDMIntroduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDM
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Huma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHSHuma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHS
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
OpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text MiningOpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text Mining
 

Dernier

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Dernier (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Scholarly Book Recommendation

  • 1. SCHOLARLY BOOK RECOMMENDATION : A TASK COMBINING 
 SENTIMENT ANALYSIS, 
 INFORMATION EXTRACTION AND IR Patrice Bellot (M. Dacos, C. Benkoussas, A. Ollagnier, M. Orban, H. Hamdan, A. Htait, E. Faath, Y.M. Kim, F. Béchet)
 Aix-Marseille Université - CNRS (LSIS UMR 7296 ; OpenEdition) patrice.bellot@univ-amu.fr LSIS - DIMAG team http://www.lsis.org/dimag OpenEdition Lab : http://lab.hypotheses.org
  • 2. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Objectives — To present the OpenEdition platforms (open access) — To present some results we obtained
 1- Social Book Search on Amazon
 2- Scholarly Book Search on OpenEdition 
 and still open questions (towards a future collaborative project ?) — OpenEdition Lab : a DH Research Program that aims to :
 — Detect hot topics, hot books, hot papers
 — Develop approaches for recommending and searching for books
 2 Our partners: libraries an institutions all over the world > 4 million unique visitors / month
  • 3. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Some Open Questions — How to combine Information Retrieval and Recommendation 
 (Learning to Rank ?) — Is combining official meta-data and Reader-generated content (reviews) useful ? — How to identify and exploit bibliographic references (cross-linking) ? — Is the content of the books useful ? (vs. reviews / abstracts only) ? — How to deal with complex queries ? — How to go beyond topical relevance ? (genre, reading level, recentness…) Classical solution (the Vector Space Model for IR): a weight is associated to every word in every document (a vector), the query is represented as a vector in the document space, a degree of similarity between the documents and the query can be estimated 3 ~d ~q ~d = 0 B B B @ wm1,d wm2,d ... wmn,d 1 C C C A ~q = 0 B B B @ wm1,q wm2,q ... wmn,q 1 C C C A wmi,d mi s(~d, ~q) = i=nX i=1 wmi,d · wmi,q (1) wi,d = wi,d qPn j=1 w2 j,d (2) s(~d, ~q) = i=nX wi,d qPn 2 · wi,q qPn 2 = ~d · ~q kdk · kqk = cos(~d, ~q) (3) Topicality only ?
  • 5. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Searching for Books ? Very diverse needs : — Topicality — With a precise context eg. arts in China during the XXth century — With named entities : locations (the book is about a specific location OR the action takes place at this location), proper names… — Style / Expertise / Language — fiction, novel, essay, proceedings, position papers… — for experts / for dummies / for children … — in English, in French, in old French, in (very) local languages … — looking for citations / references — in what book appears a given citation — what are the books that refer to a given one — Authority : — What are the most important books about … (what most important means ?) — What are the most popular books about … 5 For Digital Humanities (and for Research Purposes) : Researchers may not be interested in the best-sellers but in the impact of a book = in the debate it provokes
  • 6. Searching for books on Amazon 6
  • 7. P. Bellot (AMU-CNRS, LSIS-OpenEdition) 7 Searching for books : a complex (?) query I am looking for the best mystery books set in New York City
  • 8. Searching for books : a complex (?) query 8 I am looking for the best mystery books set in New York City not « M ystery »
  • 11. P. Bellot (AMU-CNRS, LSIS-OpenEdition) 11 http://social-book-search.humanities.uva.nl/#/overview 2 The Amazon collection The document used for this year’s Book Track is composed of Amazon pages of existing books. These pages consist of editorial information such as ISBN num- ber, title, number of pages etc... However, in this collection the most important content resides in social data. Indeed Amazon is social-oriented, and user can comment and rate products they purchased or they own. Reviews are identi- fied by the <review> fields and are unique for a single user: Amazon does not allow a forum-like discussion. They can also assign tags of their creation to a product. These tags are useful for refining the search of other users in the way that they are not fixed: they reflect the trends for a specific product. In the XML documents, they can be found in the <tag> fields. Apart from this user classification, Amazon provides its own category labels that are contained in the <browseNode> fields. Table 1. Some facts about the Amazon collection. Number of pages (i.e. books) 2, 781, 400 Number of reviews 15, 785, 133 Number of pages that contain a least a review 1, 915, 336 3 Retrieval model 3.1 Sequential Dependence Model Like the previous year, we used a language modeling approach to retrieval [4]. We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integrate multiword phrases in the query. Specifically, we use the Sequential Dependance Organizers Marijn Koolen (University of Amsterdam) Toine Bogers (Aalborg University Copenhagen) Antal van den Bosch (Radboud University Nijmegen) Antoine Doucet (University of Caen) Maria Gaede (Humboldt University Berlin) Preben Hansen (Stockholm University) Mark Hall (Edge Hill University) Iris Hendrickx (Radboud University Nijmegen) Hugo Huurdeman (University of Amsterdam) Jaap Kamps (University of Amsterdam) Vivien Petras (Humboldt University Berlin) Michael Preminger (Oslo and Akershus University College of Applied Sciences) Mette Skov (Aalborg University Copenhagen) Suzan Verberne (Radboud University Nijmegen) David Walsh (Edge Hill University)
  • 12. Searching for Books : Looking at Reviews 12
  • 15. Amazon : Organization of Items (Nodes & Items) 15 Product Advertising API https://aws.amazon.com/ cf. http://www.codediesel.com/libraries/amazon-advertising-api-browsenodes/
  • 16. Amazon Navigation
 Graph : YASIV 16 http://www.yasiv.com/#/Search?q=orwell&category=Books&lang=US
  • 17. SBS Collection : Meta-data and Social Data 17 2.8 Million « Documents » (book descriptions) http://social-book-search.humanities.uva.nl
  • 18. P. Bellot (AMU-CNRS, LSIS-OpenEdition) 18 — Using NLP : to what extent ? (Named Entity Recognition, Syntactic Analysis…) http://social-book-search.humanities.uva.nl SBS Collection : Real Topics from Library Thing
  • 19. Our proposal for CLEF Social Book Search (Amazon) 19 IR : Sequential Dependance Model (SDM) - Markov Random Field (Metzler & Croft, 2004) and/or Divergence From Randomness (InL2) model + Query Expansion with Dependance Analysis Ratings : The more a book has reviews and the more it has good ratings, the more relevant it is. Graph : Expanding the retrieved books with Similar Books then Reranking with PageRank 13 ● We tested many reranking methods. Combining the retrieval model scores and other scores based on social information. ● For each document compute: – PageRank: algorithm that exploits link structure to score the important of nodes in the graph. – Likeliness: Computed from information generated by users (reviews and ratings). More the book has a lot of reviews and good ratings, the more interesting it is. Graph Modeling – Reranking Schemes 12 ti Retrieving Collection DGD Dti DStartingNodes Neighbors SPnodes DgraphDgraph Delete duplications D nal 1 2 3 5 6 7 89 + 10 Reranking 11 Graph Modeling - Recommendation Page Rank + Similar Products - Best results in 2011 
 (Judgements obtained by crowdsourcing) (IR and ratings)
 P@10 ≈ 0.58
 - Good results in 2014 (IR, ratings, expansion) P@10 ≈ 0.23 ; MAP ≈ 0.44
 
 - in 2015 : rank 25/47 (IR + graph 
 but graph improved IR) P@10 ≈ 0.2 (best 0.39, included 
 the price of books)
  • 20. 20 Our proposal for Scholarly Book Search (OpenEdition) BILBO ÉCHO CLASSIFICATION AUTOMATIQUE ET MÉTADONNÉES RECOMMANDATION GRAPHE DE CONTENUS questions de communication vertigo edc echogeo vertigo quaderni BILBO - MISE EN RELATION DES COMPTES-RENDUS AVEC LES LIVRES ÉCHO - ANALYSE DES SENTIMENTS Langouet, G., (1986), « Innovations pédago- giques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30. Langouet, G., (1986), « Innovations pédagogiques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30. DOI : 10.3406/rfp.1986.1499 18 Voir Permanent Mandates Commission, Minutes of the Fifteenth Session (Geneva: League of Nations, 1929), pp. 100-1. Pour plus de détails, voir Paul Ghali, Les nationalités détachées de l’Empire ottoman à la suite de la guerre (Paris: Les Éditions Domat-Mont- chrestien, 1934), pp. 221-6. ils ont déjà édité trois recueils et auxquelles ils ont consacré de nombreux travaux critiques. Leur nouvel ouvrage, intitulé Le Roman véritable. Stratégies préfacielles au XVIIIe siècle et rédigé à six mains par Jan Herman, Mladen Kozul et Nathalie Kremer – chaque auteur se chargeant de certains chapitres au sein BILBO NIVEAU 1 NIVEAU 2 NIVEAU 3 <bibl><author><surname>Langouet</surname>, <forename>G.</forename>,</author> (<date>1986</date>), <title level="a">« Innovations pédagogiques et technologies éducatives »</title>, <title level="j">Revue française de pédagogie</title>, <abbr>n°</abbr> <biblScope type="issue">76</biblScope>, <abbr>pp.</abbr> <biblScope type=page>25-30</biblScope>. <idno type="DOI">DOI : 10.3406/rfp.1986.1499</idno></bibl> RI sociale OpenEdition OpenEdition INFORMATION EXTRACTION FINDING REVIEWS LINKING TO BOOKS SENTIMENT ANALYSIS BOOK RECOMMENDATION SVM - Z-score - CRF Graph scoring RATINGS POLARITY GRAPH RECOMMENDATION REFERENCE ANALYSIS
  • 21. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Searching for Book Reviews in Scientific Journals • Supervised approaches for filtering = a new kind of genre classification • Feature selection : Threshold (Z-score) + random forest • Features : unigrams (words), localisation of named entities / dates 21 tations (bag-of-words, feature selection using z- score, Named Entity repartition in the text). 6.1 Naive Bayes (NB) In order to evaluate different classification mod- els, we have adopted as a baseline the naive Bayes approach (Zubaryeva and Savoy, 2010). The clas- sification system has to choose between two pos- sible hypotheses: h0 = It is a Review and h1 = It is not a Review the class that has the maxi- mum value according to the Equation (5). Where |w| indicates the number of words included in the current document and wj is the number of words that appear in the document. arg max hi P(hi). |w| Y j=1 P(wj|hi) (5) where P(wj|hi) = tfj,hi nhi We estimate the probabilities with the Equation (5) and get the relation between the lexical fre- quency of the word wj in the whole size of the collection Thi (denoted tfj,hi ) and the size of the corresponding corpus. We have used different strategies to represent each textual unit. First, the unigram model (Bag-of- Words) where all words are considered as features. We also used feature selection based on the nor- malized z-score by keeping the first 1000 words according to this score (after removing all words that appear less than 5 times). As the third ap- proach, we suggested that the common features between the Review collection can be located in the Named Entity distribution in the text. Table 4: Results showing the performances of the classification models using different indexing schemes on the test set. The best values for the Review class are noted in bold and those for Review class are are underlined Review Review # Model R P F-M R P F-M 1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8% SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7% SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5% * C = 5.0 * = 0.00185 2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2% SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8% SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6% * C = 32.0 * = 0.00781 3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6% SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1% SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1% * C = 8.0 * = 0.03125 Z scores across the corpus. # Feature Z score # Feature Z score 1 abandonne 30.14 16 winter 9.23 2 seront 30.00 17 cleo 8.88 3 biographie 21.84 18 visible 8.75 4 entranent 21.20 19 fondamentale 8.67 5 prise 21.20 20 david 8.54 6 sacre 21.20 21 pratiques 8.52 7 toute 20.70 22 signification 8.47 8 quitte 19.55 23 01 8.38 9 dimension 15.65 24 institutionnels 8.38 10 les 14.43 25 1930 8.16 11 commandement 11.01 26 attaques 8.14 12 lie 10.61 27 courrier 8.08 13 construisent 10.16 28 moyennes 7.99 14 lieux 10.14 29 petite 7.85 15 garde 9.75 30 adapted 7.84 In our training corpus, we have 106 911 words obtained from the Bag-of-Words approach. We se- lected all tokens (features) that appear more than 5 times in each classes. The goal is therefore to design a method capable of selecting terms that clearly belong to one genre of documents. We ob- tained a vector space that contains 5 957 words (features). After calculating the normalized z- score of all features, we selected the first 1 000 features according to this score. (Poibeau, 2003). We aim to explore the distribu- tion of 3 named entities (”authors’ names”, ”loca- tions” and ”dates”) in the text after removing all XML-HTML tags. After that, we divided texts into 10 parts (the size of each part = total num- ber of words / 10). The distribution ratio of each named entity in each part is used as feature to build the new document representation and we obtained a set of 30 features. Figure 3: ”Person” named entity distribution 6 Experiments In this section we describe results from experi- ments using a collection of documents from Re- vues.org and the Web. We use supervised learning Figure 4: ”Location” named entity distribution Figure 5: ”Date” named entity distribution methods to build our classifiers, and evaluate the resulting models on new test cases. The focus of our work has been on comparing the effectiveness of different inductive learning algorithms (Naive Bayes, Support Vector Machines with RBF and Linear Kernels) in terms of classification accuracy. We also explored alternative document represen- tations (bag-of-words, feature selection using z- score, Named Entity repartition in the text). 6.2 Supp SVM desig by Vapnik recognition method is mization p tional learn learn linea a simple p tion, they radial basi layer sigm key in suc mal bound use them f garwal and the differen used the W model with Basic Func a good lev growth of t stage.(Kum 6.3 Resu We have us textual uni Words) wh Figure 4: ”Location” named entity distribution Figure 5: ”Date” named entity distribution methods to build our classifiers, and evaluate the resulting models on new test cases. The focus of our work has been on comparing the effectiveness of different inductive learning algorithms (Naive Bayes, Support Vector Machines with RBF and Linear Kernels) in terms of classification accuracy. We also explored alternative document represen- tations (bag-of-words, feature selection using z- score, Named Entity repartition in the text). 6.1 Naive Bayes (NB) In order to evaluate different classification mod- els, we have adopted as a baseline the naive Bayes approach (Zubaryeva and Savoy, 2010). The clas- sification system has to choose between two pos- sible hypotheses: h0 = It is a Review and h1 = It is not a Review the class that has the maxi- mum value according to the Equation (5). Where 6.2 Support Vector Machines (SVM) SVM designates a learning approach introduced by Vapnik in 1995 for solving two-class pattern recognition problem (Vapnik, 1995). The SVM method is based on the Structural Risk Mini- mization principle (Vapnik, 1995) from computa- tional learning theory. In their basic form, SVMs learn linear threshold function. Nevertheless, by a simple plug-in of an appropriate kernel func- tion, they can be used to learn linear classifiers, radial basic function (RBF) networks, and three- layer sigmoid neural nets (Joachims, 1998). The key in such classifiers is to determine the opti- mal boundaries between the different classes and use them for the purposes of classification (Ag- garwal and Zhai, 2012). Having the vectors form the different representations presented below. we used the Weka toolkit to learning model. This model with the use of the linear kernel and Radial Basic Function(RBF) sometimes allows to reach a good level of performance at the cost of fast growth of the processing time during the learning stage.(Kummer, 2012) 6.3 Results We have used different strategies to represent each textual unit. First, the unigram model (Bag-of- Words) where all words are considered as features. We also used feature selection based on the nor- malized z-score by keeping the first 1000 words according to this score (after removing all words that appear less than 5 times). As the third ap- proach, we suggested that the common features between the Review collection can be located in the Named Entity distribution in the text.
  • 23. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Linking Contents by Analyzing the References In Books : no common stylesheet (or a lot of stylesheets poorly respected…) Our proposal : 1) Searching for references in the document / footnotes (Support Vector Machines) 2) Annotating the references (Conditional Random Fields) BILBO : Our (open-source) software for Reference Analysis 23 Google Digital Humanities Research Awards (2012) Annotation DOI search (Crossref) OpenEdition Journals : more than 1.5 million references analyzed Test : http://bilbo.openeditionlab.org Sources : http://github.com/OpenEdition/bilbo
  • 25. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Sentiment Analysis on Reviews • Statistical Metrics (PMI, Z-score, odd ratio…) • Combined with Linguistic Ressources 25 Z_score for each term ti in a class Cj (tij) by cal- culating its term relative frequency tfrij in a par- ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor- pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" = !"#!"!!"#$! !"# Eq. (1) Z!"#$% !!" = !"#!"!!!∗!(!") !"∗! !" ∗(!!!(!")) Eq. (2) The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega- tive and neutral words in tweets according to the- se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral. - Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec- tives, verbs, nouns, adverbs and connectors in each tweet. 4 Evaluation 4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi- tive, negative or neutral. We downloaded these i ij culating its term relative frequency tfrij in a par- ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor- pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" = !"#!"!!"#$! !"# Eq. (1) Z!"#$% !!" = !"#!"!!!∗!(!") !"∗! !" ∗(!!!(!")) Eq. (2) The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures (Hu wor tive se neg con - Pa We tag, tive each 4 4.1 W and Twi Koz prov tive twe we of p pus multiplied by nj the number of term class Cj, and standard deviation (sdi) o according to the underlying corpus ( (1,2)). Z!"#$% !!" = !"#!"!!"#$! !"# Eq. (1) Z!"#$% !!" = !"#!"!!!∗!(!") !"∗! !" ∗(!!!(!")) Eq. (2) The term which has salient frequency i in compassion to others will have a Z_score. Z_score was exploited for (Zubaryeva and Savoy 2010) , they c threshold (>2) for selecting the number having Z_score more than the thresho they used a logistic regression for co these scores. We use Z_scores as added for classification because the tweet is to therefore many tweets does not have an with salient Z_score. The three following 1,2,3 show the distribution of Z_score o class, we remark that the majority of te Z!"#$% !!" = !"#!"!!"#$! !"# Eq. (1) Z!"#$% !!" = !"#!"!!!∗!(!") !"∗! !" ∗(!!!(!")) Eq. (2) The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test- ed to use different values for the threshold, the best results was obtained when the threshold is 3. positive Z_score negative Z_score Neutral Z_score Love Good Happy Great Excite Best Thank Hope Cant Wait 14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05 Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid 13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83 Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin 6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17 Table1. The first ten terms having the highest Z_score in each class - Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec- tives, verbs, nouns, adverbs and connectors in each tweet. 4 Evaluation 4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi- tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014. 4.2 Experiments Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis- sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms. Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea- ture vector of tweet terms which gave 49%, 46% features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f- measure. We also test all combinations with the- se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea- tures improve significantly the f-measure and they are better than pre-polarity features. Figure 1 Z_score distribution in positive class Figure 2 Z_score distribution in neutral class Features F-measure 2013 2014 Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas- ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure 2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22 Table 3. Average f-measures for positive and negative clas- ses of SemEval2013 and 2014 test sets after using a twitter dictionary. 5 Conclusion In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi- cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f- measure. We also test all combinations with the- se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea- tures improve significantly the f-measure and they are better than pre-polarity features. Figure 1 Z_score distribution in positive class Features F-measure 2013 2014 Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas- ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure 2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22 Table 3. Average f-measures for positive and negative clas- ses of SemEval2013 and 2014 test sets after using a twitter dictionary. 5 Conclusion [Hamdan, Béchet & Bellot, SemEval 2014] http://sentiwordnet.isti.cnr.it
  • 27. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Building a Graph of Books — Nodes = books + properties (metadata, #reviews and ranking, page ranks, ratings…) — Edges = links between books — Book A refers to Book B according to:
 — Bibliographic references and citations (in the book / in the reviews)
 — Amazon recommendation (People who bought A bought B, People who liked A liked B…) — A is similar to B
 — They share bibliographic references
 — Full-text similarity + similarity between the metadata 27 The graph allows to estimate — « Book Ranks » (cf. the Google’s Page Rank) — Neighborhood — Shortest paths …Evaluation in progress…
  • 28. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Conclusion : (Still) (Half)Open Questions — How to combine Information Retrieval and Recommendation ?
 - Learning to Rank : the best way now
 but… what is a good recommendation is not clear (crowdsourcing ?)
 - Using Reader Profiles (cf. CLEF Social Book Search track since 2015) — Is combining « official » meta-data and Reader-generated content (reviews) useful ?
 YES — How to go beyond topical relevance ? (quality, authority, genre, expertise…)
 Combining Graph analysis / Sentiment analysis
 Natural Language Processing — How to deal with complex queries ?
 NLP (dependencies, sub-queries, aspects…) — cf. Question Answering
 Translating Natural Language Query into Structured Queries — How to identify and exploit references (cross-linking) ?
 Supervised probabilistic approaches (robust, effective and efficient) — Is the content of the books useful ? (vs. reviews / abstracts only) ?
 Not sure for retrieval…
 For linking : YES (at least when the book is not a best-seller - not many reviews) 28