General Principles of Intellectual Property: Concepts of Intellectual Proper...
SDOW (ISWC2011)
1. DIGITAL
Institute for Information and Communication Technologies
Pragmatic metadata matters:
How data about the usage of data affects
semantic user models
Claudia Wagner, Markus Strohmaier, Yulan He
Sunday, October 23, 2011
2. Example
Semantic Metadata
sioc:content
sioc:name
sioc:has_creator
rdf:type
rdf:type sioc:Post
sioc:UserAccount
2
foaf:Person
sioc:account_of
Sunday, October 23, 2011
3. Example
Pragmatic Metadata
3
Sunday, October 23, 2011
4. Example
Pragmatic Metadata
3
Sunday, October 23, 2011
5. Example
Pragmatic Metadata
3
Sunday, October 23, 2011
6. Example
Pragmatic Metadata
3
Sunday, October 23, 2011
7. Example
Pragmatic Metadata
3
Sunday, October 23, 2011
8. Aim
Can pragmatic metadata support the generation of semantic
metadata and if yes how?
sioc:name sioc:content
sioc:has_creator
rdf:type
rdf:type
? sioc:topic
sioc:Post
foaf:interest
sioc:UserAccount ?
4
foaf:Person
sioc:account_of
Sunday, October 23, 2011
9. Experimental Setup
§ Methodology
§ Topic Modeling Algorithms to learn topics (probability
distributions of words) and annotate users and posts
with topics
§ Incorporated different types of pragmatic metadata
into the Topic Models
§ Compared different models via their predictive
performance
§ Dataset
§ Boards.ie
§ Forums, Posts and Users
§ User`s authoring and replying behavior
§ Training Dataset: First and last week of February 2006
§ Test Dataset: 3 future posts of each user
5
Sunday, October 23, 2011
10. Evaluation
§ Compare different models by testing their predictive
performance on held out posts.
Log Likelihood of a word of user`s
future post given the model we learned
Sum over all words in a user`s future post
§ Assumption: a better user topic model reacts less
perplex on future posts authored by a user and needs
less trainings samples.
6
Sunday, October 23, 2011
11. Methodology
LDA
§ How to learn topics and annotate users with topics?
Text
§
Latent Dirichlet Allocation (LDA)
T1: (Blei et al, 2003)
mac: 0.3
iMac: 0.13
PC: 0.03
computer: 0.04
....
T1 T2 T3
7
Sunday, October 23, 2011
12. Methodology
DMR
§ How to incorporate metadata into topic models?
§ Dirichlet Multinomial Regression (DMR) Topic Models
(Mimno et al, 2008)
§ Observe feature vector x per document
§ Draw „fresh“ alpha for each document which depends
on observed features x and the feature distribution per
topic λt
8
∝ dt= exp(λt Xdt)
Sunday, October 23, 2011
13. Methodology
Post 7
ID Alg Doc Metadata Future
M1 LDA Post - Past
Post 1
authored
M2 LDA User -
Post 2
M3 DMR Post author
M4 DMR User author Post 3
replies to
User 1
M5 DMR Post reply-user
Post 4
authored
M6 DMR User reply-user
Post 5
M7 DMR Post related-user
M8 DMR User related-user User 2 Post 6
9
Sunday, October 23, 2011
14. Post
training
scheme
(M3,
M5
and
M7)
§ Different user activities performed on content
Baseline
LDA
(M1
and
M2)
Models
which
take
user
replies
into
account.
(M6
and
M8)
10
Sunday, October 23, 2011
15. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
16. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
17. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
18. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
19. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
20. Results
ID Alg Doc Metadata
Post 7 Future
M1 LDA Post -
M2 LDA User - Past
Post 1
authored
M3 DMR Post author
Post 2
M4 DMR User author
Post 3
M5 DMR Post reply-user User 1 replies to
M6 DMR User reply-user Post 4
authored
Post 5
M7 DMR Post related-user
User 2 Post 6
M8 DMR User related-user
11
Sunday, October 23, 2011
21. Results
§ The topics of users who reply to a user are also likely for
this user
§ Therefore, if 2 users get replies from the same users
than they are more likely to talk about the same topics
§ Topic models which incorporate pragmatic metadata per
user can indeed improve models
§ Topic models which incorporate pragmatic metadata per
post often over-fit data
§ Model Assumptions are too strict!
§ Idea: Incorporate behavioral user similarities
§ Intuition: users which are similar are more likely to talk
about the same topics
§ How to measure behavioral similarity?
§ forum usage
12
§ communication behavior
Sunday, October 23, 2011
22. Methodology
Post 7 Future
ID Alg Doc Metadata
Past
Post 1
authored
M9 DMR Post top 10 forums Post 2
User 1 Post 3
M10 DMR User top 10 forums f1 f15
f2 f20
f3 f31 authored Post 4
top 10 f4 f12
M11 DMR Post communication f5 f5
Post 5
partner f6 f6
f7 f17
f8 f18 Post 6
top 10 f9 f19 User 2
M12 DMR User communication f10 f10
partner
13
Sunday, October 23, 2011
23. Post
training
scheme
(M3,
M9
and
M11)
Baseline
LDA
(M1
and
M2)
User
training
scheme
(M4,
M10
and
M12)
Models
M12
incorporates
user
similari;es
based
on
their
communica;on
behavior
14
Sunday, October 23, 2011
24. Results
§ Topic models seem to benefit from taking behavioral
user similarities into account
§ Users who behave similar (regarding their forum usage
and communication behavior) are likely to talk about the
same topics
§ Common communication-partner seem to be more
predictive for common topics than common forums
15
Sunday, October 23, 2011
25. Conclusions
§ Pragmatic metadata may help to learn better semantic
user models
§ But pragmatic metadata observed on a post level often
over-fits data
§ Pragmatic Metadata on a user level seems to improve
the predictive performance of topic models
§ If posts of 2 users are “used” in a similar way then
they are more likely to talk about the same topics
§ If 2 users behave similar (tend to post to same forums
or tend to talk to same users) they are more likely to
talk about same topics.
§ Common communication-partner seem to be more
predictive for common topics than common forums
16
Sunday, October 23, 2011
26. Limitations and Future Work
§ Perplexity and semantic interpretability of topics do not
necessarily correlate (Chang et al., 2009)
§ Separate evaluation of semantic coherence of topics
§ Analyzing different types of behavior- and usage-related
metadata and explore to what extent they may reveal
information about the semantics of data
§ behavior on social streams such as Twitter
§ tagging behavior
§ navigation behavior
17
Sunday, October 23, 2011
27. References
§ David M. Blei, Andrew Ng, Michael Jordan. Latent Dirichlet allocation. JMLR (3)
(2003) pp. 993-1022
§ Chang, J., Boyd-graber, J., Gerrish, S., Wang, C. and Blei, D. Reading Tea
Leaves: How Humans Interpret Topic Models, Neural Information Processing
Systems, NIPS (2009)
§ Mimno, D.M. and McCallum, A. Topic Models Conditioned on Arbitrary Features
with Dirichlet-multinomial Regression. In Proceedings of UAI. (2008), pp. 411-418
18
Sunday, October 23, 2011