Social media sites (by some referred to as the web 2.0) allow their users to interact with each other, for example in collecting and sharing so-called user-generated content - these can be just bookmarks, but also blogs, images, and videos. Social media support co-creation: processes where customers (or users, if you prefer) do not just consume but play an active role in defining and shaping the end product. Famous examples include Six Degrees, LiveJournal, Digg, Epinions, Myspace, Flickr, YouTube, Linked-in, and Pinterest. Of course, today's internet giants Facebook and Twitter are key new developments. Finally, Wikipedia should not be overlooked - a major resource in many language technologies including information retrieval!
The second part of the lecture looks into the opportunities for information retrieval research. Social media platforms tend to provide access to user profiles, connections between users, the content these users publish or share, and how they react to each other's content through commenting and rating. Also, the large majority of social media platforms allow their users to categorize content by means of tags (or, in direct communication, through hash-tags), resulting in collaborative ways of information organization known as folksonomies. However, these social media also form a challenge for information retrieval research: the many platforms vary in functionalities, and we have only very little understanding of clearly desirable features like combining tag usage and ratings in content recommendation! A unifying approach based on random walks will be discussed to illustrate how we can answer some of these questions [1], but clearly the area has ample opportunity to leave your own marks.
In the final part of the lecture I will briefly touch upon an even wider range of opportunities, where data derived from social media form a key component to enable new research and insights. I will review a few important results from research centered on Wikipedia, facebook and twitter data, as well as a diverse range of new information sources including the geo- and temporal information derived from images and tweets, product reviews and comments on youtube videos, and how url shorteners may give a view on what is popular on the web.
[1] Maarten Clements, Arjen P. De Vries, and Marcel J. T. Reinders. 2010. The task-dependent effect of tags and ratings on social media access. ACM Trans. Inf. Syst. 28, 4, Article 21 (November 2010), 42 pages. http://doi.acm.org/10.1145/1852102.1852107
1. 9th European Summer School in Information Retrieval September 4th, 2013
http://bit.ly/ESSIR13IRSocMedia
IR and Social Media
Arjen P. de Vries
arjen@acm.org
Centrum Wiskunde & Informatica
Delft University of Technology
Spinque B.V.
3. Social Media
Noun
social media (plural only)
Interactive forms of media that allow users
to interact with and publish to each other,
generally by means of the Internet.
The early 21st century saw a huge increase in social
media thanks to the widespread availability of the
Internet.
5. Social Media
“Social bookmarking” sites
“User generated content”
Images (flickr) and videos (youtube, vimeo), but also
blogs
Social network services
Twitter, facebook
11. “Rock group” in
author’s metadata...
Organisation in
groups may help
disambiguate
query!
More implicit
metadata...
12. Information Science
“Search for the fundamental knowledge
which will allow us to postulate and utilize
the most efficient combination of [human
and machine] resources”
M.E. Senko. Information systems: records, relations, sets, entities,
and things. Information systems, 1(1):3–13, 1975.
13. Core Questions
How to represent information?
The information need and search requests
The objects to be shown in response to an
information request
How to match information
representations?
14. IR and Social Media
Richer information representations!
15. Richer representations
User profiles
User name, full name, description, image,
homepage url, etc.
Connections between users
Networks of friends, followers, etc
Comments/reactions
Endorsing and sharing
17. (C) 2008, The New York Times Company
Anchor tekst:
“continue reading”
18. Not a lot of info
to represent
the page…
Een fan’s hyves page:
Kyteman's HipHop Orchestra: www.kyteman.com
Kaartverkoop luxor theater:
22 mei - Kyteman's hiphop Orkest - www.kyteman.com
Kluun.nl:
De site van Kyteman
Blog Rockin’ Beats:
De 21-jarige Kyteman
(trompettist, componist en
Producer Colin Benders),
heeft drie jaar gewerkt aan
zijn debuut:
the Hermit sessions.
Jazzenzo:
...een optreden van het populaire
Kyteman’s Hiphop Orkest
20. ‘Co-creation’
Social Media:
Consumer becomes a co-creator
‘Data consumption’ traces
In essence: many new sources to play the
role of anchor text
Tags and/or ratings
Tweets
Comments, reviews
21. Potential Benefits for IR
Expand content representation
Reduce the vocabulary gap(s) between
creators of content, indexers, and users
More diverse views on the same content
22. Potential Benefits for IR
Relevance depends on user context
User task
User knowledge
23. Potential Benefits for IR
Relevance depends on user context
User task
User knowledge
Social media provide an opportunity to
make much better assumptions about
user context
A specific user’s context
The variety of user contexts that may exist
24. Maarten Clements, Arjen P. de Vries and Marcel J.T. Reinders.
The task dependent effect of tags and ratings on social media access.
TOIS 28, 4, article 21 (November 2010), 42 pages.
33. Search with Random Walk
Present nodes according to estimated
probability that a random walk that starts
from (task dependent) starting nodes,
would end at this node
E.g., tag suggestion starts in a tag node;
personalized search in tag and user nodes
40. A soft clustering effect smoothly relates
similar concepts before converging to the
background probability
41. Homographs like “Java” are
disambiguated because the walk starts in
both the query tag and the target user
So, content that matches the user’s
preference is more likely to be found first
43. Analysis results
Allowing all users to tag all available
content improves retrieval tasks
Combining tags and ratings may improve
both search and recommendation tasks
44. Ternary relation lost!
The UIT matrix represents a ternary
relation, that is lost when creating the
three UI, IT and UT matrices
45. Ternary relation lost!
The UIT matrix represents a ternary
relation, that is lost when creating the
three UI, IT and UT matrices
Potentially a problem if tags express opinion
about an item; e.g.,
“poetry” can independent from item still describe
the user
“awful” requires to know what item the term
belongs to
47. Tags vs. rating
Most tags do not deviate far from the
mean rating
Only few tags strongly correlated with
opinion
Note: poetry higher quality than chicklit
48. Metadata
Scientific articles have many types of
metadata associated:
Abstract
Author
Booktitle
Description
Journal
Tags
Are all these types of metadata useful for
item recommendation?
49. Metadata
According to Toine Bogers’ PhD thesis:
Concatenate all fields associated to a single
user’s profile’s items into one huge text field,
and use an off-the-shelf IR model to match
the profile against metadata of the items.
“Profile-centric Matching”
Or, construct item profiles from meta-data of
all users for that item, and apply an item-
based collaborative filtering approach
“Item-based Hybrid Filtering”
Author, description, tags, title, url, journal
and booktitle all contribute
59. The Black Keys
Web responds, while service based
popularity index is static
60. Implications
An “artist popularity” index depends on
the platform and its user population
Web based popularity – estimated via URL
shortener’s API – “reacts” to real-world
events
Suitable as an academics’ search log
replacement?
61. Implications
An “artist popularity” index depends on
the platform and its user population
Web based popularity – estimated via URL
shortener’s API – “reacts” to real-world
events
Suitable as an academics’ search log
replacement?
Q: What is the most useful popularity –
one that changes dynamically or one that
lasts?
65. Tweets about blip.tv
“Twanchor text”
E.g.: http://blip.tv/file/2168377
Amazing
Watching “World’s most realistic 3D city
models?”
Google Earth/Maps killer
Ludvig Emgard shows how maps/satellite pics
on web is done (learn Google and MS!)
and ~120 more Tweets
66. Wikipedia
Wikipedia contains semantically very rich
annotations:
Wikipedia Categories
Wikipedia Lists
Times (1930, 1931, 1932, etc. etc.)
Names
Disambiguation pages
Etc.
Note: DBPedia is just Wikipedia
68. Geotags / POIs
Many social media items carry explicit geo
information
Geotags are low-level “coordinates”
POIs are high-level “point-of-interest” labels
Applications
Recommend geo-locations to people
Predict POI tags from (tweet) text
Predict where a user will go next
69. Map text to locations
Build a language model from all tags
assigned to flickr images that belong to a
predefined grid cell
Neighbouring cells used for smoothing
(like hierarchic language models used
previously for video / scene / shot)
User frequency of a term in a location
(instead of term frequency)
Neil O’Hare and Vanessa Murdock
Modeling Locations with Social Media
Information Retrieval, February 2013, Volume 16, Issue 1, pp 30-62
72. Searching the Social Graph
Search entities, and the relationships
between them, in the (facebook) social
graph
Clearly IR problems, but who has the data
to work with?
Micheal Curtiss et al.
Unicorn: A System for Searching the Social Graph
PVLDB, Vol. 6, No. 11
73. Crawling
How to get “the” data?
Rate limited APIs
ToS
HEADACHES!
74. Fred Morstatter, Jürgen Pfeffer, Huan Liu and Kathleen M. Carley
Is the Sample Good Enough? Comparing Data from Twitter’s Streaming
API with Twitter’s Firehose
ICWSM 2013
75. Not IR yet, but…
Interesting stuff nevertheless!
de Volkskrant, March 13, 2013
Michal Kosinski, David Stillwell, and Thore Graepel
Private traits and attributes are predictable from digital records of
human behavior
PNAS 2013 ; published ahead of print March 11, 2013,
doi:10.1073/pnas.1218772110
77. Take home message(s)
Social media give us IR researchers
access to a rich resource of context
Including time & location!
78. Take home message(s)
Social media give us IR researchers
access to a rich resource of context
Including time & location!
Gather the right data for your problem
domain, and it may be a good alternative
for not having the click data we all want
so badly
79. Take home message(s)
Social media give us IR researchers
access to a rich resource of context
Including time & location!
Gather the right data for your problem
domain, and it may be a good alternative
for not having the click data we all want
so badly
Various recommendation and retrieval
tasks exist in social media – can one
theory address all of these?