Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams

Rhea: Adaptively Sampling Authoritative
Content from Social Activity Streams
Panagiotis Liakos - Alexandros Ntoulas - Alex Delis
University of Athens, Greece
IEEE BigData 2017
December 11th-14th, 2017 - Boston, MA

UoA Panagiotis Liakos Rhea-• Motivation 2/26

500 million tweets
sent each day!

Motivation
Mining social activity in real-time is valuable for numerous
applications:
opinion mining
content recommendation
emerging news detection
Processing the full activity stream of a social network is prohibitive:
storage
computational cost

Motivation
applications:
opinion mining
storage
computational cost
Not all content is useful:
90% of tweets is conversational or spam!
Workaround: take a sample of the social
activity and use it to feed into applications!

Motivation
applications:
opinion mining
storage
computational cost
Not all content is useful:
90% of tweets is conversational or spam!
Workaround: take a sample of the social
activity and use it to feed into applications!
Our approach:
Sample the content published by authorities

Related Work
Social Activity Stream Sampling:
White-lists of users [GSB+12, WLP+12, GZB+13, ZBG+16].
Focus is mainly on Twitter.
Our approach is adaptive and does not rely on static white-lists.
Authoritative users in Online Social Networks:
Network attributes [ZAA07, JA07, ACD+08, PC11, BBC+13].
We focus on streams, not networks.
UoA Panagiotis Liakos Rhea-• Related Work 4/26

Contribution
We propose Rhea:
A sampling algorithm for authoritative content that forms a
network of authorities as it processes a social activity stream,
and samples only the activity of the top-K authoritative users.
We build on:
Network-based measures and their Our ﬁndings on the disadvantages
adaptation in a streaming setting of white-list approaches
We outperform contemporary approaches with regard to
precision, recall, and ranking accuracy!
UoA Panagiotis Liakos Rhea-• Contribution 5/26

Network-based measures
UoA Panagiotis Liakos Rhea-• Network-based measures 6/26

Network of Authorities from Social Activity

Ranking the Authorities
z-score: Zhang, Ackerman and Adamic, WWW 2007
Builds on positive and negative predictors of expertise:
z(u) = a(u)−q(u)
√
a(u)+q(u)
where, a(u) is the number of questions u has answered
and q(u) is the number of questions u has asked.

Ranking the Authorities
We propose auth-value:
A measure for a wide range of social networking sites:
auth(u) = in(u)−out(u)
√
in(u)+out(u)
where, in(u) is the weighted in-degree of u in the network of authorities
and out(u) is her respective weighted out-degree.

Our ﬁndings on
White-List approaches
UoA Panagiotis Liakos Rhea-• White-lists 10/26

Limitations of Static Lists of Authorities
Rank
October 2009 November 2009 December 2009
user u auth(u) user u auth(u) user u auth(u)
1 justinbieber 393.885 justinbieber 448.815 justinbieber 433.185
2 donniewahlberg 358.286 donniewahlberg 249.988 nickjonas 249.558
3 tweetmeme 263.103 revrunwisdom 242.807 revrunwisdom 222.571
4 revrunwisdom 237.964 tweetmeme 195.379 donniewahlberg 202.996
5 mashable 229.650 addthis 186.282 tweetmeme 183.603
6 addthis 212.325 ddlovato 181.720 jonasbrothers 182.882
7 ddlovato 204.910 luansantanaevc 167.514 addthis 181.403
8 jordanknight 191.045 jordanknight 167.197 omgfacts 154.136
9 jonasbrothers 175.054 jonasbrothers 165.520 mashable 153.616
10 lilduval 174.616 mashable 164.496 johncmayer 147.241
User rankings vary across diﬀerent months
White-lists can be unstable and
quickly become out-of-date

0.4
0.5
0.6
0.7
0.8
0.9
1
0 250 500 750 1000
Precision@K
K (authorities)
Sept. 2009 & Oct. 2009
Sept. 2009 & Nov. 2009
Sept. 2009 & Dec. 2009

0.4
0.5
0.6
0.7
0.8
0.9
1
0 250 500 750 1000
Precision@K
K (authorities)
Sept. 2009 & Oct. 2009
Sept. 2009 & Nov. 2009
Sept. 2009 & Dec. 2009
We need an adaptive algorithm!

Rhea: “She who ﬂows”
Museum of Fine Arts,
Boston
UoA Panagiotis Liakos Rhea-• Rhea 13/26

Rhea: Three Challenges
1 Maintaining user information
may be costly in terms of both memory & CPU
2 Ranking users
may require reckoning in multiple measures
3 Many elements we opt to include may be irrelevant

Maintaining User Information
Count-Min sketch:
+ct
+ct
+ct
+ct
h1
h2
hd
...
it d
w
count
Reducing the processing overhead through sampling:
We apply a Bernoulli sampling scheme [PJC+15].

Ranking Authorities
We need to know at any time the top-K users by auth(u):
Algorithm 1: put(Top-K-Heap, key, value)
input : A Top-K-Heap structure and a key associated with a value to be
inserted in the Top-K-Heap.
output : The updated Top-K-Heap.
1 begin
2 if Top-K-Heap.size() < K then
3 if Top-K-Heap.contains(key) then
4 Top-K-Heap.replace(key, value);
5 else
6 Top-K-Heap.push(key, value);
7 else
8 if Top-K-Heap.contains(key) then
9 Top-K-Heap.replace(key, value);
10 else if value > Top-K-Heap.peek().value() then
11 Top-K-Heap.pop();
12 Top-K-Heap.push(key, value);
13 return Top-K-Heap;

Filtering-out Non-relevant Activity
While processing the stream, we may deem as an authority
a user that temporarily appears to be one.
We lose in precision!
Post-processing step:
The sample is much smaller than the stream: ˆS S
We re-examine the elements of the sample and
ﬁlter-out the activity of users not in the Top-K-Heap

Rhea
Forming the network of
authorities
Sampling the stream
Removing irrelevant content
Algorithm 2: Rhea(S, K, p)
input : A stream S, a parameter K > 0 and a probability p ∈ (0, 1].
output : A set ˆS ⊂ S containing elements whose respective users are likely to
be among the top-K w.r.t. to the auth-value.
begin
T op-K-heap ← ∅;
CMSin ← ∅;
CMSout ← ∅;
foreach s ∈ S do
if random(0, 1] < p then
(in, out) ← extractIndicators(s.message) ;
CMSin[in]+ = 1 ;
CMSout[out]+ = 1 ;
authuser ←
CMSin[s.user]−CMSout[s.user]
CMSin[s.user]+CMSout[s.user]
;
if authuser > T op-K-heap.low() then
T op-K-heap.put(user, authuser);
ˆS.put(s);
foreach s ∈ ˆS do
if s.user /∈ T op-K-heap then
ˆS.remove(s);
return ˆS;

Experimental Evaluation
Dataset:
1 467 million tweets from 20 million users of Twitter
2 263, 540 answers to 83, 423 questions posted by 26, 752 users of
StackOverflow
Questions:
1 How does Rhea compare against white-list based sampling in
terms of F1-score?
2 Is Rhea able to assess the ranking relevance of the sampled
documents?
3 What is the impact of the parameters involved in the execution
of Rhea?
UoA Panagiotis Liakos Rhea-• Exeriments 19/26

F1-score
0
0.2
0.4
0.6
0.8
1
0 250 500 750 1000
F1-score
K (authorities)
Rhea (T)
WhiteList (T)
0
0.2
0.4
0.6
0.8
1
0 250 500 750 1000
F1-score
K (authorities)
Rhea (SO)
WhiteList (SO)

Normalized Discounted Cumulative Gain
0.4
0.5
0.6
0.7
0.8
0.9
1
0 250 500 750 1000
NDCG
K (authorities)
Rhea (T)
WhiteList (T)
0.4
0.5
0.6
0.7
0.8
0.9
1
0 250 500 750 1000
NDCG
K (authorities)
Rhea (SO)
WhiteList (SO)

Impact of Parameters
Varying the Value of Probability p:
Using a sample of 20% of S we achieve performance almost as
good as that of using S.
Using p = 0.2 instead of p = 1 greatly reduces processing time.
Removing Filtering Step:
Over 25 p.p. for K = 1, 000 and is never less than 10 p.p. for
any K examined.

Conclusion
Rhea is the 1st adaptive algorithm for sampling
authoritative content from social activity streams.
We exposed the dynamic nature of the task.
We introduced a measure to identify authoritative users.
Rhea employs several techniques to achieve signiﬁcantly
improved performance with regard to recall, precision, and
ranking accuracy.
UoA Panagiotis Liakos Rhea-• Conclusion 23/26

References I
[ACD+
08] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne.
Finding high-quality content in social media.
In Proc. of the Int. Conf. on Web Search and Web Data Mining, WSDM 2008, Palo Alto, California, USA,
February 11-12, 2008, pages 183–194, 2008.
[BBC+
13] Alessandro Bozzon, Marco Brambilla, Stefano Ceri, Matteo Silvestri, and Giuliano Vesci.
Choosing the right crowd: expert ﬁnding in social networks.
In Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013, pages
637–648, 2013.
[GSB+
12] Saptarshi Ghosh, Naveen Kumar Sharma, Fabr´ıcio Benevenuto, Niloy Ganguly, and P. Krishna Gummadi.
Cognos: crowdsourcing search for topic experts in microblogs.
In The 35th Int. ACM SIGIR Conf. on research and development in Information Retrieval, SIGIR ’12,
Portland, OR, USA, August 12-16, 2012, pages 575–590, 2012.
[GZB+
13] Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Kumar Sharma, Niloy Ganguly,
and P. Krishna Gummadi.
On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream.
In 22nd ACM Int. Conf. on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA,
October 27 - November 1, 2013, pages 1739–1744, 2013.
[JA07] Pawel Jurczyk and Eugene Agichtein.
Discovering authorities in question answer communities by using link analysis.
In Proc. of the 16th ACM Conf. on Information and Knowledge Management, CIKM 2007, Lisbon,
Portugal, November 6-10, 2007, pages 919–922, 2007.
[PC11] Aditya Pal and Scott Counts.
Identifying topical authorities in microblogs.
In Proc. of the 4th International Conference on Web Search and Web Data Mining, WSDM 2011, Hong
Kong, China, February 9-12, 2011, pages 45–54, 2011.
UoA Panagiotis Liakos Rhea-• References 24/26

References II
[PJC+
15] Deepan Subrahmanian Palguna, Vikas Joshi, Venkatesan T. Chakaravarthy, Ravi Kothari, and L. Venkata
Subramaniam.
Analysis of sampling algorithms for twitter.
In Proc. of the 24th Int. Joint Conf. on Artiﬁcial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July
25-31, 2015, pages 967–973, 2015.
[WLP+
12] Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier.
It’s not in their tweets: Modeling topical expertise of twitter users.
In 2012 Int. Conf. on Privacy, Security, Risk and Trust, PASSAT 2012, and 2012 Int. Conf. on Social
Computing, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012, pages 91–100, 2012.
[ZAA07] Jun Zhang, Mark S. Ackerman, and Lada A. Adamic.
Expertise networks in online communities: structure and algorithms.
In Proc. of the 16th Int. Conf. on World Wide Web, WWW 2007, Banﬀ, Alberta, Canada, May 8-12,
2007, pages 221–230, 2007.
[ZBG+
16] Muhammad Bilal Zafar, Parantapa Bhattacharya, Niloy Ganguly, Saptarshi Ghosh, and Krishna P.
Gummadi.
On the wisdom of experts vs. crowds: Discovering trustworthy topical news in microblogs.
In Proc. of the 19th ACM Conf. on Computer-Supported Cooperative Work & Social Computing, CSCW
2016, San Francisco, CA, USA, February 27 - March 2, 2016, pages 437–450, 2016.
UoA Panagiotis Liakos Rhea-• References 25/26

thank you!
for further details email me at:
p.liakos@di.uoa.gr
UoA Panagiotis Liakos Rhea-• Contact 26/26

Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams

Recommandé

Recommandé

Contenu connexe

Similaire à Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams

Similaire à Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams (20)

Dernier

Dernier (20)

Rhea: Adaptively Sampling Authoritative Content from Social Activity Streams