This document discusses using user-generated content from social media and other online sources for collective and personalized inference tasks. It describes using bilinear regression models and regularization techniques like lasso and elastic net to infer things like flu rates from Twitter data and predict voting intentions from social media posts. Key applications discussed include flu surveillance, political preference modeling, and inferring user attributes. The document outlines several case studies and models for tasks like nowcasting influenza rates, modeling voting polls, and predicting occupation from social media data.
11. Flu rates from Twitter: Performance
wcasting Events from the Social Web with Statistical Learning 72
Fig. 8. Feature Class H – Inference for Flu case study (Round 1 of 5-fold cross validation).
Root Mean
Squared Error
9
10
12
14
1-grams 2-grams Hybrid
11.62
13.82
12.44
10.57
12.64
11.14
Soft-Bolasso Baseline (correlation based feature selection)
(Lampos & Cristianini, 2012)
19. Bilinear Group l2,1 (BGL)argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
◊ ◊
UUUT QQQi WWW
a feature (user/word) is selected for all tasks (not just one), but
possibly with di erent weights
especially useful in the domain of politics (e.g. user pro party A
+ a feature (user or word) is usually selected (activated) for
all tasks, but with different weights
+ useful in the domain of political preference inference
eights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
GL can be broken into 2 convex tasks: first learn {WWW,———}, then
UUU,———} and vv + iterate through this process
@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 24/45
24
/45
(Argyriou et al., 2008)
24. Inferring voting intention from Twitter: A qualitative outcome
Party Tweet Score User type
SPÖ
centre
Inflation rate in Austria slightly down in July
from 2.2 to 2.1%. Accommodation, Water,
Energy more expensive.
0.745 Journalist
ÖVP
centre
right
Can really recommend the book “Res
Publica” by Johannes #Voggenhuber! Food
for thought and so on #Europe #Democracy
-2.323 User
FPÖ
far right
Campaign of the Viennese SPO on “Living
together” plays right into the hands of
right-wing populists
-3.44
Human
rights
GRÜ
centre left
Protest songs against the closing-down of
the bachelor course of International
Development: <link> #ID_remains
#UniBurns #UniRage
1.45
Student
Union
27. Gaussian Processes (GPs)
analysis step. Embeddings are
cause each dimension is an ab-
the clusters can be interpreted
of the most frequent or repre-
e latter are identified using the
metric:
P
x2c NPMI(w, x)
|c| 1
, (2)
cluster and w the target word.
beddings (W2V-E)
been a growing interest in neu-
research (Cohn and Sp
and Cohn, 2013) with
limited to (Polajnar et
Formally, GP meth
f : Rd ! R drawn
inputs xxx 2 Rd:
f(xxx) ⇠ GP(
where m(·) is the mean
is the covariance kerne
ponential (SE) kernel
used to encourage smo
dimensional pair of inp
usters can be interpreted
most frequent or repre-
are identified using the
c:
PMI(w, x)
| 1
, (2)
r and w the target word.
gs (W2V-E)
growing interest in neu-
e the words are projected
dense vector space via a
limited to (Polajnar et al., 201
Formally, GP methods aim
f : Rd ! R drawn from a
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx),
where m(·) is the mean functi
is the covariance kernel. Usu
ponential (SE) kernel (a.k.a.
used to encourage smooth fun
dimensional pair of inputs (xxx
kard(xxx,xxx0
) = 2
exp
" dX
limited to (Polajnar et al., 2011).
Formally, GP methods aim to learn a function
f : Rd ! R drawn from a GP prior given the
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0
)) , (3)
where m(·) is the mean function (here 0) and k(·, ·
is the covariance kernel. Usually, the Squared Ex
ponential (SE) kernel (a.k.a. RBF or Gaussian) is
used to encourage smooth functions. For the multi
dimensional pair of inputs (xxx,xxx0), this is:
Based on d-dimensional input data
we want to learn a function
Formally: Sets of random variables any finite number
of which have a multivariate Gaussian distribution
mean function
drawn on inputs
covariance function (or kernel)
drawn on pairs of inputs
(Rasmussen & Williams, 2006)
28. Common covariance functions (kernels)
briefly examining the priors on functions encoded by some commonly used kernels: the
squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are
defined in figure 1.1.
Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin)
k(x, xÕ
) = ‡2
f exp
1
≠(x≠xÕ)2
2¸2
2
‡2
f exp
1
≠ 2
¸2 sin2
1
fix≠xÕ
p
22
‡2
f (x ≠ c)(xÕ
≠ c)
Plot of k(x, xÕ
):
0 0
0
x ≠ xÕ
x ≠ xÕ
x (with xÕ
= 1)
¿ ¿ ¿
Functions f(x)
sampled from
GP prior:
x x x
Type of structure: local variation repeating structure linear functions
Figure 1.1: Examples of structures expressible by some basic kernels.
(Duvenaud, 2014)
29. Combining kernels in a GP
4 Expressing Structure with Kernels
Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per
0 0
0
0
x (with xÕ
= 1) x ≠ xÕ
x (with xÕ
= 1) x (with xÕ
= 1)
¿ ¿ ¿ ¿
quadratic functions locally periodic increasing variation growing amplitude
Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels.
it is possible to add or multiply kernels
(among other operations)
(Duvenaud, 2014)
34. Google Flu Trends: Failure
0
2
4
6
8
10
07/01/09 07/01/10 07/01/11 07/01/12 07/01/13
Google Flu Lagged CDC
Google Flu + CDC CDC
50
100
150
Google Flu Lagged CDC
Google Flu + CDC
Google estimates more
than double CDC estimates
Google starts estimating
high 100 out of 108 weeks
%ILI%baseline)
The estimates of the online Google Flu Trends tool were
approx. two times larger than the ones from the CDC in 2012/13
(Lazer et al., 2014)
odel using the log-odds of an ILI physician visit and
ds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
the percentage of ILI physician visits, Q is
ated query fraction, β0 is the intercept,
Fig
ILI-
poin
que
whi
0.8
(Ginsberg et al., 2009)
37. Google Flu Trends revised: Data (2)
Corresponding ILI rates from the CDC
Table S3. Cumulative performance (2008-2013) of GP model with various numbers of clusters.
Covariance function r MAE⇥102
MAPE (%)
SE .95 .221 10.8
Mat´ern .95 .228 11
e S4. Performance comparison of the optimal GP model (10 clusters) when a different covariance function (Mat´ern
ure S1. CDC ILI rates for the US covering 2004 to 2013, i.e., the time span of the data used in our experimental pro
eriods are distinguished by color.
different colouring per flu season
40. Google Flu Trends revised: Methods (3)
milarity metric and then apply a composite GP kernel on clusters of qu
search queries = , …,x c c{ }C1 , where ci denotes the subset of queries cl
GP covariance function to be
∑ σ δ′ ′( , ) =
⎛
⎝
⎜⎜⎜
( , ′)
⎞
⎠
⎟⎟⎟⎟
+ ⋅ ( , ),
=
k kx x c c x x
i
C
i i
1
SE n
2
otes the number of clusters, kSE has a different set of hyperparameters (σ
rm of the equation models noise (δ being a Kronecker delta function).
ntation of queries by applying the k-means++ algorithm32,33
(see SI, Gau
The distance metric of k-means uses the cosine similarity between time s
the different magnitudes of the query frequencies in our data34
.
)/
⎛
⎝
⎜⎜⎜
⋅
⎞
⎠
⎟⎟⎟
x xq q
2 2
i j
, where ∈
,
xq
T
i j{ }
denotes a column of the input mat
ng on sets of queries, the proposed method can protect an inferred m
e frequency of single queries that are not representative of an entire clu
bout a disease may trigger queries expressing a general concern rather th
s are expected to utilize a small subset of specific key-phrases, but no
d to flu infection. In addition, assuming that query clusters may convey
+ protect a model from radical changes in the frequency of
single queries that are not representative of a cluster
+ model the contribution of various thematic concepts
(captured by different clusters) to the final prediction
+ learning a sum of lower-dimensional functions: significantly
smaller input space, much easier learning task, fewer
samples required, more statistical traction obtained
- imposes the assumption that the relationship between
queries in separate clusters provides no information about
ILI (reasonable trade-off)
44. Google Flu Trends revised: Methods (4)
component (p), the moving average component (q), and a regression elemen
the sequential observations , …,y yT1
, and a D-dimensional exogenous inpu
specifies the relationship
∑ ∑ ∑φ θ ε ε= + + + ,
=
−
=
−
=
,y y w ht
i
p
i t i
i
q
i t i
i
D
i t i t
1 1 1
where the φi, θi, and wi are coefficients to be learned and εt is mean zer
unknown variance. For fixed values of p and q, this model is trained usin
extend this model with a seasonal component that incorporates yearly lag
model) and determine orders p and q as well as seasonal orders automatic
procedure39
. Instead of using all available query fractions as the exogenous i
the single prediction result (D= 1) from a query model, ˆyt
. Essentially, this
to distill all of the information that search data have to offer about the IL
this meta-information in the ARMAX procedure. Predictive intervals are es
sive nowcast through the maximum likelihood variance of the model.
Results
We evaluate our methodology on held out ILI rates and normalized query f
utive periods matching the influenza seasons from 2008 to 2013, as define
and Supplementary Fig. S1). For each test period (flu season i), we train a m
sonal ARMAX model
asonality component in the ARMAX function incorporates further information into t
ngth of the season is fixed to 52 weeks (1-year long). The full model description, wh
mes
yt =
pX
i=1
iyt i +
JX
i=1
!iyt 52 i
| {z }
AR and seasonal AR
+
qX
i=1
✓i✏t i +
KX
i=1
⌫i✏t 52 i
| {z }
MA and seasonal MA
+
DX
i=1
wiht,i
| {z }
regression
+ ✏t ,
Seasonal ARMAX
Auto-regressive
moving average
with exogenous
inputs (ARMAX)
AR component
Moving average
component
Exogenous input
53. Occupational class inference: Topic CDFs (1)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Higher Education (#21)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans
closer to the bottom-right corner of the plot
54. Occupational class inference: Topic CDFs (2)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans
closer to the bottom-right corner of the plot
55. Occupational class inference: Topic CDFs (3)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans
closer to the bottom-right corner of the plot
56. Occupational class inference: Topic similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational class
Occupational class
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes
57. Occupational class inference: Topic similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational class
Occupational class
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes
58. Occupational class inference: Topic similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational class
Occupational class
Topic distribution distance (Jensen-Shannon divergence)
for the different occupational classes
59. Income inference: Data
Income prediction
10k 30k 50k 100k
0
200
400
600
800
1000
Yearly income (£)
No.Users
+ 5,191 Twitter users (same as in the previous study)
mapped to their occupations, then mapped to an
average income in GBP (£) using the SOC taxonomy
+ approx. 11 million tweets
+ Download the data set
(Preotiuc-Pietro, Volkova,
Lampos, Bachrach &
Aletras, 2015)
66. Inferring the socioeconomic status: Results
T1 T2 P
O1 584 115 83.5%
O2 126 517 80.4%
R 82.3% 81.8% 82.0%
T1 T2 T3 P
O1 606 84 53 81.6%
O2 49 186 45 66.4%
O3 55 48 216 67.7%
R 854% 58.5% 68.8% 75.1%
Classification Accuracy (%) Precision (%) Recall (%) F1
2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03)
3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05)
Confusion matrices for the 3- and 2-way classification
Classification performance (using a GP classifier)
67. Characterising user impact: Task & Data
ct — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
mber of followers, „out: number of followees
mber of times the account has been listed
ogarithm is applied on a positive number
ut
"
= („in ≠ „out) ◊ („in/„out) + „in
of the user impact
n our data set
S) = 6.776
−5 0 5 10 15 20 25 30
0
0.05
0.1
0.15
Impact Score (S)
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
c.uk Slides: http://bit.ly/1GrxI8j 40/52
40
/52
„out
• „in: number of followers, „out: number of follow
• „⁄: number of times the account has been liste
• ◊ = 1, logarithm is applied on a positive numbe
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5
0
0.05
0.1
0.15
ProbabilityDensity
@spam?
v.lampos@ucl.ac.uk Slides:
ser impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
• „in: number of followers, „out: number of followees
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776 0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in
„out + ◊
• „in: number of followers, „out: number of followee
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5 10
0
0.05
0.1
0.15
Impa
ProbabilityDensity
@
@@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊
„out + ◊
• „in: number of followers, „out: number of followees
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5 10 1
0
0.05
0.1
0.15
Impact Sc
ProbabilityDensity
@Paul
@lamp
@nika
@spam?
v.lampos@ucl.ac.uk Slides: http:/
mplified definition
t, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
ers, „out: number of followees
the account has been listed
plied on a positive number
out) ◊ („in/„out) + „in
pact
t
0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
—> number of followers
—> number of followees
—> number of times listed
—> logarithm is applied on a positive number
β Vasileios Lampos ~ @lampos
ν Nikolaos Aletras ~ @nikaletras
40K Twitter accounts (UK) considered
(Lampos et al., 2014)
68. Characterising user impact: Topic entropy
high participation in the 10 most relevant topics. Dot-dashed lines d
is the mean of the entire sample (= 6.776).
NumberofAccounts
Impact Score (S)
0 10 20 30
0
50
100
All
Low Entropy
High Entropy
Figure 4: User impact distribution for accounts with
high (blue) and low (dark grey) topic entropy. Lines
denote the respective mean impact scores.
H(ui, ⌧) =
where ui is a
This is a mea
meaning tha
ered as mor
quality of the
pact score di
the lowest an
tropies are se
clearly below
latter above.
a connection
may exist, at
tion in the en
Use case sce
On average, the higher the user impact score,
the higher the topic entropy
69. Characterising user impact: Use case scenarios
Impact distribution under user behaviour scenarios
0 10 20 30 0 10 20 30
0
100
200
300
400
500
L
NL
C
mpact distribution (x-axis: impact points, y-axis: # of user account
ed on subsets of the most relevant attributes and topics – IA: Interac
ng many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ to
s negation and lines the respective mean impact scores.
0 0 10 20 30
0
100
200
300
400
IA
IAC
B
pact distribution (x-axis: impact points, y-axis: # of user accounts)
d on subsets of the most relevant attributes and topics – IA: Interact
g many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ top
negation and lines the respective mean impact scores.
0 10 20 30 0 10 20 30
0
50
100
150
200
LT
ST
E
x-axis: impact points, y-axis: # of user accounts) for five Twitter
e most relevant attributes and topics – IA: Interactive, IAC: Clique
Topic-Overall, TF: Topic-Focused, LT: ‘Light’ topics, ST: ‘Serious’
s the respective mean impact scores.
Interactive (IA) vs.
clique interactive
(IAC)
Links (L) vs.
very few links (NL)
Light topics (LT)
vs. more ‘serious’
topics (ST)
70. Concluding remarks
+ User-generated content is a valuable asset
> improve health surveillance tasks
> mine collective knowledge
> infer user characteristics
> numerous other tasks
+ Nonlinear models tend to perform better given the
multimodality of the feature space
+ Deep representations of text tend to improve
performance (better representations)
+ Qualitative analysis is important
> Evaluation
> Interesting insights