User-generated content: collective and personalised inference tasks

User-generated content: collective
and personalised inference tasks
Vasileios Lampos
Department of Computer Science
University College London
(March, 2016; @ DIKU)

Structure of the talk
1. Introductory remarks
2. Collective inference tasks from user-generated content 
— Nowcasting ﬂu rates from Twitter / Google 
— Modelling voting intention (bilinear text regression)
3. Personalised inference tasks using social media  
— Occupation, income, socioeconomic status & impact
4. Concluding remarks

Context and motivation
+ the Internet, the World Wide Web and connectivity
+ numerous successful web products feeding from
user activity
+ lots of user-generated content & activity logs, e.g.
social media and search engine query logs
+ large volumes of digitised data (‘Big Data’), birth of
Data Science (nothing new in principal)
How can we use online data to improve our society,
interpret human behaviour, and
enhance our understanding about our world?

User-generated content: Ongoing applications
+ Health
> disease surveillance, intervention impact
+ Finance & Commerce
> ﬁnancial indices
> consumer satisfaction, market share
+ Politics
> estimation of voting intentions
> public opinion barometers
+ Social and behavioural sciences
> complement questionnaire based studies
> approach answers to unresolved questions

Added value of user-generated content for health
+ Online content can potentially access a larger and more
representative part of the population 
Note: Traditional health surveillance schemes are based
on the subset of people that actively seek medical
attention
+ More timely information (almost instant) about a
disease outbreak in a population
+ Geographical regions with less established health
monitoring systems can greatly beneﬁt
+ Small cost when data access and expertise are in place

Collective inference tasks  
from user-generated content
Lampos & Cristianini, 2012;
Lampos, Preotiuc-Pietro & Cohn, 2013;
Lampos, Miller, Crossan & Stefansen, 2015

Flu rates from Twitter: The task
Flu surveillance
disease rates from
a health agency
f :
X ∈ ℝM x N
y ∈ ℝM
n-gram frequency
time series
2012 2013 2014
0
0.01
0.02
0.03
0.04
ILIrateper100people
ILI rates (PHE)
Bing
(Lampos & Cristianini, 2012)

Flu rates from Twitter: Lasso for feature selection
gression basics — Lasso
• observations xxxi œ Rm, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; —
¸1¸1¸1–norm regularisation or lasso (Tibshirani, 1996)
argmin
www,—
Y
_]
_[
nÿ
i=1
Q
ayi ≠ — ≠
mÿ
j=1
xijwj
R
b
2
+ ⁄
mÿ
j=1
|wj|
Z
_^
_
or argmin
wwwú
Ó
ÎXXXúwwwú ≠ yyyÎ2
¸2
+ ⁄ÎwwwÎ¸1
Ô
Regression basics — Ordinary Least Squares (1/2)
• observations xxxi œ Rm, i œ {1, ..., n} — XXX
• weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; —]
also known as lasso or L1-norm regularisation
(Tibshirani, 1996)

Flu rates from Twitter: Bootstrap lasso
Lasso may not always select the true model 
due to collinearities in the feature space
Bootstrapping lasso (‘bolasso’) for feature selection
+ For a number (N) of bootstraps, i.e. iterations
> Sample the feature space with replacement (Xi)
> Learn a new model (wi) by applying lasso on Xi and y
> Remember the n-grams with nonzero weights
+ Select the n-grams with nonzero weights in p% of the N
bootstraps
+ p can be optimised; if p<100%, then ‘soft bolasso’
(Zhao & Yu, 2006)
(Bach, 2008)

Flu rates from Twitter: Performance
wcasting Events from the Social Web with Statistical Learning 72
Fig. 8. Feature Class H – Inference for Flu case study (Round 1 of 5-fold cross validation).
Root Mean
Squared Error
9
10
12
14
1-grams 2-grams Hybrid
11.62
13.82
12.44
10.57
12.64
11.14
Soft-Bolasso Baseline (correlation based feature selection)
(Lampos & Cristianini, 2012)

Flu rates from Twitter: Selected features
Word cloud with selected n-grams. Font size is
proportional to the regression’s weight; n-grams
that are upside-down have a negative weight.

Rainfall rates from Twitter: GeneralisationFig. 3. Feature Class B – Inference for Rainfall case study (Round 5 of 6-fold cross validation).

Rainfall rates from Twitter: Selected features
Word cloud with selected n-grams. Font size is
proportional to the regression’s weight; n-grams
that are upside-down have a negative weight.

Bilinear regression
ilinear Text Regression — The general idea (2/2)
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
Bilinear Text Regression — The general idea (2/2)
• users p œ Z+
j œ {1, ..., m}
◊ ◊ + —
uuuT QQQi
www
16

Bilinear regularised regressionBilinear Text Regression — The general idea (2/2)
• users p œ Z+
j œ {1, ..., m}
◊ ◊ + —
uuuT QQQi
www
16
Bilinear Text Regression — Regularisation
• users p œ Z+
j œ {1, ..., m}
argmin
uuu,www,—
I nÿ
i=1
1
uuuT
QQQiwww + — ≠ yi
22
+ Â(uuu, ◊u) + Â(www, ◊w)
J
Â(·): regularisation function with a set of hyper-parameters (◊)
• if Â (vvv, ⁄) = ⁄ÎvvvÎ¸1 Bilinear Lasso
• if Â (vvv, ⁄1, ⁄2) = ⁄1ÎvvvÎ2
¸2
+ ⁄2ÎvvvÎ¸1 Bilinear Elastic Net (BEN)
(Lampos et al., 2013)
(Lampos, Preotiuc-Pietro & Cohn, 2013)

Bilinear elastic net (BEN): training a model
tic Net (BEN)
1
uuuT
22
BEN’s objective function
1 ÎuuuÎ2
¸2
+ ⁄u2 ÎuuuÎ¸1
1 ÎwwwÎ2
¸2
+ ⁄w2 ÎwwwÎ¸1
J
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
xity: fix uuu, learn www and vv
hrough convex
on tasks: convergence
Global objective function
during training (red)
Corresponding prediction
error on held out data (blue)
Biconvex problem
+ fix u, learn w and vice versa
+ iterate through convex 
optimisation tasks
Bilinear Elastic Net (BEN)
argmin
uuu,www,—
I nÿ
i=1
1
uuuT
22
+ ⁄u1 ÎuuuÎ2
¸2
+ ⁄u2 ÎuuuÎ¸1
+ ⁄w1 ÎwwwÎ2
¸2
+ ⁄w2 ÎwwwÎ¸1
J
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
Figure 2 : Objective function
value and RMSE (on hold-out
data) through the model’s
iterations
• Bi-convexity: fix uuu, learn www and vv
• Iterating through convex
optimisation tasks: convergence
(Al-Khayyal & Falk, 1983; Horst & Tuy, 1996)
• FISTA (Beck & Teboulle, 2009)
in SPAMS (Mairal et al., 2010):
Large-scale optimisation solver,
quick convergence
V. Lampos v.lampos@ucl.ac.uk Bilinear Text Regression and Applications 18/45
18
/45
(Mairal et al., 2010)
Large-scale solvers in SPAMS

Bilinear multi-task learningBilinear Multi-Task Learning
• tasks · œ Z+
• users p œ Z+
• responses yyyi œ R· , i œ {1, ..., n} — YYY
• weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW
v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 23/45
23
/45
Bilinear Multi-Task Learning
• tasks · œ Z+
• users p œ Z+
• responses yyyi œ R· , i œ {1, ..., n} — YYY
• weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW

Bilinear Group l2,1 (BGL)argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
◊ ◊
UUUT QQQi WWW
a feature (user/word) is selected for all tasks (not just one), but
possibly with di erent weights
especially useful in the domain of politics (e.g. user pro party A
+ a feature (user or word) is usually selected (activated) for
all tasks, but with diﬀerent weights
+ useful in the domain of political preference inference
eights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
GL can be broken into 2 convex tasks: ﬁrst learn {WWW,———}, then
UUU,———} and vv + iterate through this process
@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 24/45
24
/45
(Argyriou et al., 2008)

Inferring voting intention from Twitter: Data
United Kingdom
+ 3 parties (Conservatives, Labour, Lib Dem)
+ 42,000 Twitter users distributed proportionally to
UK’s regional population ﬁgures
+ 60 million tweets & 80,976 1-grams extracted
+ 240 polls from 30 Apr. 2010 to 13 Feb. 2012
Austria
+ 4 parties (SPO, OVP, FPO, GRU)
+ 1,100 politically active Twitter users selected by political
scientists
+ 800,000 tweets & 22,917 1-grams extracted
+ 98 polls from 25 Jan. to 25 Dec. 2012

Inferring voting intention from Twitter: PerformanceRoot Mean  
Squared Error
0
1
2
2
3
UK Austria
1.4391.478
1.699
1.573
1.442
3.067
1.47
1.723
1.851
1.69
Mean poll Last poll Elastic Net (words)
BEN BGL
(Lampos, Preotiuc-Pietro & Cohn, 2013)

Inferring voting intention from Twitter: UK
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
BGL
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIBYouGov

Inferring voting intention from Twitter: Austria
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
Polls
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BGL

Inferring voting intention from Twitter: A qualitative outcome
Party Tweet Score User type
SPÖ
centre
Inﬂation rate in Austria slightly down in July
from 2.2 to 2.1%. Accommodation, Water,
Energy more expensive.
0.745 Journalist
ÖVP
centre
right
Can really recommend the book “Res
Publica” by Johannes #Voggenhuber! Food
for thought and so on #Europe #Democracy
-2.323 User
FPÖ 
far right
Campaign of the Viennese SPO on “Living
together” plays right into the hands of
right-wing populists
-3.44
Human
rights
GRÜ
centre left
Protest songs against the closing-down of
the bachelor course of International
Development: <link> #ID_remains
#UniBurns #UniRage
1.45
Student
Union

Nonlinearities in the data (1)
frequency of search query
‘dry cough’ (Google)

Nonlinearities in the data (2)
frequency of search query
‘dry cough’ (Google)
linear
nonlinear

Gaussian Processes (GPs)
analysis step. Embeddings are
cause each dimension is an ab-
the clusters can be interpreted
of the most frequent or repre-
e latter are identified using the
metric:
P
x2c NPMI(w, x)
|c| 1
, (2)
cluster and w the target word.
beddings (W2V-E)
been a growing interest in neu-
research (Cohn and Sp
and Cohn, 2013) with
limited to (Polajnar et
Formally, GP meth
f : Rd ! R drawn
inputs xxx 2 Rd:
f(xxx) ⇠ GP(
where m(·) is the mean
is the covariance kerne
ponential (SE) kernel
used to encourage smo
dimensional pair of inp
usters can be interpreted
most frequent or repre-
are identified using the
c:
PMI(w, x)
| 1
, (2)
r and w the target word.
gs (W2V-E)
growing interest in neu-
e the words are projected
dense vector space via a
limited to (Polajnar et al., 201
Formally, GP methods aim
f : Rd ! R drawn from a
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx),
where m(·) is the mean functi
is the covariance kernel. Usu
ponential (SE) kernel (a.k.a.
used to encourage smooth fun
dimensional pair of inputs (xxx
kard(xxx,xxx0
) = 2
exp
" dX
limited to (Polajnar et al., 2011).
Formally, GP methods aim to learn a function
f : Rd ! R drawn from a GP prior given the
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0
)) , (3)
where m(·) is the mean function (here 0) and k(·, ·
is the covariance kernel. Usually, the Squared Ex
ponential (SE) kernel (a.k.a. RBF or Gaussian) is
used to encourage smooth functions. For the multi
dimensional pair of inputs (xxx,xxx0), this is:
Based on d-dimensional input data
we want to learn a function
Formally: Sets of random variables any finite number
of which have a multivariate Gaussian distribution
mean function
drawn on inputs
covariance function (or kernel)
drawn on pairs of inputs
(Rasmussen & Williams, 2006)

Common covariance functions (kernels)
briefly examining the priors on functions encoded by some commonly used kernels: the
squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are
defined in figure 1.1.
Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin)
k(x, xÕ
) = ‡2
f exp
1
≠(x≠xÕ)2
2¸2
2
‡2
f exp
1
≠ 2
¸2 sin2
1
fix≠xÕ
p
22
‡2
f (x ≠ c)(xÕ
≠ c)
Plot of k(x, xÕ
):
0 0
0
x ≠ xÕ
x ≠ xÕ
x (with xÕ
= 1)
¿ ¿ ¿
Functions f(x)
sampled from
GP prior:
x x x
Type of structure: local variation repeating structure linear functions
Figure 1.1: Examples of structures expressible by some basic kernels.
(Duvenaud, 2014)

Combining kernels in a GP
4 Expressing Structure with Kernels
Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per
0 0
0
0
x (with xÕ
= 1) x ≠ xÕ
x (with xÕ
= 1) x (with xÕ
= 1)
¿ ¿ ¿ ¿
quadratic functions locally periodic increasing variation growing amplitude
Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels.
it is possible to add or multiply kernels
(among other operations)
(Duvenaud, 2014)

GPs for regression: A toy example (1)
take some (x,y) pairs with some obvious
nonlinear underlying structure
x (predictor variable)
0 10 20 30 40 50 60
y(targetvariable)
2
4
6
8
10
12
14
16
18
20
22 x,y pairs

x (predictor variable)
0 10 20 30 40 50 60
y(targetvariable)
2
4
6
8
10
12
14
16
18
20
22 x,y pairs OLS fit GP fit
GPs for regression: A toy example (2)
Addition of 2 GP kernels:
periodic + squared exponential + noise
testing
(solid line)training
(dashed line)

More information about GPs
+ Book — “Gaussian Processes for Machine Learning” 
http://www.gaussianprocess.org/gpml/
+ Tutorial — “Gaussian Processes for Natural Language
Processing” 
http://people.eng.unimelb.edu.au/tcohn/tutorial.html
+ Video-lecture — “Gaussian Process Basics” 
http://videolectures.net/gpip06_mackay_gpb/
+ Software I — GPML for Octave or MATLAB 
http://www.gaussianprocess.org/gpml/code
+ Software II — GPy for Python 
http://sheﬃeldml.github.io/GPy/

Google Flu Trends: The idea
Can we turn search query information (statistics) to
estimates about the rate of inﬂuenza-like illness
in the real-world population?

Google Flu Trends: Failure
0
2
4
6
8
10
07/01/09 07/01/10 07/01/11 07/01/12 07/01/13
Google Flu Lagged CDC
Google Flu + CDC CDC
50
100
150
Google Flu Lagged CDC
Google Flu + CDC
Google estimates more
than double CDC estimates
Google starts estimating
high 100 out of 108 weeks
%ILI%baseline)
The estimates of the online Google Flu Trends tool were
approx. two times larger than the ones from the CDC in 2012/13
(Lazer et al., 2014)
odel using the log-odds of an ILI physician visit and
ds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
the percentage of ILI physician visits, Q is
ated query fraction, β0 is the intercept,
Fig
ILI-
poin
que
whi
0.8
(Ginsberg et al., 2009)

Google Flu Trends: Hypotheses for failure
+ ‘Big Data’ are not always good enough; may not always
capture the target signal properly
+ The estimates were based on a rather simplistic model
+ The model was OK, but some spurious search queries
invalidated the ILI inferences, e.g. ‘flu symptoms’
+ Media hype about the topic of ‘flu’ significantly increased
the search query volume from people that were just
seeking information (non patients)
+ Side note: CDC’s estimates are not necessarily the ground
truth; they can also go wrong sometimes, although we
generally assume that they are a good representation of
the real signal

Google Flu Trends revised: Data (1)
Google search query logs
> geo-located in US regions
> from 4 Jan. 2004 to 28 Dec. 2013 (521 weeks, ~decade)
> ﬁltered by a very relaxed health-topic classiﬁer
> intersection among frequently occurring search
queries in all US regions
> weekly frequencies of 49,708 queries (# of features)
> all data have been anonymised and aggregated
plus corresponding ILI rates from the CDC

Google Flu Trends revised: Data (2)
Corresponding ILI rates from the CDC
Table S3. Cumulative performance (2008-2013) of GP model with various numbers of clusters.
Covariance function r MAE⇥102
MAPE (%)
SE .95 .221 10.8
Matérn .95 .228 11
e S4. Performance comparison of the optimal GP model (10 clusters) when a different covariance function (Matérn
ure S1. CDC ILI rates for the US covering 2004 to 2013, i.e., the time span of the data used in our experimental pro
eriods are distinguished by color.
different colouring per flu season

Google Flu Trends revised: Methods (1)
r>a Elastic Net
Google search query
frequencies (Q)
Historical CDC
ILI data
k-means
k1
k2
k3
kN
…
+ GP(μ,k)
Q’≤Q Q’’≤Q’
ILI inference
(Lampos, Miller, Crossan & Stefansen, 2015)

1. Keep search queries with r ≥ 0.5 (reduces the amount
of irrelevant queries)
2. Apply the previous model (GFT) to get a baseline
performance estimate
3. Apply elastic net to select a subset of search queries
and compute another baseline
4. Group the selected queries into N = 10 clusters using  
k-means to account for their diﬀerent semantics
5. Use a diﬀerent GP covariance function on top of each
query cluster to explore non-linearities

milarity metric and then apply a composite GP kernel on clusters of qu
search queries = , …,x c c{ }C1 , where ci denotes the subset of queries cl
GP covariance function to be
∑ σ δ′ ′( , ) =
⎛
⎝
⎜⎜⎜
( , ′)
⎞
⎠
⎟⎟⎟⎟
+ ⋅ ( , ),
=
k kx x c c x x
i
C
i i
1
SE n
2
otes the number of clusters, kSE has a different set of hyperparameters (σ
rm of the equation models noise (δ being a Kronecker delta function).
ntation of queries by applying the k-means++ algorithm32,33
(see SI, Gau
The distance metric of k-means uses the cosine similarity between time s
the different magnitudes of the query frequencies in our data34
.
)/
⎛
⎝
⎜⎜⎜
⋅
⎞
⎠
⎟⎟⎟
x xq q
2 2
i j
, where ∈
,
xq
T
i j{ }
denotes a column of the input mat
ng on sets of queries, the proposed method can protect an inferred m
e frequency of single queries that are not representative of an entire clu
bout a disease may trigger queries expressing a general concern rather th
s are expected to utilize a small subset of specific key-phrases, but no
d to flu infection. In addition, assuming that query clusters may convey
+ protect a model from radical changes in the frequency of
single queries that are not representative of a cluster
+ model the contribution of various thematic concepts
(captured by different clusters) to the final prediction
+ learning a sum of lower-dimensional functions: significantly
smaller input space, much easier learning task, fewer
samples required, more statistical traction obtained
- imposes the assumption that the relationship between
queries in separate clusters provides no information about
ILI (reasonable trade-off)

Google Flu Trends revised: Results (1)e.com/scientificreports/

Google Flu Trends revised: Results (2)MAPE (%)
0
5
10
16
21
26
Mean absolute percentage (%) of error (MAPE) in ﬂu
rate estimates during a 5-year period (2008-2013)
Test data Test data; peaking moments
11%10.8%
15.8%
11.9%
24.8%
20.4%
Google Flu Trends old model Elastic Net
Gaussian Process

Google Flu Trends revised: Results (3)
‘rsv’ — 25%
‘ﬂu symptoms’ — 18%
‘benzonatate’ — 6%
‘symptoms of pneumonia’ — 6%
‘upper respiratory infection’ — 4%
impact of automatically selected queries in
a ﬂu estimate during the over-predictions
previous GFT model

component (p), the moving average component (q), and a regression elemen
the sequential observations , …,y yT1
, and a D-dimensional exogenous inpu
specifies the relationship
∑ ∑ ∑φ θ ε ε= + + + ,
=
−
=
−
=
,y y w ht
i
p
i t i
i
q
i t i
i
D
i t i t
1 1 1
where the φi, θi, and wi are coefficients to be learned and εt is mean zer
unknown variance. For fixed values of p and q, this model is trained usin
extend this model with a seasonal component that incorporates yearly lag
model) and determine orders p and q as well as seasonal orders automatic
procedure39
. Instead of using all available query fractions as the exogenous i
the single prediction result (D= 1) from a query model, ˆyt
. Essentially, this
to distill all of the information that search data have to offer about the IL
this meta-information in the ARMAX procedure. Predictive intervals are es
sive nowcast through the maximum likelihood variance of the model.
Results
We evaluate our methodology on held out ILI rates and normalized query f
utive periods matching the influenza seasons from 2008 to 2013, as define
and Supplementary Fig. S1). For each test period (flu season i), we train a m
sonal ARMAX model
asonality component in the ARMAX function incorporates further information into t
ngth of the season is ﬁxed to 52 weeks (1-year long). The full model description, wh
mes
yt =
pX
i=1
iyt i +
JX
i=1
!iyt 52 i
| {z }
AR and seasonal AR
+
qX
i=1
✓i✏t i +
KX
i=1
⌫i✏t 52 i
| {z }
MA and seasonal MA
+
DX
i=1
wiht,i
| {z }
regression
+ ✏t ,
Seasonal ARMAX
Auto-regressive
moving average
with exogenous
inputs (ARMAX)
AR component
Moving average
component
Exogenous input

Google Flu Trends revised: Results (4)e.com/scientificreports/
Figure 2. Comparison of nowcasts between an autoregressive baseline model which is based only on

Google Flu Trends revised: Results (5)
0
3
6
9
12
15
MAPE (%) in ﬂu rate autoregressive (AR) estimates during
a 4-year period (2009-2013)
Test data Test data; peaking moments
14.3
7.5%7.3%
8.3%7.7%
13%
10.2%
Google Flu Trends old model (AR) Elastic Net (AR)
Gaussian Process (AR) CDC (AR)

Personalised inference tasks
using social media content
Lampos, Aletras, Preotiuc-Pietro & Cohn, 2014;
Preotiuc-Pietro, Lampos & Aletras, 2015;
Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras, 2015;
Lampos, Aletras, Geyti, Zou & Cox, 2015

Occupational class inference: Motivation
+ Validate this hypothesis on a broader, larger data set
using social media (Twitter)
+ Downstream applications
> research (social science & other domains)
> commercial
+ Proxy for additional user attributes, e.g. income and
socioeconomic status
(Bernstein, 1960; Labov, 1972/2006)
(Preotiuc-Pietro, Lampos & Aletras, 2015)
“Socioeconomic variables are inﬂuencing language use.”

Occupational class inference: SOC 2010
C1 — Managers, Directors & Senior Officials
e.g. chief executive, bank manager
C2 — Professional Occupations (e.g. mechanical engineer, pediatrist)
C3 — Associate Professional & Technical
e.g. system administrator, dispensing optician
C4 — Administrative & Secretarial (e.g. legal clerk, secretary)
C5 — Skilled Trades (e.g. electrical fitter, tailor)
C6 — Caring, Leisure, Other Service
e.g. nursery assistant, hairdresser
C7 — Sales & Customer Service (e.g. sales assistant, telephonist)
C8 — Process, Plant and Machine Operatives
e.g. factory worker, van driver
C9 — Elementary (e.g. shelf stacker, bartender)
Standard Occupational Classification (SOC)

Occupational class inference: Data
+ 5,191 Twitter users mapped to their occupations, then
mapped to one of the 9 SOC categories
+ 10 million tweets
+ Download the data set
% of users per SOC category
0
10
20
30
40
C1 C2 C3 C4 C5 C6 C7 C8 C9

Occupational class inference: Features
User attributes (18)
+ number of followers, friends, listings, follower/friend
ratio, favourites, tweets, retweets, hashtags, @-mentions,
@-replies, links and so on
Topics — Word clusters (200)
+ SVD on the graph laplacian of the word x word similarity
matrix using normalised PMI, i.e. a form of spectral
clustering
+ Skip-gram model with negative sampling to learn word
embeddings (Word2Vec); pairwise cosine similarity on the
embeddings to derive a word x word similarity matrix;
then spectral clustering on the similarity matrix
(Bouma, 2009; von Luxburg, 2007)
(Mikolov et al., 2013)

Occupational class inference: Performance
Accuracy (%)
25
31
37
43
49
55
Feature type
User Attributes SVD-200-clusters Word2Vec-200-clusters
52.7
48.2
34.2
51.7
47.9
31.5
46.9
44.2
34
Logistic Regression SVM (RBF) Gaussian Process (SE-ARD)
most frequent
class baseline

Occupational class inference: Topic CDFs (1)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Higher Education (#21)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans
closer to the bottom-right corner of the plot

0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9

0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9

Occupational class inference: Topic similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational class
Occupational class
Topic distribution distance (Jensen-Shannon divergence)
for the diﬀerent occupational classes

Income inference: Data
Income prediction
10k 30k 50k 100k
0
200
400
600
800
1000
Yearly income (£)
No.Users
+ 5,191 Twitter users (same as in the previous study)
mapped to their occupations, then mapped to an
average income in GBP (£) using the SOC taxonomy
+ approx. 11 million tweets
(Preotiuc-Pietro, Volkova,
Lampos, Bachrach &
Aletras, 2015)

Income inference: Features
+ Proﬁle (8) 
e.g. #followers, #followees, times listed etc.
+ Shallow textual features (10) 
e.g. proportion of hashtags, @-replies, @-mentions etc.
+ Inferred (perceived) psycho-demographic features (15) 
e.g. gender, age, education level, religion, life
satisfaction, excitement, anxiety etc.
+ Emotions (9) 
e.g. positive / negative sentiment, joy, anger, fear,
disgust, sadness, surprise etc.
+ Word clusters — Topics of discussion (200) 
based on word embeddings and by applying spectral
clustering

Income inference: Performance
MAE
£8500
£9275
£10050
£10825
£11600
Income inference error (Mean Absolute Error) using
GP regression or a linear ensemble for all features
Feature Categories
£9,535£9,621
£11,456
£10,980
£10,110
£11,291
Proﬁle Demo Emotion Shallow Topics All features

Income inference: Qualitative analysis (1)
e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34)
e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66)
e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74)
28000
35000
42000
28000
35000
42000
28000
35000
42000
0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20
0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030
0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15
Feature value
Income
Relating income and emotion
Linear vs GP ﬁt

Income inference: Qualitative analysis (2)
Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics)
Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing)
30000
40000
50000
30000
40000
50000
0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075
0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12
Feature value
Income
Relating income and topics of discussion
Linear vs GP ﬁt

Inferring the socioeconomic status: Task
Profile description
on Twitter
Occupation SOC category1 NS-SEC2
1. Standard Occupational Classification: 369 job groupings
2. National Statistics Socio-Economic Classification: Map from
the job groupings in SOC to a socioeconomic status, i.e.
{upper, middle or lower}
(Lampos, Aletras, Geyti, Zou & Cox, 2016)

Inferring the socioeconomic status: Data & Features
+ 1,342 Twitter user proﬁles 
distinct data set from the previous works
+ 2 million tweets
+ Date interval: Feb. 1, 2014 to March 21, 2015
+ Each user has a socioeconomic status (SES) label: 
{upper, middle, lower}
1,291 features representing
user behaviour (4), biographical / proﬁle information
(523), text in the tweets (560), topics of discussion (200),
and impact on the platform (4)

Inferring the socioeconomic status: Results
T1 T2 P
O1 584 115 83.5%
O2 126 517 80.4%
R 82.3% 81.8% 82.0%
T1 T2 T3 P
O1 606 84 53 81.6%
O2 49 186 45 66.4%
O3 55 48 216 67.7%
R 854% 58.5% 68.8% 75.1%
Classification Accuracy (%) Precision (%) Recall (%) F1
2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03)
3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05)
Confusion matrices for the 3- and 2-way classification
Classification performance (using a GP classifier)

Characterising user impact: Task & Data
ct — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
mber of followers, „out: number of followees
mber of times the account has been listed
ogarithm is applied on a positive number
ut
"
= („in ≠ „out) ◊ („in/„out) + „in
of the user impact
n our data set
S) = 6.776
−5 0 5 10 15 20 25 30
0
0.05
0.1
0.15
Impact Score (S)
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
c.uk Slides: http://bit.ly/1GrxI8j 40/52
40
/52
„out
• „in: number of followers, „out: number of follow
• „⁄: number of times the account has been liste
• ◊ = 1, logarithm is applied on a positive numbe
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5
0
0.05
0.1
0.15
ProbabilityDensity
@spam?
v.lampos@ucl.ac.uk Slides:
ser impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
• „in: number of followers, „out: number of followees
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
µ(S) = 6.776 0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in
„out + ◊
• „in: number of followers, „out: number of followee
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
µ(S) = 6.776
−5 0 5 10
0
0.05
0.1
0.15
Impa
ProbabilityDensity
@
@@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊
„out + ◊
• „in: number of followers, „out: number of followees
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
µ(S) = 6.776
−5 0 5 10 1
0
0.05
0.1
0.15
Impact Sc
ProbabilityDensity
@Paul
@lamp
@nika
@spam?
v.lampos@ucl.ac.uk Slides: http:/
mplified definition
t, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
ers, „out: number of followees
the account has been listed
plied on a positive number
out) ◊ („in/„out) + „in
pact
t
0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
—> number of followers
—> number of followees
—> number of times listed
—> logarithm is applied on a positive number
β Vasileios Lampos ~ @lampos
ν Nikolaos Aletras ~ @nikaletras
40K Twitter accounts (UK) considered
(Lampos et al., 2014)

Characterising user impact: Topic entropy
high participation in the 10 most relevant topics. Dot-dashed lines d
is the mean of the entire sample (= 6.776).
NumberofAccounts
Impact Score (S)
0 10 20 30
0
50
100
All
Low Entropy
High Entropy
Figure 4: User impact distribution for accounts with
high (blue) and low (dark grey) topic entropy. Lines
denote the respective mean impact scores.
H(ui, ⌧) =
where ui is a
This is a mea
meaning tha
ered as mor
quality of the
pact score di
the lowest an
tropies are se
clearly below
latter above.
a connection
may exist, at
tion in the en
Use case sce
On average, the higher the user impact score,
the higher the topic entropy

Characterising user impact: Use case scenarios
Impact distribution under user behaviour scenarios
0 10 20 30 0 10 20 30
0
100
200
300
400
500
L
NL
C
mpact distribution (x-axis: impact points, y-axis: # of user account
ed on subsets of the most relevant attributes and topics – IA: Interac
ng many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ to
s negation and lines the respective mean impact scores.
0 0 10 20 30
0
100
200
300
400
IA
IAC
B
pact distribution (x-axis: impact points, y-axis: # of user accounts)
d on subsets of the most relevant attributes and topics – IA: Interact
g many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ top
negation and lines the respective mean impact scores.
0 10 20 30 0 10 20 30
0
50
100
150
200
LT
ST
E
x-axis: impact points, y-axis: # of user accounts) for ﬁve Twitter
e most relevant attributes and topics – IA: Interactive, IAC: Clique
Topic-Overall, TF: Topic-Focused, LT: ‘Light’ topics, ST: ‘Serious’
s the respective mean impact scores.
Interactive (IA) vs.
clique interactive
(IAC)
Links (L) vs.
very few links (NL)
Light topics (LT)
vs. more ‘serious’
topics (ST)

Concluding remarks
+ User-generated content is a valuable asset
> improve health surveillance tasks
> mine collective knowledge
> infer user characteristics
> numerous other tasks
+ Nonlinear models tend to perform better given the
multimodality of the feature space
+ Deep representations of text tend to improve
performance (better representations)
+ Qualitative analysis is important
> Evaluation
> Interesting insights

Future research challenges
+ Interdisciplinary research tasks require to work closer
with domain experts
+ Understand better the biases in the online media
(demographics, information propagation, external
influence etc.)
+ Attack more interesting (usually more complex)
questions, attempt to generalise findings, identify and
define limitations
+ Conduct more rigorous evaluation
+ Improve on existing methods 
(‘deeper’ understandings & interpretations)
+ Ethical concerns

Acknowledgements
Currently funded by
All collaborators (alphabetical order)
in research mentioned today
Nikolaos Aletras (Amazon), Yoram Bachrach (Microsoft
Research), Trevor Cohn (Univ. of Melbourne), Ingemar J. Cox
(UCL & Univ. of Copenhagen), Nello Cristianini (Univ. of Bristol),
Steve Crossan (Google), Jens K. Geyti (UCL), Andrew C. Miller
(Harvard Univ.), Daniel Preotiuc-Pietro (Penn), Christian
Stefansen (Google), Sviltana Volkova (PNNL), Bin Zou (UCL)

Thank you.
Any questions?
Slides can be downloaded from
lampos.net/talks-posters

References
Argyriou, Evgeniou & Pontil. Convex Multi-Task Feature Learning (Machine Learning, 2008)
Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap (ICML, 2008)
Bernstein. Language and social class (Br J Sociol, 1960)
Bouma. Normalized (pointwise) mutual information in collocation extraction (GSCL, 2009)
David Duvenaud. Automatic Model Construction with Gaussian Processes (Ph.D. Thesis, Univ of Cambridge, 2014)
Ginsberg et al. Detecting influenza epidemics using search engine query data (Nature, 2009)
Hastie, Tibshirani & Friedman. The Elements of Statistical Learning (Springer, 2009)
Labov. The Social Stratification of English in New York City (Cambridge Univ Press, 1972; 2006, 2nd ed.)
Lampos & Cristianini. Nowcasting Events from the Social Web with Statistical Learning (ACM TIST, 2012)
Lampos, Aletras, Geyti, Zou & Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour
and Language (ECIR, 2016)
Lampos, Miller, Crossan & Stefansen. Advances in nowcasting influenza-like illness rates using search query logs
(Nature Sci Rep, 2015)
Lampos, Preotiuc-Pietro, Aletras & Cohn. Predicting and Characterising User Impact on Twitter (EACL, 2014)
Lampos, Preotiuc-Pietro & Cohn. A user-centric of voting intention from Social Media (ACL, 2013)
Lazer, Kennedy, King and Vespignani. The Parable of Google Flu: Traps in Big Data Analysis (Science, 2014)
Mairal, Jenatton, Obozinski & Bach. Network Flow Algorithms for Structured Sparsity (NIPS, 2010)
Mikolov, Chen, Corrado & Dean. Efficient estimation of word representations in vector space (ICLR, 2013)
Preotiuc-Pietro, Lampos & Aletras. An analysis of the user occupational class through Twitter content (ACL, 2015)
Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras. Studying User Income through Language, Behaviour and
Affect in Social Media (PLoS ONE, 2015)
Rasmussen & Williams. Gaussian Processes for Machine Learning (MIT Press, 2006)
Tibshirani. Regression shrinkage and selection via the lasso (J R Stat Soc Series B Stat Methodol, 1996)
von Luxburg. A tutorial on spectral clustering (Stat Comput, 2007)
Zhao & Yu. On model selection consistency of lasso (JMLR, 2006)
Zou & Hastie. Regularization and variable selection via the elastic net (J R Stat Soc Series B Stat Methodol, 2005)

User-generated content: collective and personalised inference tasks

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

User-generated content: collective and personalised inference tasks