SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
User-generated	content:	collective	
and	personalised	inference	tasks
Vasileios	Lampos	
Department	of	Computer	Science	
University	College	London	
(March,	2016;	@	DIKU)
Structure	of	the	talk
1. Introductory	remarks	
2. Collective	inference	tasks	from	user-generated	content

—	Nowcasting	flu	rates	from	Twitter	/	Google

—	Modelling	voting	intention	(bilinear	text	regression)	
3. Personalised	inference	tasks	using	social	media	

—	Occupation,	income,	socioeconomic	status	&	impact	
4. Concluding	remarks
Context	and	motivation
+ the	Internet,	the	World	Wide	Web	and	connectivity	
+ numerous	successful	web	products	feeding	from	
user	activity	
+ lots	of	user-generated	content	&	activity	logs,	e.g.	
social	media	and	search	engine	query	logs	
+ large	volumes	of	digitised	data	(‘Big	Data’),	birth	of	
Data	Science	(nothing	new	in	principal)
How	can	we	use	online	data	to	improve	our	society,		
interpret	human	behaviour,	and		
enhance	our	understanding	about	our	world?
Context	and	motivation
+ the	Internet,	the	World	Wide	Web	and	connectivity	
+ numerous	successful	web	products	feeding	from	
user	activity	
+ lots	of	user-generated	content	&	activity	logs,	e.g.	
social	media	and	search	engine	query	logs	
+ large	volumes	of	digitised	data	(‘Big	Data’),	birth	of	
Data	Science	(nothing	new	in	principal)
How	can	we	use	online	data	to	improve	our	society,		
interpret	human	behaviour,	and		
enhance	our	understanding	about	our	world?
User-generated	content:	Ongoing	applications
+ Health	
> disease	surveillance,	intervention	impact	
+ Finance	&	Commerce	
> financial	indices	
> consumer	satisfaction,	market	share	
+ Politics	
> estimation	of	voting	intentions	
> public	opinion	barometers	
+ Social	and	behavioural	sciences	
> complement	questionnaire	based	studies	
> approach	answers	to	unresolved	questions
Added	value	of	user-generated	content	for	health
+ Online	content	can	potentially	access	a	larger	and	more	
representative	part	of	the	population

Note:	Traditional	health	surveillance	schemes	are	based	
on	the	subset	of	people	that	actively	seek	medical	
attention	
+ More	timely	information	(almost	instant)	about	a	
disease	outbreak	in	a	population	
+ Geographical	regions	with	less	established	health	
monitoring	systems	can	greatly	benefit	
+ Small	cost	when	data	access	and	expertise	are	in	place
Collective	inference	tasks	

from	user-generated	content
Lampos	&	Cristianini,	2012;	
Lampos,	Preotiuc-Pietro	&	Cohn,	2013;	
Lampos,	Miller,	Crossan	&	Stefansen,	2015
Flu	rates	from	Twitter:	The	task
Flu	surveillance		
disease	rates	from	
a	health	agency
f	:
X	∈	ℝM	x	N
y	∈	ℝM
n-gram	frequency	
time	series
2012 2013 2014
0
0.01
0.02
0.03
0.04
ILIrateper100people
ILI rates (PHE)
Bing
(Lampos	&	Cristianini,	2012)
Flu	rates	from	Twitter:	Lasso	for	feature	selection
gression basics — Lasso
• observations xxxi œ Rm, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; —
¸1¸1¸1–norm regularisation or lasso (Tibshirani, 1996)
argmin
www,—
Y
_]
_[
nÿ
i=1
Q
ayi ≠ — ≠
mÿ
j=1
xijwj
R
b
2
+ ⁄
mÿ
j=1
|wj|
Z
_^
_
or argmin
wwwú
Ó
ÎXXXúwwwú ≠ yyyÎ2
¸2
+ ⁄Îwwwθ1
Ô
Regression basics — Ordinary Least Squares (1/2)
• observations xxxi œ Rm, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; —]
also	known	as	lasso	or	L1-norm	regularisation
(Tibshirani,	1996)
Flu	rates	from	Twitter:	Bootstrap	lasso
			Lasso	may	not	always	select	the	true	model

			due	to	collinearities	in	the	feature	space	
Bootstrapping	lasso	(‘bolasso’)	for	feature	selection	
+ For	a	number	(N)	of	bootstraps,	i.e.	iterations	
> Sample	the	feature	space	with	replacement	(Xi)	
> Learn	a	new	model	(wi)	by	applying	lasso	on	Xi	and	y	
> Remember	the	n-grams	with	nonzero	weights	
+ Select	the	n-grams	with	nonzero	weights	in	p%	of	the	N	
bootstraps	
+ p	can	be	optimised;	if	p<100%,	then	‘soft	bolasso’
(Zhao	&	Yu,	2006)
(Bach,	2008)
Flu	rates	from	Twitter:	Performance
wcasting Events from the Social Web with Statistical Learning 72
Fig. 8. Feature Class H – Inference for Flu case study (Round 1 of 5-fold cross validation).
Root	Mean	
Squared	Error
9
10
12
14
1-grams 2-grams Hybrid
11.62
13.82
12.44
10.57
12.64
11.14
Soft-Bolasso Baseline	(correlation	based	feature	selection)
(Lampos	&	Cristianini,	2012)
Flu	rates	from	Twitter:	Selected	features
Word	cloud	with	selected	n-grams.	Font	size	is	
proportional	to	the	regression’s	weight;	n-grams	
that	are	upside-down	have	a	negative	weight.
Rainfall	rates	from	Twitter:	GeneralisationFig. 3. Feature Class B – Inference for Rainfall case study (Round 5 of 6-fold cross validation).
Rainfall	rates	from	Twitter:	Selected	features
Word	cloud	with	selected	n-grams.	Font	size	is	
proportional	to	the	regression’s	weight;	n-grams	
that	are	upside-down	have	a	negative	weight.
Bilinear	regression
ilinear Text Regression — The general idea (2/2)
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
Bilinear Text Regression — The general idea (2/2)
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
16
Bilinear	regularised	regressionBilinear Text Regression — The general idea (2/2)
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
f (QQQi) = uuuTQQQiwww + —
◊ ◊ + —
uuuT QQQi
www
16
Bilinear Text Regression — Regularisation
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yi œ R, i œ {1, ..., n} — yyy
• weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, —
j œ {1, ..., m}
argmin
uuu,www,—
I nÿ
i=1
1
uuuT
QQQiwww + — ≠ yi
22
+ Â(uuu, ◊u) + Â(www, ◊w)
J
Â(·): regularisation function with a set of hyper-parameters (◊)
• if  (vvv, ⁄) = ⁄Îvvvθ1 Bilinear Lasso
• if  (vvv, ⁄1, ⁄2) = ⁄1ÎvvvÎ2
¸2
+ ⁄2Îvvvθ1 Bilinear Elastic Net (BEN)
(Lampos et al., 2013)
(Lampos,	Preotiuc-Pietro	&	Cohn,	2013)
Bilinear	elastic	net	(BEN):	training	a	model
tic Net (BEN)
1
uuuT
QQQiwww + — ≠ yi
22
BEN’s objective function
1 ÎuuuÎ2
¸2
+ ⁄u2 Îuuuθ1
1 ÎwwwÎ2
¸2
+ ⁄w2 Îwwwθ1
J
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
xity: fix uuu, learn www and vv
hrough convex
on tasks: convergence
Global	objective	function	
during	training	(red)	
Corresponding	prediction	
error	on	held	out	data	(blue)
Biconvex	problem		
+ fix	u,	learn	w	and	vice	versa	
+ iterate	through	convex

optimisation	tasks
Bilinear Elastic Net (BEN)
argmin
uuu,www,—
I nÿ
i=1
1
uuuT
QQQiwww + — ≠ yi
22
BEN’s objective function
+ ⁄u1 ÎuuuÎ2
¸2
+ ⁄u2 Îuuuθ1
+ ⁄w1 ÎwwwÎ2
¸2
+ ⁄w2 Îwwwθ1
J
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0
0.4
0.8
1.2
1.6
2
2.4
Step
Global Objective
RMSE
Figure 2 : Objective function
value and RMSE (on hold-out
data) through the model’s
iterations
• Bi-convexity: fix uuu, learn www and vv
• Iterating through convex
optimisation tasks: convergence
(Al-Khayyal & Falk, 1983; Horst & Tuy, 1996)
• FISTA (Beck & Teboulle, 2009)
in SPAMS (Mairal et al., 2010):
Large-scale optimisation solver,
quick convergence
V. Lampos v.lampos@ucl.ac.uk Bilinear Text Regression and Applications 18/45
18
/45
BEN’s	objective	function
(Mairal	et	al.,	2010)
Large-scale	solvers	in	SPAMS
Bilinear	multi-task	learningBilinear Multi-Task Learning
• tasks · œ Z+
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yyyi œ R· , i œ {1, ..., n} — YYY
• weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW
v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 23/45
23
/45
Bilinear Multi-Task Learning
• tasks · œ Z+
• users p œ Z+
• observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX
• responses yyyi œ R· , i œ {1, ..., n} — YYY
• weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
f (QQQi) = tr
1
UUUTQQQiWWW
2
+ ———
◊ ◊
UUUT QQQi WWW
Bilinear	Group	l2,1	(BGL)argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
◊ ◊
UUUT QQQi WWW
a feature (user/word) is selected for all tasks (not just one), but
possibly with di erent weights
especially useful in the domain of politics (e.g. user pro party A
+ a	feature	(user	or	word)	is	usually	selected	(activated)	for	
all	tasks,	but	with	different	weights	
+ useful	in	the	domain	of	political	preference	inference
eights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ———
j œ {1, ..., m}
argmin
UUU,WWW,———
I ·ÿ
t=1
nÿ
i=1
1
uuuT
t QQQiwwwt + —t ≠ yti
22
+ ⁄u
pÿ
k=1
ÎUUUkÎ2 + ⁄w
mÿ
j=1
ÎWWWjÎ2
J
GL can be broken into 2 convex tasks: first learn {WWW,———}, then
UUU,———} and vv + iterate through this process
@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 24/45
24
/45
(Argyriou	et	al.,	2008)
Inferring	voting	intention	from	Twitter:	Data
United	Kingdom	
+ 3	parties	(Conservatives,	Labour,	Lib	Dem)	
+ 42,000	Twitter	users	distributed	proportionally	to	
UK’s	regional	population	figures	
+ 60	million	tweets	&	80,976	1-grams	extracted	
+ 240	polls	from	30	Apr.	2010	to	13	Feb.	2012
Austria	
+ 4	parties	(SPO,	OVP,	FPO,	GRU)	
+ 1,100	politically	active	Twitter	users	selected	by	political	
scientists		
+ 800,000	tweets	&	22,917	1-grams	extracted	
+ 98	polls	from	25	Jan.	to	25	Dec.	2012
Inferring	voting	intention	from	Twitter:	PerformanceRoot	Mean	

Squared	Error
0
1
2
2
3
UK Austria
1.4391.478
1.699
1.573
1.442
3.067
1.47
1.723
1.851
1.69
Mean	poll Last	poll Elastic	Net	(words)
BEN BGL
(Lampos,	Preotiuc-Pietro	&	Cohn,	2013)
Inferring	voting	intention	from	Twitter:	UK
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIB
BGL
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
35
40
VotingIntention%
Time
CON
LAB
LIBYouGov
Inferring	voting	intention	from	Twitter:	Austria
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
Polls
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BEN
5 10 15 20 25 30 35 40 45
0
5
10
15
20
25
30
VotingIntention%
Time
SPÖ
ÖVP
FPÖ
GRÜ
BGL
Inferring	voting	intention	from	Twitter:	A	qualitative	outcome
Party Tweet Score User	type
SPÖ	
centre
Inflation	rate	in	Austria	slightly	down	in	July	
from	2.2	to	2.1%.	Accommodation,	Water,	
Energy	more	expensive.
0.745 Journalist
ÖVP	
centre	
right
Can	really	recommend	the	book	“Res	
Publica”	by	Johannes	#Voggenhuber!	Food	
for	thought	and	so	on	#Europe	#Democracy
-2.323 User
FPÖ

far	right
Campaign	of	the	Viennese	SPO	on	“Living	
together”	plays	right	into	the	hands	of	
right-wing	populists
-3.44
Human	
rights
GRÜ	
centre	left
Protest	songs	against	the	closing-down	of	
the	bachelor	course	of	International	
Development:	<link>	#ID_remains	
#UniBurns	#UniRage
1.45
Student	
Union
Nonlinearities	in	the	data	(1)
frequency	of	search	query	
‘dry	cough’	(Google)
Nonlinearities	in	the	data	(2)
frequency	of	search	query	
‘dry	cough’	(Google)
linear
nonlinear
Gaussian	Processes	(GPs)
analysis step. Embeddings are
cause each dimension is an ab-
the clusters can be interpreted
of the most frequent or repre-
e latter are identified using the
metric:
P
x2c NPMI(w, x)
|c| 1
, (2)
cluster and w the target word.
beddings (W2V-E)
been a growing interest in neu-
research (Cohn and Sp
and Cohn, 2013) with
limited to (Polajnar et
Formally, GP meth
f : Rd ! R drawn
inputs xxx 2 Rd:
f(xxx) ⇠ GP(
where m(·) is the mean
is the covariance kerne
ponential (SE) kernel
used to encourage smo
dimensional pair of inp
usters can be interpreted
most frequent or repre-
are identified using the
c:
PMI(w, x)
| 1
, (2)
r and w the target word.
gs (W2V-E)
growing interest in neu-
e the words are projected
dense vector space via a
limited to (Polajnar et al., 201
Formally, GP methods aim
f : Rd ! R drawn from a
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx),
where m(·) is the mean functi
is the covariance kernel. Usu
ponential (SE) kernel (a.k.a.
used to encourage smooth fun
dimensional pair of inputs (xxx
kard(xxx,xxx0
) = 2
exp
" dX
limited to (Polajnar et al., 2011).
Formally, GP methods aim to learn a function
f : Rd ! R drawn from a GP prior given the
inputs xxx 2 Rd:
f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0
)) , (3)
where m(·) is the mean function (here 0) and k(·, ·
is the covariance kernel. Usually, the Squared Ex
ponential (SE) kernel (a.k.a. RBF or Gaussian) is
used to encourage smooth functions. For the multi
dimensional pair of inputs (xxx,xxx0), this is:
Based	on	d-dimensional	input	data
we	want	to	learn	a	function
Formally:	Sets	of	random	variables	any	finite	number	
of	which	have	a	multivariate	Gaussian	distribution
mean	function	
drawn	on	inputs
covariance	function	(or	kernel)	
drawn	on	pairs	of	inputs
(Rasmussen	&	Williams,	2006)
Common	covariance	functions	(kernels)
briefly examining the priors on functions encoded by some commonly used kernels: the
squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are
defined in figure 1.1.
Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin)
k(x, xÕ
) = ‡2
f exp
1
≠(x≠xÕ)2
2¸2
2
‡2
f exp
1
≠ 2
¸2 sin2
1
fix≠xÕ
p
22
‡2
f (x ≠ c)(xÕ
≠ c)
Plot of k(x, xÕ
):
0 0
0
x ≠ xÕ
x ≠ xÕ
x (with xÕ
= 1)
¿ ¿ ¿
Functions f(x)
sampled from
GP prior:
x x x
Type of structure: local variation repeating structure linear functions
Figure 1.1: Examples of structures expressible by some basic kernels.
(Duvenaud,	2014)
Combining	kernels	in	a	GP
4 Expressing Structure with Kernels
Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per
0 0
0
0
x (with xÕ
= 1) x ≠ xÕ
x (with xÕ
= 1) x (with xÕ
= 1)
¿ ¿ ¿ ¿
quadratic functions locally periodic increasing variation growing amplitude
Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels.
it	is	possible	to	add	or	multiply	kernels	
(among	other	operations)
(Duvenaud,	2014)
GPs	for	regression:	A	toy	example	(1)
take	some	(x,y)	pairs	with	some	obvious	
nonlinear	underlying	structure
x (predictor variable)
0 10 20 30 40 50 60
y(targetvariable)
2
4
6
8
10
12
14
16
18
20
22 x,y pairs
x (predictor variable)
0 10 20 30 40 50 60
y(targetvariable)
2
4
6
8
10
12
14
16
18
20
22 x,y pairs OLS fit GP fit
GPs	for	regression:	A	toy	example	(2)
Addition	of	2	GP	kernels:		
periodic	+	squared	exponential	+	noise
testing	
(solid	line)training	
(dashed	line)
More	information	about	GPs
+ Book	—	“Gaussian	Processes	for	Machine	Learning”

http://www.gaussianprocess.org/gpml/	
+ Tutorial	—	“Gaussian	Processes	for	Natural	Language	
Processing”

http://people.eng.unimelb.edu.au/tcohn/tutorial.html	
+ Video-lecture	—	“Gaussian	Process	Basics”

http://videolectures.net/gpip06_mackay_gpb/	
+ Software	I	—	GPML	for	Octave	or	MATLAB

http://www.gaussianprocess.org/gpml/code	
+ Software	II	—	GPy	for	Python

http://sheffieldml.github.io/GPy/
Google	Flu	Trends:	The	idea
Can	we	turn	search	query	information	(statistics)	to	
estimates	about	the	rate	of	influenza-like	illness		
in	the	real-world	population?
Google	Flu	Trends:	Failure
0
2
4
6
8
10
07/01/09 07/01/10 07/01/11 07/01/12 07/01/13
Google Flu Lagged CDC
Google Flu + CDC CDC
50
100
150
Google Flu Lagged CDC
Google Flu + CDC
Google estimates more
than double CDC estimates
Google starts estimating
high 100 out of 108 weeks
%ILI%baseline)
The	estimates	of	the	online	Google	Flu	Trends	tool	were	
approx.	two	times	larger	than	the	ones	from	the	CDC	in	2012/13
(Lazer	et	al.,	2014)
odel using the log-odds of an ILI physician visit and
ds of an ILI-related search query:
logit(P) = β0 + β1 × logit(Q) + ε
the percentage of ILI physician visits, Q is
ated query fraction, β0 is the intercept,
Fig
ILI-
poin
que
whi
0.8
(Ginsberg	et	al.,	2009)
Google	Flu	Trends:	Hypotheses	for	failure
+ ‘Big	Data’	are	not	always	good	enough;	may	not	always	
capture	the	target	signal	properly	
+ The	estimates	were	based	on	a	rather	simplistic	model	
+ The	model	was	OK,	but	some	spurious	search	queries	
invalidated	the	ILI	inferences,	e.g.	‘flu	symptoms’	
+ Media	hype	about	the	topic	of	‘flu’	significantly	increased	
the	search	query	volume	from	people	that	were	just	
seeking	information	(non	patients)	
+ Side	note:	CDC’s	estimates	are	not	necessarily	the	ground	
truth;	they	can	also	go	wrong	sometimes,	although	we	
generally	assume	that	they	are	a	good	representation	of	
the	real	signal
Google	Flu	Trends	revised:	Data	(1)
Google	search	query	logs	
> geo-located	in	US	regions	
> from	4	Jan.	2004	to	28	Dec.	2013	(521	weeks,	~decade)	
> filtered	by	a	very	relaxed	health-topic	classifier	
> intersection	among	frequently	occurring	search	
queries	in	all	US	regions	
> weekly	frequencies	of	49,708	queries	(#	of	features)	
> all	data	have	been	anonymised	and	aggregated	
plus	corresponding	ILI	rates	from	the	CDC
Google	Flu	Trends	revised:	Data	(2)
Corresponding	ILI	rates	from	the	CDC
Table S3. Cumulative performance (2008-2013) of GP model with various numbers of clusters.
Covariance function r MAE⇥102
MAPE (%)
SE .95 .221 10.8
Mat´ern .95 .228 11
e S4. Performance comparison of the optimal GP model (10 clusters) when a different covariance function (Mat´ern
ure S1. CDC ILI rates for the US covering 2004 to 2013, i.e., the time span of the data used in our experimental pro
eriods are distinguished by color.
different	colouring	per	flu	season
Google	Flu	Trends	revised:	Methods	(1)
r>a Elastic	Net
Google	search	query	
frequencies	(Q)
Historical	CDC		
ILI	data
k-means
k1
k2
k3
kN
…
+ GP(μ,k)
Q’≤Q Q’’≤Q’
ILI	inference
(Lampos,	Miller,	Crossan	&	Stefansen,	2015)
Google	Flu	Trends	revised:	Methods	(2)
1. Keep	search	queries	with	r	≥	0.5	(reduces	the	amount	
of	irrelevant	queries)	
2. Apply	the	previous	model	(GFT)	to	get	a	baseline	
performance	estimate	
3. Apply	elastic	net	to	select	a	subset	of	search	queries	
and	compute	another	baseline	
4. Group	the	selected	queries	into	N	=	10	clusters	using	

k-means	to	account	for	their	different	semantics	
5. Use	a	different	GP	covariance	function	on	top	of	each	
query	cluster	to	explore	non-linearities
Google	Flu	Trends	revised:	Methods	(3)
milarity metric and then apply a composite GP kernel on clusters of qu
search queries = , …,x c c{ }C1 , where ci denotes the subset of queries cl
GP covariance function to be
∑ σ δ′ ′( , ) =
⎛
⎝
⎜⎜⎜
( , ′)
⎞
⎠
⎟⎟⎟⎟
+ ⋅ ( , ),
=
k kx x c c x x
i
C
i i
1
SE n
2
otes the number of clusters, kSE has a different set of hyperparameters (σ
rm of the equation models noise (δ being a Kronecker delta function).
ntation of queries by applying the k-means++ algorithm32,33
(see SI, Gau
The distance metric of k-means uses the cosine similarity between time s
the different magnitudes of the query frequencies in our data34
.
)/
⎛
⎝
⎜⎜⎜
⋅
⎞
⎠
⎟⎟⎟
x xq q
2 2
i j
, where ∈
,
xq
T
i j{ }
denotes a column of the input mat
ng on sets of queries, the proposed method can protect an inferred m
e frequency of single queries that are not representative of an entire clu
bout a disease may trigger queries expressing a general concern rather th
s are expected to utilize a small subset of specific key-phrases, but no
d to flu infection. In addition, assuming that query clusters may convey
+ protect	a	model	from	radical	changes	in	the	frequency	of	
single	queries	that	are	not	representative	of	a	cluster	
+ model	the	contribution	of	various	thematic	concepts	
(captured	by	different	clusters)	to	the	final	prediction	
+ learning	a	sum	of	lower-dimensional	functions:	significantly	
smaller	input	space,	much	easier	learning	task,	fewer	
samples	required,	more	statistical	traction	obtained	
- imposes	the	assumption	that	the	relationship	between	
queries	in	separate	clusters	provides	no	information	about	
ILI	(reasonable	trade-off)
Google	Flu	Trends	revised:	Results	(1)e.com/scientificreports/
Google	Flu	Trends	revised:	Results	(2)MAPE	(%)
0
5
10
16
21
26
Mean	absolute	percentage	(%)	of	error	(MAPE)	in	flu	
rate	estimates	during	a	5-year	period	(2008-2013)
Test	data Test	data;	peaking	moments
11%10.8%
15.8%
11.9%
24.8%
20.4%
Google	Flu	Trends	old	model Elastic	Net
Gaussian	Process
Google	Flu	Trends	revised:	Results	(3)
‘rsv’	—	25%	
‘flu	symptoms’	—	18%	
‘benzonatate’	—		6%	
‘symptoms	of	pneumonia’	—		6%	
‘upper	respiratory	infection’	—		4%
impact	of	automatically	selected	queries	in	
a	flu	estimate	during	the	over-predictions
previous	GFT	model
Google	Flu	Trends	revised:	Methods	(4)
component (p), the moving average component (q), and a regression elemen
the sequential observations , …,y yT1
, and a D-dimensional exogenous inpu
specifies the relationship
∑ ∑ ∑φ θ ε ε= + + + ,
=
−
=
−
=
,y y w ht
i
p
i t i
i
q
i t i
i
D
i t i t
1 1 1
where the φi, θi, and wi are coefficients to be learned and εt is mean zer
unknown variance. For fixed values of p and q, this model is trained usin
extend this model with a seasonal component that incorporates yearly lag
model) and determine orders p and q as well as seasonal orders automatic
procedure39
. Instead of using all available query fractions as the exogenous i
the single prediction result (D= 1) from a query model, ˆyt
. Essentially, this
to distill all of the information that search data have to offer about the IL
this meta-information in the ARMAX procedure. Predictive intervals are es
sive nowcast through the maximum likelihood variance of the model.
Results
We evaluate our methodology on held out ILI rates and normalized query f
utive periods matching the influenza seasons from 2008 to 2013, as define
and Supplementary Fig. S1). For each test period (flu season i), we train a m
sonal ARMAX model
asonality component in the ARMAX function incorporates further information into t
ngth of the season is fixed to 52 weeks (1-year long). The full model description, wh
mes
yt =
pX
i=1
iyt i +
JX
i=1
!iyt 52 i
| {z }
AR and seasonal AR
+
qX
i=1
✓i✏t i +
KX
i=1
⌫i✏t 52 i
| {z }
MA and seasonal MA
+
DX
i=1
wiht,i
| {z }
regression
+ ✏t ,
Seasonal	ARMAX
Auto-regressive	
moving	average	
with	exogenous	
inputs	(ARMAX)
AR	component
Moving	average	
component
Exogenous	input
Google	Flu	Trends	revised:	Results	(4)e.com/scientificreports/
Figure 2. Comparison of nowcasts between an autoregressive baseline model which is based only on
Google	Flu	Trends	revised:	Results	(5)
0
3
6
9
12
15
MAPE	(%)	in	flu	rate	autoregressive	(AR)	estimates	during	
a	4-year	period	(2009-2013)
Test	data Test	data;	peaking	moments
14.3
7.5%7.3%
8.3%7.7%
13%
10.2%
Google	Flu	Trends	old	model	(AR) Elastic	Net	(AR)
Gaussian	Process	(AR) CDC	(AR)
Personalised	inference	tasks	
using	social	media	content
Lampos,	Aletras,	Preotiuc-Pietro	&	Cohn,	2014;	
Preotiuc-Pietro,	Lampos	&	Aletras,	2015;	
Preotiuc-Pietro,	Volkova,	Lampos,	Bachrach	&	Aletras,	2015;	
Lampos,	Aletras,	Geyti,	Zou	&	Cox,	2015
Occupational	class	inference:	Motivation
+ Validate	this	hypothesis	on	a	broader,	larger	data	set	
using	social	media	(Twitter)	
+ Downstream	applications	
> research	(social	science	&	other	domains)	
> commercial	
+ Proxy	for	additional	user	attributes,	e.g.	income	and	
socioeconomic	status
(Bernstein,	1960;	Labov,	1972/2006)
(Preotiuc-Pietro,	Lampos	&	Aletras,	2015)
“Socioeconomic	variables	are	influencing	language	use.”
Occupational	class	inference:	SOC	2010
C1	—	Managers,	Directors	&	Senior	Officials	
e.g.	chief	executive,	bank	manager	
C2	—	Professional	Occupations	(e.g.	mechanical	engineer,	pediatrist)	
C3	—	Associate	Professional	&	Technical	
e.g.	system	administrator,	dispensing	optician	
C4	—	Administrative	&	Secretarial	(e.g.	legal	clerk,	secretary)	
C5	—	Skilled	Trades	(e.g.	electrical	fitter,	tailor)	
C6	—	Caring,	Leisure,	Other	Service	
e.g.	nursery	assistant,	hairdresser	
C7	—	Sales	&	Customer	Service	(e.g.	sales	assistant,	telephonist)	
C8	—	Process,	Plant	and	Machine	Operatives	
e.g.	factory	worker,	van	driver	
C9	—	Elementary	(e.g.	shelf	stacker,	bartender)
Standard	Occupational	Classification	(SOC)
Occupational	class	inference:	Data
+ 5,191	Twitter	users	mapped	to	their	occupations,	then	
mapped	to	one	of	the	9	SOC	categories	
+ 10	million	tweets	
+ Download	the	data	set
%	of	users	per	SOC	category
0
10
20
30
40
C1 C2 C3 C4 C5 C6 C7 C8 C9
Occupational	class	inference:	Features
User	attributes	(18)	
+ number	of	followers,	friends,	listings,	follower/friend	
ratio,	favourites,	tweets,	retweets,	hashtags,	@-mentions,	
@-replies,	links	and	so	on	
Topics	—	Word	clusters	(200)	
+ SVD	on	the	graph	laplacian	of	the	word	x	word	similarity	
matrix	using	normalised	PMI,	i.e.	a	form	of	spectral	
clustering	
+ Skip-gram	model	with	negative	sampling	to	learn	word	
embeddings	(Word2Vec);	pairwise	cosine	similarity	on	the	
embeddings	to	derive	a	word	x	word	similarity	matrix;	
then	spectral	clustering	on	the	similarity	matrix
(Bouma,	2009;	von	Luxburg,	2007)
(Mikolov	et	al.,	2013)
Occupational	class	inference:	Performance
Accuracy	(%)
25
31
37
43
49
55
Feature	type
User	Attributes SVD-200-clusters Word2Vec-200-clusters
52.7
48.2
34.2
51.7
47.9
31.5
46.9
44.2
34
Logistic	Regression SVM	(RBF) Gaussian	Process	(SE-ARD)
most	frequent	
class	baseline
Occupational	class	inference:	Topic	CDFs	(1)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Higher Education (#21)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic	more	prevalent	in	a	class	(C1-C9),	if	the	line	leans	
closer	to	the	bottom-right	corner								of	the	plot
Occupational	class	inference:	Topic	CDFs	(2)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Arts (#116)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic	more	prevalent	in	a	class	(C1-C9),	if	the	line	leans	
closer	to	the	bottom-right	corner								of	the	plot
Occupational	class	inference:	Topic	CDFs	(3)
Feature Analysis - Cumulative Density Functions
0.001 0.01 0.05
0
0.2
0.4
0.6
0.8
1
Topic proportion
Userprobability
Elongated Words (#164)
C1
C2
C3
C4
C5
C6
C7
C8
C9
Topic more prevalent ! CDF line closer to bottom-right cornerTopic	more	prevalent	in	a	class	(C1-C9),	if	the	line	leans	
closer	to	the	bottom-right	corner								of	the	plot
Occupational	class	inference:	Topic	similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational	class
Occupational	class
Topic	distribution	distance	(Jensen-Shannon	divergence)	
for	the	different	occupational	classes
Occupational	class	inference:	Topic	similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational	class
Occupational	class
Topic	distribution	distance	(Jensen-Shannon	divergence)	
for	the	different	occupational	classes
Occupational	class	inference:	Topic	similarity
l as topics
‘Football’
re’ (#153).
cales only
iscrimina-
hich topic
formance.
ge across
cs are cov-
shows the
Fs) across
r these six
ers having
tweets. A
CDF line
the plot.
evalent in
similar pattern in both topics by which users with
lower skilled jobs tweet more often.
Figure 3: Jensen-Shannon divergence in the topic
distributions between the different occupational
classes (C 1–9).
Occupational	class
Occupational	class
Topic	distribution	distance	(Jensen-Shannon	divergence)	
for	the	different	occupational	classes
Income	inference:	Data
Income prediction
10k 30k 50k 100k
0
200
400
600
800
1000
Yearly income (£)
No.Users
+ 5,191	Twitter	users	(same	as	in	the	previous	study)	
mapped	to	their	occupations,	then	mapped	to	an	
average	income	in	GBP	(£)	using	the	SOC	taxonomy	
+ approx.	11	million	tweets	
+ Download	the	data	set
(Preotiuc-Pietro,	Volkova,	
Lampos,	Bachrach	&	
Aletras,	2015)
Income	inference:	Features
+ Profile	(8)

e.g.	#followers,	#followees,	times	listed	etc.	
+ Shallow	textual	features	(10)

e.g.	proportion	of	hashtags,	@-replies,	@-mentions	etc.	
+ Inferred	(perceived)	psycho-demographic	features	(15)

e.g.	gender,	age,	education	level,	religion,	life	
satisfaction,	excitement,	anxiety	etc.	
+ Emotions	(9)

e.g.	positive	/	negative	sentiment,	joy,	anger,	fear,	
disgust,	sadness,	surprise	etc.	
+ Word	clusters	—	Topics	of	discussion	(200)

based	on	word	embeddings	and	by	applying	spectral	
clustering
Income	inference:	Performance
MAE
£8500
£9275
£10050
£10825
£11600
Income	inference	error	(Mean	Absolute	Error)	using	
GP	regression	or	a	linear	ensemble	for	all	features
Feature	Categories
£9,535£9,621
£11,456
£10,980
£10,110
£11,291
Profile Demo Emotion Shallow Topics All	features
Income	inference:	Qualitative	analysis	(1)
e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34)
e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66)
e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74)
28000
35000
42000
28000
35000
42000
28000
35000
42000
0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20
0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030
0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15
Feature value
Income
Relating	income	and	emotion
Linear	vs	GP	fit
Income	inference:	Qualitative	analysis	(2)
Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics)
Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing)
30000
40000
50000
30000
40000
50000
0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075
0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12
Feature value
Income
Relating	income	and	topics	of	discussion
Linear	vs	GP	fit
Inferring	the	socioeconomic	status:	Task
Profile	description	
on	Twitter	
Occupation SOC	category1 NS-SEC2
1. Standard	Occupational	Classification:	369	job	groupings	
2. National	Statistics	Socio-Economic	Classification:	Map	from	
the	job	groupings	in	SOC	to	a	socioeconomic	status,	i.e.	
{upper,	middle	or	lower}
(Lampos,	Aletras,	Geyti,	Zou	&	Cox,	2016)
Inferring	the	socioeconomic	status:	Data	&	Features
+ 1,342	Twitter	user	profiles

distinct	data	set	from	the	previous	works	
+ 2	million	tweets	
+ Date	interval:	Feb.	1,	2014	to	March	21,	2015	
+ Each	user	has	a	socioeconomic	status	(SES)	label:

{upper,	middle,	lower}	
+ Download	the	data	set	
1,291	features	representing		
user	behaviour	(4),	biographical	/	profile	information	
(523),	text	in	the	tweets	(560),	topics	of	discussion	(200),	
and	impact	on	the	platform	(4)
Inferring	the	socioeconomic	status:	Results
T1 T2 P
O1 584 115 83.5%
O2 126 517 80.4%
R 82.3% 81.8% 82.0%
T1 T2 T3 P
O1 606 84 53 81.6%
O2 49 186 45 66.4%
O3 55 48 216 67.7%
R 854% 58.5% 68.8% 75.1%
Classification Accuracy	(%) Precision	(%) Recall	(%) F1
2-way 82.05	(2.4) 82.2	(2.4) 81.97	(2.6) .821	(.03)
3-way 75.09	(3.3) 72.04	(4.4) 70.76	(5.7) .714	(.05)
Confusion	matrices	for	the	3-	and	2-way	classification
Classification	performance	(using	a	GP	classifier)
Characterising	user	impact:	Task	&	Data
ct — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
mber of followers, „out: number of followees
mber of times the account has been listed
ogarithm is applied on a positive number
ut
"
= („in ≠ „out) ◊ („in/„out) + „in
of the user impact
n our data set
S) = 6.776
−5 0 5 10 15 20 25 30
0
0.05
0.1
0.15
Impact Score (S)
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
c.uk Slides: http://bit.ly/1GrxI8j 40/52
40
/52
„out
• „in: number of followers, „out: number of follow
• „⁄: number of times the account has been liste
• ◊ = 1, logarithm is applied on a positive numbe
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5
0
0.05
0.1
0.15
ProbabilityDensity
@spam?
v.lampos@ucl.ac.uk Slides:
ser impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
• „in: number of followers, „out: number of followees
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776 0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in
„out + ◊
• „in: number of followers, „out: number of followee
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5 10
0
0.05
0.1
0.15
Impa
ProbabilityDensity
@
@@spam?
User impact — a simplified definition
S(„in, „out, „⁄) = ln
A
(„⁄ + ◊) („in + ◊
„out + ◊
• „in: number of followers, „out: number of followees
• „⁄: number of times the account has been listed
• ◊ = 1, logarithm is applied on a positive number
•
!
„2
in/„out
"
= („in ≠ „out) ◊ („in/„out) + „in
Histogram of the user impact
scores in our data set
µ(S) = 6.776
−5 0 5 10 1
0
0.05
0.1
0.15
Impact Sc
ProbabilityDensity
@Paul
@lamp
@nika
@spam?
v.lampos@ucl.ac.uk Slides: http:/
mplified definition
t, „⁄) = ln
A
(„⁄ + ◊) („in + ◊)2
„out + ◊
B
ers, „out: number of followees
the account has been listed
plied on a positive number
out) ◊ („in/„out) + „in
pact
t
0.05
0.1
0.15
ProbabilityDensity
@guardian
@David_Cameron
@PaulMasonNews
@lampos
@nikaletras
@spam?
—>	number	of	followers
—>	number	of	followees
—>	number	of	times	listed
—>	logarithm	is	applied	on	a	positive	number
β				Vasileios	Lampos	~	@lampos	
ν				Nikolaos	Aletras	~	@nikaletras	
40K	Twitter	accounts	(UK)	considered
(Lampos	et	al.,	2014)
Characterising	user	impact:	Topic	entropy
high participation in the 10 most relevant topics. Dot-dashed lines d
is the mean of the entire sample (= 6.776).
NumberofAccounts
Impact Score (S)
0 10 20 30
0
50
100
All
Low Entropy
High Entropy
Figure 4: User impact distribution for accounts with
high (blue) and low (dark grey) topic entropy. Lines
denote the respective mean impact scores.
H(ui, ⌧) =
where ui is a
This is a mea
meaning tha
ered as mor
quality of the
pact score di
the lowest an
tropies are se
clearly below
latter above.
a connection
may exist, at
tion in the en
Use case sce
On	average,	the	higher	the	user	impact	score,	
the	higher	the	topic	entropy
Characterising	user	impact:	Use	case	scenarios
Impact	distribution	under	user	behaviour	scenarios
0 10 20 30 0 10 20 30
0
100
200
300
400
500
L
NL
C
mpact distribution (x-axis: impact points, y-axis: # of user account
ed on subsets of the most relevant attributes and topics – IA: Interac
ng many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ to
s negation and lines the respective mean impact scores.
0 0 10 20 30
0
100
200
300
400
IA
IAC
B
pact distribution (x-axis: impact points, y-axis: # of user accounts)
d on subsets of the most relevant attributes and topics – IA: Interact
g many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ top
negation and lines the respective mean impact scores.
0 10 20 30 0 10 20 30
0
50
100
150
200
LT
ST
E
x-axis: impact points, y-axis: # of user accounts) for five Twitter
e most relevant attributes and topics – IA: Interactive, IAC: Clique
Topic-Overall, TF: Topic-Focused, LT: ‘Light’ topics, ST: ‘Serious’
s the respective mean impact scores.
Interactive	(IA)	vs.	
clique	interactive	
(IAC)
Links	(L)	vs.	
very	few	links	(NL)
Light	topics	(LT)	
vs.	more	‘serious’	
topics	(ST)
Concluding	remarks
+ User-generated	content	is	a	valuable	asset	
> improve	health	surveillance	tasks	
> mine	collective	knowledge	
> infer	user	characteristics	
> numerous	other	tasks	
+ Nonlinear	models	tend	to	perform	better	given	the	
multimodality	of	the	feature	space	
+ Deep	representations	of	text	tend	to	improve	
performance	(better	representations)	
+ Qualitative	analysis	is	important	
> Evaluation	
> Interesting	insights
Future	research	challenges
+ Interdisciplinary	research	tasks	require	to	work	closer	
with	domain	experts	
+ Understand	better	the	biases	in	the	online	media	
(demographics,	information	propagation,	external	
influence	etc.)	
+ Attack	more	interesting	(usually	more	complex)	
questions,	attempt	to	generalise	findings,	identify	and	
define	limitations	
+ Conduct	more	rigorous	evaluation	
+ Improve	on	existing	methods

(‘deeper’	understandings	&	interpretations)	
+ Ethical	concerns
Acknowledgements
Currently	funded	by
All	collaborators	(alphabetical	order)		
in	research	mentioned	today	
Nikolaos	Aletras	(Amazon),	Yoram	Bachrach	(Microsoft	
Research),	Trevor	Cohn	(Univ.	of	Melbourne),	Ingemar	J.	Cox	
(UCL	&	Univ.	of	Copenhagen),	Nello	Cristianini	(Univ.	of	Bristol),	
Steve	Crossan	(Google),	Jens	K.	Geyti	(UCL),	Andrew	C.	Miller	
(Harvard	Univ.),	Daniel	Preotiuc-Pietro	(Penn),	Christian	
Stefansen	(Google),	Sviltana	Volkova	(PNNL),	Bin	Zou	(UCL)
Thank	you.	
Any	questions?
Slides	can	be	downloaded	from	
lampos.net/talks-posters
References
Argyriou,	Evgeniou	&	Pontil.	Convex	Multi-Task	Feature	Learning	(Machine	Learning,	2008)	
Bach.	Bolasso:	Model	Consistent	Lasso	Estimation	through	the	Bootstrap	(ICML,	2008)	
Bernstein.	Language	and	social	class	(Br	J	Sociol,	1960)	
Bouma.	Normalized	(pointwise)	mutual	information	in	collocation	extraction	(GSCL,	2009)	
David	Duvenaud.	Automatic	Model	Construction	with	Gaussian	Processes	(Ph.D.	Thesis,	Univ	of	Cambridge,	2014)	
Ginsberg	et	al.	Detecting	influenza	epidemics	using	search	engine	query	data	(Nature,	2009)	
Hastie,	Tibshirani	&	Friedman.	The	Elements	of	Statistical	Learning	(Springer,	2009)	
Labov.	The	Social	Stratification	of	English	in	New	York	City	(Cambridge	Univ	Press,	1972;	2006,	2nd	ed.)	
Lampos	&	Cristianini.	Nowcasting	Events	from	the	Social	Web	with	Statistical	Learning	(ACM	TIST,	2012)	
Lampos,	Aletras,	Geyti,	Zou	&	Cox.	Inferring	the	Socioeconomic	Status	of	Social	Media	Users	based	on	Behaviour	
and	Language	(ECIR,	2016)	
Lampos,	Miller,	Crossan	&	Stefansen.	Advances	in	nowcasting	influenza-like	illness	rates	using	search	query	logs	
(Nature	Sci	Rep,	2015)	
Lampos,	Preotiuc-Pietro,	Aletras	&	Cohn.	Predicting	and	Characterising	User	Impact	on	Twitter	(EACL,	2014)	
Lampos,	Preotiuc-Pietro	&	Cohn.	A	user-centric	of	voting	intention	from	Social	Media	(ACL,	2013)	
Lazer,	Kennedy,	King	and	Vespignani.	The	Parable	of	Google	Flu:	Traps	in	Big	Data	Analysis	(Science,	2014)	
Mairal,	Jenatton,	Obozinski	&	Bach.	Network	Flow	Algorithms	for	Structured	Sparsity	(NIPS,	2010)	
Mikolov,	Chen,	Corrado	&	Dean.	Efficient	estimation	of	word	representations	in	vector	space	(ICLR,	2013)	
Preotiuc-Pietro,	Lampos	&	Aletras.	An	analysis	of	the	user	occupational	class	through	Twitter	content	(ACL,	2015)	
Preotiuc-Pietro,	Volkova,	Lampos,	Bachrach	&	Aletras.	Studying	User	Income	through	Language,	Behaviour	and	
Affect	in	Social	Media	(PLoS	ONE,	2015)	
Rasmussen	&	Williams.	Gaussian	Processes	for	Machine	Learning	(MIT	Press,	2006)	
Tibshirani.	Regression	shrinkage	and	selection	via	the	lasso	(J	R	Stat	Soc	Series	B	Stat	Methodol,	1996)	
von	Luxburg.	A	tutorial	on	spectral	clustering	(Stat	Comput,	2007)	
Zhao	&	Yu.	On	model	selection	consistency	of	lasso	(JMLR,	2006)	
Zou	&	Hastie.	Regularization	and	variable	selection	via	the	elastic	net	(J	R	Stat	Soc	Series	B	Stat	Methodol,	2005)

Contenu connexe

Dernier

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 

Dernier (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

User-generated content: collective and personalised inference tasks