Twitterology - The Science of Twitter

Bruno Gonçalves
www.bgoncalves.com
Twitterology: 
The Science of Twitter

www.bgoncalves.com@bgoncalves
The Internet In Real Time

www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves

Social Media

Twitter

Anatomy of a Tweet

Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'follow_request_sent',
u'profile_use_background_image',
u'default_profile_image',
u'id',
u'profile_background_image_url_https',
u'verified',
u'profile_text_color',
u'profile_image_url_https',
u'profile_sidebar_fill_color',
u'entities',
u'followers_count',
u'profile_sidebar_border_color',
u'id_str',
u'profile_background_color',
u'listed_count',
u'is_translation_enabled',
u'utc_offset',
u'statuses_count',
u'description',
u'friends_count',
u'location',
u'profile_link_color',
u'profile_image_url',
u'following',
u'geo_enabled',
u'profile_banner_url',
u'profile_background_image_url',
u'screen_name',
u'lang',
u'profile_background_tile',
u'favourites_count',
u'name',
u'notifications',
u'url',
u'created_at',
u'contributors_enabled',
u'time_zone',
u'protected',
u'default_profile',
u'is_translator']
http://www.bgoncalves.com/teaching/data-mining.html

Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'type',
u'coordinates']
[u'symbols',
u'user_mentions',
u'hashtags',
u'urls']
u'<a href="http://foursquare.com" rel=“nofollow">
foursquare</a>'
u"I'm at Terminal Rodovixe1rio de Feira de Santana
(Feira de Santana, BA) http://t.co/WirvdHwYMq"
{u'display_url': u'4sq.com/1k5MeYF',
u'expanded_url': u'http://4sq.com/1k5MeYF',
u'indices': [70, 92],
u'url': u'http://t.co/WirvdHwYMq'}
http://www.bgoncalves.com/teaching/data-mining.html

Market Penetration PLoS One 8, E61981 (2013)

World Coverage

Age Distribution
PLoS One 10, e0115545 (2015)

Demographics
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We observe that there exists a match for 64.2% of the users.
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
FractionofJoiningUsers
whoareMale
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
each last name with over 100 individuals in the U.S.
ing the 2000 Census, the Census releases the distributio
race/ethnicity for that last name. For example, the last n
“Myers” was observed to correspond to Caucasians 86%
the time, African-Americans 9.7%, Asians 0.4%, and
panics 1.4%.
Race/ethnicity distribution of Twitter users
We first determined the number of U.S.-based users
whom we could infer the race/ethnicity by comparing
last word of their self-reported name to the U.S. Ce
last name list. We observed that we found a match
71.8% of the users. We the determined the distributio
race/ethnicity in each county by taking the race/ethn
distribution in the Census list, weighted by the freque
of each name occurring in Twitter users in that coun
Due to the large amount of ambiguity in the last name
race/ethnicity list (in particular, the last name list is m
than 95% predictive for only 18.5% of the users), we are
able to directly compare the Twitter race/ethnicity distr
1
This is effectively the census.model approach discuss
prior work (Chang et al. 2010).
(a) Normal representation
Figure 2: Per-county over- and underrepresentation of U.S. po
tation rate of 0.324%, presented in both (a) a normal layout an
Blue colors indicate underrepresentation, while red colors repre
to the log of the over- or underrepresentation rate. Clear trend
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
ICWSM’11, 375 (2011)

Network Structure

Twitter Network
TIME). The top 20 are listed in Figure 7. Some of them follow the
followers, but most of them do not (the median number of follow
ings of the top 40 users is 114, three orders of magnitude small
than the number of followers). We revisit the issue of reciprocity
Section 3.3.
3.2 Followers vs. Tweets
Figure 2: The number of followers and that of tweets per use
In order to gauge the correlation between the number of follow
ers and that of written tweets, we plot the number of tweets (y
against the number of followers a user has (x) in Figure 2. We b
the number of followers in logscale and plot the median per bin
the dashed line. The majority of users who have fewer than 10 fo
lowers never tweeted or did just once and thus the median stays at
The average number of tweets against the number of followers p
ompared against each other. Before we delve into the eccen-
es and peculiarities of Twitter, we run a batch of well-known
sis and present the summary.
Basic Analysis
Figure 1: Number of followings and followers
construct a directed network based on the following and fol-
d and analyze its basic characteristics. Figure 1 displays the
bution of the number of followings as the solid line and that of
wers as the dotted line. The y-axis represents complementary
lative distribution function (CCDF). We ﬁrst explain the dis-
nitude smaller
reciprocity in
eets per user
ber of follow-
of tweets (y)
ure 2. We bin
dian per bin in
er than 10 fol-
dian stays at 1.
followers per
re are outliers
of followers.
n x = 100 to
sure, but only state the correlation between the numbers of tweets
and followers.
3.3 Reciprocity
In Section 3.1 we brieﬂy mention that top users by the number
of followers in Twitter are mostly celebrities and mass media and
most of them do not follow their followers back. In fact Twitter
shows a low level of reciprocity; 77.9% of user pairs with any link
between them are connected one-way, and only 22.1% have recip-
rocal relationship between them. We call those r-friends of a user as
they reciprocate a user’s following. Previous studies have reported
much higher reciprocity on other social networking services: 68%
on Flickr [4] and 84% on Yahoo! 360 [18].
Moreover, 67.6% of users are not followed by any of their fol-
lowings in Twitter. We conjecture that for these users Twitter is
rather a source of information than a social networking site. Fur-
ther validation is out of the scope of this paper and we leave it for
future work.
3.4 Degree of Separation
WWW'10, 591 (2010)

Retweet Trees April 26-30 • Raleigh • NC • USA
ce Size of Retweet
age and median numbers of additional recipi-
via retweeting
be to mass media in various forms: radio, TV, and
y are immediate recipients and consumers of the
hed media produce. On Twitter people acquire
lways directly from those they follow, but often
suming a tweet posted by a user is viewed and
of the user’s followers, we count the number of
nts who are not immediate followers of the orig-
Figure 14 displays its average and median per
number of followers of the original tweet user.
almost always below the average, indicating that
a very large number of additional recipients. Up
llowers, the average number of additional recipi-
d by the number of followers of the tweet source.
WWW'10, 591 (2010)

Retweets Trees
Figure 15: Retweet trees of ‘air france ﬂight’ tweets
Figure 16: Height and participating users in retweet trees
etweeting the same tweet, and cross-retweet is retweeting each
ther.
In Figure 16 we plot the CCDFs of the retweet tree heights and
he number of users in a retweet tree. The height of 1 is the most
6. IMPACT OF RETWEET
We have seen how trending topics rise in popularity and ev
ally die in Section 5. Then how exactly does information spre
Twitter? Retweet is an effective means to relay the informatio
yond adjacent neighbors. We dig into the retweet trees constr
per trending topic and examine key factors that impact the eve
spread of information.
6.1 Audience Size of Retweet
WWW 2010 • Full Paper
WWW’10, 591 (2010)
WWW'10, 591 (2010)

Link Function ICWSM’11, 89 (2011)

Link Function
Agreement Discussion
ICWSM’11, 89 (2011)

Friends Talk to Each Other PLoS One 6, E22656 (2011)

Online Conversations
0 350 400 450 500 550 600
ut
0 50 100 150 200 250 300 350 400 450 500 550 600
010020030040050060050150250350450550
k
in
ρ
B)
ReciprocatedConnections
0 50 100 150 200 250 300 350 400 450 500 550 600
12345678
ωout
k
out
A)
0 50 100 150 200
010020030040050060050150250350450550
ρ
B)
!out
i =
P
j !ij
kout
i
AverageWeightperConnection
1.7 Million users
370 Million messages
Saturation of the number of reciprocated connections
Number of connections for which interaction strength is highest
PLoS One 6, E22656 (2011)

wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties

Weak
• Interviews to ﬁnd out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
retweet edges.
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
siﬁcation
4.2 M
A secon
The Strength of Ties (1973)

Weak
• Interviews to ﬁnd out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
retweet edges.
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
siﬁcation
4.2 M
A secon
The Strength of Ties (1973)
for a time sufficient to its
ale communication network
nd the calls among them links.
indicates a particular egocentric network evolution. In order to
quantify it, we measure the probability, p(n), that the next
communication event of an agent having n social ties will occur via
the establishment of a new (n 1 1)th
link. We calculate these
probabilities in the MPC dataset averaging them for users with the
same degree k at the end of the observation time. We therefore
. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.
al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors
idth and color represent weight.

Network Structure The Strength of Intermediary Ties in Social Media
“People whose networks bridge the structural holes
between groups have an advantage in detecting and
developing rewarding opportunities. Information
arbitrage is their advantage. They are able to see
early, see more broadly, and translate information
across groups.”
AJS Volume 110 Number 2 (September 2004): 349–99
᭧ 2004 by The University of Chicago. All rights reserved.
0002-9602/2004/11002-0004$10.00
Structural Holes and Good Ideas1
Ronald S. Burt
University of Chicago
This article outlines the mechanism by which brokerage prov
social capital. Opinion and behavior are more homogeneous w
than between groups, so people connected across groups are m
familiar with alternative ways of thinking and behaving. Broke
across the structural holes between groups provides a vision o
tions otherwise unseen, which is the mechanism by which broke
becomes social capital. I review evidence consistent with the
pothesis, then look at the networks around managers in a
American electronics company. The organization is rife with s
tural holes, and brokerage has its expected correlates. Compensa
positive performance evaluations, promotions, and good idea
disproportionately in the hands of people whose networks
structural holes. The between-group brokers are more likely t
press ideas, less likely to have ideas dismissed, and more like
have ideas evaluated as valuable. I close with implications for
ativity and structural change.
The hypothesis in this article is that people who stand near the hol
a social structure are at higher risk of having good ideas. The argum
is that opinion and behavior are more homogeneous within than betw
groups, so people connected across groups are more familiar with a
1
Portions of this material were presented as the 2003 Coleman Lecture at the Univ
of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho
the University of California at Berkeley, the University of Chicago, the Univers
Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus
the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe
Rationality” conference at the 2003 meetings of the American Sociological Associ
I am grateful to Christina Hardy for her assistance on the manuscript and to se
colleagues for comments affecting the ﬁnal text: William Barnett, James Baron
athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R
Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R
Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate
Peter Marsden for his comments as discussant at the Coleman Lecture. Direc
respondence to Ron Burt, Graduate School of Business, University of Chicago
cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu
PLoS One 7, e29358 (2012)

ation that the stronger the tie is the higher
acts of both parties it has and the higher the
belong to the same group.
groups
to consider is the characteristics of links
ese links occur mainly between groups
200 users (Figure 4A–C). However, their
he quality of the links (if they bear mentions
ks with mentions are less abundant than the
retweets are slightly more abundant.
ngth of weak ties theory [12,14–16], weak
between which they take place should be small according to the
Granovetter’s theory. The results show that the most likely to
attract retweets are the links connecting groups that are neither too
close nor too far. This can be explained with Aral’s theory about
the trade-off between diversity and bandwidth: if the two groups
are too close there is no enough diversity in the information, while
if the groups are too far the communication is poor. These trends
are not dependant on the size of the considered groups (see Figs.
S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the
Supplementary Information).
ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned.
f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular
in respect to detected groups.
.0029358.g002
Network Structure The Strength of Intermediary Ties in Social Media
to Granovetter expectation that the stronger the
number of mutual contacts of both parties it has a
Figure 2. Group and link statistics. (A) Size distri
(C) Percentage of links of different types, e.g. followe
topological localizations in respect to detected grou
doi:10.1371/journal.pone.0029358.g002
The
PLoS One 7, e29358 (2012)

Groups
Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the
groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group
The Strength of Intermediary Ties in So
PLoS One 7, e29358 (2012)
2.4 Links between groups
The next question to consider is the characteristic
between groups. These links occur mainly betwee
containing less than 200 users (Figure 4A–C). Howe
frequency depends on the quality of the links (if they bear
or retweets). While links with mentions are less abundan
baseline, those with retweets are slightly more
According to the strength of weak ties theory [12,14–
links are typically connections between persons no
neighbors, being important to keep the network conn
for information diffusion. We investigate whether
between groups play a similar role in the online n
information transmitters. The actions more related to in
diffusion are retweets [24] that show a slight prefe
occurring on between-group links (Figures 4B and
preference is enhanced when the similarity between
groups is taken into account. We define the similarity be
groups, A and B, in terms of the Jaccard index
connections:
similarity(A,B)~
jlinks of A and Bj
j|links of A and Bj
:
The similarity is the overlap between the groups’ connec
it estimates network proximity of the groups. The gener
is that links with mentions more likely occur between clo
and retweets occur between groups with medium
(Figure 4D). Mentions as personal messages are
exchanged between users with similar environments
predicted by the strength of weak ties theory. Links with
are related to information transfer and the similarity of t
PLoS ONE | www.plosone.org

Twitter follower distance
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81
f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New
ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
Social Networks 34, 73 (2012)

Locality
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79
Table 5
Top countries.
Share of
egos (%)a
Share of egos
(%) for egos in
dyadsb
Share of
alters (%)c
Percentage of
domestic tiesd
Percentage of
domestic ties among
non-local tiesd
Following foreign
alters/being followed
from abroad
Country named
explicitly (% of
egos)
USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1
Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4
UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3
Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0
Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5
Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7
Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3
Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6
Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3
Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7
a
Out of the 2852 egos located at the level of country or better.
b
Out of the egos included in 1953 dyads with both parties located at the level of country or better.
c
Out of the 1953 alters located at the level of country or better.
d
The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country.
between those two interpretations. We also note that top Twitter
clusters intersect only to an extent with Alderson and Beckfield’s
(2004) ranking of world cities based on multinational corporations’
branch headquarters. (Of Alderson and Beckfield’s top 25 cities by
in-degree or “prestige,” 13 appear in the top 25 Twitter clusters
ranked by in-degree centrality, with another 6 appearing in top
100.)
5.3. National borders
Of the ties that were matched to countries, 75 percent con-
nect users in the same country. This prevalence of domestic ties is
Table 6
The most common languages. Based on 2852 egos.
Language % of egos
English 72.5
Portuguese 10.1
Japanese 5.4
Spanish 3.1
Indonesian 1.8
German 1.7
Dutch 1.0
Chinese 0.9
Korean 0.4
Swedish 0.4
Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77
accounts, by randomly drawing an account from among those “fol-
lowed” by each of those egos. We then coded the locations of the
alters using the same procedure as we did for the egos, removing
those pairs where the alter could not be assigned to a country. In
the end, we obtained a sample of 1953 ego-alter pairs with both
the ego and the alter assigned to a country, including 1259 pairs
with “specific” locations for both parties (Table 1).
4.4. Aggregating nearby locations
Since specific locations vary substantially in precision and since
users can often choose between a range of specific names for the
same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we
aggregated nearby locations within each country, by assigning a
set of coordinates (obtained from Google Maps) to each location
smaller than 25,000 km2 and then merging nearby locations within
each country by replacing their coordinates with a weighted aver-
age of the coordinates of the merged locations. This reduced our
location descriptions to a set of 386 regional clusters, which are
comparable in size to metropolitan areas. We labeled each clus-
ter with the most common name associated with it in our sample.
For example, the cluster centered on Manhattan is referred to as
“New York.”
5. Analysis
In this section we analyze the factors affecting the formation of
Twitter ties. We first look at the effect of each variable identified
earlier based on theoretical considerations: the actual physical dis-
tance, the frequency of air travel, national boundaries, and language
differences. In addition to presenting the descriptive statistics
demonstrating the effects of each variable and investigating the
nature of such effects, we correlated the effects using the Quadratic
Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the
last subsection we also examined the relationship between the
variables using QAP regression (Double Dekker Semi-partialling
MRQAP). All statistical calculations were done using UCINet 6.277
(Borgatti et al., 2002).
For correlation and regression analysis we used networks with
nodes representing the 25 largest regional clusters of users (see
Table 3
Top clusters.
Rank Clustera
Share of
egos (%)b
Share of egos
(%) for egos in
dyadsc
Share of
alters (%)d
Localitye
1 “New York” 8.5 8.3 10.2 54.3
2 “Los Angeles, CA” 5.1 5.6 10.4 53.3
3 “ ” (Tokyo) 4.1 4.8 5.0 62.9
4 “London” 3.6 3.3 4.9 48.8
5 “São Paulo” 3.5 3.0 3.6 78.4
6 “San Francisco” 2.8 2.7 4.1 41.2
7 “New Jersey”f
2.5 2.8 2.1 20.0
8 “Chicago” 2.2 2.0 1.7 32.0
9 “Washington, DC” 2.1 2.8 2.6 34.3
10 “Manchester, UK” 1.9 2.0 1.1 30.8
11 “Atlanta” 1.7 2.1 2.1 46.2
12 “San Diego” 1.5 1.5 1.1 26.3
13 “Toronto, Canada” 1.3 1.1 1.5 42.9
14 “Seattle” 1.3 1.4 1.2 58.8
15 “Houston” 1.2 1.2 1.0 40.0
16 “Dallas, Texas” 1.2 1.0 1.4 61.5
17 “Rio de Janeiro” 1.2 1.0 1.1 30.8
18 “Boston, MA” 1.2 1.2 1.1 20.0
19 “Amsterdam” 1.1 1.1 0.9 50.0
20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9
21 “Austin, TX” 1.0 1.0 1.3 50.0
22 “Sydney” 0.9 1.0 0.8 38.5
23 “Orlando, Forida” 0.9 1.0 0.6 16.7
24 “Phoenix, AZ” 0.8 0.7 0.6 11.1
25 “ ” (Hy¯ogo)g
0.8 1.0 1.0 25.0
a
Each cluster is labeled with the name most frequently used for locations assigned
to the cluster.
b
Out of the 2167 egos located with precision of <25,000 km2
.
c
Out of the 1259 egos included in dyads with both parties located with precision
of <25,000 km2
.
d
Out of the 1259 alters included in dyads with both parties located with precision
of <25,000 km2
.
e
Defined as the share of local of ties among all ties for egos in a cluster.
f
Centered between Philadelphia and Trenton, NJ and includes all locations iden-
tified as just “New Jersey”.
g
Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai
region of Japan.
over half of the egos are in other countries, as are 4 of the 10
largest clusters: Tokyo, São Paulo, and two clusters in the United

Mobility and Social Networks
Coupling Mobility and Interactions in Social Media
Follower
www.bgoncalves.combgoncalves
Geography and Social Networks
!"#$%&
'%()&"*+,-.&$#%,(
Geography
Follower
Reply
ReTweet
!"#$%&'()*+),-./*012
3&#1)40-$.&*#
!"#$%&'()*#),-./*012
5#+*0 */
6 7
6 7
Geography
PLoS One 9, E92196 (2014)
and for their dependence on the distance. The error Err of this
null model is between 0:66–0:76 for the three countries, around
twice the error of the TF model (see Figure 6).
The linking model (L model) is a simplified version of the TF
model, without random mobility and the box size d?0. Agents
move to visit their contacts with probability pv, whereas with
probability 1{pv they do not perform any action. In this version
of the model, users can connect only by random connections or
when two of them coincide, visiting a common friend, which leads
to triadic closure. These two processes do not depend on the
distances between the users. A thorough description can be
obtained with a mean-field approach (see the corresponding
section). The results of the L model are shown in Figure 2. Due to
the triangle closing mechanism, this null model creates networks
with a considerable level of clustering. However, it does not
(e.g., for the US the TF model has Err lower by 0:5 and 1:5 than
the TF-normal and the TF-uniform models, respectively, as shown
in Figure 6).
Simplified models that neglect either geography or network
structure perform considerably worse than the TF model in
reproducing the properties of real networks. Likewise, non-realistic
assumptions on human mobility mechanism yield worse results
than the default TF model. To conclude, the coupling of
geography and structure through a realistic mobility mechanism
produces networks with significantly more realistic geographic and
structural properties.
Sensitivity of the TF Model to the Parameters and its
Modifications
The results presented so far have been obtained at the optimal
Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different
colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users
throughout entire simulation. Ego network shows the social connections at the end of the simulation.

Geo-Social Properties PLoS One 9, E92196 (2014)
Couplin
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3

,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
DL
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Triangle Disparity
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
e.0092196.g002
Reciprocity
Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
C(d).
Prob of a Link
ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Clustering

Geo-Social Model
New position of u
{
{
{
Detect all
encounters e
in the box of u
Visit a random
neighbour
Jump to
a new location
Starting position
of user u
Created new
social links
PLoS One 9, E92196 (2014)

Model Fitting
0:39 for Germany. For simplicity, we focus on the Twitter
networks only, although similar results are obtained for the other
datasets.
Results
Simulations for the Optimal Parameters
An example with the displacements between the consecutive
locations and the ego networks for a sample of individuals, as
generated by the TF model, are displayed in Figure 4. The
parameters of the model are set to the ones that correspond to the
minimum of the error Err. As shown, the agents tend to stay close
to their original positions. Occasional long jumps occur due to
friend visits that live far apart. In this range of parameters and
simulation times, the main mechanism for generating long distance
second null model, the linking model (L model), in contrast, is
based only on random linking and triadic closure, and it is
equivalent to the TF model without the mobility. We consider the
two uncoupled null models and compare their results with those of
the TF model. In this way, we demonstrate the importance of the
coupling through a realistic mobility mechanism to reproduce the
empirical networks.
The spatial model (S model) consists of randomly connecting
pair of users with a probability that decays as power-law of the
distance between them (suggested in [41]). The exponent of the
power-law is fixed at {0:7 following Figure 2A. The results of
the S model are shown in the panels of Figure 2. While it is set to
match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or
P Dð Þ are not well reproduced. The S model fails to account for the
high level of clustering and reciprocity in the empirical networks
Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red
rectangle.
PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196
Prob. to Make a New Friend
Prob.toVisitanOldFriend
PLoS One 9, E92196 (2014)

perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Model Results
Reciprocity
Clustering Triangle Disparity
andom connections, and so the distribution of triangles disparity prevent
Figure 5. Geo-social properties of the model networks. Various statistical pro
red squares) and from simulation of the TF model (black line) for the US. Correspond
nd S4.
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3

,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
DL
Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine
(red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be
and S4.
Coupling Mobility and Interactio
s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
one.0092196.g005
Prob of a Link
PLoS One 9, E92196 (2014)

Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
Starting from Paris
Starting from New York
a
b

Human Diffusion
Starting from New Yorkb
J. R. Soc. Interface 12, 20150473 (2015)

Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)

Residents and Tourists
50 100 150 200 250 300 350
0.1
0.2
0.3
0.4
0.5
0.6
Coverage
R
~
Local
Non−Local
a
100
200
300
400
500
600
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Proportion of Non−Local Users
Coverage
b
125 135 145 155
New York
Chicago
San Francisco
Shanghai
Dallas
Berlin
Paris
Saint Petersburg
Beijing
Moscow
Coverage
c
325 335 345
Houston
Barcelona
Brussels
Detroit
Lima
Istanbul
Rome
Moscow
Paris
Lisbon
Coverage
d
J. R. Soc. Interface 12, 20150473 (2015)

City Communities
0 2 4 6 8 10
Los Angeles
San Francisco
Miami
Singapore
Tokyo
Paris
London
New York
Weighted Betwennness (x 102
)
Weighted degree
J. R. Soc. Interface 12, 20150473 (2015)

#tags
• Metadata added to a Tweet for topic marking
• Originally proposed by Chris Messina in 2007
• Quickly adopted informally by the Twitter
community
• Native support added by Twitter after it became
popular

Hashtag Statistics
numberofusers
tag
105
103
101
101 103 105
500 users
numberoftweets
tag
105
103
101
101 103 105
swsx swineflu
gfail
peace
watchmen
nsotu
winnenden
masters
WWW’12, 251 (2012)

Activity Peak Detection
! Peak: relative activity to baseline have to be
10 times larger
! Minimal level of activity expected
! Selection of isolated popularity bursts (no
other peaks one week before/after)
! We detected 115 peaks
continuous periodic peak
#video #ff #w2e
WWW’12, 251 (2012)

Peak Characterization
600 1500 83% 17%69% 31% 100
% 48%
6
r cup finale
9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-6
250
150
500
#winnenden #watchmen
Days
Tweets
Before Peak After PeakPeak800
600
400
200
0 30-30
peak
baseline
-15 15
WWW’12, 251 (2012)

Some Examples
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
0 6-3-6 3
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peakdays after peakdays after peak
numberoftweetsuserID
#masters #winnenden #watchmen #nsotu
Anticipation Reaction “Instantaneous”“Anticipation + Reaction”
WWW’12, 251 (2012)

Classes of Peaks
! An#cipatory,behavior!
! Increasing,amount,of,tweets,un#l,the,event!
! Sharp,drop,of,a;en#on,aer,the,event
0%
peak(fp
=0)
0%
before(f
b=0)
0% after (fa
= 0)
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geekw2e humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
nsotu
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
! Unexpected,events!
! Driven,by,exogenous,sources
! Ac#vity,concentrated,on,the,peak,day!
! Events,that,only,discussed,while,,,,,,,,,,
they,are,happen
! Collec#ve,a;en#on,is,built,up,to,a,,,,,,,,,,,peak,
intensity,,then,a;en#on,shis,away
WWW’12, 251 (2012)

Barycentric Coordinates
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
-3-6
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-3-6
2500
1500
500
days after peak daysdays after peakdays after peak
#masters #winnenden #watchmen #n
(0,0,1)
(0,1,0) (1,0,0)
(0,1/2,1/2)
(1/3,1/3,1/3)
(1/2,0,1/2)
(1/2,1/2,0)
(1/2,1/4,1/4)(1/4,1/2,1/4)
(1/4,1/4,1/2)
2D-Simplex
WWW’12, 251 (2012)

600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
-3-6
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-3-6
2500
1500
500
days after peak daysdays after peakdays after peak
#masters #winnenden #watchmen #n
0%
peak
0%
before
0% after
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geek w2e
humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
grammys
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
2D-Simplex
WWW’12, 251 (2012)

0%
peak
0%
before
0% after
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geek w2e
humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
grammys
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
0 6-3-6 3
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peakdays after peakdays after peak
#masters #winnenden #watchmen #nsotu
1500
1000
500
0 6-3-6 3
83% 17% 1000
600
200
0 6-3-6 3
59% 41%
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peak
#watchmen #nsotu
WWW’12, 251 (2012)

Language Matters

Signal By Language

Signal By Language
Italian
English
Spanish
Portuguese
Other
76%

Signal By Language
Italian
English
Spanish
Portuguese
Other
16%

Signal By Language
Italian
English
Spanish
Portuguese
Other
2%

Signal By Language
Italian
English
Spanish
Portuguese
Other

Spanish PLoS One 9, E112074 (2014)

Local Variations PLoS One 9, E112074 (2014)

Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Superdialects
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10
f(K)
Silhouette
Mexico City
Guatemala
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Mexico City
Guatemala
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
N = 956
N = 179
Mexico City
Guatemala
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Mexico City
Guatemala
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
PLoS One 9, E112074 (2014)

Regional Dialects PLoS One 9, E112074 (2014)

Bilingualism

Global Language Network
Twitter
n Link Weight and Color
t-statistic
102.59
n
Slovak
DanishFinnish
Haitian
Hebrew
Galician
Czech
Swahili
Albanian
Irish
Malay
Estonian
Maltese
Romanian
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
Georgian
Malayalam
Modern Greek
Armenian
Kannada
Telugu
Latvian
Korean
Burmese
Thai
Filipino
Hungarian
Central Khmer
Cherokee
Russian
Bulgarian
Welsh
Amharic
Belarusian
Ukrainian
Macedonian
Italian
English Arabic
Serbo-Croatian Sinhala
Turkish
Slovenian
Azerbaijani
Persian
German
Basque
Norwegian
Catalan
Afrikaans
French
Swedish
Spanish
Dutch
Dhivehi
Japanese
Tibetan
Panjabi
Tamil
Chinese
Lao
Gujarati
ian
n
esian
can
Narom
Kabyle
Occitan
Amharic
Malagasy
Pushto
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Abaza
Northern Yukaghir
Lezghian
Chukot
Old Russian
Ossetian
Tajik
Tabassaran
ChechenDargwa
Lak AbkhazianAdyghe
Nepali macrolanguage
Swahili (macrolanguage)
Arabic
Kazakh
Mongolian
n
Uighur
Latvian
anto
Persian
Belarusian
age Family Population Link Weight and Color
iatic
dian
nesian
Caucasian
Creoles pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic
t-statistic
co-occurrences
(users, editors, translations)
102.59
min
6
6
6
twitter
wikipedia
book translations
994,682
49,637
183,329
max
1 billion
10 million
100 million
1 million
Slovak
DanishFinnish
Haitian
Hebrew
Galician
Czech
Swahili
Albanian
Irish
Malay
Estonian
Maltese
Romanian
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
English Arabic
Serbo-Croatian Sinhala
Slovenian Persian
German
Basque
Norwegian
Catalan
Afrikaans
French
Swedish
Spanish
Dutch
Ido
e
li
Navajo
Interlingua
English
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
do-Romanian
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Thai JapaneseQuechua
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Albanian
French
Finnish
Silesian
Breton
Pennsylvania German
Slovak
Wikipedia
Language Family Pop
Afro-Asiatic
Altaic
Amerindian
Austronesian
Caucasian
Creoles pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic
Persian
Marathi
Mazanderani
Kashmiri
Fiji Hindi
OriyaSanskrit
Gilaki
Icelandic
Swahili
Scottish Gaelic
Kannada
Moldavian
Scots
Maltese
Burmese
Cebuano
Lao
Mongolian
Cornish
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Kinyarwanda
Faroese
Panjabi
Zulu
Central Khmer
Old English
Irish
Bengali
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragones
Ligu
Italian
Western Frisian
Portuguese
Dutch
Spanish
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Macedonian
Low German
Slovenian
Yiddish
Bavarian
Albanian
Estonian
Modern Greek
Romansh
Azerbaijani
Bulgarian
Georgian Arabic
Kurdish
Serbo-CroatianLithuanian
KÃ¶l
French
Czech
Russian
Kirghiz
Finnish
Tatar
Yakut
Armenian Hebrew
Luxembourgish
Ukrainian
Latvian
TurkishKazakh
Breton
Pennsylvania German
Belarusian
Slovak
German
Language Family Population
Afro-Asiatic
Altaic
Amerindian
Austronesian
Caucasian
Creoles pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic 1 billion
10 million
100 million
1 million
Moldavian
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Albanian
French
Finnish
Silesian
Breton
Pennsylvania German
Slovak
PNAS 111, E5616 (2014)

Wikipedia Twitter
Language Family Population Link Weight and C
Afro-Asiatic Caucasian Niger-Congo t-statisti
2.59
1 million
Finnish
Galician
Czec
Swahili
Alb
Irish
Malay
Estonian
Ma
Romania
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
M
Modern
Armenian
Kannada
Telugu
Korean
Burmese
Thai
Filipino
Hungarian
Central Khmer
Cherokee
English
Dhivehi
Japanese
Tibetan
Panjabi
Tamil
Chinese
Lao
Gujarati
Persian
Marathi
Mazanderani
Kashmiri
Fiji Hindi
OriyaSanskrit
Gilaki
Icelandic
Swahili
Scottish Gaelic
Kannada
Moldavian
Scots
Maltese
Burmese
Cebuano
Lao
Mongolian
Cornish
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Kinyarwanda
Faroese
Panjabi
Zulu
Central Khmer
Old English
Irish
Bengali
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Macedonian
Low German
Slovenian
Yiddish
Bavarian
Albanian
Estonian
Modern Greek
Romansh
Azerbaijani
Bulgarian
Georgian Arabic
Kurdish
Serbo-CroatianLithuanian
KÃ¶lsch
French
Czech
Russian
Kirghiz
Chuvash
Finnish
Tatar
Yakut
Silesian
Corsican
Narom
Kabyle
OccitanArmenian Hebrew
Luxembourgish
Ukrainian
Latvian
TurkishKazakh
Breton
Pennsylvania German
Belarusian
Slovak
German
Réunion Creole French
Lingala
Kabyle
Occitan (post 1500) Muyang
Old High German (ca. 750-1050)
Saramaccan
Walloon
Western Frisian
Eastern Maroon Creole
Swiss German
Caribbean Javanese
Sranan Tongo
Karang
Dogosé
Kasem
French
Old Provençal (to 1500)
Tamashek
Tembo (Kitembo)
Central Atlas Tamazight
BudumaBambara
Picard
Wolof
Ngiemboon
Lama (Togo)
Russian
Amharic
Malagasy
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Romansh
Afrikaans Romanian
German
Lithuanian
Arabic
Kazakh
Lisu
Mongolian
Kachin
Uighur
Tai Hongjin
Newari
Korean
Latvian
Hungarian
Esperanto
Persian
Japanese
Hmong
Serbo-Croatian
Vietnamese
Belarusian
HaniTibetan
Dutch
Lahu
Sichuan Yi
Azhe
Chinese
Church Slavic
Naxi
Middle Dutch (ca. 1050-1350)
Wa
RomanyCaribbean Hindustani
Zhuang
PNAS 111, E5616 (2014)
69

Book Translations
Navajo
Chipewyan
Ojibwa
Xhosa
Sindhi
Filipino (macrolanguage)
Kikuyu
Cree
Dakota
Lule Sami
Tavringer Romani
Kurdish
Swedish
Northern Sami
Inari Sami
Finnish
Egyptian (Ancient)
Somali
Inuktitut
Cornish
Hopi
Haitian
Syriac
Kriol
Classical NahuatlOld Irish (to 900)
Hittite
Old English (ca. 450-1100)
Middle English (1100-1500)
Icelandic
Pahlavi Old NorseYoruba
Zulu
Ottoman Turkish (1500-1928)
Galician
Ladino
Danish
Norwegian
Southern Sami
Faroese
Sumerian
Kalaallisut
Hawaiian
Kashmiri
Djeebbana
Anglo-NormanPali
Guianese Creole French
Réunion Creole French
Gascon
Lingala
Corsican
Fulah
Kabyle
Occitan (post 1500) Muyang
Old High German (ca. 750-1050)
Saramaccan
Walloon
Western Frisian
Eastern Maroon Creole
Swiss German
Caribbean Javanese
Sranan Tongo
Buamu
Karang
Dogosé
Latin
Ifè
Italian
Old French (842-ca. 1400)
Middle French (ca. 1400-1600)
Basque
Fuliiru
Portuguese
Catalan
Welsh
Ancient Greek (to 1453)
Kasem
Thayore
Asturian
Biali
Aragonese
French
Tepo Krumen
Spanish
Old Provençal (to 1500)
Tamashek
Tembo (Kitembo)
Central Atlas Tamazight
BudumaBambara
Picard
Cerma
Breton
Mofu-Gudur
Wolof
Ngiemboon
Lama (Togo)
Ngangam
Quechua
Kara-Kalpak
Even
Kalmyk
Nanai
Buriat
Azerbaijani
Kumyk
Bashkir
Southern Altai
Tuvinian
Sanskrit
Lao
Russian
Amharic
Hindi
Kannada
Malagasy
Tamil
Panjabi
Evenki
Karachay-Balkar
Khakas
Turkmen
Old Japanese
Gagauz
Pushto
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Abaza
Northern Yukaghir
Lezghian
Chukot
Old Russian
Ossetian
Tajik
Tabassaran
ChechenDargwa
Ingush
Lak
Georgian
Avaric
Abkhazian
Kabardian
Adyghe
Chuvash
Dolgan
Crimean Tatar
Yakut
Tatar
Kirghiz
Nogai
Uzbek
Romansh
Afrikaans Romanian
Slovenian
Polish
German
Albanian
Lithuanian
Ukrainian
Slovak
Central Khmer
Moldavian
Arabic
Kazakh
Lisu
Mongolian
Kachin
Uighur
Tai Hongjin
Newari
Korean
Latvian
Hungarian
Esperanto
Persian
Japanese
Hmong
Serbo-Croatian
Vietnamese
Belarusian
HaniTibetan
Dutch
Lahu
Sichuan Yi
Azhe
Chinese
Church Slavic
Naxi
Middle Dutch (ca. 1050-1350)
Wa
RomanyCaribbean Hindustani
Zhuang
Maori
Modern Greek (1453-)
Scots
Warlpiri
Coptic
English
Official Aramaic (700-300 BCE)
Sinhala
Scottish Gaelic
Burmese
Gujarati
Assamese
Bengali
Malayalam
Marathi
Bulgarian
Hausa
Armenian
Czech
Hebrew
Yiddish
Urdu
Malay (macrolanguage)
Middle High German (ca. 1050-1500)
Turkish
Irish
Thai
Jola-Fonyi
Guadeloupean Creole French
Swati
Macedonian
Tokelau
Rajasthani
Telugu
Maltese
Middle Irish (900-1200)
GeezAkkadian
Estonian
Oriya macrolanguage
PNAS 111, E5616 (2014)
70

numbers are 41% and 63%. In contrast, the correlation between the representation of
languages in Twitter and Book Translations is 0.63 (R2
=40%), and the correlation between
the strength of links is only 0.48 (R2
=23%). Finally, we note that—with respect to the book
translation dataset—the two digital datasets (Twitter and Wikipedia) are overexpressed in
languages associated with developing countries, like Malay, Filipino and Swahili. This
indicates that these digital media are more inclusive of the populations of developing
countries than written books.
PNAS 111, E5616 (2014)
71

Language and Fame
afrafr
araara
azaze
belbel
benben
bulbul
catcat
cesces
dandan
deudeu
ellell
eng
estest
euseus
fasfasfilfil
finfin
frfra
gujguj
hbshbs
hebheb
hinhin
hunhun
hyehye
islisl
itaita
jpnjpn
kankan
katkat
khmkhm
korkor
lalav
litlit
malmalmkdmkd
mlmlt
msamsa
mymya
nldnld
nornor
panpan
polpol
porpor
ronron
rusrus
sisin
slkslk
slslv
spaspa
sqisqiswswa
sweswe
tamtamteltelthatha
turtur
ukrukr
urdurd
vivie
zhozho
R² = 0.693
p-value 0.001
C
araara
benben
cat
ces
dandan
deu
ell
eng
fin
fra
glglg
hin
hun
ita
jpn
nld
norpol
ron
rus
slk
slslv
spa
swe
teltel turtur
zho
$10k
$20k
$30k
$40k
$50k
$0k
GDP per Capita
R² = 0.858
p-value 0.001
F
log10
(HAfamouspeople)
log10
(Twitter Eigenvector Centrality)
0
1
2
3
−6 −4 −2 0 −6 −4 −2 0 −6 −4 −2 0
1
2
3
0
log10
(Wikipedia Eigenvector Centrality) log10
(Book Trans. Eigenvector Cent.)
$10k
$20k
$30k
$40k
$50k
$0k
GDP per Capita Number of speakers
400 M
1200 M
800 M
afrafr
ara
azeaze
belbel
benben
bulbul
catcat
cesces dandan
deudeu
ellell
eng
estest
euseus
fasfasfilfil
finfin
frafra
gujguj
hbshbs
hebheb
hihin
hun
hyehye
isisl
itaita
jpn
kankan
katkat
khmkhm
korkor
lavlav
litlit
malmal
mkdmkdmlmlt
msamsa
mymya
nld
nornor
panpan
polpol
por
ronron
rusrus
sinsin
slkslk
slslv
spaspa
sqisqiswswa
swe
tamtam
glgglg
thatha
ukrukr
urdurd
vievie
zhozho
R² = 0.755
p-value 0.001
B
afr
ara
azaze
belbel
benben
bubul
cat
cesdan
deu
ell
eng
estest
eus
fafas fil
fin
fra
glglg
gujguj
hbs
hebheb
hin
hun
hye
isisl
ita
jpn
kankan
kat
khmkhm
kor
lav
litlit
malmalmkdmkd mlt
msa
mymya
nld
nor
pan
pol
por
ronron
rus
sisin
slslk
slslv
spa
sqi
swa
swe
tammtel tha
turukr
urd
vivie
zho
R² = 0.447
p-value 0.001
A
$10k
$20k
$30k
$40k
$50k
GDP per Capita
$0
ara
ben
cat
ces
dan
deu
ell
eng
fin
fra
glg
hbs
hin
hun
ita jpn
nld
norpol
por
ron
rus
slk
slv
spa
swe
tel tur
zho
R² = 0.399
p-value 0.001
D
arara
benben
catcat
cesces
dandan
deudeu
elell
engeng
finfin
frfra
glglg
hbshbs
hihin
hunhun
itaitajpnjpn
nldnld
nornor
polpol
porpor
roron
rurus
slslk
slslv
spaspa
sweswe
tetel tutur
zhozho
R² = 0.758
p-value 0.001
E
hbs
por
glgglg
turtur
teltel
log10
(Wikipedia26+famouspeople)
Fig. 3. The position of a language in the GLN and the global impact of its speakers. Top row shows the number of people per language (born 1800–1950)
with articles in at least 26 Wikipedia language editions as a function of their language’s eigenvector centrality in the (A) Twitter GLN, (B) Wikipedia GLN, and
(C) book translation GLN. The bottom row shows the number of people per language (born 1800–1950) listed in Human Accomplishment as a function of
their language’s eigenvector centrality in (D) Twitter GLN, (E) Wikipedia GLN, and (F) book translation GLN. Size represents the number of speakers for each
PNAS 111, E5616 (2014)
72

Collective Attention
“Prediction is very difﬁcult,
especially about the future.”
(Niels Bohr)

Even more so in Political Elections
http://truthy.indiana.edu
A B C)C D
E)E F G)G H
#ampat @PeaceKaren_25 gopleader.gov “How Chris Coons
budget works- uses tax
$ 2 attend dinners and
fashion shows”

A B C)C D
E)E F G)G H
fashion shows”
Table 1: Features used in truthy classification.
nodes Number of nodes
edges Number of edges
mean k Mean degree
mean s Mean strength
mean w Mean edge weight in largest con-
nected component
max k(i,o) Maximum (in,out)-degree
max k(i,o) user User with max. (in,out)-degree
max s(i,o) Maximum (in,out)-strength
max s(i,o) user User with max. (in,out)-strength
std k(i,o) Std. dev. of (in,out)-degree
std s(i,o) Std. dev. of (in,out)-strength
skew k(i,o) Skew of (in,out)-degree distribution
skew s(i,o) Skew of (in,out)-strength distribution
mean cc Mean size of connected components
max cc Size of largest connected component
entry nodes Number of unique injections
num truthy Number of times ‘truthy’ button was
clicked
sentiment scores Six GPOMS sentiment dimensions
graph. These include the number of nodes and edges in the
graph, the mean degree and strength of nodes in the graph,
mean edge weight, mean clustering coefficient across nodes
in the largest connected component, and the standard devi-
ation and skew of each network’s in-degree, out-degree and
strength distributions (see Fig. 2). Additionally we track the
out-degree and out-strength of the most prolific broadcaster,
as well as the in-degree and in-strength of the most focused-
upon user. We also monitor the number of unique injection
points of the meme, reasoning that organic memes (such as
those relating to news events) will be associated with larger
number of originating users.
4.4 Sentiment Analysis
We also utilize a modified version of the Google-based
Profile of Mood States (GPOMS) sentiment analysis
method (Bollen, Mao, and Pepe 2010) in the analysis of
meme-specific sentiment on Twitter. The GPOMS tool as-
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
ior), or the users engaged in repeated reply/retweet exclu-
sively with other users who had tweeted the meme. ‘Legit-
imate’ memes were described as memes representing nor-
mal use of Twitter — several non-automated users convers-
ing about a topic. The final category, ‘remove,’ was used for
memes in a non-English language or otherwise unrelated to
U.S. politics (#youth, for example). These memes were
not used in the training or evaluation of classifiers.
Upon gathering 252 annotated memes, we observed an
imbalance in our labeled data (231 legitimate and only 21
truthy). Rather than simply resampling from the smaller
class, as is common practice in the case of class imbal-
eatures used in truthy classification.
des Number of nodes
ges Number of edges
n k Mean degree
n s Mean strength
n w Mean edge weight in largest con-
nected component
,o) Maximum (in,out)-degree
ser User with max. (in,out)-degree
,o) Maximum (in,out)-strength
ser User with max. (in,out)-strength
,o) Std. dev. of (in,out)-degree
,o) Std. dev. of (in,out)-strength
,o) Skew of (in,out)-degree distribution
,o) Skew of (in,out)-strength distribution
cc Mean size of connected components
cc Size of largest connected component
des Number of unique injections
thy Number of times ‘truthy’ button was
clicked
ores Six GPOMS sentiment dimensions
ude the number of nodes and edges in the
degree and strength of nodes in the graph,
t, mean clustering coefficient across nodes
nected component, and the standard devi-
f each network’s in-degree, out-degree and
ions (see Fig. 2). Additionally we track the
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)

A B C)C D
E)E F G)G H
fashion shows”
Table 1: Features used in truthy classification.
nodes Number of nodes
edges Number of edges
mean k Mean degree
mean s Mean strength
mean w Mean edge weight in largest con-
nected component
max k(i,o) Maximum (in,out)-degree
max k(i,o) user User with max. (in,out)-degree
max s(i,o) Maximum (in,out)-strength
max s(i,o) user User with max. (in,out)-strength
std k(i,o) Std. dev. of (in,out)-degree
std s(i,o) Std. dev. of (in,out)-strength
skew k(i,o) Skew of (in,out)-degree distribution
skew s(i,o) Skew of (in,out)-strength distribution
mean cc Mean size of connected components
max cc Size of largest connected component
entry nodes Number of unique injections
num truthy Number of times ‘truthy’ button was
clicked
sentiment scores Six GPOMS sentiment dimensions
graph. These include the number of nodes and edges in the
graph, the mean degree and strength of nodes in the graph,
mean edge weight, mean clustering coefficient across nodes
in the largest connected component, and the standard devi-
ation and skew of each network’s in-degree, out-degree and
strength distributions (see Fig. 2). Additionally we track the
out-degree and out-strength of the most prolific broadcaster,
as well as the in-degree and in-strength of the most focused-
upon user. We also monitor the number of unique injection
points of the meme, reasoning that organic memes (such as
those relating to news events) will be associated with larger
number of originating users.
4.4 Sentiment Analysis
We also utilize a modified version of the Google-based
Profile of Mood States (GPOMS) sentiment analysis
method (Bollen, Mao, and Pepe 2010) in the analysis of
meme-specific sentiment on Twitter. The GPOMS tool as-
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
ior), or the users engaged in repeated reply/retweet exclu-
sively with other users who had tweeted the meme. ‘Legit-
imate’ memes were described as memes representing nor-
mal use of Twitter — several non-automated users convers-
ing about a topic. The final category, ‘remove,’ was used for
memes in a non-English language or otherwise unrelated to
U.S. politics (#youth, for example). These memes were
not used in the training or evaluation of classifiers.
Upon gathering 252 annotated memes, we observed an
imbalance in our labeled data (231 legitimate and only 21
truthy). Rather than simply resampling from the smaller
class, as is common practice in the case of class imbal-
eatures used in truthy classification.
des Number of nodes
ges Number of edges
n k Mean degree
n s Mean strength
n w Mean edge weight in largest con-
nected component
,o) Maximum (in,out)-degree
ser User with max. (in,out)-degree
,o) Maximum (in,out)-strength
ser User with max. (in,out)-strength
,o) Std. dev. of (in,out)-degree
,o) Std. dev. of (in,out)-strength
,o) Skew of (in,out)-degree distribution
,o) Skew of (in,out)-strength distribution
cc Mean size of connected components
cc Size of largest connected component
des Number of unique injections
thy Number of times ‘truthy’ button was
clicked
ores Six GPOMS sentiment dimensions
ude the number of nodes and edges in the
degree and strength of nodes in the graph,
t, mean clustering coefficient across nodes
nected component, and the standard devi-
f each network’s in-degree, out-degree and
ions (see Fig. 2). Additionally we track the
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
Why not start with
something a bit simpler?

American Idol
• Popularity contest
• Well deﬁned audience, across the entire US
• Similar demographics voting and tweeting
• Weekly “votes”, involving the same population
• Immediate results
• (Almost) No incentives for organized campaigns

Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Hollie
Skylar
Top 5
EPJ Data Science 1, 8 (2012)

Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Hollie
Top 4

Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Top 3

Geographic Locations
T
(B)
(C)
Jessica Phillip
Joshua Hollie
Skylar CC
Top 4
(A)
(B)
(C)
Top 3
(B)
(C)
Jessica Phillip
Joshua Hollie
Skylar CC
Top 5

An actual prediction EPJ Data Science 1, 8 (2012)

And the winner is...
Jessica
Phillip
World
U.S.
Phillip
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80

And the winner is...
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80

And the winner is... EPJ Data Science 1, 8 (2012)

Stock Market 2
ecember 19,
text content
a positive
The second
of tweets to
ublic mood
ublic along
lting public
s Industrial
changes in
e prediction
els is signif-
re included,
ublic mood
by GPOMS
appiness as
Twitter
feed ~
(1) OpinionFinder
(2) G-POMS (6 dim.)
Mood indicators (daily)
DJIA ~
Stock market (daily)
(3) DJIA
Granger
causality
-n (lag)
F-statistic
p-value
text
analysis
normalization
SOFNN
predicted
value MAPE
Direction %
1
2
t-1
t-2
t-3
3
t=0
value
feb28
2008
apr may jun jul aug sep oct nov dec dec20
2008
(1) OF ~
GPOMS
(2) Granger Causality analysis
(3) SOFNN training test
Methodology
Data sets and timeline
Fig. 1. Diagram outlining 3 phases of methodology and corresponding data
sets: (1) creation and validation of OpinionFinder and GPOMS public mood

POMS
• Simple questionnaire that classiﬁes a person’s mood along 6 dimensions:
• tension-anxiety
• depression-dejection
• anger-hostility
• fatigue-inertia
• vigor-activity
• confusion-bewilderment
• How to administer it to Twitter users?
• Expand vocabulary using Google n-grams
• Search twitter for matching words
Profile of Mood States
Subject's Initials
Birth date
Date
Subject Code No.
Directions: Describe HOW YOU FEEL RIGHT NOW
by circling the most appropriate number after each of the words listed below:
Quite a
FEELING Not at all A little Moderate bit Extremely
1. Friendly 1 2 3 4 5
2. Tense 1 2 3 4 5
3. Angry 1 2 3 4 5
4. Worn Out 1 2 3 4 5
5. Unhappy 1 2 3 4 5
6. Clear-headed 1 2 3 4 5
7. Lively 1 2 3 4 5
8. Confused 1 2 3 4 5
9. Sorry for things done 1 2 3 4 5
10. Shaky 1 2 3 4 5
11. Listless 1 2 3 4 5
12. Peeved 1 2 3 4 5
13. Considerate 1 2 3 4 5
14. Sad 1 2 3 4 5
15. Active 1 2 3 4 5
16. On edge 1 2 3 4 5
17. Grouchy 1 2 3 4 5
18. Blue 1 2 3 4 5
19. Energetic 1 2 3 4 5
20. Panicky 1 2 3 4 5
21. Hopeless 1 2 3 4 5
22. Relaxed 1 2 3 4 5
23. Unworthy 1 2 3 4 5

Timelines along each mood dimension
ounterpart to the differentiated response to the Presidential
lection. On Thanksgiving day we find a spike in Happy
values, indicating high levels of public happiness. However,
no other mood dimensions are elevated on November 27.
Furthermore, the spike in Happy values is limited to the one
day, i.e. we find no significant mood response the day before
or after Thanksgiving.
1.25
1.75
OpinionFinder day after
election
Thanksgiving
-1
1
pre- election
anxiety
CALM
-1
1
ALERT
-1
1
election
results
SURE
1
1
pre! election
energy
VITAL
-1
-1 KIND
-1
1
Thanksgiving
happiness
HAPPY
Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26
z-scores
ig. 2. Tracking public mood states from tweets posted between October
008 to December 2008 shows public responses to presidential election and
hanksgiving.
rtially overlap with the mood values provided by
r, but not necessarily all mood dimensions that
ortant in describing the various components of
e.g. the varied mood response to the Presidential
GPOMS thus provides a unique perspective on
states not captured by uni-dimensional tools such
nder.
Granger Causality Analysis of Mood vs. DJIA
blishing that our mood time series responds to
cio-cultural events such as the Presidential elec-
nksgiving, we are concerned with the question
r variations of the public’s mood state correlate
in the stock market, in particular DJIA closing
nswer this question, we apply the econometric
Granger causality analysis to the daily time
ed by GPOMS and OpinionFinder vs. the DJIA.
ality analysis rests on the assumption that if a
auses Y then changes in X will systematically
changes in Y . We will thus find that the lagged
will exhibit a statistically significant correlation
elation however does not prove causation. We
Granger causality analysis in a similar fashion
re not testing actual causation but whether one
as predictive information about the other or not7
.
ime series, denoted Dt, is defined to reflect daily
tock market value, i.e. its values are the delta
high level of confidence. However, this result only applies to
1 GPOMS mood dimension. We observe that X1 (i.e. Calm)
has the highest Granger causality relation with DJIA for lags
ranging from 2 to 6 days (p-values 0.05). The other four
mood dimensions of GPOMS do not have significant causal
relations with changes in the stock market, and neither does
the OpinionFinder time series.
To visualize the correlation between X1 and the DJIA in
more detail, we plot both time series in Fig. 3. To maintain
the same scale, we convert the DJIA delta values Dt and mood
index value Xt to z-scores as shown in Eq. 1.
-2
-1
0
1
2
DJIAz-score
Aug 09 Aug 29 Sep 18 Oct 08 Oct 28
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
DJIAz-scoreCalmz-score
Calmz-score
bank
bail-out
Fig. 3. A panel of three graphs. The top graph shows the overlap of the
day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm
Look for correlations between
dimensions and DJIA
1
Twitter mood predicts the stock market.
Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2.
?: authors made equal contributions.
Abstract—Behavioral economics tells us that emotions can
profoundly affect individual behavior and decision-making. Does
this also apply to societies at large, i.e. can societies experience
mood states that affect their collective decision making? By
extension is the public mood correlated or even predictive of
economic indicators? Here we investigate whether measurements
of collective mood states derived from large-scale Twitter feeds
are correlated to the value of the Dow Jones Industrial Average
(DJIA) over time. We analyze the text content of daily Twitter
feeds by two mood tracking tools, namely OpinionFinder that
measures positive vs. negative mood and Google-Profile of Mood
States (GPOMS) that measures mood in terms of 6 dimensions
(Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate
the resulting mood time series by comparing their ability to
detect the public’s response to the presidential election and
Thanksgiving day in 2008. A Granger causality analysis and
a Self-Organizing Fuzzy Neural Network are then used to
investigate the hypothesis that public mood states, as measured by
the OpinionFinder and GPOMS mood time series, are predictive
of changes in DJIA closing values. Our results indicate that the
accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others.
We find an accuracy of 87.6% in predicting the daily up and
down changes in the closing values of the DJIA and a reduction
of the Mean Average Percentage Error by more than 6%.
Index Terms—stock market prediction — twitter — mood
analysis.
I. INTRODUCTION
STOCK market prediction has attracted much attention
from academia as well as business. But can the stock
market really be predicted? Early research on stock market
prediction [1], [2], [3] was based on random walk theory
and the Efficient Market Hypothesis (EMH) [4]. According
to the EMH stock market prices are largely driven by new
information, i.e. news, rather than present and past prices.
Since news is unpredictable, stock market prices will follow a
random walk pattern and cannot be predicted with more than
50 percent accuracy [5].
There are two problems with EMH. First, numerous studies
show that stock market prices do not follow a random walk
and can indeed to some degree be predicted [5], [6], [7], [8]
thereby calling into question EMH’s basic assumptions. Sec-
ond, recent research suggests that news may be unpredictable
but that very early indicators can be extracted from online
social media (blogs, Twitter feeds, etc) to predict changes
in various economic and commercial indicators. This may
conceivably also be the case for the stock market. For example,
[11] shows how online chat activity predicts book sales. [12]
uses assessments of blog sentiment to predict movie sales.
sentiment from blogs. In addition, Google search queries have
been shown to provide early indicators of disease infection
rates and consumer spending [14]. [9] investigates the relations
between breaking financial news and stock price changes.
Most recently [13] provide a ground-breaking demonstration
of how public sentiment related to movies, as expressed on
Twitter, can actually predict box office receipts.
Although news most certainly influences stock market
prices, public mood states or sentiment may play an equally
important role. We know from psychological research that
emotions, in addition to information, play an significant role
in human decision-making [16], [18], [39]. Behavioral finance
has provided further proof that financial decisions are sig-
nificantly driven by emotion and mood [19]. It is therefore
reasonable to assume that the public mood and sentiment can
drive stock market values as much as news. This is supported
by recent research by [10] who extract an indicator of public
anxiety from LiveJournal posts and investigate whether its
variations can predict SP500 values.
However, if it is our goal to study how public mood
influences the stock markets, we need reliable, scalable and
early assessments of the public mood at a time-scale and
resolution appropriate for practical stock market prediction.
Large surveys of public mood over representative samples of
the population are generally expensive and time-consuming
to conduct, cf. Gallup’s opinion polls and various consumer
and well-being indices. Some have therefore proposed indirect
assessment of public mood or sentiment from the results of
soccer games [20] and from weather conditions [21]. The
accuracy of these methods is however limited by the low
degree to which the chosen indicators are expected to be
correlated with public mood.
Over the past 5 years significant progress has been made
in sentiment tracking techniques that extract indicators of
public mood directly from social media content such as blog
content [10], [12], [15], [17] and in particular large-scale
Twitter feeds [22]. Although each so-called tweet, i.e. an
individual user post, is limited to only 140 characters, the
aggregate of millions of tweets submitted to Twitter at any
given time may provide an accurate representation of public
mood and sentiment. This has led to the development of real-
time sentiment-tracking indicators such as [17] and “Pulse of
Nation”1
.
In this paper we investigate whether public sentiment, as
expressed in large-scale collections of daily Twitter posts, can
be used to predict the stock market. We use two tools to
measure variations in the public mood from tweets submitted
arXiv:1010.3003v1[cs.CE]14Oct2010

And it works!

And it works! (Maybe!)

Coming Soon! CompleNet 2016
Dijon, France — March 23-25

Twitterology - The Science of Twitter

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (7)

En vedette

En vedette (7)

Similaire à Twitterology - The Science of Twitter

Similaire à Twitterology - The Science of Twitter (20)

Dernier

Dernier (20)

Twitterology - The Science of Twitter