SlideShare une entreprise Scribd logo
1  sur  96
Télécharger pour lire hors ligne
Bruno Gonçalves
www.bgoncalves.com
Twitterology:

The Science of Twitter
www.bgoncalves.com@bgoncalves
The Internet In Real Time
www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
Social Media
www.bgoncalves.com@bgoncalves
Twitter
Data
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'follow_request_sent',
u'profile_use_background_image',
u'default_profile_image',
u'id',
u'profile_background_image_url_https',
u'verified',
u'profile_text_color',
u'profile_image_url_https',
u'profile_sidebar_fill_color',
u'entities',
u'followers_count',
u'profile_sidebar_border_color',
u'id_str',
u'profile_background_color',
u'listed_count',
u'is_translation_enabled',
u'utc_offset',
u'statuses_count',
u'description',
u'friends_count',
u'location',
u'profile_link_color',
u'profile_image_url',
u'following',
u'geo_enabled',
u'profile_banner_url',
u'profile_background_image_url',
u'screen_name',
u'lang',
u'profile_background_tile',
u'favourites_count',
u'name',
u'notifications',
u'url',
u'created_at',
u'contributors_enabled',
u'time_zone',
u'protected',
u'default_profile',
u'is_translator']
http://www.bgoncalves.com/teaching/data-mining.html
www.bgoncalves.com@bgoncalves
Anatomy of a Tweet
[u'contributors',
u'truncated',
u'text',
u'in_reply_to_status_id',
u'id',
u'favorite_count',
u'source',
u'retweeted',
u'coordinates',
u'entities',
u'in_reply_to_screen_name',
u'in_reply_to_user_id',
u'retweet_count',
u'id_str',
u'favorited',
u'user',
u'geo',
u'in_reply_to_user_id_str',
u'possibly_sensitive',
u'lang',
u'created_at',
u'in_reply_to_status_id_str',
u'place',
u'metadata']
[u'type',
u'coordinates']
[u'symbols',
u'user_mentions',
u'hashtags',
u'urls']
u'<a href="http://foursquare.com" rel=“nofollow">
foursquare</a>'
u"I'm at Terminal Rodovixe1rio de Feira de Santana
(Feira de Santana, BA) http://t.co/WirvdHwYMq"
{u'display_url': u'4sq.com/1k5MeYF',
u'expanded_url': u'http://4sq.com/1k5MeYF',
u'indices': [70, 92],
u'url': u'http://t.co/WirvdHwYMq'}
http://www.bgoncalves.com/teaching/data-mining.html
Demographics
www.bgoncalves.com@bgoncalves
Market Penetration PLoS One 8, E61981 (2013)
www.bgoncalves.com@bgoncalves
World Coverage
www.bgoncalves.com@bgoncalves
Age Distribution
PLoS One 10, e0115545 (2015)
www.bgoncalves.com@bgoncalves
Demographics
users who we could infer a gender for, based on their name
and the list previously described. We do so by comparing
the first word of their self-reported name to the gender list.
We observe that there exists a match for 64.2% of the users.
Moreover, we find a strong bias towards male users: Fully
71.8% of the the users who we find a name match for had a
male name.
0
0.2
0.4
0.6
0.8
1
2007-01 2007-07 2008-01 2008-07 2009-01 2009-07
FractionofJoiningUsers
whoareMale
Date
Figure 3: Gender of joining users over time, binned into
groups of 10,000 joining users (note that the join rate in-
creases substantially). The bias towards male users is ob-
served to be decreasing over time.
each last name with over 100 individuals in the U.S.
ing the 2000 Census, the Census releases the distributio
race/ethnicity for that last name. For example, the last n
“Myers” was observed to correspond to Caucasians 86%
the time, African-Americans 9.7%, Asians 0.4%, and
panics 1.4%.
Race/ethnicity distribution of Twitter users
We first determined the number of U.S.-based users
whom we could infer the race/ethnicity by comparing
last word of their self-reported name to the U.S. Ce
last name list. We observed that we found a match
71.8% of the users. We the determined the distributio
race/ethnicity in each county by taking the race/ethn
distribution in the Census list, weighted by the freque
of each name occurring in Twitter users in that coun
Due to the large amount of ambiguity in the last name
race/ethnicity list (in particular, the last name list is m
than 95% predictive for only 18.5% of the users), we are
able to directly compare the Twitter race/ethnicity distr
1
This is effectively the census.model approach discuss
prior work (Chang et al. 2010).
(a) Normal representation
Figure 2: Per-county over- and underrepresentation of U.S. po
tation rate of 0.324%, presented in both (a) a normal layout an
Blue colors indicate underrepresentation, while red colors repre
to the log of the over- or underrepresentation rate. Clear trend
overrepresentation of populous counties.
less than 95% predictive (e.g., the name Avery was observed
to correspond to male babies only 56.8% of the time; it was
Undersampling
Oversampling
(a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic
Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and
Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity
are shown. Blue regions correspond to undersampling; red regions to oversampling.
ICWSM’11, 375 (2011)
www.bgoncalves.com@bgoncalves
Network Structure
www.bgoncalves.com@bgoncalves
Twitter Network
TIME). The top 20 are listed in Figure 7. Some of them follow the
followers, but most of them do not (the median number of follow
ings of the top 40 users is 114, three orders of magnitude small
than the number of followers). We revisit the issue of reciprocity
Section 3.3.
3.2 Followers vs. Tweets
Figure 2: The number of followers and that of tweets per use
In order to gauge the correlation between the number of follow
ers and that of written tweets, we plot the number of tweets (y
against the number of followers a user has (x) in Figure 2. We b
the number of followers in logscale and plot the median per bin
the dashed line. The majority of users who have fewer than 10 fo
lowers never tweeted or did just once and thus the median stays at
The average number of tweets against the number of followers p
ompared against each other. Before we delve into the eccen-
es and peculiarities of Twitter, we run a batch of well-known
sis and present the summary.
Basic Analysis
Figure 1: Number of followings and followers
construct a directed network based on the following and fol-
d and analyze its basic characteristics. Figure 1 displays the
bution of the number of followings as the solid line and that of
wers as the dotted line. The y-axis represents complementary
lative distribution function (CCDF). We first explain the dis-
nitude smaller
reciprocity in
eets per user
ber of follow-
of tweets (y)
ure 2. We bin
dian per bin in
er than 10 fol-
dian stays at 1.
followers per
re are outliers
of followers.
n x = 100 to
sure, but only state the correlation between the numbers of tweets
and followers.
3.3 Reciprocity
In Section 3.1 we briefly mention that top users by the number
of followers in Twitter are mostly celebrities and mass media and
most of them do not follow their followers back. In fact Twitter
shows a low level of reciprocity; 77.9% of user pairs with any link
between them are connected one-way, and only 22.1% have recip-
rocal relationship between them. We call those r-friends of a user as
they reciprocate a user’s following. Previous studies have reported
much higher reciprocity on other social networking services: 68%
on Flickr [4] and 84% on Yahoo! 360 [18].
Moreover, 67.6% of users are not followed by any of their fol-
lowings in Twitter. We conjecture that for these users Twitter is
rather a source of information than a social networking site. Fur-
ther validation is out of the scope of this paper and we leave it for
future work.
3.4 Degree of Separation
WWW'10, 591 (2010)
www.bgoncalves.com@bgoncalves
Retweet Trees April 26-30 • Raleigh • NC • USA
ce Size of Retweet
age and median numbers of additional recipi-
via retweeting
be to mass media in various forms: radio, TV, and
y are immediate recipients and consumers of the
hed media produce. On Twitter people acquire
lways directly from those they follow, but often
suming a tweet posted by a user is viewed and
of the user’s followers, we count the number of
nts who are not immediate followers of the orig-
Figure 14 displays its average and median per
number of followers of the original tweet user.
almost always below the average, indicating that
a very large number of additional recipients. Up
llowers, the average number of additional recipi-
d by the number of followers of the tweet source.
WWW'10, 591 (2010)
www.bgoncalves.com@bgoncalves
Retweets Trees
Figure 15: Retweet trees of ‘air france flight’ tweets
Figure 16: Height and participating users in retweet trees
etweeting the same tweet, and cross-retweet is retweeting each
ther.
In Figure 16 we plot the CCDFs of the retweet tree heights and
he number of users in a retweet tree. The height of 1 is the most
6. IMPACT OF RETWEET
We have seen how trending topics rise in popularity and ev
ally die in Section 5. Then how exactly does information spre
Twitter? Retweet is an effective means to relay the informatio
yond adjacent neighbors. We dig into the retweet trees constr
per trending topic and examine key factors that impact the eve
spread of information.
6.1 Audience Size of Retweet
WWW 2010 • Full Paper
WWW’10, 591 (2010)
WWW'10, 591 (2010)
www.bgoncalves.com@bgoncalves
Link Function ICWSM’11, 89 (2011)
www.bgoncalves.com@bgoncalves
Link Function
Agreement Discussion
ICWSM’11, 89 (2011)
Social Interactions
www.bgoncalves.com@bgoncalves
Friends Talk to Each Other PLoS One 6, E22656 (2011)
www.bgoncalves.com@bgoncalves
Friends Talk to Each Other PLoS One 6, E22656 (2011)
www.bgoncalves.com@bgoncalves
Online Conversations
0 350 400 450 500 550 600
ut
0 50 100 150 200 250 300 350 400 450 500 550 600
010020030040050060050150250350450550
k
in
ρ
B)
ReciprocatedConnections
0 50 100 150 200 250 300 350 400 450 500 550 600
12345678
ωout
k
out
A)
0 50 100 150 200
010020030040050060050150250350450550
ρ
B)
!out
i =
P
j !ij
kout
i
AverageWeightperConnection
1.7 Million users
370 Million messages
Saturation of the number of reciprocated connections
Number of connections for which interaction strength is highest
PLoS One 6, E22656 (2011)
www.bgoncalves.com@bgoncalves
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties
www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
www.bgoncalves.com@bgoncalves
Weak
• Interviews to find out how individuals found out about job opportunities.
• Mostly from acquaintances or friends of friends
• “It is argued that the degree of overlap of two individuals social networks varies directly
with the strength of their tie to one another”
wo possible cases in networks with
: ͑a͒ positively correlated nets and ͑b͒
width of the line of the links represents
CAL REVIEW E 76, 066106 ͑2007͒
A
C
B
kin = 1
kout = 2
sin = 1
sout = 2
kin = 2
kout = 1
sin = 3
sout = 1 kin = 1
kout = 1
sin = 1
sout = 2
Figure 2: Example of a meme diffusion network involving
three users mentioning and retweeting each other. The val-
ues of various node statistics are shown next to each node.
The strength s refers to weighted degree, k stands for degree.
Observing a retweet at node B provides implicit confirma-
tion that information from A appeared in B’s Twitter feed,
while a mention of B originating at node A explicitly con-
firms that A’s message appeared in B’s Twitter feed. This
may or may not be noticed by B, therefore mention edges
are less reliable indicators of information flow compared to
retweet edges.
Retweet and reply/mention information parsed from the
text can be ambiguous, as in the case when a tweet is marked
as being a ‘retweet’ of multiple people. Rather, we rely
on Twitter metadata, which designates users replied to or
retweeted by each message. Thus, while the text of a tweet
may contain several mentions, we only draw an edge to the
user explicitly designated as the mentioned user by the meta-
data. In so doing, we may miss retweets that do not use the
explicit retweet feature and thus are not captured in the meta-
data. Note that this is separate from our use of mentions as
memes (§ 3.1), which we parse from the text of the tweet.
4 System Architecture
Figure 3
website,
memes.
detailed
per day
lion twe
process
network
to produ
acteristic
analyses
sification
4.2 M
A secon
The Strength of Ties (1973)
for a time sufficient to its
ale communication network
nd the calls among them links.
indicates a particular egocentric network evolution. In order to
quantify it, we measure the probability, p(n), that the next
communication event of an agent having n social ties will occur via
the establishment of a new (n 1 1)th
link. We calculate these
probabilities in the MPC dataset averaging them for users with the
same degree k at the end of the observation time. We therefore
. Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows.
al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors
idth and color represent weight.
www.bgoncalves.com@bgoncalves
Network Structure The Strength of Intermediary Ties in Social Media
“People whose networks bridge the structural holes
between groups have an advantage in detecting and
developing rewarding opportunities. Information
arbitrage is their advantage. They are able to see
early, see more broadly, and translate information
across groups.”
AJS Volume 110 Number 2 (September 2004): 349–99
᭧ 2004 by The University of Chicago. All rights reserved.
0002-9602/2004/11002-0004$10.00
Structural Holes and Good Ideas1
Ronald S. Burt
University of Chicago
This article outlines the mechanism by which brokerage prov
social capital. Opinion and behavior are more homogeneous w
than between groups, so people connected across groups are m
familiar with alternative ways of thinking and behaving. Broke
across the structural holes between groups provides a vision o
tions otherwise unseen, which is the mechanism by which broke
becomes social capital. I review evidence consistent with the
pothesis, then look at the networks around managers in a
American electronics company. The organization is rife with s
tural holes, and brokerage has its expected correlates. Compensa
positive performance evaluations, promotions, and good idea
disproportionately in the hands of people whose networks
structural holes. The between-group brokers are more likely t
press ideas, less likely to have ideas dismissed, and more like
have ideas evaluated as valuable. I close with implications for
ativity and structural change.
The hypothesis in this article is that people who stand near the hol
a social structure are at higher risk of having good ideas. The argum
is that opinion and behavior are more homogeneous within than betw
groups, so people connected across groups are more familiar with a
1
Portions of this material were presented as the 2003 Coleman Lecture at the Univ
of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho
the University of California at Berkeley, the University of Chicago, the Univers
Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus
the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe
Rationality” conference at the 2003 meetings of the American Sociological Associ
I am grateful to Christina Hardy for her assistance on the manuscript and to se
colleagues for comments affecting the final text: William Barnett, James Baron
athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R
Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R
Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate
Peter Marsden for his comments as discussant at the Coleman Lecture. Direc
respondence to Ron Burt, Graduate School of Business, University of Chicago
cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu
PLoS One 7, e29358 (2012)
www.bgoncalves.com@bgoncalves
ation that the stronger the tie is the higher
acts of both parties it has and the higher the
belong to the same group.
groups
to consider is the characteristics of links
ese links occur mainly between groups
200 users (Figure 4A–C). However, their
he quality of the links (if they bear mentions
ks with mentions are less abundant than the
retweets are slightly more abundant.
ngth of weak ties theory [12,14–16], weak
between which they take place should be small according to the
Granovetter’s theory. The results show that the most likely to
attract retweets are the links connecting groups that are neither too
close nor too far. This can be explained with Aral’s theory about
the trade-off between diversity and bandwidth: if the two groups
are too close there is no enough diversity in the information, while
if the groups are too far the communication is poor. These trends
are not dependant on the size of the considered groups (see Figs.
S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the
Supplementary Information).
ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned.
f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular
in respect to detected groups.
.0029358.g002
Network Structure The Strength of Intermediary Ties in Social Media
to Granovetter expectation that the stronger the
number of mutual contacts of both parties it has a
Figure 2. Group and link statistics. (A) Size distri
(C) Percentage of links of different types, e.g. followe
topological localizations in respect to detected grou
doi:10.1371/journal.pone.0029358.g002
The
PLoS One 7, e29358 (2012)
www.bgoncalves.com@bgoncalves
Groups
Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the
groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group
The Strength of Intermediary Ties in So
PLoS One 7, e29358 (2012)
2.4 Links between groups
The next question to consider is the characteristic
between groups. These links occur mainly betwee
containing less than 200 users (Figure 4A–C). Howe
frequency depends on the quality of the links (if they bear
or retweets). While links with mentions are less abundan
baseline, those with retweets are slightly more
According to the strength of weak ties theory [12,14–
links are typically connections between persons no
neighbors, being important to keep the network conn
for information diffusion. We investigate whether
between groups play a similar role in the online n
information transmitters. The actions more related to in
diffusion are retweets [24] that show a slight prefe
occurring on between-group links (Figures 4B and
preference is enhanced when the similarity between
groups is taken into account. We define the similarity be
groups, A and B, in terms of the Jaccard index
connections:
similarity(A,B)~
jlinks of A and Bj
j|links of A and Bj
:
The similarity is the overlap between the groups’ connec
it estimates network proximity of the groups. The gener
is that links with mentions more likely occur between clo
and retweets occur between groups with medium
(Figure 4D). Mentions as personal messages are
exchanged between users with similar environments
predicted by the strength of weak ties theory. Links with
are related to information transfer and the similarity of t
PLoS ONE | www.plosone.org
Geolocation
www.bgoncalves.com@bgoncalves
Twitter follower distance
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81
f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New
ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on
Social Networks 34, 73 (2012)
www.bgoncalves.com@bgoncalves
Locality
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79
Table 5
Top countries.
Share of
egos (%)a
Share of egos
(%) for egos in
dyadsb
Share of
alters (%)c
Percentage of
domestic tiesd
Percentage of
domestic ties among
non-local tiesd
Following foreign
alters/being followed
from abroad
Country named
explicitly (% of
egos)
USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1
Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4
UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3
Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0
Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5
Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7
Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3
Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6
Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3
Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7
a
Out of the 2852 egos located at the level of country or better.
b
Out of the egos included in 1953 dyads with both parties located at the level of country or better.
c
Out of the 1953 alters located at the level of country or better.
d
The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country.
between those two interpretations. We also note that top Twitter
clusters intersect only to an extent with Alderson and Beckfield’s
(2004) ranking of world cities based on multinational corporations’
branch headquarters. (Of Alderson and Beckfield’s top 25 cities by
in-degree or “prestige,” 13 appear in the top 25 Twitter clusters
ranked by in-degree centrality, with another 6 appearing in top
100.)
5.3. National borders
Of the ties that were matched to countries, 75 percent con-
nect users in the same country. This prevalence of domestic ties is
Table 6
The most common languages. Based on 2852 egos.
Language % of egos
English 72.5
Portuguese 10.1
Japanese 5.4
Spanish 3.1
Indonesian 1.8
German 1.7
Dutch 1.0
Chinese 0.9
Korean 0.4
Swedish 0.4
Social Networks 34, 73 (2012)
Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77
accounts, by randomly drawing an account from among those “fol-
lowed” by each of those egos. We then coded the locations of the
alters using the same procedure as we did for the egos, removing
those pairs where the alter could not be assigned to a country. In
the end, we obtained a sample of 1953 ego-alter pairs with both
the ego and the alter assigned to a country, including 1259 pairs
with “specific” locations for both parties (Table 1).
4.4. Aggregating nearby locations
Since specific locations vary substantially in precision and since
users can often choose between a range of specific names for the
same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we
aggregated nearby locations within each country, by assigning a
set of coordinates (obtained from Google Maps) to each location
smaller than 25,000 km2 and then merging nearby locations within
each country by replacing their coordinates with a weighted aver-
age of the coordinates of the merged locations. This reduced our
location descriptions to a set of 386 regional clusters, which are
comparable in size to metropolitan areas. We labeled each clus-
ter with the most common name associated with it in our sample.
For example, the cluster centered on Manhattan is referred to as
“New York.”
5. Analysis
In this section we analyze the factors affecting the formation of
Twitter ties. We first look at the effect of each variable identified
earlier based on theoretical considerations: the actual physical dis-
tance, the frequency of air travel, national boundaries, and language
differences. In addition to presenting the descriptive statistics
demonstrating the effects of each variable and investigating the
nature of such effects, we correlated the effects using the Quadratic
Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the
last subsection we also examined the relationship between the
variables using QAP regression (Double Dekker Semi-partialling
MRQAP). All statistical calculations were done using UCINet 6.277
(Borgatti et al., 2002).
For correlation and regression analysis we used networks with
nodes representing the 25 largest regional clusters of users (see
Table 3
Top clusters.
Rank Clustera
Share of
egos (%)b
Share of egos
(%) for egos in
dyadsc
Share of
alters (%)d
Localitye
1 “New York” 8.5 8.3 10.2 54.3
2 “Los Angeles, CA” 5.1 5.6 10.4 53.3
3 “ ” (Tokyo) 4.1 4.8 5.0 62.9
4 “London” 3.6 3.3 4.9 48.8
5 “São Paulo” 3.5 3.0 3.6 78.4
6 “San Francisco” 2.8 2.7 4.1 41.2
7 “New Jersey”f
2.5 2.8 2.1 20.0
8 “Chicago” 2.2 2.0 1.7 32.0
9 “Washington, DC” 2.1 2.8 2.6 34.3
10 “Manchester, UK” 1.9 2.0 1.1 30.8
11 “Atlanta” 1.7 2.1 2.1 46.2
12 “San Diego” 1.5 1.5 1.1 26.3
13 “Toronto, Canada” 1.3 1.1 1.5 42.9
14 “Seattle” 1.3 1.4 1.2 58.8
15 “Houston” 1.2 1.2 1.0 40.0
16 “Dallas, Texas” 1.2 1.0 1.4 61.5
17 “Rio de Janeiro” 1.2 1.0 1.1 30.8
18 “Boston, MA” 1.2 1.2 1.1 20.0
19 “Amsterdam” 1.1 1.1 0.9 50.0
20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9
21 “Austin, TX” 1.0 1.0 1.3 50.0
22 “Sydney” 0.9 1.0 0.8 38.5
23 “Orlando, Forida” 0.9 1.0 0.6 16.7
24 “Phoenix, AZ” 0.8 0.7 0.6 11.1
25 “ ” (Hy¯ogo)g
0.8 1.0 1.0 25.0
a
Each cluster is labeled with the name most frequently used for locations assigned
to the cluster.
b
Out of the 2167 egos located with precision of <25,000 km2
.
c
Out of the 1259 egos included in dyads with both parties located with precision
of <25,000 km2
.
d
Out of the 1259 alters included in dyads with both parties located with precision
of <25,000 km2
.
e
Defined as the share of local of ties among all ties for egos in a cluster.
f
Centered between Philadelphia and Trenton, NJ and includes all locations iden-
tified as just “New Jersey”.
g
Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai
region of Japan.
over half of the egos are in other countries, as are 4 of the 10
largest clusters: Tokyo, São Paulo, and two clusters in the United
www.bgoncalves.com@bgoncalves
Mobility and Social Networks
Coupling Mobility and Interactions in Social Media
Follower
www.bgoncalves.combgoncalves
Geography and Social Networks
!"#$%&
'%()&"*+,-.&$#%,(
Geography
Follower
Reply
ReTweet
!"#$%&'()*+),-./*012
3&#1)40-$.&*#
!"#$%&'()*#),-./*012
5#+*0 */
6 7
6 7
Geography
PLoS One 9, E92196 (2014)
and for their dependence on the distance. The error Err of this
null model is between 0:66–0:76 for the three countries, around
twice the error of the TF model (see Figure 6).
The linking model (L model) is a simplified version of the TF
model, without random mobility and the box size d?0. Agents
move to visit their contacts with probability pv, whereas with
probability 1{pv they do not perform any action. In this version
of the model, users can connect only by random connections or
when two of them coincide, visiting a common friend, which leads
to triadic closure. These two processes do not depend on the
distances between the users. A thorough description can be
obtained with a mean-field approach (see the corresponding
section). The results of the L model are shown in Figure 2. Due to
the triangle closing mechanism, this null model creates networks
with a considerable level of clustering. However, it does not
(e.g., for the US the TF model has Err lower by 0:5 and 1:5 than
the TF-normal and the TF-uniform models, respectively, as shown
in Figure 6).
Simplified models that neglect either geography or network
structure perform considerably worse than the TF model in
reproducing the properties of real networks. Likewise, non-realistic
assumptions on human mobility mechanism yield worse results
than the default TF model. To conclude, the coupling of
geography and structure through a realistic mobility mechanism
produces networks with significantly more realistic geographic and
structural properties.
Sensitivity of the TF Model to the Parameters and its
Modifications
The results presented so far have been obtained at the optimal
Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different
colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users
throughout entire simulation. Ego network shows the social connections at the end of the simulation.
doi:10.1371/journal.pone.0092196.g004
www.bgoncalves.com@bgoncalves
Geo-Social Properties PLoS One 9, E92196 (2014)
Couplin
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
 
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Triangle Disparity
eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
e.0092196.g002
Coupling Mobility and Interactions in Social Media
Reciprocity
Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking
model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and
C(d).
doi:10.1371/journal.pone.0092196.g002
Coupling Mobility and Interactions in Social Media
Prob of a Link
ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares),
ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2).
Coupling Mobility and Interactions in Social Media
Clustering
www.bgoncalves.com@bgoncalves
Geo-Social Model
New position of u
{
{
{
Detect all
encounters e
in the box of u
Visit a random
neighbour
Jump to
a new location
Starting position
of user u
Created new
social links
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Model Fitting
0:39 for Germany. For simplicity, we focus on the Twitter
networks only, although similar results are obtained for the other
datasets.
Results
Simulations for the Optimal Parameters
An example with the displacements between the consecutive
locations and the ego networks for a sample of individuals, as
generated by the TF model, are displayed in Figure 4. The
parameters of the model are set to the ones that correspond to the
minimum of the error Err. As shown, the agents tend to stay close
to their original positions. Occasional long jumps occur due to
friend visits that live far apart. In this range of parameters and
simulation times, the main mechanism for generating long distance
second null model, the linking model (L model), in contrast, is
based only on random linking and triadic closure, and it is
equivalent to the TF model without the mobility. We consider the
two uncoupled null models and compare their results with those of
the TF model. In this way, we demonstrate the importance of the
coupling through a realistic mobility mechanism to reproduce the
empirical networks.
The spatial model (S model) consists of randomly connecting
pair of users with a probability that decays as power-law of the
distance between them (suggested in [41]). The exponent of the
power-law is fixed at {0:7 following Figure 2A. The results of
the S model are shown in the panels of Figure 2. While it is set to
match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or
P Dð Þ are not well reproduced. The S model fails to account for the
high level of clustering and reciprocity in the empirical networks
Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red
rectangle.
doi:10.1371/journal.pone.0092196.g003
PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196
Prob. to Make a New Friend
Prob.toVisitanOldFriend
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
Coupling Mobility and Interactions in Social Media
Model Results
Reciprocity
Clustering Triangle Disparity
andom connections, and so the distribution of triangles disparity prevent
Figure 5. Geo-social properties of the model networks. Various statistical pro
red squares) and from simulation of the TF model (black line) for the US. Correspond
nd S4.
doi:10.1371/journal.pone.0092196.g005
that has also an edge between i and k, forming a triangle. Note
a triangle consists of 3 triads centered on different nodes.
effect of the distance on the clustering coefficient can
incorporated by measuring the distances from each central n
j to two neighbors i and k forming a triad, d~dijzdjk,
calculating the network clustering restricted to triads with dist
d. This new function C(d) is the probability of closing a tria
given the distance d in a triad
C(d)~
D(d)
L(d)
,
where (d) and (d) are the numbers of triads and closed tr
for the distance d, respectively. The value of the global cluste
coefficient C can be recovered by averaging C(d) over d. In
datasets, we observe a drop in C(d) followed by a plateau, whi
best visible for the US networks (Figure 2E).
Given a triangle, several configurations are possible if the
diversity in the edge lengths. The triangle can be equilateral
the edges have the same length, isosceles if two have the s
length and the other is smaller, etc. We estimate the domi
shapes of the triangles in the network by measuring the dispari
defined as:
D~6
d2
1 zd2
2 zd2
3
(d1zd2zd3)2
{
1
3
 
,
where d1, d2 and d3 are the geographical distances between
locations of the users forming the triangle. The disparity t
values between 0 and 1 as the shape of the triangle passes f
equilateral to isosceles, where one edge is much smaller than
other two. D shows a distribution with two maxima in the on
social networks (Figure 2F), for low and high values. The two m
C(d).
doi:10.1371/journal.pone.0092196.g002
DL
Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine
(red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be
and S4.
doi:10.1371/journal.pone.0092196.g005
Coupling Mobility and Interactio
s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics
al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data
m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3
one.0092196.g005
Coupling Mobility and Interactions in Social Media
Prob of a Link
PLoS One 9, E92196 (2014)
www.bgoncalves.com@bgoncalves
Human Diffusion J. R. Soc. Interface 12, 20150473 (2015)
Starting from Paris
Starting from New York
a
b
www.bgoncalves.com@bgoncalves
Human Diffusion
Starting from New Yorkb
J. R. Soc. Interface 12, 20150473 (2015)
www.bgoncalves.com@bgoncalves
Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
www.bgoncalves.com@bgoncalves
Residents and Tourists
50 100 150 200 250 300 350
0.1
0.2
0.3
0.4
0.5
0.6
Coverage
R
~
Local
Non−Local
a
100
200
300
400
500
600
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Proportion of Non−Local Users
Coverage
b
125 135 145 155
New York
Chicago
San Francisco
Shanghai
Dallas
Berlin
Paris
Saint Petersburg
Beijing
Moscow
Coverage
c
325 335 345
Houston
Barcelona
Brussels
Detroit
Lima
Istanbul
Rome
Moscow
Paris
Lisbon
Coverage
d
J. R. Soc. Interface 12, 20150473 (2015)
www.bgoncalves.com@bgoncalves
City Communities
0 2 4 6 8 10
Los Angeles
San Francisco
Miami
Singapore
Tokyo
Paris
London
New York
Weighted Betwennness (x 102
)
Weighted degree
J. R. Soc. Interface 12, 20150473 (2015)
Collective Attention
www.bgoncalves.com@bgoncalves
#tags
• Metadata added to a Tweet for topic marking
• Originally proposed by Chris Messina in 2007
• Quickly adopted informally by the Twitter
community
• Native support added by Twitter after it became
popular
www.bgoncalves.com@bgoncalves
Hashtag Statistics
numberofusers
tag
105
103
101
101 103 105
500 users
numberoftweets
tag
105
103
101
101 103 105
swsx swineflu
gfail
peace
watchmen
nsotu
winnenden
masters
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Activity Peak Detection
! Peak: relative activity to baseline have to be
10 times larger
! Minimal level of activity expected
! Selection of isolated popularity bursts (no
other peaks one week before/after)
! We detected 115 peaks
continuous periodic peak
#video #ff #w2e
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Peak Characterization
600 1500 83% 17%69% 31% 100
% 48%
6
r cup finale
9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-6
250
150
500
#winnenden #watchmen
Days
Tweets
Before Peak After PeakPeak800
600
400
200
0 30-30
peak
baseline
-15 15
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Some Examples
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
0 6-3-6 3
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peakdays after peakdays after peak
numberoftweetsuserID
#masters #winnenden #watchmen #nsotu
Anticipation Reaction “Instantaneous”“Anticipation + Reaction”
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Classes of Peaks
! An#cipatory,behavior!
! Increasing,amount,of,tweets,un#l,the,event!
! Sharp,drop,of,a;en#on,aer,the,event
0%
peak(fp
=0)
0%
before(f
b=0)
0% after (fa
= 0)
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geekw2e humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
nsotu
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
! Unexpected,events!
! Driven,by,exogenous,sources
! Ac#vity,concentrated,on,the,peak,day!
! Events,that,only,discussed,while,,,,,,,,,,
they,are,happen
! Collec#ve,a;en#on,is,built,up,to,a,,,,,,,,,,,peak,
intensity,,then,a;en#on,shis,away
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Barycentric Coordinates
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
-3-6
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-3-6
2500
1500
500
days after peak daysdays after peakdays after peak
numberoftweetsuserID
#masters #winnenden #watchmen #n
(0,0,1)
(0,1,0) (1,0,0)
(0,1/2,1/2)
(1/3,1/3,1/3)
(1/2,0,1/2)
(1/2,1/2,0)
(1/2,1/4,1/4)(1/4,1/2,1/4)
(1/4,1/4,1/2)
2D-Simplex
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Barycentric Coordinates
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
-3-6
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
-3-6
2500
1500
500
days after peak daysdays after peakdays after peak
numberoftweetsuserID
#masters #winnenden #watchmen #n
0%
peak
0%
before
0% after
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geek w2e
humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
grammys
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
2D-Simplex
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Barycentric Coordinates
0%
peak
0%
before
0% after
100%
peak
100%
before
100% after
swineflu
h1n1
sxswi
easter
teaparty
advertising
mastersnfl
earthhour
twestival
plurkfirstfollow
mrtweet
cebit
bsg
cricket
google
hadopi
inaug09
drupalcon
coalition
geek w2e
humor
davos
watchmen
job
house
mikeyy
superbowl
gfail
blackout
oscar
snowmageddon
grammys
zombies
rp09
brand
skittles
phish
ces09
socialmedia
winnenden
peace
macheist
earthday
amazonfail
fridayfollow
aprilfools
600
400
200
0 6-3-6 3
500
300
100
0 6-3-6 3
1500
1000
500
0 6-3-6 3
83% 17%69% 31%73% 27% 1000
600
200
0 6-3-6 3
59% 41%
51% 48%600
400
200
0 6-3-6 3
master cup finale
Apr 9, 2009
53% 47%700
500
300
200
0 6-3-6 3
school shooting
Mar 3, 2009
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peakdays after peakdays after peak
numberoftweetsuserID
#masters #winnenden #watchmen #nsotu
1500
1000
500
0 6-3-6 3
83% 17% 1000
600
200
0 6-3-6 3
59% 41%
34% 27% 39%600
400
200
0 6-3-6 3
movie release date
Mar 6, 2009
98% 2%
0 6-3-6 3
Obama's first
state of the union
Feb 25, 2009
2500
1500
500
days after peak days after peak
#watchmen #nsotu
WWW’12, 251 (2012)
www.bgoncalves.com@bgoncalves
Language Matters
Languages
www.bgoncalves.com@bgoncalves
Signal By Language
www.bgoncalves.com@bgoncalves
Signal By Language
Italian
English
Spanish
Portuguese
Other
76%
www.bgoncalves.com@bgoncalves
Signal By Language
Italian
English
Spanish
Portuguese
Other
16%
www.bgoncalves.com@bgoncalves
Signal By Language
Italian
English
Spanish
Portuguese
Other
2%
www.bgoncalves.com@bgoncalves
Signal By Language
Italian
English
Spanish
Portuguese
Other
www.bgoncalves.com@bgoncalves
Spanish PLoS One 9, E112074 (2014)
www.bgoncalves.com@bgoncalves
Local Variations PLoS One 9, E112074 (2014)
www.bgoncalves.com@bgoncalves
Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Superdialects
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10
f(K)
Silhouette
Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
N = 956
N = 179
Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
B)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
Mexico City
Guatemala
San Salvador Caracas
San Jose
Panama
Bogota
Quito
Lima
Asuncion
Cordoba
Santiago Buenos Aires
Santiago De Compostela
Palma De Mallorca
0 5 10 15 20
Clusters
0.0
0.2
0.4
0.6
0.8
1.0
f(K)
silhouette
α β
0
2
4
6
8
Cluster
N = 179
N = 956
Population(x105
)
A)
C)
Santander
Oviedo
Bilbao
Zaragoza
Valladolid
Barcelona
Madrid
Seville
San Diego
Miami
New York
San Juan
Santo Domingo
PLoS One 9, E112074 (2014)
www.bgoncalves.com@bgoncalves
Regional Dialects PLoS One 9, E112074 (2014)
www.bgoncalves.com@bgoncalves www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
Bilingualism
www.bgoncalves.com@bgoncalves
Global Language Network
Twitter
n Link Weight and Color
t-statistic
102.59
n
Slovak
DanishFinnish
Haitian
Hebrew
Galician
Czech
Swahili
Albanian
Irish
Malay
Estonian
Maltese
Romanian
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
Georgian
Malayalam
Modern Greek
Armenian
Kannada
Telugu
Latvian
Korean
Burmese
Thai
Filipino
Hungarian
Central Khmer
Cherokee
Russian
Bulgarian
Welsh
Amharic
Belarusian
Ukrainian
Macedonian
Italian
English Arabic
Serbo-Croatian Sinhala
Turkish
Slovenian
Azerbaijani
Persian
German
Basque
Norwegian
Catalan
Afrikaans
French
Swedish
Spanish
Dutch
Dhivehi
Japanese
Tibetan
Panjabi
Tamil
Chinese
Lao
Gujarati
ian
n
esian
can
Narom
Kabyle
Occitan
Amharic
Malagasy
Pushto
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Abaza
Northern Yukaghir
Lezghian
Chukot
Old Russian
Ossetian
Tajik
Tabassaran
ChechenDargwa
Lak AbkhazianAdyghe
Nepali macrolanguage
Swahili (macrolanguage)
Arabic
Kazakh
Mongolian
n
Uighur
Latvian
anto
Persian
Belarusian
age Family Population Link Weight and Color
iatic
dian
nesian
Caucasian
Creoles  pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic
t-statistic
co-occurrences
(users, editors, translations)
102.59
min
6
6
6
twitter
wikipedia
book translations
994,682
49,637
183,329
max
1 billion
10 million
100 million
1 million
Slovak
DanishFinnish
Haitian
Hebrew
Galician
Czech
Swahili
Albanian
Irish
Malay
Estonian
Maltese
Romanian
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
English Arabic
Serbo-Croatian Sinhala
Slovenian Persian
German
Basque
Norwegian
Catalan
Afrikaans
French
Swedish
Spanish
Dutch
Ido
e
li
Navajo
Interlingua
English
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
do-Romanian
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Thai JapaneseQuechua
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Albanian
French
Finnish
Silesian
Breton
Pennsylvania German
Slovak
Wikipedia
Language Family Pop
Afro-Asiatic
Altaic
Amerindian
Austronesian
Caucasian
Creoles  pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic
Persian
Marathi
Mazanderani
Kashmiri
Fiji Hindi
OriyaSanskrit
Gilaki
Icelandic
Swahili
Scottish Gaelic
Kannada
Moldavian
Scots
Maltese
Burmese
Cebuano
Lao
Mongolian
Cornish
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Kinyarwanda
Faroese
Panjabi
Zulu
Central Khmer
Old English
Irish
Bengali
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragones
Ligu
Italian
Western Frisian
Portuguese
Dutch
Spanish
Thai JapaneseQuechua
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Macedonian
Low German
Slovenian
Yiddish
Bavarian
Albanian
Estonian
Modern Greek
Romansh
Azerbaijani
Bulgarian
Georgian Arabic
Kurdish
Serbo-CroatianLithuanian
Köl
French
Czech
Russian
Kirghiz
Finnish
Tatar
Yakut
Armenian Hebrew
Luxembourgish
Ukrainian
Latvian
TurkishKazakh
Breton
Pennsylvania German
Belarusian
Slovak
German
Language Family Population
Afro-Asiatic
Altaic
Amerindian
Austronesian
Caucasian
Creoles  pidgins
Dravidian
Indo-European
Niger-Congo
Other
Sino-Tibetan
Tai Uralic 1 billion
10 million
100 million
1 million
Moldavian
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Thai JapaneseQuechua
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Albanian
French
Finnish
Silesian
Breton
Pennsylvania German
Slovak
PNAS 111, E5616 (2014)
www.bgoncalves.com@bgoncalves
Global Language Network
Wikipedia Twitter
Language Family Population Link Weight and C
Afro-Asiatic Caucasian Niger-Congo t-statisti
2.59
1 million
Finnish
Galician
Czec
Swahili
Alb
Irish
Malay
Estonian
Ma
Romania
Lithuanian
Hindi
Portuguese
Urdu
Yiddish
Vietnamese
Polish
Bengali
Icelandic
M
Modern
Armenian
Kannada
Telugu
Korean
Burmese
Thai
Filipino
Hungarian
Central Khmer
Cherokee
English
Dhivehi
Japanese
Tibetan
Panjabi
Tamil
Chinese
Lao
Gujarati
Persian
Marathi
Mazanderani
Kashmiri
Fiji Hindi
OriyaSanskrit
Gilaki
Icelandic
Swahili
Scottish Gaelic
Kannada
Moldavian
Scots
Maltese
Burmese
Cebuano
Lao
Mongolian
Cornish
Urdu
Ido
Telugu
Assamese
Nepali
Navajo
Filipino
Kalaallisut
Interlingua
Somali
English
Gujarati
Amharic
Tok Pisin
Hindi
Limburgan
Javanese
Pushto
Vlaams
Malayalam
Sundanese
Welsh
Kinyarwanda
Faroese
Panjabi
Zulu
Central Khmer
Old English
Irish
Bengali
Papiamento
Tamil
Pampanga
Macedo-Romanian
Bikol
Sinhala
Polish
Venetian
Aragonese
Kashubian
Asturian
Sardinian
Ligurian
Friulian
Guarani
Italian
Western Frisian
Portuguese
Dutch
Spanish
Thai JapaneseQuechua
Catalan Chinese
Sicilian
Neapolitan
Emiliano-Romagnolo
Basque
Malay
Vietnamese
Galician
Afrikaans
Lombard
Korean
Norwegian
Esperanto
Romanian
Latin
Swedish
Danish
Hungarian
Macedonian
Low German
Slovenian
Yiddish
Bavarian
Albanian
Estonian
Modern Greek
Romansh
Azerbaijani
Bulgarian
Georgian Arabic
Kurdish
Serbo-CroatianLithuanian
Kölsch
French
Czech
Russian
Kirghiz
Chuvash
Finnish
Tatar
Yakut
Silesian
Corsican
Narom
Kabyle
OccitanArmenian Hebrew
Luxembourgish
Ukrainian
Latvian
TurkishKazakh
Breton
Pennsylvania German
Belarusian
Slovak
German
Réunion Creole French
Lingala
Kabyle
Occitan (post 1500) Muyang
Old High German (ca. 750-1050)
Saramaccan
Walloon
Western Frisian
Eastern Maroon Creole
Swiss German
Caribbean Javanese
Sranan Tongo
Karang
Dogosé
Kasem
French
Old Provençal (to 1500)
Tamashek
Tembo (Kitembo)
Central Atlas Tamazight
BudumaBambara
Picard
Wolof
Ngiemboon
Lama (Togo)
Russian
Amharic
Malagasy
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Romansh
Afrikaans Romanian
German
Nepali macrolanguage
Lithuanian
Swahili (macrolanguage)
Arabic
Kazakh
Lisu
Mongolian
Kachin
Uighur
Tai Hongjin
Newari
Korean
Latvian
Hungarian
Esperanto
Persian
Japanese
Hmong
Serbo-Croatian
Vietnamese
Belarusian
HaniTibetan
Dutch
Lahu
Sichuan Yi
Azhe
Chinese
Church Slavic
Naxi
Middle Dutch (ca. 1050-1350)
Wa
RomanyCaribbean Hindustani
Zhuang
PNAS 111, E5616 (2014)
69
www.bgoncalves.com@bgoncalves
Global Language Network
Book Translations
Navajo
Chipewyan
Ojibwa
Xhosa
Sindhi
Filipino (macrolanguage)
Kikuyu
Cree
Dakota
Lule Sami
Tavringer Romani
Kurdish
Swedish
Northern Sami
Inari Sami
Finnish
Egyptian (Ancient)
Somali
Inuktitut
Cornish
Hopi
Haitian
Syriac
Kriol
Classical NahuatlOld Irish (to 900)
Hittite
Old English (ca. 450-1100)
Middle English (1100-1500)
Icelandic
Pahlavi Old NorseYoruba
Zulu
Ottoman Turkish (1500-1928)
Galician
Ladino
Danish
Norwegian
Southern Sami
Faroese
Sumerian
Kalaallisut
Hawaiian
Kashmiri
Djeebbana
Anglo-NormanPali
Guianese Creole French
Réunion Creole French
Gascon
Lingala
Corsican
Fulah
Kabyle
Occitan (post 1500) Muyang
Old High German (ca. 750-1050)
Saramaccan
Walloon
Western Frisian
Eastern Maroon Creole
Swiss German
Caribbean Javanese
Sranan Tongo
Buamu
Karang
Dogosé
Latin
Ifè
Italian
Old French (842-ca. 1400)
Middle French (ca. 1400-1600)
Basque
Fuliiru
Portuguese
Catalan
Welsh
Ancient Greek (to 1453)
Kasem
Thayore
Asturian
Biali
Aragonese
French
Tepo Krumen
Spanish
Old Provençal (to 1500)
Tamashek
Tembo (Kitembo)
Central Atlas Tamazight
BudumaBambara
Picard
Cerma
Breton
Mofu-Gudur
Wolof
Ngiemboon
Lama (Togo)
Ngangam
Quechua
Kara-Kalpak
Even
Kalmyk
Nanai
Buriat
Azerbaijani
Kumyk
Bashkir
Southern Altai
Tuvinian
Sanskrit
Lao
Russian
Amharic
Hindi
Kannada
Malagasy
Tamil
Panjabi
Evenki
Karachay-Balkar
Khakas
Turkmen
Old Japanese
Gagauz
Pushto
Moksha
Udmurt
Khanty
Karelian
Mari (Russia)
Nenets
Erzya
Komi
Abaza
Northern Yukaghir
Lezghian
Chukot
Old Russian
Ossetian
Tajik
Tabassaran
ChechenDargwa
Ingush
Lak
Georgian
Avaric
Abkhazian
Kabardian
Adyghe
Chuvash
Dolgan
Crimean Tatar
Yakut
Tatar
Kirghiz
Nogai
Uzbek
Romansh
Afrikaans Romanian
Slovenian
Polish
German
Albanian
Nepali macrolanguage
Lithuanian
Ukrainian
Slovak
Central Khmer
Moldavian
Swahili (macrolanguage)
Arabic
Kazakh
Lisu
Mongolian
Kachin
Uighur
Tai Hongjin
Newari
Korean
Latvian
Hungarian
Esperanto
Persian
Japanese
Hmong
Serbo-Croatian
Vietnamese
Belarusian
HaniTibetan
Dutch
Lahu
Sichuan Yi
Azhe
Chinese
Church Slavic
Naxi
Middle Dutch (ca. 1050-1350)
Wa
RomanyCaribbean Hindustani
Zhuang
Maori
Modern Greek (1453-)
Scots
Warlpiri
Coptic
English
Official Aramaic (700-300 BCE)
Sinhala
Scottish Gaelic
Burmese
Gujarati
Assamese
Bengali
Malayalam
Marathi
Bulgarian
Hausa
Armenian
Czech
Hebrew
Yiddish
Urdu
Malay (macrolanguage)
Middle High German (ca. 1050-1500)
Turkish
Irish
Thai
Jola-Fonyi
Guadeloupean Creole French
Swati
Macedonian
Tokelau
Rajasthani
Telugu
Maltese
Middle Irish (900-1200)
GeezAkkadian
Estonian
Oriya macrolanguage
PNAS 111, E5616 (2014)
70
www.bgoncalves.com@bgoncalves
Global Language Network
numbers are 41% and 63%. In contrast, the correlation between the representation of
languages in Twitter and Book Translations is 0.63 (R2
=40%), and the correlation between
the strength of links is only 0.48 (R2
=23%). Finally, we note that—with respect to the book
translation dataset—the two digital datasets (Twitter and Wikipedia) are overexpressed in
languages associated with developing countries, like Malay, Filipino and Swahili. This
indicates that these digital media are more inclusive of the populations of developing
countries than written books.
PNAS 111, E5616 (2014)
71
www.bgoncalves.com@bgoncalves
Language and Fame
afrafr
araara
azaze
belbel
benben
bulbul
catcat
cesces
dandan
deudeu
ellell
eng
estest
euseus
fasfasfilfil
finfin
frfra
gujguj
hbshbs
hebheb
hinhin
hunhun
hyehye
islisl
itaita
jpnjpn
kankan
katkat
khmkhm
korkor
lalav
litlit
malmalmkdmkd
mlmlt
msamsa
mymya
nldnld
nornor
panpan
polpol
porpor
ronron
rusrus
sisin
slkslk
slslv
spaspa
sqisqiswswa
sweswe
tamtamteltelthatha
turtur
ukrukr
urdurd
vivie
zhozho
R² = 0.693
p-value  0.001
C
araara
benben
cat
ces
dandan
deu
ell
eng
fin
fra
glglg
hin
hun
ita
jpn
nld
norpol
ron
rus
slk
slslv
spa
swe
teltel turtur
zho
$10k
$20k
$30k
$40k
$50k
$0k
GDP per Capita
R² = 0.858
p-value  0.001
F
log10
(HAfamouspeople)
log10
(Twitter Eigenvector Centrality)
0
1
2
3
−6 −4 −2 0 −6 −4 −2 0 −6 −4 −2 0
1
2
3
0
log10
(Wikipedia Eigenvector Centrality) log10
(Book Trans. Eigenvector Cent.)
$10k
$20k
$30k
$40k
$50k
$0k
GDP per Capita Number of speakers
400 M
1200 M
800 M
afrafr
ara
azeaze
belbel
benben
bulbul
catcat
cesces dandan
deudeu
ellell
eng
estest
euseus
fasfasfilfil
finfin
frafra
gujguj
hbshbs
hebheb
hihin
hun
hyehye
isisl
itaita
jpn
kankan
katkat
khmkhm
korkor
lavlav
litlit
malmal
mkdmkdmlmlt
msamsa
mymya
nld
nornor
panpan
polpol
por
ronron
rusrus
sinsin
slkslk
slslv
spaspa
sqisqiswswa
swe
tamtam
glgglg
thatha
ukrukr
urdurd
vievie
zhozho
R² = 0.755
p-value  0.001
B
afr
ara
azaze
belbel
benben
bubul
cat
cesdan
deu
ell
eng
estest
eus
fafas fil
fin
fra
glglg
gujguj
hbs
hebheb
hin
hun
hye
isisl
ita
jpn
kankan
kat
khmkhm
kor
lav
litlit
malmalmkdmkd mlt
msa
mymya
nld
nor
pan
pol
por
ronron
rus
sisin
slslk
slslv
spa
sqi
swa
swe
tammtel tha
turukr
urd
vivie
zho
R² = 0.447
p-value  0.001
A
$10k
$20k
$30k
$40k
$50k
GDP per Capita
$0
ara
ben
cat
ces
dan
deu
ell
eng
fin
fra
glg
hbs
hin
hun
ita jpn
nld
norpol
por
ron
rus
slk
slv
spa
swe
tel tur
zho
R² = 0.399
p-value  0.001
D
arara
benben
catcat
cesces
dandan
deudeu
elell
engeng
finfin
frfra
glglg
hbshbs
hihin
hunhun
itaitajpnjpn
nldnld
nornor
polpol
porpor
roron
rurus
slslk
slslv
spaspa
sweswe
tetel tutur
zhozho
R² = 0.758
p-value  0.001
E
hbs
por
glgglg
turtur
teltel
log10
(Wikipedia26+famouspeople)
Fig. 3. The position of a language in the GLN and the global impact of its speakers. Top row shows the number of people per language (born 1800–1950)
with articles in at least 26 Wikipedia language editions as a function of their language’s eigenvector centrality in the (A) Twitter GLN, (B) Wikipedia GLN, and
(C) book translation GLN. The bottom row shows the number of people per language (born 1800–1950) listed in Human Accomplishment as a function of
their language’s eigenvector centrality in (D) Twitter GLN, (E) Wikipedia GLN, and (F) book translation GLN. Size represents the number of speakers for each
PNAS 111, E5616 (2014)
72
Predictions
www.bgoncalves.com@bgoncalves
Collective Attention
“Prediction is very difficult,
especially about the future.”
(Niels Bohr)
www.bgoncalves.com@bgoncalves
Even more so in Political Elections
http://truthy.indiana.edu
A B C)C D
E)E F G)G H
#ampat @PeaceKaren_25 gopleader.gov “How Chris Coons
budget works- uses tax
$ 2 attend dinners and
fashion shows”
www.bgoncalves.com@bgoncalves
Even more so in Political Elections
http://truthy.indiana.edu
A B C)C D
E)E F G)G H
#ampat @PeaceKaren_25 gopleader.gov “How Chris Coons
budget works- uses tax
$ 2 attend dinners and
fashion shows”
Table 1: Features used in truthy classification.
nodes Number of nodes
edges Number of edges
mean k Mean degree
mean s Mean strength
mean w Mean edge weight in largest con-
nected component
max k(i,o) Maximum (in,out)-degree
max k(i,o) user User with max. (in,out)-degree
max s(i,o) Maximum (in,out)-strength
max s(i,o) user User with max. (in,out)-strength
std k(i,o) Std. dev. of (in,out)-degree
std s(i,o) Std. dev. of (in,out)-strength
skew k(i,o) Skew of (in,out)-degree distribution
skew s(i,o) Skew of (in,out)-strength distribution
mean cc Mean size of connected components
max cc Size of largest connected component
entry nodes Number of unique injections
num truthy Number of times ‘truthy’ button was
clicked
sentiment scores Six GPOMS sentiment dimensions
graph. These include the number of nodes and edges in the
graph, the mean degree and strength of nodes in the graph,
mean edge weight, mean clustering coefficient across nodes
in the largest connected component, and the standard devi-
ation and skew of each network’s in-degree, out-degree and
strength distributions (see Fig. 2). Additionally we track the
out-degree and out-strength of the most prolific broadcaster,
as well as the in-degree and in-strength of the most focused-
upon user. We also monitor the number of unique injection
points of the meme, reasoning that organic memes (such as
those relating to news events) will be associated with larger
number of originating users.
4.4 Sentiment Analysis
We also utilize a modified version of the Google-based
Profile of Mood States (GPOMS) sentiment analysis
method (Bollen, Mao, and Pepe 2010) in the analysis of
meme-specific sentiment on Twitter. The GPOMS tool as-
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
ior), or the users engaged in repeated reply/retweet exclu-
sively with other users who had tweeted the meme. ‘Legit-
imate’ memes were described as memes representing nor-
mal use of Twitter — several non-automated users convers-
ing about a topic. The final category, ‘remove,’ was used for
memes in a non-English language or otherwise unrelated to
U.S. politics (#youth, for example). These memes were
not used in the training or evaluation of classifiers.
Upon gathering 252 annotated memes, we observed an
imbalance in our labeled data (231 legitimate and only 21
truthy). Rather than simply resampling from the smaller
class, as is common practice in the case of class imbal-
eatures used in truthy classification.
des Number of nodes
ges Number of edges
n k Mean degree
n s Mean strength
n w Mean edge weight in largest con-
nected component
,o) Maximum (in,out)-degree
ser User with max. (in,out)-degree
,o) Maximum (in,out)-strength
ser User with max. (in,out)-strength
,o) Std. dev. of (in,out)-degree
,o) Std. dev. of (in,out)-strength
,o) Skew of (in,out)-degree distribution
,o) Skew of (in,out)-strength distribution
cc Mean size of connected components
cc Size of largest connected component
des Number of unique injections
thy Number of times ‘truthy’ button was
clicked
ores Six GPOMS sentiment dimensions
ude the number of nodes and edges in the
degree and strength of nodes in the graph,
t, mean clustering coefficient across nodes
nected component, and the standard devi-
f each network’s in-degree, out-degree and
ions (see Fig. 2). Additionally we track the
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
www.bgoncalves.com@bgoncalves
Even more so in Political Elections
http://truthy.indiana.edu
A B C)C D
E)E F G)G H
#ampat @PeaceKaren_25 gopleader.gov “How Chris Coons
budget works- uses tax
$ 2 attend dinners and
fashion shows”
Table 1: Features used in truthy classification.
nodes Number of nodes
edges Number of edges
mean k Mean degree
mean s Mean strength
mean w Mean edge weight in largest con-
nected component
max k(i,o) Maximum (in,out)-degree
max k(i,o) user User with max. (in,out)-degree
max s(i,o) Maximum (in,out)-strength
max s(i,o) user User with max. (in,out)-strength
std k(i,o) Std. dev. of (in,out)-degree
std s(i,o) Std. dev. of (in,out)-strength
skew k(i,o) Skew of (in,out)-degree distribution
skew s(i,o) Skew of (in,out)-strength distribution
mean cc Mean size of connected components
max cc Size of largest connected component
entry nodes Number of unique injections
num truthy Number of times ‘truthy’ button was
clicked
sentiment scores Six GPOMS sentiment dimensions
graph. These include the number of nodes and edges in the
graph, the mean degree and strength of nodes in the graph,
mean edge weight, mean clustering coefficient across nodes
in the largest connected component, and the standard devi-
ation and skew of each network’s in-degree, out-degree and
strength distributions (see Fig. 2). Additionally we track the
out-degree and out-strength of the most prolific broadcaster,
as well as the in-degree and in-strength of the most focused-
upon user. We also monitor the number of unique injection
points of the meme, reasoning that organic memes (such as
those relating to news events) will be associated with larger
number of originating users.
4.4 Sentiment Analysis
We also utilize a modified version of the Google-based
Profile of Mood States (GPOMS) sentiment analysis
method (Bollen, Mao, and Pepe 2010) in the analysis of
meme-specific sentiment on Twitter. The GPOMS tool as-
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
ior), or the users engaged in repeated reply/retweet exclu-
sively with other users who had tweeted the meme. ‘Legit-
imate’ memes were described as memes representing nor-
mal use of Twitter — several non-automated users convers-
ing about a topic. The final category, ‘remove,’ was used for
memes in a non-English language or otherwise unrelated to
U.S. politics (#youth, for example). These memes were
not used in the training or evaluation of classifiers.
Upon gathering 252 annotated memes, we observed an
imbalance in our labeled data (231 legitimate and only 21
truthy). Rather than simply resampling from the smaller
class, as is common practice in the case of class imbal-
eatures used in truthy classification.
des Number of nodes
ges Number of edges
n k Mean degree
n s Mean strength
n w Mean edge weight in largest con-
nected component
,o) Maximum (in,out)-degree
ser User with max. (in,out)-degree
,o) Maximum (in,out)-strength
ser User with max. (in,out)-strength
,o) Std. dev. of (in,out)-degree
,o) Std. dev. of (in,out)-strength
,o) Skew of (in,out)-degree distribution
,o) Skew of (in,out)-strength distribution
cc Mean size of connected components
cc Size of largest connected component
des Number of unique injections
thy Number of times ‘truthy’ button was
clicked
ores Six GPOMS sentiment dimensions
ude the number of nodes and edges in the
degree and strength of nodes in the graph,
t, mean clustering coefficient across nodes
nected component, and the standard devi-
f each network’s in-degree, out-degree and
ions (see Fig. 2). Additionally we track the
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
www.bgoncalves.com@bgoncalves
Even more so in Political Elections
http://truthy.indiana.edu
A B C)C D
E)E F G)G H
#ampat @PeaceKaren_25 gopleader.gov “How Chris Coons
budget works- uses tax
$ 2 attend dinners and
fashion shows”
Table 1: Features used in truthy classification.
nodes Number of nodes
edges Number of edges
mean k Mean degree
mean s Mean strength
mean w Mean edge weight in largest con-
nected component
max k(i,o) Maximum (in,out)-degree
max k(i,o) user User with max. (in,out)-degree
max s(i,o) Maximum (in,out)-strength
max s(i,o) user User with max. (in,out)-strength
std k(i,o) Std. dev. of (in,out)-degree
std s(i,o) Std. dev. of (in,out)-strength
skew k(i,o) Skew of (in,out)-degree distribution
skew s(i,o) Skew of (in,out)-strength distribution
mean cc Mean size of connected components
max cc Size of largest connected component
entry nodes Number of unique injections
num truthy Number of times ‘truthy’ button was
clicked
sentiment scores Six GPOMS sentiment dimensions
graph. These include the number of nodes and edges in the
graph, the mean degree and strength of nodes in the graph,
mean edge weight, mean clustering coefficient across nodes
in the largest connected component, and the standard devi-
ation and skew of each network’s in-degree, out-degree and
strength distributions (see Fig. 2). Additionally we track the
out-degree and out-strength of the most prolific broadcaster,
as well as the in-degree and in-strength of the most focused-
upon user. We also monitor the number of unique injection
points of the meme, reasoning that organic memes (such as
those relating to news events) will be associated with larger
number of originating users.
4.4 Sentiment Analysis
We also utilize a modified version of the Google-based
Profile of Mood States (GPOMS) sentiment analysis
method (Bollen, Mao, and Pepe 2010) in the analysis of
meme-specific sentiment on Twitter. The GPOMS tool as-
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
ior), or the users engaged in repeated reply/retweet exclu-
sively with other users who had tweeted the meme. ‘Legit-
imate’ memes were described as memes representing nor-
mal use of Twitter — several non-automated users convers-
ing about a topic. The final category, ‘remove,’ was used for
memes in a non-English language or otherwise unrelated to
U.S. politics (#youth, for example). These memes were
not used in the training or evaluation of classifiers.
Upon gathering 252 annotated memes, we observed an
imbalance in our labeled data (231 legitimate and only 21
truthy). Rather than simply resampling from the smaller
class, as is common practice in the case of class imbal-
eatures used in truthy classification.
des Number of nodes
ges Number of edges
n k Mean degree
n s Mean strength
n w Mean edge weight in largest con-
nected component
,o) Maximum (in,out)-degree
ser User with max. (in,out)-degree
,o) Maximum (in,out)-strength
ser User with max. (in,out)-strength
,o) Std. dev. of (in,out)-degree
,o) Std. dev. of (in,out)-strength
,o) Skew of (in,out)-degree distribution
,o) Skew of (in,out)-strength distribution
cc Mean size of connected components
cc Size of largest connected component
des Number of unique injections
thy Number of times ‘truthy’ button was
clicked
ores Six GPOMS sentiment dimensions
ude the number of nodes and edges in the
degree and strength of nodes in the graph,
t, mean clustering coefficient across nodes
nected component, and the standard devi-
f each network’s in-degree, out-degree and
ions (see Fig. 2). Additionally we track the
Table 2: Performance of two classifiers with and without re-
sampling training data to equalize class sizes. All results are
averaged based on 10-fold cross-validation.
Classifier Resampling? Accuracy AUC
AdaBoost No 92.6% 0.91
AdaBoost Yes 96.4% 0.99
SVM No 88.3% 0.77
SVM Yes 95.6% 0.95
Table 3: Confusion matrices for a boosted decision stump
classifier with and without resampling. The labels on the
rows refer to true class assignments; the labels on the
columns are those predicted.
No resampling With resampling
Truthy Legitimate Truthy Legitimate
T 45 (12%) 16 (4%) 165 (45%) 6 (1%)
L 11 (3%) 294 (80%) 7 (2%) 188 (51%)
additional volunteers), and asking them to place each meme
in one of the three categories. A meme was to be classified as
‘truthy’ if a significant portion of the users involved in that
meme appeared to be spreading it in misleading ways —
e.g., if a number of the accounts tweeting about the meme
appeared to be robots or sock puppets, the accounts appeared
to follow only other propagators of the meme (clique behav-
Why not start with
something a bit simpler?
www.bgoncalves.com@bgoncalves
American Idol
• Popularity contest
• Well defined audience, across the entire US
• Similar demographics voting and tweeting
• Weekly “votes”, involving the same population
• Immediate results
• (Almost) No incentives for organized campaigns
www.bgoncalves.com@bgoncalves
Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Hollie
Skylar
Top 5
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Hollie
Top 4
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
Skylar
10 20 30 40 50
% of Tweets
60 700
Calibration
Jessica
Phillip
Joshua
Top 3
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
Geographic Locations
T
(B)
(C)
Jessica Phillip
Joshua Hollie
Skylar CC
Top 4
(A)
(B)
(C)
Top 3
(B)
(C)
Jessica Phillip
Joshua Hollie
Skylar CC
Top 5
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
An actual prediction EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
And the winner is...
Jessica
Phillip
World
U.S.
Phillip
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
And the winner is...
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80
Phillip
U.S.
Jessica
Phillip
10 20 30 40 50
% of Tweets
60 700 80
EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
And the winner is... EPJ Data Science 1, 8 (2012)
www.bgoncalves.com@bgoncalves
Stock Market 2
ecember 19,
text content
a positive
The second
of tweets to
ublic mood
ublic along
lting public
s Industrial
changes in
e prediction
els is signif-
re included,
ublic mood
by GPOMS
appiness as
Twitter
feed ~
(1) OpinionFinder
(2) G-POMS (6 dim.)
Mood indicators (daily)
DJIA ~
Stock market (daily)
(3) DJIA
Granger
causality
-n (lag)
F-statistic
p-value
text
analysis
normalization
SOFNN
predicted
value MAPE
Direction %
1
2
t-1
t-2
t-3
3
t=0
value
feb28
2008
apr may jun jul aug sep oct nov dec dec20
2008
(1) OF ~
GPOMS
(2) Granger Causality analysis
(3) SOFNN training test
Methodology
Data sets and timeline
Fig. 1. Diagram outlining 3 phases of methodology and corresponding data
sets: (1) creation and validation of OpinionFinder and GPOMS public mood
www.bgoncalves.com@bgoncalves
POMS
• Simple questionnaire that classifies a person’s mood along 6 dimensions:
• tension-anxiety
• depression-dejection
• anger-hostility
• fatigue-inertia
• vigor-activity
• confusion-bewilderment
• How to administer it to Twitter users?
• Expand vocabulary using Google n-grams
• Search twitter for matching words
Profile of Mood States
Subject's Initials
Birth date
Date
Subject Code No.
Directions: Describe HOW YOU FEEL RIGHT NOW
by circling the most appropriate number after each of the words listed below:
Quite a
FEELING Not at all A little Moderate bit Extremely
1. Friendly 1 2 3 4 5
2. Tense 1 2 3 4 5
3. Angry 1 2 3 4 5
4. Worn Out 1 2 3 4 5
5. Unhappy 1 2 3 4 5
6. Clear-headed 1 2 3 4 5
7. Lively 1 2 3 4 5
8. Confused 1 2 3 4 5
9. Sorry for things done 1 2 3 4 5
10. Shaky 1 2 3 4 5
11. Listless 1 2 3 4 5
12. Peeved 1 2 3 4 5
13. Considerate 1 2 3 4 5
14. Sad 1 2 3 4 5
15. Active 1 2 3 4 5
16. On edge 1 2 3 4 5
17. Grouchy 1 2 3 4 5
18. Blue 1 2 3 4 5
19. Energetic 1 2 3 4 5
20. Panicky 1 2 3 4 5
21. Hopeless 1 2 3 4 5
22. Relaxed 1 2 3 4 5
23. Unworthy 1 2 3 4 5
www.bgoncalves.com@bgoncalves
Timelines along each mood dimension
ounterpart to the differentiated response to the Presidential
lection. On Thanksgiving day we find a spike in Happy
values, indicating high levels of public happiness. However,
no other mood dimensions are elevated on November 27.
Furthermore, the spike in Happy values is limited to the one
day, i.e. we find no significant mood response the day before
or after Thanksgiving.
1.25
1.75
OpinionFinder day after
election
Thanksgiving
-1
1
pre- election
anxiety
CALM
-1
1
ALERT
-1
1
election
results
SURE
1
1
pre! election
energy
VITAL
-1
-1 KIND
-1
1
Thanksgiving
happiness
HAPPY
Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26
z-scores
ig. 2. Tracking public mood states from tweets posted between October
008 to December 2008 shows public responses to presidential election and
hanksgiving.
rtially overlap with the mood values provided by
r, but not necessarily all mood dimensions that
ortant in describing the various components of
e.g. the varied mood response to the Presidential
GPOMS thus provides a unique perspective on
states not captured by uni-dimensional tools such
nder.
Granger Causality Analysis of Mood vs. DJIA
blishing that our mood time series responds to
cio-cultural events such as the Presidential elec-
nksgiving, we are concerned with the question
r variations of the public’s mood state correlate
in the stock market, in particular DJIA closing
nswer this question, we apply the econometric
Granger causality analysis to the daily time
ed by GPOMS and OpinionFinder vs. the DJIA.
ality analysis rests on the assumption that if a
auses Y then changes in X will systematically
changes in Y . We will thus find that the lagged
will exhibit a statistically significant correlation
elation however does not prove causation. We
Granger causality analysis in a similar fashion
re not testing actual causation but whether one
as predictive information about the other or not7
.
ime series, denoted Dt, is defined to reflect daily
tock market value, i.e. its values are the delta
high level of confidence. However, this result only applies to
1 GPOMS mood dimension. We observe that X1 (i.e. Calm)
has the highest Granger causality relation with DJIA for lags
ranging from 2 to 6 days (p-values  0.05). The other four
mood dimensions of GPOMS do not have significant causal
relations with changes in the stock market, and neither does
the OpinionFinder time series.
To visualize the correlation between X1 and the DJIA in
more detail, we plot both time series in Fig. 3. To maintain
the same scale, we convert the DJIA delta values Dt and mood
index value Xt to z-scores as shown in Eq. 1.
-2
-1
0
1
2
DJIAz-score
Aug 09 Aug 29 Sep 18 Oct 08 Oct 28
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
DJIAz-scoreCalmz-score
Calmz-score
bank
bail-out
Fig. 3. A panel of three graphs. The top graph shows the overlap of the
day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm
Look for correlations between
dimensions and DJIA
1
Twitter mood predicts the stock market.
Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2.
?: authors made equal contributions.
Abstract—Behavioral economics tells us that emotions can
profoundly affect individual behavior and decision-making. Does
this also apply to societies at large, i.e. can societies experience
mood states that affect their collective decision making? By
extension is the public mood correlated or even predictive of
economic indicators? Here we investigate whether measurements
of collective mood states derived from large-scale Twitter feeds
are correlated to the value of the Dow Jones Industrial Average
(DJIA) over time. We analyze the text content of daily Twitter
feeds by two mood tracking tools, namely OpinionFinder that
measures positive vs. negative mood and Google-Profile of Mood
States (GPOMS) that measures mood in terms of 6 dimensions
(Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate
the resulting mood time series by comparing their ability to
detect the public’s response to the presidential election and
Thanksgiving day in 2008. A Granger causality analysis and
a Self-Organizing Fuzzy Neural Network are then used to
investigate the hypothesis that public mood states, as measured by
the OpinionFinder and GPOMS mood time series, are predictive
of changes in DJIA closing values. Our results indicate that the
accuracy of DJIA predictions can be significantly improved by
the inclusion of specific public mood dimensions but not others.
We find an accuracy of 87.6% in predicting the daily up and
down changes in the closing values of the DJIA and a reduction
of the Mean Average Percentage Error by more than 6%.
Index Terms—stock market prediction — twitter — mood
analysis.
I. INTRODUCTION
STOCK market prediction has attracted much attention
from academia as well as business. But can the stock
market really be predicted? Early research on stock market
prediction [1], [2], [3] was based on random walk theory
and the Efficient Market Hypothesis (EMH) [4]. According
to the EMH stock market prices are largely driven by new
information, i.e. news, rather than present and past prices.
Since news is unpredictable, stock market prices will follow a
random walk pattern and cannot be predicted with more than
50 percent accuracy [5].
There are two problems with EMH. First, numerous studies
show that stock market prices do not follow a random walk
and can indeed to some degree be predicted [5], [6], [7], [8]
thereby calling into question EMH’s basic assumptions. Sec-
ond, recent research suggests that news may be unpredictable
but that very early indicators can be extracted from online
social media (blogs, Twitter feeds, etc) to predict changes
in various economic and commercial indicators. This may
conceivably also be the case for the stock market. For example,
[11] shows how online chat activity predicts book sales. [12]
uses assessments of blog sentiment to predict movie sales.
sentiment from blogs. In addition, Google search queries have
been shown to provide early indicators of disease infection
rates and consumer spending [14]. [9] investigates the relations
between breaking financial news and stock price changes.
Most recently [13] provide a ground-breaking demonstration
of how public sentiment related to movies, as expressed on
Twitter, can actually predict box office receipts.
Although news most certainly influences stock market
prices, public mood states or sentiment may play an equally
important role. We know from psychological research that
emotions, in addition to information, play an significant role
in human decision-making [16], [18], [39]. Behavioral finance
has provided further proof that financial decisions are sig-
nificantly driven by emotion and mood [19]. It is therefore
reasonable to assume that the public mood and sentiment can
drive stock market values as much as news. This is supported
by recent research by [10] who extract an indicator of public
anxiety from LiveJournal posts and investigate whether its
variations can predict SP500 values.
However, if it is our goal to study how public mood
influences the stock markets, we need reliable, scalable and
early assessments of the public mood at a time-scale and
resolution appropriate for practical stock market prediction.
Large surveys of public mood over representative samples of
the population are generally expensive and time-consuming
to conduct, cf. Gallup’s opinion polls and various consumer
and well-being indices. Some have therefore proposed indirect
assessment of public mood or sentiment from the results of
soccer games [20] and from weather conditions [21]. The
accuracy of these methods is however limited by the low
degree to which the chosen indicators are expected to be
correlated with public mood.
Over the past 5 years significant progress has been made
in sentiment tracking techniques that extract indicators of
public mood directly from social media content such as blog
content [10], [12], [15], [17] and in particular large-scale
Twitter feeds [22]. Although each so-called tweet, i.e. an
individual user post, is limited to only 140 characters, the
aggregate of millions of tweets submitted to Twitter at any
given time may provide an accurate representation of public
mood and sentiment. This has led to the development of real-
time sentiment-tracking indicators such as [17] and “Pulse of
Nation”1
.
In this paper we investigate whether public sentiment, as
expressed in large-scale collections of daily Twitter posts, can
be used to predict the stock market. We use two tools to
measure variations in the public mood from tweets submitted
arXiv:1010.3003v1[cs.CE]14Oct2010
www.bgoncalves.com@bgoncalves
And it works!
www.bgoncalves.com@bgoncalves
And it works! (Maybe!)
www.bgoncalves.com@bgoncalves
And it works! (Maybe!)
www.bgoncalves.com@bgoncalves
Coming Soon! CompleNet 2016
Dijon, France — March 23-25
www.bgoncalves.com@bgoncalves
Coming Soon! CompleNet 2016
Dijon, France — March 23-25

Contenu connexe

Tendances

Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Pete Burnap
 
Strategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsStrategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsKeRoxiLi
 
Feat. Gerbaudo Class (Data and General Election in the UK)
Feat. Gerbaudo Class (Data and General Election in the UK)Feat. Gerbaudo Class (Data and General Election in the UK)
Feat. Gerbaudo Class (Data and General Election in the UK)fabiomalini
 
Does Google still need links? - SearchLove San Diego 2017
Does Google still need links? - SearchLove San Diego 2017Does Google still need links? - SearchLove San Diego 2017
Does Google still need links? - SearchLove San Diego 2017Tom Capper
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshopPawel Szulc
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Pawel Szulc
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2The Night's Watch
 

Tendances (7)

Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter Machine Classification and Analysis of Suicide-Related Communication on Twitter
Machine Classification and Analysis of Suicide-Related Communication on Twitter
 
Strategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using TweetsStrategize your World Cup Marketing Campaign using Tweets
Strategize your World Cup Marketing Campaign using Tweets
 
Feat. Gerbaudo Class (Data and General Election in the UK)
Feat. Gerbaudo Class (Data and General Election in the UK)Feat. Gerbaudo Class (Data and General Election in the UK)
Feat. Gerbaudo Class (Data and General Election in the UK)
 
Does Google still need links? - SearchLove San Diego 2017
Does Google still need links? - SearchLove San Diego 2017Does Google still need links? - SearchLove San Diego 2017
Does Google still need links? - SearchLove San Diego 2017
 
Apache spark workshop
Apache spark workshopApache spark workshop
Apache spark workshop
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2
 

En vedette

Making Sense of Data Big and Small
Making Sense of Data Big and SmallMaking Sense of Data Big and Small
Making Sense of Data Big and SmallBruno Gonçalves
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economytnoulas
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
 
Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017tnoulas
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksBruno Gonçalves
 

En vedette (7)

Making Sense of Data Big and Small
Making Sense of Data Big and SmallMaking Sense of Data Big and Small
Making Sense of Data Big and Small
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
 

Similaire à Twitterology - The Science of Twitter

Outreach Through Social Media | Ocean Sciences 2014
Outreach Through Social Media | Ocean Sciences 2014Outreach Through Social Media | Ocean Sciences 2014
Outreach Through Social Media | Ocean Sciences 2014Christie Wilcox
 
American Majority Twitter Manual
American Majority Twitter ManualAmerican Majority Twitter Manual
American Majority Twitter ManualJennifer Raiffie
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...Spotle.ai
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1crore projects
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
State of the Union
State of the UnionState of the Union
State of the Unionjtierney
 
Are Twitter Users Equal in Predicting Elections
Are Twitter Users Equal in Predicting ElectionsAre Twitter Users Equal in Predicting Elections
Are Twitter Users Equal in Predicting ElectionsLu Chen
 
Evolution of Twitter Users and Behavior
Evolution of Twitter Users and BehaviorEvolution of Twitter Users and Behavior
Evolution of Twitter Users and BehaviorAli Babaoglan Blog
 
Estudio sobre twitter jun09
Estudio sobre twitter jun09Estudio sobre twitter jun09
Estudio sobre twitter jun09albapocket
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
 
Twitter for Nonprofits
Twitter for NonprofitsTwitter for Nonprofits
Twitter for NonprofitsTed Fickes
 
Mutiple linear regression project
Mutiple linear regression projectMutiple linear regression project
Mutiple linear regression projectJAPAN SHAH
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsRESHAN FARAZ
 
Modeling Spread of Disease from Social Interactions
Modeling Spread of Disease from Social InteractionsModeling Spread of Disease from Social Interactions
Modeling Spread of Disease from Social InteractionsPrashanth Selvam
 

Similaire à Twitterology - The Science of Twitter (20)

Science of @Twitter
Science of @TwitterScience of @Twitter
Science of @Twitter
 
Outreach Through Social Media | Ocean Sciences 2014
Outreach Through Social Media | Ocean Sciences 2014Outreach Through Social Media | Ocean Sciences 2014
Outreach Through Social Media | Ocean Sciences 2014
 
Social Media Use 2016
Social Media Use 2016Social Media Use 2016
Social Media Use 2016
 
American Majority Twitter Manual
American Majority Twitter ManualAmerican Majority Twitter Manual
American Majority Twitter Manual
 
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Cyber Pun...
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
State of the Union
State of the UnionState of the Union
State of the Union
 
Are Twitter Users Equal in Predicting Elections
Are Twitter Users Equal in Predicting ElectionsAre Twitter Users Equal in Predicting Elections
Are Twitter Users Equal in Predicting Elections
 
Evolution of Twitter Users and Behavior
Evolution of Twitter Users and BehaviorEvolution of Twitter Users and Behavior
Evolution of Twitter Users and Behavior
 
Estudio sobre twitter jun09
Estudio sobre twitter jun09Estudio sobre twitter jun09
Estudio sobre twitter jun09
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Hub Spot Sotwitter09
Hub Spot Sotwitter09Hub Spot Sotwitter09
Hub Spot Sotwitter09
 
Sotwitter09
Sotwitter09Sotwitter09
Sotwitter09
 
Twitter for Nonprofits
Twitter for NonprofitsTwitter for Nonprofits
Twitter for Nonprofits
 
Social and economical networks from (big-)data - Esteban Moro II
Social and economical networks from (big-)data - Esteban Moro IISocial and economical networks from (big-)data - Esteban Moro II
Social and economical networks from (big-)data - Esteban Moro II
 
Mutiple linear regression project
Mutiple linear regression projectMutiple linear regression project
Mutiple linear regression project
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
Modeling Spread of Disease from Social Interactions
Modeling Spread of Disease from Social InteractionsModeling Spread of Disease from Social Interactions
Modeling Spread of Disease from Social Interactions
 

Dernier

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)itwameryclare
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 

Dernier (20)

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)Functional group interconversions(oxidation reduction)
Functional group interconversions(oxidation reduction)
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 

Twitterology - The Science of Twitter

  • 9. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'follow_request_sent', u'profile_use_background_image', u'default_profile_image', u'id', u'profile_background_image_url_https', u'verified', u'profile_text_color', u'profile_image_url_https', u'profile_sidebar_fill_color', u'entities', u'followers_count', u'profile_sidebar_border_color', u'id_str', u'profile_background_color', u'listed_count', u'is_translation_enabled', u'utc_offset', u'statuses_count', u'description', u'friends_count', u'location', u'profile_link_color', u'profile_image_url', u'following', u'geo_enabled', u'profile_banner_url', u'profile_background_image_url', u'screen_name', u'lang', u'profile_background_tile', u'favourites_count', u'name', u'notifications', u'url', u'created_at', u'contributors_enabled', u'time_zone', u'protected', u'default_profile', u'is_translator'] http://www.bgoncalves.com/teaching/data-mining.html
  • 10. www.bgoncalves.com@bgoncalves Anatomy of a Tweet [u'contributors', u'truncated', u'text', u'in_reply_to_status_id', u'id', u'favorite_count', u'source', u'retweeted', u'coordinates', u'entities', u'in_reply_to_screen_name', u'in_reply_to_user_id', u'retweet_count', u'id_str', u'favorited', u'user', u'geo', u'in_reply_to_user_id_str', u'possibly_sensitive', u'lang', u'created_at', u'in_reply_to_status_id_str', u'place', u'metadata'] [u'type', u'coordinates'] [u'symbols', u'user_mentions', u'hashtags', u'urls'] u'<a href="http://foursquare.com" rel=“nofollow"> foursquare</a>' u"I'm at Terminal Rodovixe1rio de Feira de Santana (Feira de Santana, BA) http://t.co/WirvdHwYMq" {u'display_url': u'4sq.com/1k5MeYF', u'expanded_url': u'http://4sq.com/1k5MeYF', u'indices': [70, 92], u'url': u'http://t.co/WirvdHwYMq'} http://www.bgoncalves.com/teaching/data-mining.html
  • 15. www.bgoncalves.com@bgoncalves Demographics users who we could infer a gender for, based on their name and the list previously described. We do so by comparing the first word of their self-reported name to the gender list. We observe that there exists a match for 64.2% of the users. Moreover, we find a strong bias towards male users: Fully 71.8% of the the users who we find a name match for had a male name. 0 0.2 0.4 0.6 0.8 1 2007-01 2007-07 2008-01 2008-07 2009-01 2009-07 FractionofJoiningUsers whoareMale Date Figure 3: Gender of joining users over time, binned into groups of 10,000 joining users (note that the join rate in- creases substantially). The bias towards male users is ob- served to be decreasing over time. each last name with over 100 individuals in the U.S. ing the 2000 Census, the Census releases the distributio race/ethnicity for that last name. For example, the last n “Myers” was observed to correspond to Caucasians 86% the time, African-Americans 9.7%, Asians 0.4%, and panics 1.4%. Race/ethnicity distribution of Twitter users We first determined the number of U.S.-based users whom we could infer the race/ethnicity by comparing last word of their self-reported name to the U.S. Ce last name list. We observed that we found a match 71.8% of the users. We the determined the distributio race/ethnicity in each county by taking the race/ethn distribution in the Census list, weighted by the freque of each name occurring in Twitter users in that coun Due to the large amount of ambiguity in the last name race/ethnicity list (in particular, the last name list is m than 95% predictive for only 18.5% of the users), we are able to directly compare the Twitter race/ethnicity distr 1 This is effectively the census.model approach discuss prior work (Chang et al. 2010). (a) Normal representation Figure 2: Per-county over- and underrepresentation of U.S. po tation rate of 0.324%, presented in both (a) a normal layout an Blue colors indicate underrepresentation, while red colors repre to the log of the over- or underrepresentation rate. Clear trend overrepresentation of populous counties. less than 95% predictive (e.g., the name Avery was observed to correspond to male babies only 56.8% of the time; it was Undersampling Oversampling (a) Caucasian (non-hispanic) (b) African-American (c) Asian or Pacific Islander (d) Hispanic Figure 4: Per-county area cartograms of Twitter over- and undersampling rates of Caucasian, African-American, Asian, and Hispanic users, relative to the 2000 U.S. Census. Only counties with more than 500 Twitter users with inferred race/ethnicity are shown. Blue regions correspond to undersampling; red regions to oversampling. ICWSM’11, 375 (2011)
  • 17. www.bgoncalves.com@bgoncalves Twitter Network TIME). The top 20 are listed in Figure 7. Some of them follow the followers, but most of them do not (the median number of follow ings of the top 40 users is 114, three orders of magnitude small than the number of followers). We revisit the issue of reciprocity Section 3.3. 3.2 Followers vs. Tweets Figure 2: The number of followers and that of tweets per use In order to gauge the correlation between the number of follow ers and that of written tweets, we plot the number of tweets (y against the number of followers a user has (x) in Figure 2. We b the number of followers in logscale and plot the median per bin the dashed line. The majority of users who have fewer than 10 fo lowers never tweeted or did just once and thus the median stays at The average number of tweets against the number of followers p ompared against each other. Before we delve into the eccen- es and peculiarities of Twitter, we run a batch of well-known sis and present the summary. Basic Analysis Figure 1: Number of followings and followers construct a directed network based on the following and fol- d and analyze its basic characteristics. Figure 1 displays the bution of the number of followings as the solid line and that of wers as the dotted line. The y-axis represents complementary lative distribution function (CCDF). We first explain the dis- nitude smaller reciprocity in eets per user ber of follow- of tweets (y) ure 2. We bin dian per bin in er than 10 fol- dian stays at 1. followers per re are outliers of followers. n x = 100 to sure, but only state the correlation between the numbers of tweets and followers. 3.3 Reciprocity In Section 3.1 we briefly mention that top users by the number of followers in Twitter are mostly celebrities and mass media and most of them do not follow their followers back. In fact Twitter shows a low level of reciprocity; 77.9% of user pairs with any link between them are connected one-way, and only 22.1% have recip- rocal relationship between them. We call those r-friends of a user as they reciprocate a user’s following. Previous studies have reported much higher reciprocity on other social networking services: 68% on Flickr [4] and 84% on Yahoo! 360 [18]. Moreover, 67.6% of users are not followed by any of their fol- lowings in Twitter. We conjecture that for these users Twitter is rather a source of information than a social networking site. Fur- ther validation is out of the scope of this paper and we leave it for future work. 3.4 Degree of Separation WWW'10, 591 (2010)
  • 18. www.bgoncalves.com@bgoncalves Retweet Trees April 26-30 • Raleigh • NC • USA ce Size of Retweet age and median numbers of additional recipi- via retweeting be to mass media in various forms: radio, TV, and y are immediate recipients and consumers of the hed media produce. On Twitter people acquire lways directly from those they follow, but often suming a tweet posted by a user is viewed and of the user’s followers, we count the number of nts who are not immediate followers of the orig- Figure 14 displays its average and median per number of followers of the original tweet user. almost always below the average, indicating that a very large number of additional recipients. Up llowers, the average number of additional recipi- d by the number of followers of the tweet source. WWW'10, 591 (2010)
  • 19. www.bgoncalves.com@bgoncalves Retweets Trees Figure 15: Retweet trees of ‘air france flight’ tweets Figure 16: Height and participating users in retweet trees etweeting the same tweet, and cross-retweet is retweeting each ther. In Figure 16 we plot the CCDFs of the retweet tree heights and he number of users in a retweet tree. The height of 1 is the most 6. IMPACT OF RETWEET We have seen how trending topics rise in popularity and ev ally die in Section 5. Then how exactly does information spre Twitter? Retweet is an effective means to relay the informatio yond adjacent neighbors. We dig into the retweet trees constr per trending topic and examine key factors that impact the eve spread of information. 6.1 Audience Size of Retweet WWW 2010 • Full Paper WWW’10, 591 (2010) WWW'10, 591 (2010)
  • 23. www.bgoncalves.com@bgoncalves Friends Talk to Each Other PLoS One 6, E22656 (2011)
  • 24. www.bgoncalves.com@bgoncalves Friends Talk to Each Other PLoS One 6, E22656 (2011)
  • 25. www.bgoncalves.com@bgoncalves Online Conversations 0 350 400 450 500 550 600 ut 0 50 100 150 200 250 300 350 400 450 500 550 600 010020030040050060050150250350450550 k in ρ B) ReciprocatedConnections 0 50 100 150 200 250 300 350 400 450 500 550 600 12345678 ωout k out A) 0 50 100 150 200 010020030040050060050150250350450550 ρ B) !out i = P j !ij kout i AverageWeightperConnection 1.7 Million users 370 Million messages Saturation of the number of reciprocated connections Number of connections for which interaction strength is highest PLoS One 6, E22656 (2011)
  • 26. www.bgoncalves.com@bgoncalves wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties
  • 27. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973)
  • 28. www.bgoncalves.com@bgoncalves Weak • Interviews to find out how individuals found out about job opportunities. • Mostly from acquaintances or friends of friends • “It is argued that the degree of overlap of two individuals social networks varies directly with the strength of their tie to one another” wo possible cases in networks with : ͑a͒ positively correlated nets and ͑b͒ width of the line of the links represents CAL REVIEW E 76, 066106 ͑2007͒ A C B kin = 1 kout = 2 sin = 1 sout = 2 kin = 2 kout = 1 sin = 3 sout = 1 kin = 1 kout = 1 sin = 1 sout = 2 Figure 2: Example of a meme diffusion network involving three users mentioning and retweeting each other. The val- ues of various node statistics are shown next to each node. The strength s refers to weighted degree, k stands for degree. Observing a retweet at node B provides implicit confirma- tion that information from A appeared in B’s Twitter feed, while a mention of B originating at node A explicitly con- firms that A’s message appeared in B’s Twitter feed. This may or may not be noticed by B, therefore mention edges are less reliable indicators of information flow compared to retweet edges. Retweet and reply/mention information parsed from the text can be ambiguous, as in the case when a tweet is marked as being a ‘retweet’ of multiple people. Rather, we rely on Twitter metadata, which designates users replied to or retweeted by each message. Thus, while the text of a tweet may contain several mentions, we only draw an edge to the user explicitly designated as the mentioned user by the meta- data. In so doing, we may miss retweets that do not use the explicit retweet feature and thus are not captured in the meta- data. Note that this is separate from our use of mentions as memes (§ 3.1), which we parse from the text of the tweet. 4 System Architecture Figure 3 website, memes. detailed per day lion twe process network to produ acteristic analyses sification 4.2 M A secon The Strength of Ties (1973) for a time sufficient to its ale communication network nd the calls among them links. indicates a particular egocentric network evolution. In order to quantify it, we measure the probability, p(n), that the next communication event of an agent having n social ties will occur via the establishment of a new (n 1 1)th link. We calculate these probabilities in the MPC dataset averaging them for users with the same degree k at the end of the observation time. We therefore . Panels (a), and (b) show calls within 3 hours between people in the same town in two different time windows. al network structure, which was recorded by aggregating interactions during 6 months. Node size and colors idth and color represent weight.
  • 29. www.bgoncalves.com@bgoncalves Network Structure The Strength of Intermediary Ties in Social Media “People whose networks bridge the structural holes between groups have an advantage in detecting and developing rewarding opportunities. Information arbitrage is their advantage. They are able to see early, see more broadly, and translate information across groups.” AJS Volume 110 Number 2 (September 2004): 349–99 ᭧ 2004 by The University of Chicago. All rights reserved. 0002-9602/2004/11002-0004$10.00 Structural Holes and Good Ideas1 Ronald S. Burt University of Chicago This article outlines the mechanism by which brokerage prov social capital. Opinion and behavior are more homogeneous w than between groups, so people connected across groups are m familiar with alternative ways of thinking and behaving. Broke across the structural holes between groups provides a vision o tions otherwise unseen, which is the mechanism by which broke becomes social capital. I review evidence consistent with the pothesis, then look at the networks around managers in a American electronics company. The organization is rife with s tural holes, and brokerage has its expected correlates. Compensa positive performance evaluations, promotions, and good idea disproportionately in the hands of people whose networks structural holes. The between-group brokers are more likely t press ideas, less likely to have ideas dismissed, and more like have ideas evaluated as valuable. I close with implications for ativity and structural change. The hypothesis in this article is that people who stand near the hol a social structure are at higher risk of having good ideas. The argum is that opinion and behavior are more homogeneous within than betw groups, so people connected across groups are more familiar with a 1 Portions of this material were presented as the 2003 Coleman Lecture at the Univ of Chicago, at the Harvard-MIT workshop on economic sociology, in worksho the University of California at Berkeley, the University of Chicago, the Univers Kentucky, the Russell Sage Foundation, the Stanford Graduate School of Bus the University of Texas at Dallas, Universiteit Utrecht, and the “Social Aspe Rationality” conference at the 2003 meetings of the American Sociological Associ I am grateful to Christina Hardy for her assistance on the manuscript and to se colleagues for comments affecting the final text: William Barnett, James Baron athan Bendor, Jack Birner, Matthew Bothner, Frank Dobbin, Chip Heath, R Kranton, Rakesh Khurana, Jeffrey Pfeffer, Joel Podolny, Holly Raider, James R Don Ronchi, Ezra Zuckerman, and two AJS reviewers. I am especially grate Peter Marsden for his comments as discussant at the Coleman Lecture. Direc respondence to Ron Burt, Graduate School of Business, University of Chicago cago, Illinois 60637. E-mail: ron.burt@gsb.uchicago.edu PLoS One 7, e29358 (2012)
  • 30. www.bgoncalves.com@bgoncalves ation that the stronger the tie is the higher acts of both parties it has and the higher the belong to the same group. groups to consider is the characteristics of links ese links occur mainly between groups 200 users (Figure 4A–C). However, their he quality of the links (if they bear mentions ks with mentions are less abundant than the retweets are slightly more abundant. ngth of weak ties theory [12,14–16], weak between which they take place should be small according to the Granovetter’s theory. The results show that the most likely to attract retweets are the links connecting groups that are neither too close nor too far. This can be explained with Aral’s theory about the trade-off between diversity and bandwidth: if the two groups are too close there is no enough diversity in the information, while if the groups are too far the communication is poor. These trends are not dependant on the size of the considered groups (see Figs. S6, S7, S8, S9, S10, S11, S12, S13, S14 and Table S1 in the Supplementary Information). ink statistics. (A) Size distribution of the group. (B) Distribution of the number of groups to which each user is assigned. f different types, e.g. follower links (black bars), links with mentions (red bars) or retweets (green bars), staying in particular in respect to detected groups. .0029358.g002 Network Structure The Strength of Intermediary Ties in Social Media to Granovetter expectation that the stronger the number of mutual contacts of both parties it has a Figure 2. Group and link statistics. (A) Size distri (C) Percentage of links of different types, e.g. followe topological localizations in respect to detected grou doi:10.1371/journal.pone.0029358.g002 The PLoS One 7, e29358 (2012)
  • 31. www.bgoncalves.com@bgoncalves Groups Figure 4. Group-group activity. (A) Distribution of the number of links in the follower network between groups as a function of the groups. (B) Fractions f of links of the different types (follower, with mentions and with retweets) as a function of the size of the group The Strength of Intermediary Ties in So PLoS One 7, e29358 (2012) 2.4 Links between groups The next question to consider is the characteristic between groups. These links occur mainly betwee containing less than 200 users (Figure 4A–C). Howe frequency depends on the quality of the links (if they bear or retweets). While links with mentions are less abundan baseline, those with retweets are slightly more According to the strength of weak ties theory [12,14– links are typically connections between persons no neighbors, being important to keep the network conn for information diffusion. We investigate whether between groups play a similar role in the online n information transmitters. The actions more related to in diffusion are retweets [24] that show a slight prefe occurring on between-group links (Figures 4B and preference is enhanced when the similarity between groups is taken into account. We define the similarity be groups, A and B, in terms of the Jaccard index connections: similarity(A,B)~ jlinks of A and Bj j|links of A and Bj : The similarity is the overlap between the groups’ connec it estimates network proximity of the groups. The gener is that links with mentions more likely occur between clo and retweets occur between groups with medium (Figure 4D). Mentions as personal messages are exchanged between users with similar environments predicted by the strength of weak ties theory. Links with are related to information transfer and the similarity of t PLoS ONE | www.plosone.org
  • 33. www.bgoncalves.com@bgoncalves Twitter follower distance Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 f physical distances between egos and alters. The graph shows the number of ties by distance, in 200 km bins (for example, New ed towards the 5400 km bin). The total number of ties in each of the two simulations is the same as in the observed data. Based on Social Networks 34, 73 (2012)
  • 34. www.bgoncalves.com@bgoncalves Locality Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 79 Table 5 Top countries. Share of egos (%)a Share of egos (%) for egos in dyadsb Share of alters (%)c Percentage of domestic tiesd Percentage of domestic ties among non-local tiesd Following foreign alters/being followed from abroad Country named explicitly (% of egos) USA 48.5 45.7 54.5 91.6 89.3 0.3 8.1 Brazil 10.6 12.1 10.5 83.5 72.5 4.9 55.4 UK 7.6 8.3 7.6 50.6 33.3 1.2 45.3 Japan 5.5 6.5 6.3 92.1 86.0 1.4 25.0 Canada 3.7 3.8 2.9 33.3 23.1 1.6 58.5 Australia 2.7 2.7 1.9 50.0 32.0 2.2 69.7 Indonesia 2.6 1.8 1.2 60.0 25.0 7.0 83.3 Germany 2.1 1.8 1.3 62.9 58.8 3.2 58.6 Netherlands 1.4 1.4 1.2 66.7 22.2 1.5 54.3 Mexico 1.2 1.3 0.7 44.0 8.3 7.0 56.7 a Out of the 2852 egos located at the level of country or better. b Out of the egos included in 1953 dyads with both parties located at the level of country or better. c Out of the 1953 alters located at the level of country or better. d The number of ties with the ego and the alter in the given country as a share of all ties for egos in that country. between those two interpretations. We also note that top Twitter clusters intersect only to an extent with Alderson and Beckfield’s (2004) ranking of world cities based on multinational corporations’ branch headquarters. (Of Alderson and Beckfield’s top 25 cities by in-degree or “prestige,” 13 appear in the top 25 Twitter clusters ranked by in-degree centrality, with another 6 appearing in top 100.) 5.3. National borders Of the ties that were matched to countries, 75 percent con- nect users in the same country. This prevalence of domestic ties is Table 6 The most common languages. Based on 2852 egos. Language % of egos English 72.5 Portuguese 10.1 Japanese 5.4 Spanish 3.1 Indonesian 1.8 German 1.7 Dutch 1.0 Chinese 0.9 Korean 0.4 Swedish 0.4 Social Networks 34, 73 (2012) Y. Takhteyev et al. / Social Networks 34 (2012) 73–81 77 accounts, by randomly drawing an account from among those “fol- lowed” by each of those egos. We then coded the locations of the alters using the same procedure as we did for the egos, removing those pairs where the alter could not be assigned to a country. In the end, we obtained a sample of 1953 ego-alter pairs with both the ego and the alter assigned to a country, including 1259 pairs with “specific” locations for both parties (Table 1). 4.4. Aggregating nearby locations Since specific locations vary substantially in precision and since users can often choose between a range of specific names for the same place (e.g., “Palo Alto” vs. “Silicon Valley” vs. “SF Bay”), we aggregated nearby locations within each country, by assigning a set of coordinates (obtained from Google Maps) to each location smaller than 25,000 km2 and then merging nearby locations within each country by replacing their coordinates with a weighted aver- age of the coordinates of the merged locations. This reduced our location descriptions to a set of 386 regional clusters, which are comparable in size to metropolitan areas. We labeled each clus- ter with the most common name associated with it in our sample. For example, the cluster centered on Manhattan is referred to as “New York.” 5. Analysis In this section we analyze the factors affecting the formation of Twitter ties. We first look at the effect of each variable identified earlier based on theoretical considerations: the actual physical dis- tance, the frequency of air travel, national boundaries, and language differences. In addition to presenting the descriptive statistics demonstrating the effects of each variable and investigating the nature of such effects, we correlated the effects using the Quadratic Assignment Procedure (QAP, Krackhardt, 1987; Butts, 2007). In the last subsection we also examined the relationship between the variables using QAP regression (Double Dekker Semi-partialling MRQAP). All statistical calculations were done using UCINet 6.277 (Borgatti et al., 2002). For correlation and regression analysis we used networks with nodes representing the 25 largest regional clusters of users (see Table 3 Top clusters. Rank Clustera Share of egos (%)b Share of egos (%) for egos in dyadsc Share of alters (%)d Localitye 1 “New York” 8.5 8.3 10.2 54.3 2 “Los Angeles, CA” 5.1 5.6 10.4 53.3 3 “ ” (Tokyo) 4.1 4.8 5.0 62.9 4 “London” 3.6 3.3 4.9 48.8 5 “São Paulo” 3.5 3.0 3.6 78.4 6 “San Francisco” 2.8 2.7 4.1 41.2 7 “New Jersey”f 2.5 2.8 2.1 20.0 8 “Chicago” 2.2 2.0 1.7 32.0 9 “Washington, DC” 2.1 2.8 2.6 34.3 10 “Manchester, UK” 1.9 2.0 1.1 30.8 11 “Atlanta” 1.7 2.1 2.1 46.2 12 “San Diego” 1.5 1.5 1.1 26.3 13 “Toronto, Canada” 1.3 1.1 1.5 42.9 14 “Seattle” 1.3 1.4 1.2 58.8 15 “Houston” 1.2 1.2 1.0 40.0 16 “Dallas, Texas” 1.2 1.0 1.4 61.5 17 “Rio de Janeiro” 1.2 1.0 1.1 30.8 18 “Boston, MA” 1.2 1.2 1.1 20.0 19 “Amsterdam” 1.1 1.1 0.9 50.0 20 “Jakarta, Indonesia” 1.1 0.6 0.3 42.9 21 “Austin, TX” 1.0 1.0 1.3 50.0 22 “Sydney” 0.9 1.0 0.8 38.5 23 “Orlando, Forida” 0.9 1.0 0.6 16.7 24 “Phoenix, AZ” 0.8 0.7 0.6 11.1 25 “ ” (Hy¯ogo)g 0.8 1.0 1.0 25.0 a Each cluster is labeled with the name most frequently used for locations assigned to the cluster. b Out of the 2167 egos located with precision of <25,000 km2 . c Out of the 1259 egos included in dyads with both parties located with precision of <25,000 km2 . d Out of the 1259 alters included in dyads with both parties located with precision of <25,000 km2 . e Defined as the share of local of ties among all ties for egos in a cluster. f Centered between Philadelphia and Trenton, NJ and includes all locations iden- tified as just “New Jersey”. g Centered near the boundary between Hy¯ogo and Osaka prefectures, in the Kansai region of Japan. over half of the egos are in other countries, as are 4 of the 10 largest clusters: Tokyo, São Paulo, and two clusters in the United
  • 35. www.bgoncalves.com@bgoncalves Mobility and Social Networks Coupling Mobility and Interactions in Social Media Follower www.bgoncalves.combgoncalves Geography and Social Networks !"#$%& '%()&"*+,-.&$#%,( Geography Follower Reply ReTweet !"#$%&'()*+),-./*012 3&#1)40-$.&*# !"#$%&'()*#),-./*012 5#+*0 */ 6 7 6 7 Geography PLoS One 9, E92196 (2014) and for their dependence on the distance. The error Err of this null model is between 0:66–0:76 for the three countries, around twice the error of the TF model (see Figure 6). The linking model (L model) is a simplified version of the TF model, without random mobility and the box size d?0. Agents move to visit their contacts with probability pv, whereas with probability 1{pv they do not perform any action. In this version of the model, users can connect only by random connections or when two of them coincide, visiting a common friend, which leads to triadic closure. These two processes do not depend on the distances between the users. A thorough description can be obtained with a mean-field approach (see the corresponding section). The results of the L model are shown in Figure 2. Due to the triangle closing mechanism, this null model creates networks with a considerable level of clustering. However, it does not (e.g., for the US the TF model has Err lower by 0:5 and 1:5 than the TF-normal and the TF-uniform models, respectively, as shown in Figure 6). Simplified models that neglect either geography or network structure perform considerably worse than the TF model in reproducing the properties of real networks. Likewise, non-realistic assumptions on human mobility mechanism yield worse results than the default TF model. To conclude, the coupling of geography and structure through a realistic mobility mechanism produces networks with significantly more realistic geographic and structural properties. Sensitivity of the TF Model to the Parameters and its Modifications The results presented so far have been obtained at the optimal Figure 4. Simulation results: mobility and social networks. Mobility (upper row) and ego networks (lower row) of 20 random users (different colors) for the instances of the TF model yielding the lowest error Err (see Figure 3). Mobility network shows mobility patterns of individual users throughout entire simulation. Ego network shows the social connections at the end of the simulation. doi:10.1371/journal.pone.0092196.g004
  • 36. www.bgoncalves.com@bgoncalves Geo-Social Properties PLoS One 9, E92196 (2014) Couplin that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Triangle Disparity eo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). enta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and e.0092196.g002 Coupling Mobility and Interactions in Social Media Reciprocity Figure 2. Network geo-social properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), Gowalla (blue diamonds), Brightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). The spatial model (magenta), based on geography, matches well the data in Pl(d), but yields near-zero values for R(d), Jf (d) and C(d). The linking model (cyan), based on triadic closure, produces enough clustering, but it does not reproduce the distance dependencies of Pl(d), R(d), Jf (d) and C(d). doi:10.1371/journal.pone.0092196.g002 Coupling Mobility and Interactions in Social Media Prob of a Link ocial properties. Various statistical network properties are plotted for the data obtained from Twitter (red squares), ightkite (green triangles) and the null models (dashed lines), for the US (for the UK and Germany, see Figures S1 and S2). Coupling Mobility and Interactions in Social Media Clustering
  • 37. www.bgoncalves.com@bgoncalves Geo-Social Model New position of u { { { Detect all encounters e in the box of u Visit a random neighbour Jump to a new location Starting position of user u Created new social links PLoS One 9, E92196 (2014)
  • 38. www.bgoncalves.com@bgoncalves Model Fitting 0:39 for Germany. For simplicity, we focus on the Twitter networks only, although similar results are obtained for the other datasets. Results Simulations for the Optimal Parameters An example with the displacements between the consecutive locations and the ego networks for a sample of individuals, as generated by the TF model, are displayed in Figure 4. The parameters of the model are set to the ones that correspond to the minimum of the error Err. As shown, the agents tend to stay close to their original positions. Occasional long jumps occur due to friend visits that live far apart. In this range of parameters and simulation times, the main mechanism for generating long distance second null model, the linking model (L model), in contrast, is based only on random linking and triadic closure, and it is equivalent to the TF model without the mobility. We consider the two uncoupled null models and compare their results with those of the TF model. In this way, we demonstrate the importance of the coupling through a realistic mobility mechanism to reproduce the empirical networks. The spatial model (S model) consists of randomly connecting pair of users with a probability that decays as power-law of the distance between them (suggested in [41]). The exponent of the power-law is fixed at {0:7 following Figure 2A. The results of the S model are shown in the panels of Figure 2. While it is set to match Pl dð Þ, other properties such as P(k), R dð Þ, Jf dð Þ, C dð Þ or P Dð Þ are not well reproduced. The S model fails to account for the high level of clustering and reciprocity in the empirical networks Figure 3. Fitting the TF model. Values of the error Err when pv and pc are changed. The minimum error for each of the plots is marked with a red rectangle. doi:10.1371/journal.pone.0092196.g003 PLOS ONE | www.plosone.org 5 March 2014 | Volume 9 | Issue 3 | e92196 Prob. to Make a New Friend Prob.toVisitanOldFriend PLoS One 9, E92196 (2014)
  • 39. www.bgoncalves.com@bgoncalves perties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data lation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 Coupling Mobility and Interactions in Social Media Model Results Reciprocity Clustering Triangle Disparity andom connections, and so the distribution of triangles disparity prevent Figure 5. Geo-social properties of the model networks. Various statistical pro red squares) and from simulation of the TF model (black line) for the US. Correspond nd S4. doi:10.1371/journal.pone.0092196.g005 that has also an edge between i and k, forming a triangle. Note a triangle consists of 3 triads centered on different nodes. effect of the distance on the clustering coefficient can incorporated by measuring the distances from each central n j to two neighbors i and k forming a triad, d~dijzdjk, calculating the network clustering restricted to triads with dist d. This new function C(d) is the probability of closing a tria given the distance d in a triad C(d)~ D(d) L(d) , where (d) and (d) are the numbers of triads and closed tr for the distance d, respectively. The value of the global cluste coefficient C can be recovered by averaging C(d) over d. In datasets, we observe a drop in C(d) followed by a plateau, whi best visible for the US networks (Figure 2E). Given a triangle, several configurations are possible if the diversity in the edge lengths. The triangle can be equilateral the edges have the same length, isosceles if two have the s length and the other is smaller, etc. We estimate the domi shapes of the triangles in the network by measuring the dispari defined as: D~6 d2 1 zd2 2 zd2 3 (d1zd2zd3)2 { 1 3 , where d1, d2 and d3 are the geographical distances between locations of the users forming the triangle. The disparity t values between 0 and 1 as the shape of the triangle passes f equilateral to isosceles, where one edge is much smaller than other two. D shows a distribution with two maxima in the on social networks (Figure 2F), for low and high values. The two m C(d). doi:10.1371/journal.pone.0092196.g002 DL Figure 5. Geo-social properties of the model networks. Various statistical properties are plotted for the networks obtaine (red squares) and from simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be and S4. doi:10.1371/journal.pone.0092196.g005 Coupling Mobility and Interactio s, and so the distribution of triangles disparity prevents the model from producing networks with characteristics al properties of the model networks. Various statistical properties are plotted for the networks obtained from Twitter data m simulation of the TF model (black line) for the US. Corresponding results for the UK and Germany can be found in Figures S3 one.0092196.g005 Coupling Mobility and Interactions in Social Media Prob of a Link PLoS One 9, E92196 (2014)
  • 40. www.bgoncalves.com@bgoncalves Human Diffusion J. R. Soc. Interface 12, 20150473 (2015) Starting from Paris Starting from New York a b
  • 41. www.bgoncalves.com@bgoncalves Human Diffusion Starting from New Yorkb J. R. Soc. Interface 12, 20150473 (2015)
  • 42. www.bgoncalves.com@bgoncalves Residents and Tourists J. R. Soc. Interface 12, 20150473 (2015)
  • 43. www.bgoncalves.com@bgoncalves Residents and Tourists 50 100 150 200 250 300 350 0.1 0.2 0.3 0.4 0.5 0.6 Coverage R ~ Local Non−Local a 100 200 300 400 500 600 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Proportion of Non−Local Users Coverage b 125 135 145 155 New York Chicago San Francisco Shanghai Dallas Berlin Paris Saint Petersburg Beijing Moscow Coverage c 325 335 345 Houston Barcelona Brussels Detroit Lima Istanbul Rome Moscow Paris Lisbon Coverage d J. R. Soc. Interface 12, 20150473 (2015)
  • 44. www.bgoncalves.com@bgoncalves City Communities 0 2 4 6 8 10 Los Angeles San Francisco Miami Singapore Tokyo Paris London New York Weighted Betwennness (x 102 ) Weighted degree J. R. Soc. Interface 12, 20150473 (2015)
  • 46. www.bgoncalves.com@bgoncalves #tags • Metadata added to a Tweet for topic marking • Originally proposed by Chris Messina in 2007 • Quickly adopted informally by the Twitter community • Native support added by Twitter after it became popular
  • 47. www.bgoncalves.com@bgoncalves Hashtag Statistics numberofusers tag 105 103 101 101 103 105 500 users numberoftweets tag 105 103 101 101 103 105 swsx swineflu gfail peace watchmen nsotu winnenden masters WWW’12, 251 (2012)
  • 48. www.bgoncalves.com@bgoncalves Activity Peak Detection ! Peak: relative activity to baseline have to be 10 times larger ! Minimal level of activity expected ! Selection of isolated popularity bursts (no other peaks one week before/after) ! We detected 115 peaks continuous periodic peak #video #ff #w2e WWW’12, 251 (2012)
  • 49. www.bgoncalves.com@bgoncalves Peak Characterization 600 1500 83% 17%69% 31% 100 % 48% 6 r cup finale 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -6 250 150 500 #winnenden #watchmen Days Tweets Before Peak After PeakPeak800 600 400 200 0 30-30 peak baseline -15 15 WWW’12, 251 (2012)
  • 50. www.bgoncalves.com@bgoncalves Some Examples 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 0 6-3-6 3 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peakdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #nsotu Anticipation Reaction “Instantaneous”“Anticipation + Reaction” WWW’12, 251 (2012)
  • 51. www.bgoncalves.com@bgoncalves Classes of Peaks ! An#cipatory,behavior! ! Increasing,amount,of,tweets,un#l,the,event! ! Sharp,drop,of,a;en#on,aer,the,event 0% peak(fp =0) 0% before(f b=0) 0% after (fa = 0) 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geekw2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon nsotu zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools ! Unexpected,events! ! Driven,by,exogenous,sources ! Ac#vity,concentrated,on,the,peak,day! ! Events,that,only,discussed,while,,,,,,,,,, they,are,happen ! Collec#ve,a;en#on,is,built,up,to,a,,,,,,,,,,,peak, intensity,,then,a;en#on,shis,away WWW’12, 251 (2012)
  • 52. www.bgoncalves.com@bgoncalves Barycentric Coordinates 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 -3-6 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -3-6 2500 1500 500 days after peak daysdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #n (0,0,1) (0,1,0) (1,0,0) (0,1/2,1/2) (1/3,1/3,1/3) (1/2,0,1/2) (1/2,1/2,0) (1/2,1/4,1/4)(1/4,1/2,1/4) (1/4,1/4,1/2) 2D-Simplex WWW’12, 251 (2012)
  • 53. www.bgoncalves.com@bgoncalves Barycentric Coordinates 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 -3-6 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 -3-6 2500 1500 500 days after peak daysdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #n 0% peak 0% before 0% after 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geek w2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon grammys zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools 2D-Simplex WWW’12, 251 (2012)
  • 54. www.bgoncalves.com@bgoncalves Barycentric Coordinates 0% peak 0% before 0% after 100% peak 100% before 100% after swineflu h1n1 sxswi easter teaparty advertising mastersnfl earthhour twestival plurkfirstfollow mrtweet cebit bsg cricket google hadopi inaug09 drupalcon coalition geek w2e humor davos watchmen job house mikeyy superbowl gfail blackout oscar snowmageddon grammys zombies rp09 brand skittles phish ces09 socialmedia winnenden peace macheist earthday amazonfail fridayfollow aprilfools 600 400 200 0 6-3-6 3 500 300 100 0 6-3-6 3 1500 1000 500 0 6-3-6 3 83% 17%69% 31%73% 27% 1000 600 200 0 6-3-6 3 59% 41% 51% 48%600 400 200 0 6-3-6 3 master cup finale Apr 9, 2009 53% 47%700 500 300 200 0 6-3-6 3 school shooting Mar 3, 2009 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peakdays after peakdays after peak numberoftweetsuserID #masters #winnenden #watchmen #nsotu 1500 1000 500 0 6-3-6 3 83% 17% 1000 600 200 0 6-3-6 3 59% 41% 34% 27% 39%600 400 200 0 6-3-6 3 movie release date Mar 6, 2009 98% 2% 0 6-3-6 3 Obama's first state of the union Feb 25, 2009 2500 1500 500 days after peak days after peak #watchmen #nsotu WWW’12, 251 (2012)
  • 64. www.bgoncalves.com@bgoncalves Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Superdialects 0 0.25 0.5 0.75 1 1 2 3 4 5 6 7 8 9 10 f(K) Silhouette Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo N = 956 N = 179 Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) B) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo Mexico City Guatemala San Salvador Caracas San Jose Panama Bogota Quito Lima Asuncion Cordoba Santiago Buenos Aires Santiago De Compostela Palma De Mallorca 0 5 10 15 20 Clusters 0.0 0.2 0.4 0.6 0.8 1.0 f(K) silhouette α β 0 2 4 6 8 Cluster N = 179 N = 956 Population(x105 ) A) C) Santander Oviedo Bilbao Zaragoza Valladolid Barcelona Madrid Seville San Diego Miami New York San Juan Santo Domingo PLoS One 9, E112074 (2014)
  • 69. www.bgoncalves.com@bgoncalves Global Language Network Twitter n Link Weight and Color t-statistic 102.59 n Slovak DanishFinnish Haitian Hebrew Galician Czech Swahili Albanian Irish Malay Estonian Maltese Romanian Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic Georgian Malayalam Modern Greek Armenian Kannada Telugu Latvian Korean Burmese Thai Filipino Hungarian Central Khmer Cherokee Russian Bulgarian Welsh Amharic Belarusian Ukrainian Macedonian Italian English Arabic Serbo-Croatian Sinhala Turkish Slovenian Azerbaijani Persian German Basque Norwegian Catalan Afrikaans French Swedish Spanish Dutch Dhivehi Japanese Tibetan Panjabi Tamil Chinese Lao Gujarati ian n esian can Narom Kabyle Occitan Amharic Malagasy Pushto Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Abaza Northern Yukaghir Lezghian Chukot Old Russian Ossetian Tajik Tabassaran ChechenDargwa Lak AbkhazianAdyghe Nepali macrolanguage Swahili (macrolanguage) Arabic Kazakh Mongolian n Uighur Latvian anto Persian Belarusian age Family Population Link Weight and Color iatic dian nesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic t-statistic co-occurrences (users, editors, translations) 102.59 min 6 6 6 twitter wikipedia book translations 994,682 49,637 183,329 max 1 billion 10 million 100 million 1 million Slovak DanishFinnish Haitian Hebrew Galician Czech Swahili Albanian Irish Malay Estonian Maltese Romanian Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic English Arabic Serbo-Croatian Sinhala Slovenian Persian German Basque Norwegian Catalan Afrikaans French Swedish Spanish Dutch Ido e li Navajo Interlingua English Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh do-Romanian Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Albanian French Finnish Silesian Breton Pennsylvania German Slovak Wikipedia Language Family Pop Afro-Asiatic Altaic Amerindian Austronesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic Persian Marathi Mazanderani Kashmiri Fiji Hindi OriyaSanskrit Gilaki Icelandic Swahili Scottish Gaelic Kannada Moldavian Scots Maltese Burmese Cebuano Lao Mongolian Cornish Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Kinyarwanda Faroese Panjabi Zulu Central Khmer Old English Irish Bengali Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragones Ligu Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Macedonian Low German Slovenian Yiddish Bavarian Albanian Estonian Modern Greek Romansh Azerbaijani Bulgarian Georgian Arabic Kurdish Serbo-CroatianLithuanian Köl French Czech Russian Kirghiz Finnish Tatar Yakut Armenian Hebrew Luxembourgish Ukrainian Latvian TurkishKazakh Breton Pennsylvania German Belarusian Slovak German Language Family Population Afro-Asiatic Altaic Amerindian Austronesian Caucasian Creoles pidgins Dravidian Indo-European Niger-Congo Other Sino-Tibetan Tai Uralic 1 billion 10 million 100 million 1 million Moldavian Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Albanian French Finnish Silesian Breton Pennsylvania German Slovak PNAS 111, E5616 (2014)
  • 70. www.bgoncalves.com@bgoncalves Global Language Network Wikipedia Twitter Language Family Population Link Weight and C Afro-Asiatic Caucasian Niger-Congo t-statisti 2.59 1 million Finnish Galician Czec Swahili Alb Irish Malay Estonian Ma Romania Lithuanian Hindi Portuguese Urdu Yiddish Vietnamese Polish Bengali Icelandic M Modern Armenian Kannada Telugu Korean Burmese Thai Filipino Hungarian Central Khmer Cherokee English Dhivehi Japanese Tibetan Panjabi Tamil Chinese Lao Gujarati Persian Marathi Mazanderani Kashmiri Fiji Hindi OriyaSanskrit Gilaki Icelandic Swahili Scottish Gaelic Kannada Moldavian Scots Maltese Burmese Cebuano Lao Mongolian Cornish Urdu Ido Telugu Assamese Nepali Navajo Filipino Kalaallisut Interlingua Somali English Gujarati Amharic Tok Pisin Hindi Limburgan Javanese Pushto Vlaams Malayalam Sundanese Welsh Kinyarwanda Faroese Panjabi Zulu Central Khmer Old English Irish Bengali Papiamento Tamil Pampanga Macedo-Romanian Bikol Sinhala Polish Venetian Aragonese Kashubian Asturian Sardinian Ligurian Friulian Guarani Italian Western Frisian Portuguese Dutch Spanish Thai JapaneseQuechua Catalan Chinese Sicilian Neapolitan Emiliano-Romagnolo Basque Malay Vietnamese Galician Afrikaans Lombard Korean Norwegian Esperanto Romanian Latin Swedish Danish Hungarian Macedonian Low German Slovenian Yiddish Bavarian Albanian Estonian Modern Greek Romansh Azerbaijani Bulgarian Georgian Arabic Kurdish Serbo-CroatianLithuanian Kölsch French Czech Russian Kirghiz Chuvash Finnish Tatar Yakut Silesian Corsican Narom Kabyle OccitanArmenian Hebrew Luxembourgish Ukrainian Latvian TurkishKazakh Breton Pennsylvania German Belarusian Slovak German Réunion Creole French Lingala Kabyle Occitan (post 1500) Muyang Old High German (ca. 750-1050) Saramaccan Walloon Western Frisian Eastern Maroon Creole Swiss German Caribbean Javanese Sranan Tongo Karang Dogosé Kasem French Old Provençal (to 1500) Tamashek Tembo (Kitembo) Central Atlas Tamazight BudumaBambara Picard Wolof Ngiemboon Lama (Togo) Russian Amharic Malagasy Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Romansh Afrikaans Romanian German Nepali macrolanguage Lithuanian Swahili (macrolanguage) Arabic Kazakh Lisu Mongolian Kachin Uighur Tai Hongjin Newari Korean Latvian Hungarian Esperanto Persian Japanese Hmong Serbo-Croatian Vietnamese Belarusian HaniTibetan Dutch Lahu Sichuan Yi Azhe Chinese Church Slavic Naxi Middle Dutch (ca. 1050-1350) Wa RomanyCaribbean Hindustani Zhuang PNAS 111, E5616 (2014) 69
  • 71. www.bgoncalves.com@bgoncalves Global Language Network Book Translations Navajo Chipewyan Ojibwa Xhosa Sindhi Filipino (macrolanguage) Kikuyu Cree Dakota Lule Sami Tavringer Romani Kurdish Swedish Northern Sami Inari Sami Finnish Egyptian (Ancient) Somali Inuktitut Cornish Hopi Haitian Syriac Kriol Classical NahuatlOld Irish (to 900) Hittite Old English (ca. 450-1100) Middle English (1100-1500) Icelandic Pahlavi Old NorseYoruba Zulu Ottoman Turkish (1500-1928) Galician Ladino Danish Norwegian Southern Sami Faroese Sumerian Kalaallisut Hawaiian Kashmiri Djeebbana Anglo-NormanPali Guianese Creole French Réunion Creole French Gascon Lingala Corsican Fulah Kabyle Occitan (post 1500) Muyang Old High German (ca. 750-1050) Saramaccan Walloon Western Frisian Eastern Maroon Creole Swiss German Caribbean Javanese Sranan Tongo Buamu Karang Dogosé Latin Ifè Italian Old French (842-ca. 1400) Middle French (ca. 1400-1600) Basque Fuliiru Portuguese Catalan Welsh Ancient Greek (to 1453) Kasem Thayore Asturian Biali Aragonese French Tepo Krumen Spanish Old Provençal (to 1500) Tamashek Tembo (Kitembo) Central Atlas Tamazight BudumaBambara Picard Cerma Breton Mofu-Gudur Wolof Ngiemboon Lama (Togo) Ngangam Quechua Kara-Kalpak Even Kalmyk Nanai Buriat Azerbaijani Kumyk Bashkir Southern Altai Tuvinian Sanskrit Lao Russian Amharic Hindi Kannada Malagasy Tamil Panjabi Evenki Karachay-Balkar Khakas Turkmen Old Japanese Gagauz Pushto Moksha Udmurt Khanty Karelian Mari (Russia) Nenets Erzya Komi Abaza Northern Yukaghir Lezghian Chukot Old Russian Ossetian Tajik Tabassaran ChechenDargwa Ingush Lak Georgian Avaric Abkhazian Kabardian Adyghe Chuvash Dolgan Crimean Tatar Yakut Tatar Kirghiz Nogai Uzbek Romansh Afrikaans Romanian Slovenian Polish German Albanian Nepali macrolanguage Lithuanian Ukrainian Slovak Central Khmer Moldavian Swahili (macrolanguage) Arabic Kazakh Lisu Mongolian Kachin Uighur Tai Hongjin Newari Korean Latvian Hungarian Esperanto Persian Japanese Hmong Serbo-Croatian Vietnamese Belarusian HaniTibetan Dutch Lahu Sichuan Yi Azhe Chinese Church Slavic Naxi Middle Dutch (ca. 1050-1350) Wa RomanyCaribbean Hindustani Zhuang Maori Modern Greek (1453-) Scots Warlpiri Coptic English Official Aramaic (700-300 BCE) Sinhala Scottish Gaelic Burmese Gujarati Assamese Bengali Malayalam Marathi Bulgarian Hausa Armenian Czech Hebrew Yiddish Urdu Malay (macrolanguage) Middle High German (ca. 1050-1500) Turkish Irish Thai Jola-Fonyi Guadeloupean Creole French Swati Macedonian Tokelau Rajasthani Telugu Maltese Middle Irish (900-1200) GeezAkkadian Estonian Oriya macrolanguage PNAS 111, E5616 (2014) 70
  • 72. www.bgoncalves.com@bgoncalves Global Language Network numbers are 41% and 63%. In contrast, the correlation between the representation of languages in Twitter and Book Translations is 0.63 (R2 =40%), and the correlation between the strength of links is only 0.48 (R2 =23%). Finally, we note that—with respect to the book translation dataset—the two digital datasets (Twitter and Wikipedia) are overexpressed in languages associated with developing countries, like Malay, Filipino and Swahili. This indicates that these digital media are more inclusive of the populations of developing countries than written books. PNAS 111, E5616 (2014) 71
  • 73. www.bgoncalves.com@bgoncalves Language and Fame afrafr araara azaze belbel benben bulbul catcat cesces dandan deudeu ellell eng estest euseus fasfasfilfil finfin frfra gujguj hbshbs hebheb hinhin hunhun hyehye islisl itaita jpnjpn kankan katkat khmkhm korkor lalav litlit malmalmkdmkd mlmlt msamsa mymya nldnld nornor panpan polpol porpor ronron rusrus sisin slkslk slslv spaspa sqisqiswswa sweswe tamtamteltelthatha turtur ukrukr urdurd vivie zhozho R² = 0.693 p-value 0.001 C araara benben cat ces dandan deu ell eng fin fra glglg hin hun ita jpn nld norpol ron rus slk slslv spa swe teltel turtur zho $10k $20k $30k $40k $50k $0k GDP per Capita R² = 0.858 p-value 0.001 F log10 (HAfamouspeople) log10 (Twitter Eigenvector Centrality) 0 1 2 3 −6 −4 −2 0 −6 −4 −2 0 −6 −4 −2 0 1 2 3 0 log10 (Wikipedia Eigenvector Centrality) log10 (Book Trans. Eigenvector Cent.) $10k $20k $30k $40k $50k $0k GDP per Capita Number of speakers 400 M 1200 M 800 M afrafr ara azeaze belbel benben bulbul catcat cesces dandan deudeu ellell eng estest euseus fasfasfilfil finfin frafra gujguj hbshbs hebheb hihin hun hyehye isisl itaita jpn kankan katkat khmkhm korkor lavlav litlit malmal mkdmkdmlmlt msamsa mymya nld nornor panpan polpol por ronron rusrus sinsin slkslk slslv spaspa sqisqiswswa swe tamtam glgglg thatha ukrukr urdurd vievie zhozho R² = 0.755 p-value 0.001 B afr ara azaze belbel benben bubul cat cesdan deu ell eng estest eus fafas fil fin fra glglg gujguj hbs hebheb hin hun hye isisl ita jpn kankan kat khmkhm kor lav litlit malmalmkdmkd mlt msa mymya nld nor pan pol por ronron rus sisin slslk slslv spa sqi swa swe tammtel tha turukr urd vivie zho R² = 0.447 p-value 0.001 A $10k $20k $30k $40k $50k GDP per Capita $0 ara ben cat ces dan deu ell eng fin fra glg hbs hin hun ita jpn nld norpol por ron rus slk slv spa swe tel tur zho R² = 0.399 p-value 0.001 D arara benben catcat cesces dandan deudeu elell engeng finfin frfra glglg hbshbs hihin hunhun itaitajpnjpn nldnld nornor polpol porpor roron rurus slslk slslv spaspa sweswe tetel tutur zhozho R² = 0.758 p-value 0.001 E hbs por glgglg turtur teltel log10 (Wikipedia26+famouspeople) Fig. 3. The position of a language in the GLN and the global impact of its speakers. Top row shows the number of people per language (born 1800–1950) with articles in at least 26 Wikipedia language editions as a function of their language’s eigenvector centrality in the (A) Twitter GLN, (B) Wikipedia GLN, and (C) book translation GLN. The bottom row shows the number of people per language (born 1800–1950) listed in Human Accomplishment as a function of their language’s eigenvector centrality in (D) Twitter GLN, (E) Wikipedia GLN, and (F) book translation GLN. Size represents the number of speakers for each PNAS 111, E5616 (2014) 72
  • 75. www.bgoncalves.com@bgoncalves Collective Attention “Prediction is very difficult, especially about the future.” (Niels Bohr)
  • 76. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows”
  • 77. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav-
  • 78. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav-
  • 79. www.bgoncalves.com@bgoncalves Even more so in Political Elections http://truthy.indiana.edu A B C)C D E)E F G)G H #ampat @PeaceKaren_25 gopleader.gov “How Chris Coons budget works- uses tax $ 2 attend dinners and fashion shows” Table 1: Features used in truthy classification. nodes Number of nodes edges Number of edges mean k Mean degree mean s Mean strength mean w Mean edge weight in largest con- nected component max k(i,o) Maximum (in,out)-degree max k(i,o) user User with max. (in,out)-degree max s(i,o) Maximum (in,out)-strength max s(i,o) user User with max. (in,out)-strength std k(i,o) Std. dev. of (in,out)-degree std s(i,o) Std. dev. of (in,out)-strength skew k(i,o) Skew of (in,out)-degree distribution skew s(i,o) Skew of (in,out)-strength distribution mean cc Mean size of connected components max cc Size of largest connected component entry nodes Number of unique injections num truthy Number of times ‘truthy’ button was clicked sentiment scores Six GPOMS sentiment dimensions graph. These include the number of nodes and edges in the graph, the mean degree and strength of nodes in the graph, mean edge weight, mean clustering coefficient across nodes in the largest connected component, and the standard devi- ation and skew of each network’s in-degree, out-degree and strength distributions (see Fig. 2). Additionally we track the out-degree and out-strength of the most prolific broadcaster, as well as the in-degree and in-strength of the most focused- upon user. We also monitor the number of unique injection points of the meme, reasoning that organic memes (such as those relating to news events) will be associated with larger number of originating users. 4.4 Sentiment Analysis We also utilize a modified version of the Google-based Profile of Mood States (GPOMS) sentiment analysis method (Bollen, Mao, and Pepe 2010) in the analysis of meme-specific sentiment on Twitter. The GPOMS tool as- Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- ior), or the users engaged in repeated reply/retweet exclu- sively with other users who had tweeted the meme. ‘Legit- imate’ memes were described as memes representing nor- mal use of Twitter — several non-automated users convers- ing about a topic. The final category, ‘remove,’ was used for memes in a non-English language or otherwise unrelated to U.S. politics (#youth, for example). These memes were not used in the training or evaluation of classifiers. Upon gathering 252 annotated memes, we observed an imbalance in our labeled data (231 legitimate and only 21 truthy). Rather than simply resampling from the smaller class, as is common practice in the case of class imbal- eatures used in truthy classification. des Number of nodes ges Number of edges n k Mean degree n s Mean strength n w Mean edge weight in largest con- nected component ,o) Maximum (in,out)-degree ser User with max. (in,out)-degree ,o) Maximum (in,out)-strength ser User with max. (in,out)-strength ,o) Std. dev. of (in,out)-degree ,o) Std. dev. of (in,out)-strength ,o) Skew of (in,out)-degree distribution ,o) Skew of (in,out)-strength distribution cc Mean size of connected components cc Size of largest connected component des Number of unique injections thy Number of times ‘truthy’ button was clicked ores Six GPOMS sentiment dimensions ude the number of nodes and edges in the degree and strength of nodes in the graph, t, mean clustering coefficient across nodes nected component, and the standard devi- f each network’s in-degree, out-degree and ions (see Fig. 2). Additionally we track the Table 2: Performance of two classifiers with and without re- sampling training data to equalize class sizes. All results are averaged based on 10-fold cross-validation. Classifier Resampling? Accuracy AUC AdaBoost No 92.6% 0.91 AdaBoost Yes 96.4% 0.99 SVM No 88.3% 0.77 SVM Yes 95.6% 0.95 Table 3: Confusion matrices for a boosted decision stump classifier with and without resampling. The labels on the rows refer to true class assignments; the labels on the columns are those predicted. No resampling With resampling Truthy Legitimate Truthy Legitimate T 45 (12%) 16 (4%) 165 (45%) 6 (1%) L 11 (3%) 294 (80%) 7 (2%) 188 (51%) additional volunteers), and asking them to place each meme in one of the three categories. A meme was to be classified as ‘truthy’ if a significant portion of the users involved in that meme appeared to be spreading it in misleading ways — e.g., if a number of the accounts tweeting about the meme appeared to be robots or sock puppets, the accounts appeared to follow only other propagators of the meme (clique behav- Why not start with something a bit simpler?
  • 80. www.bgoncalves.com@bgoncalves American Idol • Popularity contest • Well defined audience, across the entire US • Similar demographics voting and tweeting • Weekly “votes”, involving the same population • Immediate results • (Almost) No incentives for organized campaigns
  • 81. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Hollie Skylar Top 5 EPJ Data Science 1, 8 (2012)
  • 82. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Hollie Top 4 EPJ Data Science 1, 8 (2012)
  • 83. www.bgoncalves.com@bgoncalves Skylar 10 20 30 40 50 % of Tweets 60 700 Calibration Jessica Phillip Joshua Top 3 EPJ Data Science 1, 8 (2012)
  • 84. www.bgoncalves.com@bgoncalves Geographic Locations T (B) (C) Jessica Phillip Joshua Hollie Skylar CC Top 4 (A) (B) (C) Top 3 (B) (C) Jessica Phillip Joshua Hollie Skylar CC Top 5 EPJ Data Science 1, 8 (2012)
  • 86. www.bgoncalves.com@bgoncalves And the winner is... Jessica Phillip World U.S. Phillip Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 EPJ Data Science 1, 8 (2012)
  • 87. www.bgoncalves.com@bgoncalves And the winner is... Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 Phillip U.S. Jessica Phillip 10 20 30 40 50 % of Tweets 60 700 80 EPJ Data Science 1, 8 (2012)
  • 88. www.bgoncalves.com@bgoncalves And the winner is... EPJ Data Science 1, 8 (2012)
  • 89. www.bgoncalves.com@bgoncalves Stock Market 2 ecember 19, text content a positive The second of tweets to ublic mood ublic along lting public s Industrial changes in e prediction els is signif- re included, ublic mood by GPOMS appiness as Twitter feed ~ (1) OpinionFinder (2) G-POMS (6 dim.) Mood indicators (daily) DJIA ~ Stock market (daily) (3) DJIA Granger causality -n (lag) F-statistic p-value text analysis normalization SOFNN predicted value MAPE Direction % 1 2 t-1 t-2 t-3 3 t=0 value feb28 2008 apr may jun jul aug sep oct nov dec dec20 2008 (1) OF ~ GPOMS (2) Granger Causality analysis (3) SOFNN training test Methodology Data sets and timeline Fig. 1. Diagram outlining 3 phases of methodology and corresponding data sets: (1) creation and validation of OpinionFinder and GPOMS public mood
  • 90. www.bgoncalves.com@bgoncalves POMS • Simple questionnaire that classifies a person’s mood along 6 dimensions: • tension-anxiety • depression-dejection • anger-hostility • fatigue-inertia • vigor-activity • confusion-bewilderment • How to administer it to Twitter users? • Expand vocabulary using Google n-grams • Search twitter for matching words Profile of Mood States Subject's Initials Birth date Date Subject Code No. Directions: Describe HOW YOU FEEL RIGHT NOW by circling the most appropriate number after each of the words listed below: Quite a FEELING Not at all A little Moderate bit Extremely 1. Friendly 1 2 3 4 5 2. Tense 1 2 3 4 5 3. Angry 1 2 3 4 5 4. Worn Out 1 2 3 4 5 5. Unhappy 1 2 3 4 5 6. Clear-headed 1 2 3 4 5 7. Lively 1 2 3 4 5 8. Confused 1 2 3 4 5 9. Sorry for things done 1 2 3 4 5 10. Shaky 1 2 3 4 5 11. Listless 1 2 3 4 5 12. Peeved 1 2 3 4 5 13. Considerate 1 2 3 4 5 14. Sad 1 2 3 4 5 15. Active 1 2 3 4 5 16. On edge 1 2 3 4 5 17. Grouchy 1 2 3 4 5 18. Blue 1 2 3 4 5 19. Energetic 1 2 3 4 5 20. Panicky 1 2 3 4 5 21. Hopeless 1 2 3 4 5 22. Relaxed 1 2 3 4 5 23. Unworthy 1 2 3 4 5
  • 91. www.bgoncalves.com@bgoncalves Timelines along each mood dimension ounterpart to the differentiated response to the Presidential lection. On Thanksgiving day we find a spike in Happy values, indicating high levels of public happiness. However, no other mood dimensions are elevated on November 27. Furthermore, the spike in Happy values is limited to the one day, i.e. we find no significant mood response the day before or after Thanksgiving. 1.25 1.75 OpinionFinder day after election Thanksgiving -1 1 pre- election anxiety CALM -1 1 ALERT -1 1 election results SURE 1 1 pre! election energy VITAL -1 -1 KIND -1 1 Thanksgiving happiness HAPPY Oct 22 Oct 29 Nov 05 Nov 12 Nov 19 Nov 26 z-scores ig. 2. Tracking public mood states from tweets posted between October 008 to December 2008 shows public responses to presidential election and hanksgiving. rtially overlap with the mood values provided by r, but not necessarily all mood dimensions that ortant in describing the various components of e.g. the varied mood response to the Presidential GPOMS thus provides a unique perspective on states not captured by uni-dimensional tools such nder. Granger Causality Analysis of Mood vs. DJIA blishing that our mood time series responds to cio-cultural events such as the Presidential elec- nksgiving, we are concerned with the question r variations of the public’s mood state correlate in the stock market, in particular DJIA closing nswer this question, we apply the econometric Granger causality analysis to the daily time ed by GPOMS and OpinionFinder vs. the DJIA. ality analysis rests on the assumption that if a auses Y then changes in X will systematically changes in Y . We will thus find that the lagged will exhibit a statistically significant correlation elation however does not prove causation. We Granger causality analysis in a similar fashion re not testing actual causation but whether one as predictive information about the other or not7 . ime series, denoted Dt, is defined to reflect daily tock market value, i.e. its values are the delta high level of confidence. However, this result only applies to 1 GPOMS mood dimension. We observe that X1 (i.e. Calm) has the highest Granger causality relation with DJIA for lags ranging from 2 to 6 days (p-values 0.05). The other four mood dimensions of GPOMS do not have significant causal relations with changes in the stock market, and neither does the OpinionFinder time series. To visualize the correlation between X1 and the DJIA in more detail, we plot both time series in Fig. 3. To maintain the same scale, we convert the DJIA delta values Dt and mood index value Xt to z-scores as shown in Eq. 1. -2 -1 0 1 2 DJIAz-score Aug 09 Aug 29 Sep 18 Oct 08 Oct 28 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 DJIAz-scoreCalmz-score Calmz-score bank bail-out Fig. 3. A panel of three graphs. The top graph shows the overlap of the day-to-day difference of DJIA values (blue: ZDt ) with the GPOMS’ Calm Look for correlations between dimensions and DJIA 1 Twitter mood predicts the stock market. Johan Bollen1,?,Huina Mao1,?,Xiao-Jun Zeng2. ?: authors made equal contributions. Abstract—Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public’s response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%. Index Terms—stock market prediction — twitter — mood analysis. I. INTRODUCTION STOCK market prediction has attracted much attention from academia as well as business. But can the stock market really be predicted? Early research on stock market prediction [1], [2], [3] was based on random walk theory and the Efficient Market Hypothesis (EMH) [4]. According to the EMH stock market prices are largely driven by new information, i.e. news, rather than present and past prices. Since news is unpredictable, stock market prices will follow a random walk pattern and cannot be predicted with more than 50 percent accuracy [5]. There are two problems with EMH. First, numerous studies show that stock market prices do not follow a random walk and can indeed to some degree be predicted [5], [6], [7], [8] thereby calling into question EMH’s basic assumptions. Sec- ond, recent research suggests that news may be unpredictable but that very early indicators can be extracted from online social media (blogs, Twitter feeds, etc) to predict changes in various economic and commercial indicators. This may conceivably also be the case for the stock market. For example, [11] shows how online chat activity predicts book sales. [12] uses assessments of blog sentiment to predict movie sales. sentiment from blogs. In addition, Google search queries have been shown to provide early indicators of disease infection rates and consumer spending [14]. [9] investigates the relations between breaking financial news and stock price changes. Most recently [13] provide a ground-breaking demonstration of how public sentiment related to movies, as expressed on Twitter, can actually predict box office receipts. Although news most certainly influences stock market prices, public mood states or sentiment may play an equally important role. We know from psychological research that emotions, in addition to information, play an significant role in human decision-making [16], [18], [39]. Behavioral finance has provided further proof that financial decisions are sig- nificantly driven by emotion and mood [19]. It is therefore reasonable to assume that the public mood and sentiment can drive stock market values as much as news. This is supported by recent research by [10] who extract an indicator of public anxiety from LiveJournal posts and investigate whether its variations can predict SP500 values. However, if it is our goal to study how public mood influences the stock markets, we need reliable, scalable and early assessments of the public mood at a time-scale and resolution appropriate for practical stock market prediction. Large surveys of public mood over representative samples of the population are generally expensive and time-consuming to conduct, cf. Gallup’s opinion polls and various consumer and well-being indices. Some have therefore proposed indirect assessment of public mood or sentiment from the results of soccer games [20] and from weather conditions [21]. The accuracy of these methods is however limited by the low degree to which the chosen indicators are expected to be correlated with public mood. Over the past 5 years significant progress has been made in sentiment tracking techniques that extract indicators of public mood directly from social media content such as blog content [10], [12], [15], [17] and in particular large-scale Twitter feeds [22]. Although each so-called tweet, i.e. an individual user post, is limited to only 140 characters, the aggregate of millions of tweets submitted to Twitter at any given time may provide an accurate representation of public mood and sentiment. This has led to the development of real- time sentiment-tracking indicators such as [17] and “Pulse of Nation”1 . In this paper we investigate whether public sentiment, as expressed in large-scale collections of daily Twitter posts, can be used to predict the stock market. We use two tools to measure variations in the public mood from tweets submitted arXiv:1010.3003v1[cs.CE]14Oct2010
  • 95. www.bgoncalves.com@bgoncalves Coming Soon! CompleNet 2016 Dijon, France — March 23-25
  • 96. www.bgoncalves.com@bgoncalves Coming Soon! CompleNet 2016 Dijon, France — March 23-25