1. The document proposes a new method for crawling social networks using a Chrome extension called Twaater, which attaches to a Twitter web app to automatically log user interactions.
2. Twaater is used to collect tweet metrics over time to analyze how factors like retweets, favorites and hashtag popularity correlate with a tweet's visibility in hashtag streams.
3. Preliminary analysis found that favorites in particular contribute to tweets appearing higher in hashtag streams, while account metrics like followers showed no effect, and more sophisticated models are needed to fully understand Twitter's hashtag algorithm.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Reverse Engineering Twitter Hashtag Algorithm
1.
2. .
Contributions
1. a brand new method for crawling social networks
2. a framework that can be used by social media to evaluate impact
◦ = probability for tweets to show up in hashtag streams
3. example analysis based on the above
.
The goal is...
..
.... to reverse engineer hashtag algorithm
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 2/21
...
2/21
4. .
Hashtag Streams
.
Hashtag Streams are ...
..
.... streams of tweets that show up when people search Twitter
• hashtag is the best way to search
• note: Twitter tries to phase out hashtags (and mentions), so search may find
tweets even without hashtags
.
Hashtags are Important...
..
.... because they are used by social media to promote events, products, etc.
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 4/21
...
4/21
5. .
Twitter Infographics
• Twitter promotes hashtags by releasing
infographics
• the content is very confusing for social
media
• hard to translate into numbers, concrete
actions, etc.
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 5/21
...
5/21
7. .
Twitter Infographics (3) : Cleanup
YES
Decide
New Tag?
Will you
promote it?
Will you
add value?
Add to
hashtag
stream
Out Out
NO
NO
NO
YESYES
• all the garbage cleaned out, a much
clearer decision algorithms
• does not clarify what the value or
promotion mean in practice
• since Twitter does not help, we need to
reverse engineer the algorithm
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 7/21
...
7/21
9. .
Crawling : Practice and Problems
• traditional crawling is done in commandline using wget or
curl
• problem1: Twitter and others try to avoid being crawled and created fences
(login, cookies, forwarding, JS post-loading, etc.)
• problem2: official APis are very restricted, Twitter API does not cover
search
• problem3: hard to use other services while crawling .... Twitter +
YouTube
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 9/21
...
9/21
10. .
Snowball Sampling
• the new way to look at sampling
• done in cycles:
1. sample something
2. select a wanted subset
3. sample the subset at a higher
depth
4. .... repeat
• snowball sampling is directly applicable
to crawling Twitter
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 10/21
...
10/21
11. .
Crawling : Two Approaches
• approach 1 (traditional) : use APIs (HTTP,
OAuth, etc.) to get data
• approach 2 (proposed) : attach your robot
to a working Twitter webapp in browser
◦ interaction is via clicks, just like human
◦ more natural
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 11/21
...
11/21
13. .
Implementation : Twaater
• Chrome extension, auto-triggered by
loading a Twitter page
• storing logs in one's own Dropbox drive
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 13/21
...
13/21
14. .
Implementation : Twaater
• https://github.com/maratishe/twaater
• personalization
1. need to change Dropbox auth tokens to point to one's own drive
2. enter Twitter under own account and let Twaater pick up from here
• runs continuously, close browser when want to stop
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 14/21
...
14/21
17. .
Twaater : Tweet Timelin
• all metrics change in time
• timeline of one tweet is very
important
• aggregates tweet status and its
position (if any) in hashtag streams
◦ for each hashtag contained in a tweet
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 17/21
...
17/21
18. .
Analysis : Rules and CCF
• lists : time serious of metrics versus time series ouf positions in hashtag
streams
◦ ccf( metric values, hashtag positions)
◦ note that there are alland tophashtag streams
• selection : pick a max in time series, and filter lists by threshold
◦ thresholds are different for each metric
◦ helps to filter out noise or focus only on large (important) values
• view showing up in hashtag streams as binary (yes/no) versus analog
(list position) values
• extras (future work) : analysis along the timeline, much higher complexity
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 18/21
...
18/21
19. .
Analysis : Results
0 0.1 0.2 0.3 0.4 0.5
Threshold (% of max)
-1.05
-0.7
-0.35
0.35
0.7
1.05
ccf
tags
links
mentions
retweets favorites
tweets
following
followers
tagstatus
all/binary
0 0.1 0.2 0.3 0.4 0.5
Threshold (% of max)
-1.05
-0.7
-0.35
0.35
0.7
1.05
ccf
tagslinks
mentions
retweets
favorites
tweets
following
followers
tagstatus
top/binary
0 0.1 0.2 0.3 0.4 0.5
Threshold (% of max)
-1.05
-0.7
-0.35
0.35
0.7
1.05
ccf
tags
links
mentions
retweets
favorites
tweets
following
followers
tagstatus
all/actual
0 0.1 0.2 0.3 0.4 0.5
Threshold (% of max)
-1.05
-0.7
-0.35
0.35
0.7
1.05
ccf
tagslinks
mentions
retweets
favorites
tweets following
followerstagstatus
top/actual
• binary: useless
• analog: filtering out
very low values (most)
helps reveal good
correlation
◦ for example,
favorites
contributes to tweets
showing up closer
to top in lists
• account metrics:
show no effect
• among large values,
tagstatus (topic
popularity) becomes
prominent
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 19/21
...
19/21
20. .
Future Work
• Twaater is own-centric, makes is possible to crowdsource/distribute
crawling
◦ fits the description of snowball sampling
• 2nd order statistics (CCF) did not reveal a simple hashtag algorithm
◦ more complicated models have to be tested
• alternatively smarter filtering can also help
◦ ... select a subset of important tweets to subject to analysis
M.Zhanikeev -- maratishe@gmail.com -- Reverse Engineering Twitter Hashtag Algorithm -- http://bit.do/marat140614 20/21
...
20/21