"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
2016 Presidential Candidate Tracker
1. CS 541 : 2016 Presidential Candidate Tracker
Anwar Jameel (aj528)
Shahab Shekari (ss1817)
28 April 2016
1 Introduction
In this project, we have analyzed the data from twitter and facebook about
USA Presidential Candidates. The huge amount of data is parsed from 20
different news channels (Facebook pages) and from Twitter related to current
presidential race. We have used python APIs of Facebook (GraphAPI) and
Twitter (Tweepy) to get the data from Facebook and Twitter. DynamoDB is
used for storing different stats (e.g. polarity scores) about each presidential
candidate. We have used Natural Language Toolkit for sentiment analysis of
facebook posts and its comments and twitter tweets. We have set up cron jobs on
EC2 node to parse facebook posts once a day every midnight and tweets every
morning at 6am (for last day’s popular tweets) and every 6 hours: 3,9,15,21
(for live tweets). Finally, we have created a website to compare the polarity
scores of all the candidates based on different sources over a period of time.
We have also created a word cloud containing adjectives and hashtags for each
candidate on our website. The candidate engagements such as likes, number of
retweets, number of favorites are also shown on each candidate’s page. In the
next sections we discuss more about the implementation details.
2 Implementations
The sources of data in our project are Facebook and Twitter. In this section,
we discuss about all the features implemented to analyse and integrate the data
from Facebook and Twitter. We created the website to project different patterns
from the analysed data stored in DynamoDB. We have empowered our website
with an option to compare the polarity scores between any subset of 20 news
outlets and twitter. This feature provides an unbiased polarity comparison of
different sources for every candidate. We also display the popular hashtags
for every candidate on home page of our website. The word cloud containing
adjectives and hashtags is also displayed for every candidate. We also display
percentage of likes, retweets and favorites for each candidate, compared to other
candidates on the candidate’s page. Figure 1 shows the top 10 hashtags used
1
2. in conjunction with individual candidates for the past day. The hashtags are
rotated (faded in/out) every 5 seconds.
Figure 1: Website Home Page
2.1 Facebook
The data related to 6 USA presidential candidates is collected from facebook
pages of 20 different news outlets (such as Fox News, CNN, MSNBC, etc.). We
have defined target words (such as ”trump”, ”hillary”, etc.) which we look for
in a post’s title while parsing the posts of news channels. The relevant posts are
the ones which contain these target words. For every relevant post, top 20 com-
ments based on number of likes are retrieved and stored. Since facebook doesn’t
provide a direct API for getting the top k comments, we maintain a priority
queue of the earliest 500 comments of a post to find the top 20 comments. The
polarity scores are calculated using nltk sentiment analyzer for every relevant
post and its top comments. The aggregate polarity scores for each post and its
top comments are also stored for every candidate. The adjectives are retrieved
(again, using NLTK) from top comments of every post and stored for every can-
didate. The adjectives are also aggregated for every candidate to form a word
cloud. Both aggregated polarity scores and aggregated adjectives are stored in
the aggregate stats table of DynamoDB. The actual facebook post along with
other details such as comments, # of likes, candidate ids, post id etc. are stored
in fb posts table of DynamoDB.
2.2 Twitter
The data related to 6 USA presidential candidates is collected from popular
and recent tweets at Twitter. We look for tweets containing target words and
2
3. Figure 2: Candidate Page
store them with their candidate ids based on target words. The popular tweets
for previous day are parsed and stored every day at 6am. Recent, live tweets,
are parsed and stored every 6 hours (at 3, 9, 15, 21). For every relevant tweet,
we also store its timestamp, hashtags, number of favorites, number of retweets
and the tweet itself in twt posts table of DynamoDB. Polarity scores are cal-
culated for every tweet using nltk sentiment analyzer. We calculate the ag-
gregate polarity scores for every candidate and store them in aggregate stats
table of DynamoDB. Aggregated hashtags for every candidate are also stored
in aggregate stats table of DynamoDB.
Figure 2 shows a snapshot of a candidate page (Hillary Clinton) with few sources
selected to project the polarity and word cloud.
3 Data Statistics
Table Partition
Key
Sort
Key
Read Write Storage # Items
fb posts candidate id post id 10 10 2.71 MB 6,589
twt posts candidate id tweet id 200 10 81 MB 270,562
aggregate
stats
stat source time
stamp
5 5 1.04 MB 222
For each table we define Primary Key as combination of (Partition Key,
Sort Key). Partition key determines the partition where the item is stored in
DynamoDB. All items with the same partition key are stored together, in sorted
order by sort key value. The above given table show statistics for facebook,
twitter and aggregated statistics tables that we maintain at DynamoDB. The
3
4. read and write columns show provisioned read and write capacity units. The
storage column gives the amount of data stored and items column gives the
number of facebook posts/comments, tweets etc. present in each table for the
last 3 weeks.
4 Challenges
There are several challenges that we have faced during the implementation as
well as during data retrieval process. We do have AWS system limitations,
due to a lack of credits, which restrict us to scan a large number of items from
DynamoDB. As the amount of data grows with time, we need more data units to
be scanned from DynamoDB to project the polarity and word cloud over time.
In order to cope with the amount of data we have to scan from DynamoDB, we
implemented a caching solution, such that we scan DynamoDB tables just once
every 3 hours, store the data in RAM, and then filter and project the data from
RAM as needed.
Another challenge was the fact that Facebook does not provide a public API
to search or get public posts. We have overcome this challenge by scraping the
posts from a news channel’s Facebook wall and then finding the relevant posts
based on defined target words. Another scientific challenge that we have faced
is how to decide target words, we restricted ourselves to the last names of the
candidates. The choice was based on the fact that, since the amount of data
was not an issue, we would prefer not having any false positives, by excluding
posts that only contained the first name of the candidates.
5 Conclusion and Future Work
We have successfully, completed all the proposed goals of our project. We are
able to collect huge amounts of data to analyse, integrate and project the pat-
terns in the data using current data analysis techniques. Our projections are
unbiased in the sense that we have considered almost all the major attributes
of a facebook post or a tweet to calculate polarity scores. Our target words
are just the last names of the candidates and we assume that if any post or
story related with a candidate appears on either facebook or twitter then it will
contain either of these target words. We believe that up to a great extent this
assumption holds correct trivially. We also aggregate the adjectives and hash-
tags over time and form a word cloud based on the counts of these adjectives
and hashtags over a period of time. The engagement statistics are a great way
to see a candidate’s fondness in the public and we have projected it for every
candidate. We can also observe that few news channels have high variance in
polarity graph curves when compare with others. This variance can be used as
a measure to find how biased a certain news channel is.
Although our results portray great analytics about current presidential candi-
date, we think there is still a scope of improvement. In future, we would like
4
5. to include more sources (such as international media) to gather data from and
project the polarity and word cloud. We would also like to extrapolate the
polarity graph for future events using machine learning techniques.
Acknowledgement
We would like to thank Professor Am´elie Marian for providing us with construc-
tive suggestions and guiding us throughout the project.
5