2. Introduction Extraction Processing Analyzing Reporting
Text mining of Twitter data could provide unprecedented utility for
businesses, political groups and curious Internet users alike
Introduction
Twitter is a “micro-blogging" social networking website that has a large and rapidly growing user base.
The website provides a rich bank of data in the form of “tweets," which are short status updates and musings from Twitter's users
that must be written in 140 characters or less.
As an increasingly-popular platform for conveying opinions and thoughts, it seems natural to mine Twitter for potentially interesting
trends regarding prominent topics in the news or popular culture.
Problem Statement
How can one extract the rich text information available in twitter and how can it be used to draw meaningful insights?
Approach
To achieve this we would first need to build an accurate sentiment analyzer for tweets, which is what this solution aims to achieve.
For a user-generated status update (which can not exceed 140 characters), our classification model would determine whether the
given tweet reflects positive opinion or negative opinion on the user's behalf.
For instance, the tweet “I'm in forida with Jesse! i love vacations!" would be positive, whereas the tweet “Setting up an apartment is
lame." would be negative.
Based on this, we would be able to translate text(tweet) into numbers for the business..
Impact
Businesses will be able to better understand the image of their brand.
Manufacturers can get an idea of the features of a product that are, according to the users, not up to the mark and start working on
the improvements.
A political lobbyist can gauge the popular opinion of a politician by calculating the sentiment of all tweets containing the politician‟s
name.
This can also help businesses gauge the performance of their competitors.
3. Introduction Extraction Processing Analyzing Reporting
A combination of „twitter API‟, MS Excel & SAS can be used to extract
information from twitter and create an input dataset for the analysis
Advantages:
Fetch Data Export Data Returns latest 10K tweets for a
particular Keyword
Metrics like Time, Date, Location etc
Twitter
can also be retrieved.
Search key word on websites like Export data Disadvantages:
Searchtastic.com, topsi.com etc to Excel Manual process, difficult to automate
Run Macros Access API Export Data Import Data into SAS
VBA excel macros will Data will be fetched from This data will be Appended Sas datasets
access data from APIs APIs and exported to CSV‟s imported into SAS will form a master dataset
Data Extraction from Twitter can be accomplished by following the process mentioned below.
Step 1: The entire process can be fully automated by scheduling the run of VBA macros (process can also be initiated through SAS
macros). We can schedule the process to run periodically and data can be retrieved on a regular basis.
Step 2:Running an excel macro to access twitter data. This macro creates a URL based on the user‟s input. This excel file access „twitter
API‟ through XML and fetches data into one of the sheets. This data is then exported into a CSV file.
Step 3: The exported data is then imported into SAS using SAS macros. This data is then appended into a master data set.
Step 4: We now have a data set which can be used for further processing and analyzing.
4. Introduction Extraction Processing Analyzing Reporting
Retrieved tweets will go through a series of scrubbing steps; these will
simplify extraction of information from the tweets
Example tweet: I bought an ipad…. It has a good touch screen..Luv it :)
Tweets written in languages other than English will be filtered out. This filter will be applied in
Language filter the excel macro itself. The API filters language based on the URL.
Removal of special Removal of characters like !:@#$%^&*)(.,;” etc.. List of all special characters will be given in
the Excel itself and using „Find and Replace‟ functionality we can replace all these with blanks
characters Ex: I bought an ipad It has a good touch screen luv it
This will replace the pronouns in the tweet with the respective nouns.
Pronoun resolution Ex: I bought an ipad. ipad has a good touch screen luv ipad
Custom dictionary can be created in Excel. We can add service of an online dictionary provider
Spell check by changing the research options. This will correct wrongly spelled words in Excel (Twittionary1)
Ex: I bought an ipad ipad has a good touch screen love ipad
This will replace all the words with same meaning with one word of same meaning. In built Excel
Synonym thesaurus thesaurus can be used to accomplish this.
Ex: : I bought an ipad ipad has a good touch screen love ipad (no change)
Removal of noise Words like a,an,is,the etc. are to be removed.
words Ex: I bought ipad ipad good touch screen love ipad
Part of speech Markov Model for POS tagging2
tagging
1
http://twittionary.com/
2 http://en.wikipedia.org/wiki/Part-of-speech_tagging#Use_of_Hidden_Markov_Models
5. Introduction Extraction Processing Analyzing Reporting
Logistic regression model can be developed on a sample of data; this can
be used to classify sentiments of the tweet
Manual classification will be done on a sample of tweets, classification could be lets say –
positive or negative (opinion lexicons can also be used for classification1)
Tweets assigned manually will be divided into 2 parts – 80% of data should be taken in
Model sample and 20% of data should be taken as validating sample (almost same amount
of positive and negative tweets should be taken in both the samples)
A logistic regression will be used develop a model taking classification as dependent
variable and binary variables for the words as independent variables. Dependent variable
will have 0 for negative feedback and 1 for positive feedback2.
Accuracy of model can be tested against validating sample. Model equation obtained using
logistic regression will be used to calculate classification of tweets on validating sample.
Results obtained will be compared with the manually assigned classification. If accuracy is
too low, then logistic regression should be developed again for a different set of tweets
Validated model can be easily used to classify any number of tweets into 2 groups –
positive and negative.
1 http://nmis.isti.cnr.it/sebastiani/Publications/LREC06.pdf
2 http://nlp.stanford.edu/courses/cs224n/2009/fp/19.pdf
6. Introduction Extraction Processing Analyzing Reporting
Text can converted to numbers in the form of different metrics/reports to
better understand the sentiments of users
350
Sentiments variation
Heat Map showing 300
250
over time
Sentiments- 200
Positive, negative &
Positive, negative & 150
100
neutral feedbacks
neutral feedbacks 50
0 can be analyzed over
can be represented by
time
different colors
Positive Negative Neutral
Metrics: % of positive and negative tweets on subject by feature Metrics: Time variation of number of tweets can seen over time.
and by geography
Insights: This will highlight the cultural & regional acceptance of Insights: Trends of graph will give an idea of popularity of the
the product subject with time
Touch Screen Speed Graphics Effectiveness of a Marketing Activity
A business can gauge effectiveness of a recent marketing
campaign by aggregating user opinion on twitter regarding
their product
Mixed Reactions report
Positive Negative Neutral Order of positive and negative feedbacks from members
Sentiments by Features who have given mixed feedback. This order will indicate
how the reaction changes over usage time.
Metrics: Positive, negative & neutral feedbacks for individual
features can be shown Popularity Report
A political lobbyist can gauge the popular opinion of a
Insights: This will indicate which features need improvement . politician by calculating the sentiment of all tweets
containing the politician‟s name.