Text mining on Twitter information based on R platform

Text mining on Twitter information based on R platform
Qiaoyang ZHANG
∗
Computer and information
science system
Macau University of Science
and Technology
3269046927@qq.com
Fayan TAO
†
science system
and Technology
fytao2015@gmail.com
Junyi LU
‡
science system
and Technology
448673862@qq.com
ABSTRACT
Twitter is one of the most popular social networks, which
plays a vital role in this new era. Exploring the information
diffusion on Twitter is attractive and useful.
In this report, we apply R to do text mining and anal-
ysis, which is about the topic–”#prayforparis” in Twitter.
We first do the data preprocessing, such as data cleaning
and stemming words. Then we show tweets frequency and
association. We find that the word ”prayforparis” ranks the
highest frequency, and most of words we mined are related to
”prayforparis”, ”paris” and ”parisattack”. We also show lay-
outs of whole tweets and some extracted tweets. Additional,
we cluster tweets as 10 groups to see the connections of dif-
ferent topics. Since tweets can indicate usersa´r attitudes
and emotions well, so we further do the sentiment analysis.
We find that most people expressed their sadness and anger
about Paris attack by ISIS and pray for Paris. Besides, the
majority hold the positive attitudes toward this attack.
Keywords
text mining; Twitter; R; ”#prayforparis”; sentiment analysis
1. INTRODUCTION AND MOTIVATION
As data mining and big data become hot research spots
in this new era, the technique of data analysis is required
much higher as well. It is difficult to store and analyze large
data by using traditional database methodologies. So we try
to employ the powerful statistics platform R to do big data
mining and analysis, because R provides kinds of statistical
models and data analysis methods, such as classic statistical
tests, time-series analysis, classification and clustering.
∗We rank the authors’ names by the inverse alphabet order
of the first number of authors’ last name.
Stu ID:1509853G-II20-0033
†Stu ID:1509853F-II20-0019
‡Stu ID:1509853G-II20-0061
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
We try to analyze a large social network data set, which
is mainly focus on Twitter users and their expressions about
latest news in this project. And it is executed to discover
some characteristics those tweets have. By analyzing the
large amount of social network data, we can get better knowl-
edge on users’ preferences and habits, which will be helpful
for people who are interested in such data. For example,
business firms/companies can provide better services after
analyzing similar social networks data. That is why we want
to choose this topic.
2. RELATED WORKS
2.1 Sentiments analysis by searching Twitter and
Weibo
1
User-Level sentiment evolution can be analysis in Weibo.
ZHANG Lumin and JIA Yan et al[16] firstly proposed a mul-
tidimensional sentiment model with hierarchical structure to
analyze users’s complicate sentiments.
Michael Mathioudakis and Nick Koudas[11] presented a
”TwitterMonitor”. It is a system that performs trend detec-
tion over the Twitter stream. The system identifies emerg-
ing topics (i.e. ’trends’) on Twitter in real time and provides
meaningful analysis that synthesize an accurate description
of each topic. Users interact with the system by ordering
the identified trends using different criteria and submitting
their own description for each trend.
Twitter, in particular, is currently the major microblog-
ging service, with more than 50 million subscribers. Twitter
users generate short text messages–the so–called ”tweets”
to report their current thoughts and actions, comment on
breaking news and engage in discussions.[11]
Passonneau and Rebecca[1] mainly introduced a model
based on Tree Kernel to analysis the POS-specific prior po-
larity features of Twitter data and used a Partial Tree (PT)
kernel which first proposed by Moschitti (2006) to calculate
the similarity between two trees(You can see an example in
figure 1).They are divided sentiment in tweets into 3 cat-
egories: positive, negative and neutral. They marked the
sentiment expressed by emotions with an emotional dictio-
nary and translate acronym (e.g. gr8, gr8t = great. lol =
laughing out loud.) with an acronym dictionary. Those dic-
tionaries can maps emotions or acronyms to their polarity.
And they used an English ”Word stop” dictionary which are
found in Word Net to identify stop words and used an sen-
timent dictionary which has many positive words, negative
1
This part of related works is provided by Qiaoyang Zhang.

Figure 1: A tree kernel for a synthesized tweet: ”@Fernando this isn’t a great day for playing the HARP! :)”
words and neutral words to map words in tweets to their
polarity.
The accuracy of the model they used is higher than the
accuracy of Unigram model by 4.02%. And the standard
deviation of the model they used is lower than the standard
deviation of Unigram model by 0.52%.
2.2 Study information diffusion on Twitter
2
A number of recent papers have explored the information
diffusion on Twitter, which is one of the most popular social
networks.
In the year 2011, Shaomei Wu et al.[14] focused on pro-
duction, flow, consumption of information in the context
of Twitter. They exploited a Twitter ”lists” to distinguish
elite users(Celebrities, Media, Organizations, Bloggers) and
ordinary users; and they found strong homophily within cat-
egories, which means that each category mainly follows it-
self. They also re–examined the classical ”two- step flow”
theory[10] of communications, finding considerable support
for it on Twitter. Additional, various URLs’ lifespans were
demonstrated under different categories. Finally, they ex-
amined the attention paid by the different user categories to
different news topics.
This paper Sheds clearly light on how media information
is transmitted on Twitter. The approach of defining a limit
set of predetermined user categories presented can be ex-
plored to automatic classification schemes. However, they
just focus on the one narrow cross-section of the media in-
formation(URLs). It would be better if their methods are
applied to other channels(TV, Radio); Another weakness
exists in this paper is lacking information flow on Twitter
with other sources of outcome data(e.g. users’ opinions and
actions).
Daniel Ramage et al.[13]studied search behaviors on Twit-
ter especially for the information that users prefer searching.
They also compared Twitter search with web search in terms
of users’ queries. They found that Twitter results contain
more social events and contents, while web results include
more facts and navigation.
Eytan Bakshy et al.[3] conducted regression model to ana-
lyze Twitter data. They explored word of mouth marketing
to study usersa´r influence on Twitter not only on commu-
nication but also on URLs. They found that the largest
2
This part of related works is provided by Fayan Tao.
cascades tend to be generated by users who have been influ-
ential in the past and who have a large number of followers.
They also found that URLs that were rated more interesting
and/or elicited more positive feelings by workers on Mechan-
ical Turk were more likely to spread.
As we can see that, all of three papers mentioned above are
all focus on a large number of tweets, and employ different
methods to analyze various characteristics of tweets from
different aspects. But they are all limited to mainly focus
on Twitter data rather extend to other more social networks.
2.3 Semantic Analysis and Text Mining
3
Many researches are done to gain better understanding
people’s characteristics in a specific field by analyzing the
semantics of social network content. This has many appli-
cations especially for the business marketing purpose
Topic mining and sentiment analysis is done on follow-
ers’ comments on a company’s Facebook Fan Page and the
authors get most frequent terms in each domain (TF, TF-
IDF, three sentiments) and sentiment distributions through-
out one year and its relation versus ”Likes”, respectively [5].
This can help the marketing staffs aware of the sentiment
trend as well as main sentiment to adjust the marketing
technique. Support Vector Machine (SVM) Classification
Model is used in their analysis method. Before classification,
word segmentation and feature extraction is done. Feature
extraction is based on semantic dictionary and some addi-
tional rules. They found that the sentiment distribution of
the comments can be a contributing factor of the distribu-
tion of ”Likes”.
Hsin-Ying Wu et al. [15] presented a method of analyzing
Facebook posts serving as a marketing tool to help young
entrepreneurs to identify existing competitors in the market
and also their succeed factors and features during decision
making process. The overall mining process consists of three
stages:
1 Extracting Facebook posts;
2 Text data preprocessing;
2 Key phrases and terms filtering and extraction.
In detail, they did word segmentation to original com-
ments based on the lexicons, morphological rules for quan-
tifier words and reduplicated words. The words and phrases
3
This part of related works is provided by Junyi Lu.

are extracted from text files and transformed into a key
phrase matrix based on frequencies. And next, a k-means
clustering algorithm based on the phrase frequency matrix
and their similarity is used to identify the most important
phrases(i.e. features and factors of each shop). Various tools
is utilized in their study. CKIP is for Chinese word segmen-
tation, PERL is for extracting text files and WEKA is for
key phrase clustering.
Social network mining is also done in educational field.
Chen et al.[4] conducted an initial research on mining tweets
for understanding the students’ learning experiences. They
first use Radian6, a commercial social monitoring tool, to
acquire studentsa´r posts using the hashtag #engineering-
Problems and they collected 19,799 unique tweets. Due to
the ambiguity and complexity of natural language, they con-
ducted inductive content analysis and categorized the tweets
into 5 prominent themes and one group called ”others”. The
main hashtag, non-letter symbols, repeating letters, stop-
words is removed in preprocessing stage. Multi-label naive
Bayesian classifier is used because a tweet can reflect sev-
eral problems. Then they obtained another data set us-
ing the geocode of Purdue University with a radius of 1.3
miles to demonstrate the effectiveness of the classifier and
try to detect the studentsa´r problems. They also demon-
strated the multi-label naive Bayesian classifier performs
better than other state-of-the-art classifiers (SVM and M3L)
according to 4 parameters(Accuracy, Precision, Recall, F1).
But therea´rs a main defect in their method since they as-
sume the categories are independent when they transform
the problem into single-label classification problems.
Most of the text mining process is much the same as each
other. Generally, text preprocessing is conducted(stopwords,
punctuation, weird symbols and characters removal, seg-
mentation) at the beginning. (Some study like sentiment
analysis need Part-of-speech Tagging.) Then a term fre-
quency matrix is built with the data set to calculate the term
frequencies. Finally, classification and clustering is mostly
used to analyze the data and generate knowledge.
3. TEXT MINING UNDER R PLATFORM
3.1 About R
R[18] is a language and environment for statistical com-
puting and graphics. It is a GNU project which is similar to
the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a
different implementation of S.
R provides a wide variety of statistical (linear and nonlin-
ear modelling, classical statistical tests, time-series analysis,
classification, clustering.) and graphical techniques, and is
highly extensible. R also provides an Open Source route to
participation in statistical research. R is available as Free
Software under the terms of the Free Software Foundation’s
GNU General Public License in source code form. It com-
piles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows
and MacOS.
3.2 The idea
Text mining[2][17] is the discovery of interesting knowl-
edge in text documents. It is a challenging issue to find
the accurate knowledge from unstructured text documents
to help users to find what they want. It can be defied as
the art of extracting data from large amount of texts. It
allows to structure and categorize the text contents which
are initially non-organized and heterogeneous. Text mining
is an important data mining technique which includes the
most successful technique to extract the effective patterns.
This report presents examples of text mining with R.
Twitter text (”prayforparis”)is used as the data to analyze.
It starts with extracting text from Twitter. The extracted
text is then transformed to build a document-term matrix.
After that, frequent words and associations are found from
the matrix. Next, words and tweets are clustered to find
groups of words and topics of tweets. Finally, a sentiment
analysis of tweets are explored, and a word cloud is used to
present important words in documents.
In this report, ”tweet” and ”document” will be used in-
terchangeably, so are ”word” and ”term”. There are three
important packages used in the examples: twitteR, tm and
wordcloud Package twitter[8] provides access to Twitter data,
tm[6] provides functions for text mining, and wordcloud[7]
visualizes the result with a word cloud 2.
4. IMPLEMENTATIONS
4
4.1 Data Preprocessing
We firstly mine 3200 tweets from twitter by search the
main topic ”prayforparis” during the time 13rd, Nov,2015 to
13rd, Dec,2015. Then we do some data preprocessing.
4.1.1 Data Cleaning
The tweets are first converted to a data frame and then
to a corpus, which is a collection of text documents. After
that, the corpus needs a couple of transformations, including
changing letters to lower case, adding ”pray” and ”for” as ex-
tra stop words and removing URLs, punctuations, numbers
extra whitespace and stop words.
Next, we keep a copy of corpus to use later as a dictionary
for stem completion
4.1.2 Stemming Words
Stemming[19] is the term used in linguistic morphology
and information retrieval to describe the process for reduc-
ing inflected (or sometimes derived) words to their word
stem, base or root formalgenerally a written word form.
A stemmer for English, for example, should identify the
string ”stems”, ”stemmer”, ”stemming”, ”stemmed” as based
on ”stem”. Using word stremming makes the words would
look normal. This can be achieved with function ”stemCom-
pletion()” in R .
In the following steps, we use ”stemCompletion()” to com-
plete the stems with the unstemmed corpus ”myCorpus-
Copy” as a dictionary. With the default setting, it takes
the most frequent match in dictionary as completion.
4.1.3 Building a Term-Document Matrix
A term-document matrix indicates the relationship be-
tween terms and documents, where each row stands for a
term and each column for a document, and an entry is
the number of occurrences of the term in the document.
4
All of our implementation codes are attached in the end of
this report.

TermDocumentMatrix terms: 3621, documents: 3200
Non-/sparse entries 27543/11559657
Sparsity 100%
Maximal term length 38
Weighting term frequency (tf)
Table 1: TermDocumentMatrix
Figure 2: layout of whole tweets
Alternatively, one can also build a document-term matrix
by swapping row and column. In this report, we build
a term-document matrix from the above processed corpus
with function ”TermDocumentMatrix()”.
As the table 1 shows, there are totally 3621 terms and
3200 document in the ”TermDocumentMatrix”. We can see
that it is very sparse, with nearly 100% of the entries being
zero, which means that the majority terms are not contained
in the document.
We can also see the layout of whole tweets in figure2, they
are mainly located in two parts. As the large number of data,
we cannot tell those words clearly. Therefore, we select some
terms from the total data and show their distributions as
figure3 and figure4 shows. We can see that most of terms
are connected within a bounded zone, which means that
they have associations more or less.
5. FREQUENT TERMS AND ASSOCIATIONS
Based on the above data process, now we show the fre-
quent words. Note that there are 3200 tweets in total.
We first choose those words which appear more than 100
times, the results are shown as table 2. We can see that, for
example, the number of ”parisattack” , ”pour” and ”victim”
are all more that 100, which means they have high frequency
in this topical”prayforparis”.
In the further process, we show the number of all of words
that appear at least 100 times. The result is as figure5 shows.
As figure5 contains so many terms that we cannot tell the
number of each term. So we only choose 70 terms, and
show the number of all of words that appear at least 100
Figure 3: layout-1 of some parts selected from whole
tweets
Figure 4: layout-2 of some parts selected from whole
tweets
Figure 5: Total words that appear at least 100 times
Figure 6: Selecting some Words that appear at least
100 times
lose over papajackadvic struggl trust
0.56 0.56 0.56 0.56 0.56
worri prayfor think hope simoncowel
0.56 0.40 0.40 0.32 0.29
scare stay
0.28 0.25
Table 3: words associated with ”pray” with correla-
tion no less than 0.25

number of tweets words
[1] ”´ld’” ”attentat” ”aux” ”?a”
[5] ”de” ”déjà” ”et” ”everyon”
[9] ”fait” ”franc” ”go” ”il”
[13] ”jamai” ”jour” ”la” ”les”
[17] ”louistomlinson” ”moi” ”ne” ”noubliera”
[21] ”novembr” ”pari” ”parisattack” ”pas”
[25] ”pens´le” ”pour” ”prayforpari” ”que”
[29] ”rt” ”simoncr” ”thought” ”un”
[33] ”victim” ”vous” ”y” ”ytbclara”
Table 2: the number of words are larger than 100
times. As the figure6 shows, it is not surprised that the count
of ”prayforparis” is highest, which is more than 3000. The
second one is ”pari” with the ”parisattack” following. This
result indicates that most of people care for Paris attack and
pray for Paris.
To find the associations among words, we take the ”pray”
for example to see which words are associated with ”pray”
with correlation no less than 0.25.
From the table 3,we can see that there are 12 terms includ-
ing ”loss”, ”struggl”, ”trust”, ”hope” have connection with
”pray”, in which six terms such as ”loss”, ”papajackadvic”
and ”trust” are associated with ”pray” by the correction of
0.56, While ”prayfor” and ”hope” have correction 0.40 and
0.32 with ”pray”, respectively.
6. CLUSTERING WORDS
We then try to find clusters of words with hierarchical
clustering. Sparse terms are removed, so that the plot of
clustering will not be crowded with words. we cut related
data into 10 clusters. The agglomeration method is set to
ward, which denotes the increase in variance when two clus-
ters are merged.
In the figure7, we can see different topics related to ”pray-
forparis” in the tweets. Words ”les”, ”parisattack” aˇrfaita´s
and some other words are clustered into one group, because
there are a couple of tweets on parisattack. Another group
contains ”everyone” and ”thought” because everyone are fo-
cus on this event. We can also see that ”moir”, ”déjà” ”pray-
forpari” are in a single group, which means they have few
relationships.
Figure 7: cluster (10 groups)
7. EXPERIMENTS ABOUT SENTIMENTS
Figure 8: Emotion categories of #prayforparis
Figure 9: Classification by polarity of #prayforparis
Figure 10: A wordcloud of #prayforparis

Stages
&Individuals Works
Stage 1 Literatures survey
Determine project topic
Stage 2 R programming and
text ming learning
Stage 3 Implementations
Stage 4 Presentation
and Final report
Qiaoyang Zhang Mainly read the references:
[1],[9],[11],[12], and[16];
Sentiment analysis implementation.
Fayan Tao Mainly read [2], [3], [10], [13] and [14];
Data preprocessing
and data analysis
Junyi Lu Mainly read [4] [5] and [15];
Analyze data association
and do cluster words.
Remark All of us read [6], [7], [8], [17] [18], and [19].
Table 4: Timetable and working plan
We also did an experiment about sentiment in R software
with the method mentioned in the related works. We loaded
a package named ”sentiment” in R software and analyzed
the sentiment of tweets about a hashtag ”#prayforparis” in
Twitter. We used the ”sentiment” package to mine more
than 6800 tweets on Twitter and established corpus[12] in
R to mainly analysis the related words of speech, frequency
and its correlation. figure 8 shows the emotion categories of
”#prayforparis” with emotion dictionary. In this figure, we
can see that nearly 1000 persons felt sad and angry about
the terrorist attacks in Paris(angry about the terrorist attack
from ISIS).And there are a small number of people felt afraid
and surprised.
In the figure 9,we can see that nearly 5000 people used
positive words and more than 1500 people used negative
words in their tweets. In addition, there are less than 500
people used no polarity words in the hashtag about ”#pray-
forparis”.
From the picture 10 of word cloud[17], we can intuitively
see the most frequently used words about ”#prayforparis”in
Twitter(the larger the front, the more used in tweets). The
most of polarities were concentrated in the type of sadness,
anger and disgust.
From these experimental data, we can draw a conclusion
that the general attitudes of people around the world toward
terrorist attack is sad and anger. Most people feel sorry for
the victims and pray for the victims in paris.They are also
strongly against terrorism.
8. WORKING PLAN
To finish this project, we made a timetable and working
plan as table 4 shows.
9. CONCLUSION AND FUTURE WORKS
In this report, we apply R to do text mining and analy-
sis about ”#prayforparis” in Twitter. We first do the data
preprocessing, such as data cleaning and stemming words.
Then we show the tweets frequency and association, we find
that ”prayforparis” ranks the highest frequency, and most
of words we mined are related to ”prayforparis”, ”paris” and
”parisattack”. We also show the layout of whole tweets and
some extracted tweets. Additional, we cluster tweets topic
as 10 groups to see the connections of terms. Since tweets
can indicate users’ attitudes and emotions well, so we fur-
ther do the sentiment analysis. We find that most people
expressed their sadness and anger about Paris attack by ISIS
and praied for paris. As the results show, the majority hold
the positive attitudes toward to this attack, mainly because
of hope for good future to Paris and whole wold as well.
As the data we mined is limited to one topic, and it is not
so large, which may result in data incompleteness. Addi-
tional, there are some problems existing during the data pre-
processing, for example, the ”termdocmatrix” is so sparse,
which are likely to have an bad influence on the following
analysis and evaluations. In the future works, we plan to
develop a better model or algorithm, which can be used to
mine and analyze different kinds of social networks data by
R. We will also focus on the improvement of data prepro-
cessing, so that it can make the result more precise.
10. ACKNOWLEDGMENT
We wish to thank Dr. Hong-Ning DAI for his patient
guidance and vital suggestions on this report.
11. REFERENCES
[1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.
Passonneau. Sentiment analysis of twitter data.
Proceedings of the Workshop on Languages in Social
Media. 39(4):620´lC622, 2011.
[2] V.Aswini, S.K.Lavanya, Pattern Discovery for Text
Mining Computation of Power, Energy, Information
and Communication (ICCPEIC), 2014 International
Conference on IEEE. PP. 412-416. 2014.
[3] E. Bakshy, J.M. Hofman, W. A. Mason, and D. J.
Watts. Everyone’s an influencer: quantifying influence
on twitter. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 65-74. 2011.
DOI=http://dx.doi.org/10.1145/1935826.1935845
[4] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan,
Mining Social Media Data for Understanding
Studentsa´r Learning Experiences. IEEE Trans. Learn.
Technol., vol. 7, no. 3, pp. 246´lC259, 2014.
[5] Kuan-Cheng Lin et al., Mining the user clusters on
Facebook fan pages based on topic and sentiment
analysis. Information Reuse and Integration (IRI),
2014 IEEE 15th International Conference on , vol.,
no., pp.627-632, 13-15 Aug. 2014
[6] I.Feinerer, tm: Text Mining Package. R package
version 0.5-7.1. 2012.
[7] I.Fellows, wordcloud: Word Clouds. R package version
2.0. 2012.
[8] J. Gentry, twitteR: R based Twitter client. R package
version 0.99.19. 2012.
[9] I.Guellil and K.Boukhalfa. Social big data mining: A
survey focused on opinion mining and sentiments
analysis. In Programming and Systems (ISPS), 2015

12th International Symposium on, pp. 1–10, April
2015.
[10] E. Katz.The two-step ﬂow of communication: An
up-to-date report on an hypothesis. Public Opinion
Quarterly, 21(1):61´lC78, 1957.
[11] M. Mathioudakis and N. Koudas. TwitterMonitor :
Trend Detection over the Twitter Stream. Proceeding:
SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data.
ACM New York, NY. pp. 1155–1157. 2010.
[12] A. Pak and P. Paroubek. Twitter as a corpus for
sentiment analysis and opinion mining. In Seventh
Conference on International Language Resources
Evaluation, 2010.
[13] J. Teevan, D. Ramage, and M. R. Morris.
TwitterSearch: a comparison of microblog search and
web search. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 35-44. 2011.DOI=
http://dx.doi.org/10.1145/1935826.1935842
[14] S.M. Wu, J. M. Hofman, W. A. Mason, and D.J.
Watts. Who says what to whom on twitter. In
*Proceedings of the 20th international conference on
World wide web* (WWW ’11). ACM, New York, NY,
USA, pp.705-714. 2011.
DOI=http://dx.doi.org/10.1145/1963405.1963504
[15] Hsin-Ying Wu; Kuan-Liang Liu; C. Trappey,
Understanding customers using Facebook Pages: Data
mining users feedback using text analysis. Computer
Supported Cooperative Work in Design (CSCWD),
Proceedings of the 2014 IEEE 18th International
Conference on , vol., no., pp.346-350, 21-23 May 2014
[16] L. M. Zhang, Y. Jia, X. Zhu, B. Zhou and Y. Han.
User-Level Sentiment Evolution Analysis in Microblog.
Browse Journals & Magazines, Communications,
China. Volume:11 Issue:12 pp. 152–163. 2011.
[17] Y.C. Zhao, R and Data Mining: Examples and Case
Studies. Published by Elsevier. 2012.
[18] More details about R:
https://www.r-project.org/about.html
[19] More information about stemming:
https://en.wikipedia.org/wiki/Stemming
APPENDIX
A. CODES FOR TEXTMINING

1 l i b r a r y (ROAuth)
2 l i b r a r y ( bitops )
3 l i b r a r y ( RCurl )
4 l i b r a r y ( twitteR )
5 l i b r a r y (NLP)
6 l i b r a r y (tm)
7 l i b r a r y ( RColorBrewer )
8 l i b r a r y ( wordcloud )
9 l i b r a r y (XML)
10 #Set t w i t t e r auth url
11 reqTokenURL <− ”https :// api . t w i t t e r . com/oauth/ request token ”
12 accessTokenURL <− ”https :// api . t w i t t e r . com/oauth/ access token ”
13 authURL <− ”https :// api . t w i t t e r . com/oauth/ authorize ”
14 #Set t w i t t e r key
15 consumerkey <− ”PXoumpl5ndvroikd1DPeGkcqE ”
16 consumerSecret <− ”raDtyWXPYBS5zAH0WVjUGKoiObIAEpHroWJ8G6UjlVn5DBdzbv”
17 accessToken <− ”3954258018−HALNbJ0Jo0pPVK844ZvNBnz5yRCXcdyTPKNE4rq”
18 acce ss Secr e t <− ”K45pUUUpWjqwSM0VgQZWDzx7D7F7RN74fB7gDg1EAh05B”
19 setup twitter oauth ( consumerkey , consumerSecret , accessToken ,
20 +acce ss Secr e t )
21 l i b r a r y ( twitteR )
22 tweets <− searchTwitter ( ”PrayforParis ” , s i nc e = ”2015−11−13” ,
23 + u n t i l = ”2015−12−14” , n = 3200)
24 ( nDocs <− length ( tweets ))
25 #[ 1 ] 3200
26 # convert tweets to a data frame
27 tweets . df <− twListToDF ( tweets )
28 dim( tweets . df )
29 # 3200 16
30 #Text cleaning
31 l i b r a r y (tm)
32 # build a corpus , and s p e c i f y the source to be character vectors
33 myCorpus <− Corpus ( VectorSource ( tweets . df$text ))
34 # convert to lower case
35 # tm v0 .6
36 myCorpus <− tm map(myCorpus , content transformer ( tolower ))
37 # tm v0.5−10
38 # myCorpus <− tm map(myCorpus , tolower )
39 # remove URLs
40 removeURL <− function (x) gsub ( ”http [ ˆ [ : space : ] ] ∗ ” , ”” , x)
41 # tm v0 .6
42 myCorpus <− tm map(myCorpus , content transformer (removeURL ))
43 # tm v0.5−10
44 # myCorpus <− tm map(myCorpus , removeURL)
45 # remove anything other than English l e t t e r s or space
46 removeNumPunct <− function (x) gsub ( ” [ ˆ [ : alpha : ] [ : space : ] ] ∗ ” , ”” , x)
47 myCorpus <− tm map(myCorpus , content transformer (removeNumPunct ))
48 # remove punctuation
49 # myCorpus <− tm map(myCorpus , removePunctuation )
50 # remove numbers
51 # myCorpus <− tm map(myCorpus , removeNumbers )
52 # add two extra stop words : ”pray ” and ”f o r ”
53 myStopwords <− c ( stopwords ( ’ e n g l i s h ’ ) , ”pray ” , ”f o r ”)
54 # remove ”ISIS ” and ”Paris ” from stopwords
55 myStopwords <− s e t d i f f ( myStopwords , c ( ”ISIS ” , ”Paris ”))
56 # remove stopwords from corpus
57 myCorpus <− tm map(myCorpus , removeWords , myStopwords )
58 # remove extra whitespace
59 myCorpus <− tm map(myCorpus , stripWhitespace )
60 # keep a copy of corpus to use l a t e r as a dictionary
61 #f o r stem completion
62 myCorpusCopy <− myCorpus
63 # stem words
64 myCorpus <− tm map(myCorpus , stemDocument )
65 # inspect the f i r s t 5 documents ( tweets )
66 # inspect (myCorpus [ 1 : 5 ] )
67 # The code below i s used f o r to make text f i t f o r paper width
68 f o r ( i in c ( 1 : 2 , 320)) {
69 cat ( paste0 ( ” [ ” , i , ” ] ”))

1
2 writeLines ( strwrap ( as . character (myCorpus [ [ i ] ] ) , 60))}
3 #[ 1 ] RT BahutConfess PrayForPari
4 #[ 2 ] FCBayern dontbombsyria i s i PrayForUmmah i s r a i l spdbpt bbc
5 #PrayforPari Merkel franc BVBPAOK saudi
6 #[ 3 2 0 ] RT RodrigueDLG Rip aux victim du bataclan AMAs PrayForParid
7 # tm v0.5−10
8 # myCorpus <− tm map(myCorpus , stemCompletion )
9 # tm v0 .6
10 stemCompletion2 <− function (x , dictionary ) {
11 x <− u n l i s t ( s t r s p l i t ( as . character (x ) , ” ”))
12 # Unexpectedly , stemCompletion completes an empty s t r i n g to
13 # a word in dictionary . Remove empty s t r i n g to avoid above i s s u e .
14 x <− x [ x != ”” ]
15 x <− stemCompletion (x , dictionary=dictionary )
16 x <− paste (x , sep=”” , c o l l a p s e=” ”)
17 PlainTextDocument ( stripWhitespace (x ))
18 }
19 myCorpus <− lapply (myCorpus , stemCompletion2 ,
20 +dictionary=myCorpusCopy)
21 myCorpus <− Corpus ( VectorSource (myCorpus ))
22 # count frequency of ”ISIS ”
23 ISISCases <− lapply (myCorpusCopy ,
24 function (x) { grep ( as . character (x ) , pattern = ” <ISIS ”) } )
25 sum( u n l i s t ( ISISCases ))
26 ## [ 1 ] 8
27 # count frequency of ”pray ”
28 prayCases <− lapply (myCorpusCopy ,
29 function (x) { grep ( as . character (x ) , pattern = ” <pray ”) } )
30 sum( u n l i s t ( prayCases ))
31 ## [ 1 ] 1136
32 # replace ”Islam ” with ”ISIS ”
33 myCorpus <− tm map(myCorpus , content transformer ( gsub ) ,
34 pattern = ”Islam ” , replacement = ”ISIS ”)
35 tdm <− TermDocumentMatrix (myCorpus , control =
36 +l i s t ( wordLengths = c (1 , Inf ) ) )
37 tdm
38 #<<TermDocumentMatrix ( terms : 3621 , documents : 3200)>>
39 #Non−/sparse e n t r i e s : 27543/11559657
40 #Sparsity : 100%
41 #Maximal term length : 38
42 #Weighting : term frequency ( t f )
43 #Frequent Words and Asso c i a t i o n s
44 idx <− which ( dimnames (tdm) $Terms == ”pray ”)
45 inspect (tdm [ idx + ( 0 : 5) , 1 0 : 1 6 ] )
46 #############
47 <<TermDocumentMatrix ( terms : 6 , documents : 7)>>
48 Non−/sparse e n t r i e s : 2/40
49 Sparsity : 95%
50 Maximal term length : 14
51 Weighting : term frequency ( t f )
52 Docs
53 Terms 10 11 12 13 14 15 16
54 pray 0 1 0 0 0 0 0
55 prayed 0 0 0 0 0 0 0
56 prayer 0 0 0 0 1 0 0
57 prayersburundi 0 0 0 0 0 0 0
58 p r a y e r s f o r f r 0 0 0 0 0 0 0
59 p r a y e r s f o r p a r i 0 0 0 0 0 0 0
60 ##########
61 # inspect frequent words
62 findFreqTerms (tdm , lowfreq =100)
63 termFrequency <− rowSums( as . matrix (tdm))
64 termFrequency <− subset ( termFrequency , termFrequency >=100)
65 # inspect frequent words
66 ( freq . terms <− findFreqTerms (tdm , lowfreq = 100))
67 term . freq <− rowSums( as . matrix (tdm ))
68 term . freq <− subset ( term . freq , term . freq >= 100)
69 df <− data . frame ( term = names ( term . freq ) , freq = term . freq )

1
2 l i b r a r y ( ggplot2 )
3 ggplot ( df , aes (x = term , y = freq )) + geom bar ( stat = ”i d e n t i t y ”) +
4 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
5 #s e l e c t some terms
6 ggplot ( df [ 3 0 : 6 0 , 4 0 : 8 0 ] , aes (x = term , y = freq )) +
7 +geom bar ( stat = ”i d e n t i t y ”) +
8 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
9 # which words are associated with ”pray ”?
10 findAssocs (tdm , ’ pray ’ , 0.25)
11 #c l u s t i n g words
12 # remove sparse terms
13 tdm2 <− removeSparseTerms (tdm , sparse =0.95)
14 m2 <− as . matrix (tdm2)
15 #### c l u s t e r terms
16 distMatrix <− d i s t ( s c a l e (m2))
17 f i t <− hclust ( distMatrix , method=”ward .D2”)
18 #other methods : complete average centroid
19 plot ( f i t )
20 # cut tree into 10 c l u s t e r s
21 rect . hclust ( f i t , k=10)
22 ( groups <− cutree ( f i t , k=10))
23 ##############################
24 > ( groups <− cutree ( f i t , k=10))
25 ´ld’ attentat ?a d´lej´ld’ et everyon
26 1 2 2 3 1 4
27 f a i t i l jamai l e s moi noubliera
28 2 5 2 2 6 2
29 pari parisattack pour prayforpari rt simoncr
30 7 2 1 8 9 2
31 thought un victim y ytbclara
32 4 10 1 5 1
33 ##################
34 #change tdm to a Boolean matrix
35 termDocMatrix=as . matrix (tdm)
36 #termDocMatrix=as . matrix (tdm [ 4 0 : 240 ,40:240])
37 #remove ”r ” , ”data ” and ”mining ”
38 idx <− which ( dimnames ( termDocMatrix ) $Terms %in% c ( ”pray ” , ”par i s ” , ”shoot ”))
39 M <− termDocMatrix[−idx , ]
40 # build a tweet−tweet adjacency matrix
41 tweetMatrix <− t (M) %∗% M
42 l i b r a r y ( igraph )
43 g <− graph . adjacency ( tweetMatrix , weighted=T, mode = ”undirected ”)
44 V( g ) $degree <− degree ( g )
45 g <− s i m p l i f y ( g )
46 #set l a b e l s of v e r t i c e s to tweet IDs
47 V( g ) $ l a b e l <− V( g )$name
48 V( g ) $ l a b e l . cex <− 1
49 V( g ) $ l a b e l . c o l o r <− rgb ( . 4 , 0 , 0 , . 7 )
50 V( g ) $ s i z e <− 2
51 V( g ) $frame . c o l o r <− NA
52 barplot ( table (V( g ) $degree ))
53 tdm=tdm [ 1 : 2 0 0 , 1 : 2 0 0 ]
54 idx <− V( g ) $degree == 0
55 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
56 #load t w i t t e r text
57 #l i b r a r y ( twitteR)# load ( f i l e = ”data/rdmTweets . RData”)
58 #convert tweets to a data frame
59 df <− do . c a l l ( ”rbind ” , lapply (tdm , as . data . frame ))
60 #set l a b e l s to the IDs and the f i r s t 20 characters of tweets
61 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
62 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
63 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
64 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
65 E( g ) $width <− egam
66 set . seed (3152)
67 layout2 <− layout . fruchterman . reingold ( g )
68 plot (g , layout=layout2 )
69 #termDocMatrix=as . matrix (tdm [ 4 0 : 1 0 0 , 1 4 0 : 2 0 0 ] )
70 dim( termDocMatrix )

1
2 termDocMatrix [ termDocMatrix >=1] <− 1
3 # transform into a term−term adjacency matrix
4 termMatrix <− termDocMatrix %∗% t ( termDocMatrix )
5 # inspect terms numbered 5 to 10
6 dim( termMatrix )
7 # [ 1 ] 3642 3200
8 termMatrix [ 5 : 1 0 , 5 : 1 0 ]
9 ################
10 Terms abrahammateomus abzzni accept account acontecem across
11 abrahammateomus 1 0 0 0 0 0
12 abzzni 0 1 0 0 0 0
13 accept 0 0 2 0 0 0
14 account 0 0 0 1 0 0
15 acontecem 0 0 0 0 2 0
16 across 0 0 0 0 0 2
17 ##############
18 l i b r a r y ( igraph )
19 # build a graph from the above matrix
20 g <− graph . adjacency ( termMatrix , weighted=T, mode=”undirected ”)
21 # remove loops
22 g <− s i m p l i f y ( g )
23 # set l a b e l s and degrees of v e r t i c e s
24 V( g ) $ l a b e l <− V( g )$name
26 # set seed to make the layout reproducible set . seed (30)
29 set . seed (3000) #3152
32 +substr ( def$text [ idx ] , 1 , 20) , sep=” : ”)
34 E( g ) $color <− rb ( . 5 , . 5 , 0 , egam)
36 set . seed (3152)
39 #########################################
40 termMatrix <− termMatrix [1500:2000 ,1500:2000]
41 # create a graph
42 #g <− graph . incidence ( termDocMatrix , mode=c ( ” a l l ”))
43 g <− graph . incidence ( termMatrix , mode=c ( ” a l l ”))
44 # get index f o r term v e r t i c e s and tweet v e r t i c e s
45 nTerms <− nrow (M)
46 nDocs <− ncol (M)
47 idx . terms <− 1: nTerms
48 idx . docs <− (nTerms +1):( nTerms+nDocs )
49 # set c o l o r s and s i z e s f o r v e r t i c e s
51 V( g ) $color [ idx . terms ] <− rgb (0 , 1 , 0 , . 5 )
52 V( g ) $ s i z e [ idx . terms ] <− 6
53 V( g ) $color [ idx . docs ] <− rgb (1 , 0 , 0 , . 4 )
54 V( g ) $ s i z e [ idx . docs ] <− 4
55 V( g ) $frame . c o l o r <− NA
56 # set vertex l a b e l s and t h e i r c o l o r s and s i z e s
57 V( g ) $ l a b e l <− V( g )$name
58 V( g ) $ l a b e l . c o l o r <− rgb (0 , 0 , 0 , 0.5)
59 V( g ) $ l a b e l . cex <− 1.4∗V( g ) $degree /max(V( g ) $degree ) + 1
60 # set edge width and c o l o r
61 E( g ) $width <− .3
62 E( g ) $color <− rgb ( . 5 , . 5 , 0 , . 3 )
63 set . seed (1500)
64 plot (g , layout=layout . fruchterman . reingold )
65 idx <− V( g ) $degree == 0
66 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )

1 # convert tweets to a data frame
2 df <− do . c a l l ( ”rbind ” , lapply ( termMatrix , as . data . frame ))
3 # set l a b e l s to the IDs and the f i r s t 20 characters of tweets
5 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
7 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
9 set . seed (3152)
12 ###############sentiment a n a l y s i s #############
13 # harvest some tweets
14 some tweets = searchTwitter ( ”#pr ay f or par i s ” , n=10000, lang=”en ”)
15 # get the text
16 some txt = sapply ( some tweets , function (x) x$getText ( ) )
17 # remove retweet e n t i t i e s
18 some txt = gsub ( ”(RT| via ) ( ( ? : bW∗@w+)+)” , ”” , some txt )
19 # remove at people
20 some txt = gsub ( ”@w+” , ”” , some txt )
21 # remove punctuation
22 some txt = gsub ( ” [ [ : punct : ] ] ” , ”” , some txt )
23 # remove numbers
24 some txt = gsub ( ” [ [ : d i g i t : ] ] ” , ”” , some txt )
25 # remove html l i n k s
26 some txt = gsub ( ”http w+” , ”” , some txt )
27 # remove unnecessary spaces
28 some txt = gsub ( ” [ t ]{2 ,} ” , ”” , some txt )
29 some txt = gsub ( ”ˆ s +| s+$ ” , ”” , some txt )
30 # define ”tolower e rr or handling ” function
31 try . er r or = function (x)
32 {
33 # create missing value
34 y = NA
35 # tryCatch er r or
36 t r y e r r o r = tryCatch ( tolower (x ) , e r ror=function ( e ) e )
37 # i f not an er r or
38 i f ( ! i n h e r i t s ( try error , ”e r r or ”))
39 y = tolower (x)
40 # r e s u l t
41 return (y)
42 }
43 # lower case using try . e r ror with sapply
44 some txt = sapply ( some txt , try . er r or )
45 # remove NAs in some txt
46 some txt = some txt [ ! i s . na ( some txt ) ]
47 names ( some txt ) = NULL
48 # c l a s s i f y emotion
49 class emo = c l a s s i f y e m o t i o n ( some txt , algorithm=”bayes ” , p r i o r =1.0)
50 # get emotion best f i t
51 emotion = class emo [ , 7 ]
52 # s u b s t i t u t e NA’ s by ”unknown”
53 emotion [ i s . na ( emotion ) ] = ”unknown”
54 # c l a s s i f y p o l a r i t y
55 c l a s s p o l = c l a s s i f y p o l a r i t y ( some txt , algorithm=”bayes ”)
56 # get p o l a r i t y best f i t
57 p o l a r i t y = c l a s s p o l [ , 4 ]
58 # data frame with r e s u l t s
59 sent df = data . frame ( text=some txt , emotion=emotion ,
60 p o l a r i t y=polarity , stringsAsFactors=FALSE)
61 # sort data frame
62 sent df = within ( sent df ,
63 +emotion <− f a c t o r ( emotion ,
64 +l e v e l s=names ( sort ( table ( emotion ) , decreasing=TRUE) ) ) )

1
2 # plot d i s t r i b u t i o n of emotions
3 ggplot ( sent df , aes (x=emotion )) +
4 geom bar ( aes (y =.. count . . , f i l l =emotion )) +
5 s c a l e f i l l b r e w e r ( p a l e t t e=”Dark2 ”) +
6 labs (x=”emotion c a t e g o r i e s ” , y=”number of tweets ”) +
7 labs ( t i t l e = ”Sentiment Analysis of Tweets about
8 +Starbucks n( c l a s s i f i c a t i o n by emotion ) ” ,
9 +plot . t i t l e = element text ( s i z e =12))
10 # plot d i s t r i b u t i o n of p o l a r i t y
11 ggplot ( sent df , aes (x=p o l a r i t y )) +
12 geom bar ( aes (y =.. count . . , f i l l =p o l a r i t y )) +
13 s c a l e f i l l b r e w e r ( p a l e t t e=”RdGy”) +
14 labs (x=”p o l a r i t y c a t e g o r i e s ” , y=”number of tweets ”) +
15 labs ( t i t l e = ”Sentiment Analysis of Tweets about
16 +#pr a yf or pa r is n( c l a s s i f i c a t i o n by p o l a r i t y ) ” ,
17 +plot . t i t l e = element text ( s i z e =12))
18 # separating text by emotion
19 emos = l e v e l s ( f a c t o r ( sent df$emotion ))
20 nemo = length ( emos )
21 emo . docs = rep ( ”” , nemo)
22 f o r ( i in 1: nemo)
23 {
24 tmp = some txt [ emotion == emos [ i ] ]
25 emo . docs [ i ] = paste (tmp , c o l l a p s e=” ”)
26 }
27 # remove stopwords
28 emo . docs = removeWords (emo . docs , stopwords ( ”e n g l i s h ”))
29 # create corpus
30 corpus = Corpus ( VectorSource (emo . docs ))
31 tdm = TermDocumentMatrix ( corpus )
32 tdm = as . matrix (tdm)
33 colnames (tdm) = emos
34 # comparison word cloud
35 comparison . cloud (tdm , c o l o r s = brewer . pal (nemo , ”Dark2 ”) ,
36 +s c a l e = c ( 3 , . 5 ) , random . order = FALSE, t i t l e . s i z e = 1. 5)

Text mining on Twitter information based on R platform

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Text mining on Twitter information based on R platform

Similaire à Text mining on Twitter information based on R platform (20)

Text mining on Twitter information based on R platform