SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Text mining on Twitter information based on R platform
Qiaoyang ZHANG
∗
Computer and information
science system
Macau University of Science
and Technology
3269046927@qq.com
Fayan TAO
†
Computer and information
science system
Macau University of Science
and Technology
fytao2015@gmail.com
Junyi LU
‡
Computer and information
science system
Macau University of Science
and Technology
448673862@qq.com
ABSTRACT
Twitter is one of the most popular social networks, which
plays a vital role in this new era. Exploring the information
diffusion on Twitter is attractive and useful.
In this report, we apply R to do text mining and anal-
ysis, which is about the topic–”#prayforparis” in Twitter.
We first do the data preprocessing, such as data cleaning
and stemming words. Then we show tweets frequency and
association. We find that the word ”prayforparis” ranks the
highest frequency, and most of words we mined are related to
”prayforparis”, ”paris” and ”parisattack”. We also show lay-
outs of whole tweets and some extracted tweets. Additional,
we cluster tweets as 10 groups to see the connections of dif-
ferent topics. Since tweets can indicate usersa´r attitudes
and emotions well, so we further do the sentiment analysis.
We find that most people expressed their sadness and anger
about Paris attack by ISIS and pray for Paris. Besides, the
majority hold the positive attitudes toward this attack.
Keywords
text mining; Twitter; R; ”#prayforparis”; sentiment analysis
1. INTRODUCTION AND MOTIVATION
As data mining and big data become hot research spots
in this new era, the technique of data analysis is required
much higher as well. It is difficult to store and analyze large
data by using traditional database methodologies. So we try
to employ the powerful statistics platform R to do big data
mining and analysis, because R provides kinds of statistical
models and data analysis methods, such as classic statistical
tests, time-series analysis, classification and clustering.
∗We rank the authors’ names by the inverse alphabet order
of the first number of authors’ last name.
Stu ID:1509853G-II20-0033
†Stu ID:1509853F-II20-0019
‡Stu ID:1509853G-II20-0061
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
We try to analyze a large social network data set, which
is mainly focus on Twitter users and their expressions about
latest news in this project. And it is executed to discover
some characteristics those tweets have. By analyzing the
large amount of social network data, we can get better knowl-
edge on users’ preferences and habits, which will be helpful
for people who are interested in such data. For example,
business firms/companies can provide better services after
analyzing similar social networks data. That is why we want
to choose this topic.
2. RELATED WORKS
2.1 Sentiments analysis by searching Twitter and
Weibo
1
User-Level sentiment evolution can be analysis in Weibo.
ZHANG Lumin and JIA Yan et al[16] firstly proposed a mul-
tidimensional sentiment model with hierarchical structure to
analyze users’s complicate sentiments.
Michael Mathioudakis and Nick Koudas[11] presented a
”TwitterMonitor”. It is a system that performs trend detec-
tion over the Twitter stream. The system identifies emerg-
ing topics (i.e. ’trends’) on Twitter in real time and provides
meaningful analysis that synthesize an accurate description
of each topic. Users interact with the system by ordering
the identified trends using different criteria and submitting
their own description for each trend.
Twitter, in particular, is currently the major microblog-
ging service, with more than 50 million subscribers. Twitter
users generate short text messages–the so–called ”tweets”
to report their current thoughts and actions, comment on
breaking news and engage in discussions.[11]
Passonneau and Rebecca[1] mainly introduced a model
based on Tree Kernel to analysis the POS-specific prior po-
larity features of Twitter data and used a Partial Tree (PT)
kernel which first proposed by Moschitti (2006) to calculate
the similarity between two trees(You can see an example in
figure 1).They are divided sentiment in tweets into 3 cat-
egories: positive, negative and neutral. They marked the
sentiment expressed by emotions with an emotional dictio-
nary and translate acronym (e.g. gr8, gr8t = great. lol =
laughing out loud.) with an acronym dictionary. Those dic-
tionaries can maps emotions or acronyms to their polarity.
And they used an English ”Word stop” dictionary which are
found in Word Net to identify stop words and used an sen-
timent dictionary which has many positive words, negative
1
This part of related works is provided by Qiaoyang Zhang.
Figure 1: A tree kernel for a synthesized tweet: ”@Fernando this isn’t a great day for playing the HARP! :)”
words and neutral words to map words in tweets to their
polarity.
The accuracy of the model they used is higher than the
accuracy of Unigram model by 4.02%. And the standard
deviation of the model they used is lower than the standard
deviation of Unigram model by 0.52%.
2.2 Study information diffusion on Twitter
2
A number of recent papers have explored the information
diffusion on Twitter, which is one of the most popular social
networks.
In the year 2011, Shaomei Wu et al.[14] focused on pro-
duction, flow, consumption of information in the context
of Twitter. They exploited a Twitter ”lists” to distinguish
elite users(Celebrities, Media, Organizations, Bloggers) and
ordinary users; and they found strong homophily within cat-
egories, which means that each category mainly follows it-
self. They also re–examined the classical ”two- step flow”
theory[10] of communications, finding considerable support
for it on Twitter. Additional, various URLs’ lifespans were
demonstrated under different categories. Finally, they ex-
amined the attention paid by the different user categories to
different news topics.
This paper Sheds clearly light on how media information
is transmitted on Twitter. The approach of defining a limit
set of predetermined user categories presented can be ex-
plored to automatic classification schemes. However, they
just focus on the one narrow cross-section of the media in-
formation(URLs). It would be better if their methods are
applied to other channels(TV, Radio); Another weakness
exists in this paper is lacking information flow on Twitter
with other sources of outcome data(e.g. users’ opinions and
actions).
Daniel Ramage et al.[13]studied search behaviors on Twit-
ter especially for the information that users prefer searching.
They also compared Twitter search with web search in terms
of users’ queries. They found that Twitter results contain
more social events and contents, while web results include
more facts and navigation.
Eytan Bakshy et al.[3] conducted regression model to ana-
lyze Twitter data. They explored word of mouth marketing
to study usersa´r influence on Twitter not only on commu-
nication but also on URLs. They found that the largest
2
This part of related works is provided by Fayan Tao.
cascades tend to be generated by users who have been influ-
ential in the past and who have a large number of followers.
They also found that URLs that were rated more interesting
and/or elicited more positive feelings by workers on Mechan-
ical Turk were more likely to spread.
As we can see that, all of three papers mentioned above are
all focus on a large number of tweets, and employ different
methods to analyze various characteristics of tweets from
different aspects. But they are all limited to mainly focus
on Twitter data rather extend to other more social networks.
2.3 Semantic Analysis and Text Mining
3
Many researches are done to gain better understanding
people’s characteristics in a specific field by analyzing the
semantics of social network content. This has many appli-
cations especially for the business marketing purpose
Topic mining and sentiment analysis is done on follow-
ers’ comments on a company’s Facebook Fan Page and the
authors get most frequent terms in each domain (TF, TF-
IDF, three sentiments) and sentiment distributions through-
out one year and its relation versus ”Likes”, respectively [5].
This can help the marketing staffs aware of the sentiment
trend as well as main sentiment to adjust the marketing
technique. Support Vector Machine (SVM) Classification
Model is used in their analysis method. Before classification,
word segmentation and feature extraction is done. Feature
extraction is based on semantic dictionary and some addi-
tional rules. They found that the sentiment distribution of
the comments can be a contributing factor of the distribu-
tion of ”Likes”.
Hsin-Ying Wu et al. [15] presented a method of analyzing
Facebook posts serving as a marketing tool to help young
entrepreneurs to identify existing competitors in the market
and also their succeed factors and features during decision
making process. The overall mining process consists of three
stages:
1 Extracting Facebook posts;
2 Text data preprocessing;
2 Key phrases and terms filtering and extraction.
In detail, they did word segmentation to original com-
ments based on the lexicons, morphological rules for quan-
tifier words and reduplicated words. The words and phrases
3
This part of related works is provided by Junyi Lu.
are extracted from text files and transformed into a key
phrase matrix based on frequencies. And next, a k-means
clustering algorithm based on the phrase frequency matrix
and their similarity is used to identify the most important
phrases(i.e. features and factors of each shop). Various tools
is utilized in their study. CKIP is for Chinese word segmen-
tation, PERL is for extracting text files and WEKA is for
key phrase clustering.
Social network mining is also done in educational field.
Chen et al.[4] conducted an initial research on mining tweets
for understanding the students’ learning experiences. They
first use Radian6, a commercial social monitoring tool, to
acquire studentsa´r posts using the hashtag #engineering-
Problems and they collected 19,799 unique tweets. Due to
the ambiguity and complexity of natural language, they con-
ducted inductive content analysis and categorized the tweets
into 5 prominent themes and one group called ”others”. The
main hashtag, non-letter symbols, repeating letters, stop-
words is removed in preprocessing stage. Multi-label naive
Bayesian classifier is used because a tweet can reflect sev-
eral problems. Then they obtained another data set us-
ing the geocode of Purdue University with a radius of 1.3
miles to demonstrate the effectiveness of the classifier and
try to detect the studentsa´r problems. They also demon-
strated the multi-label naive Bayesian classifier performs
better than other state-of-the-art classifiers (SVM and M3L)
according to 4 parameters(Accuracy, Precision, Recall, F1).
But therea´rs a main defect in their method since they as-
sume the categories are independent when they transform
the problem into single-label classification problems.
Most of the text mining process is much the same as each
other. Generally, text preprocessing is conducted(stopwords,
punctuation, weird symbols and characters removal, seg-
mentation) at the beginning. (Some study like sentiment
analysis need Part-of-speech Tagging.) Then a term fre-
quency matrix is built with the data set to calculate the term
frequencies. Finally, classification and clustering is mostly
used to analyze the data and generate knowledge.
3. TEXT MINING UNDER R PLATFORM
3.1 About R
R[18] is a language and environment for statistical com-
puting and graphics. It is a GNU project which is similar to
the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by
John Chambers and colleagues. R can be considered as a
different implementation of S.
R provides a wide variety of statistical (linear and nonlin-
ear modelling, classical statistical tests, time-series analysis,
classification, clustering.) and graphical techniques, and is
highly extensible. R also provides an Open Source route to
participation in statistical research. R is available as Free
Software under the terms of the Free Software Foundation’s
GNU General Public License in source code form. It com-
piles and runs on a wide variety of UNIX platforms and
similar systems (including FreeBSD and Linux), Windows
and MacOS.
3.2 The idea
Text mining[2][17] is the discovery of interesting knowl-
edge in text documents. It is a challenging issue to find
the accurate knowledge from unstructured text documents
to help users to find what they want. It can be defied as
the art of extracting data from large amount of texts. It
allows to structure and categorize the text contents which
are initially non-organized and heterogeneous. Text mining
is an important data mining technique which includes the
most successful technique to extract the effective patterns.
This report presents examples of text mining with R.
Twitter text (”prayforparis”)is used as the data to analyze.
It starts with extracting text from Twitter. The extracted
text is then transformed to build a document-term matrix.
After that, frequent words and associations are found from
the matrix. Next, words and tweets are clustered to find
groups of words and topics of tweets. Finally, a sentiment
analysis of tweets are explored, and a word cloud is used to
present important words in documents.
In this report, ”tweet” and ”document” will be used in-
terchangeably, so are ”word” and ”term”. There are three
important packages used in the examples: twitteR, tm and
wordcloud Package twitter[8] provides access to Twitter data,
tm[6] provides functions for text mining, and wordcloud[7]
visualizes the result with a word cloud 2.
4. IMPLEMENTATIONS
4
4.1 Data Preprocessing
We firstly mine 3200 tweets from twitter by search the
main topic ”prayforparis” during the time 13rd, Nov,2015 to
13rd, Dec,2015. Then we do some data preprocessing.
4.1.1 Data Cleaning
The tweets are first converted to a data frame and then
to a corpus, which is a collection of text documents. After
that, the corpus needs a couple of transformations, including
changing letters to lower case, adding ”pray” and ”for” as ex-
tra stop words and removing URLs, punctuations, numbers
extra whitespace and stop words.
Next, we keep a copy of corpus to use later as a dictionary
for stem completion
4.1.2 Stemming Words
Stemming[19] is the term used in linguistic morphology
and information retrieval to describe the process for reduc-
ing inflected (or sometimes derived) words to their word
stem, base or root formalgenerally a written word form.
A stemmer for English, for example, should identify the
string ”stems”, ”stemmer”, ”stemming”, ”stemmed” as based
on ”stem”. Using word stremming makes the words would
look normal. This can be achieved with function ”stemCom-
pletion()” in R .
In the following steps, we use ”stemCompletion()” to com-
plete the stems with the unstemmed corpus ”myCorpus-
Copy” as a dictionary. With the default setting, it takes
the most frequent match in dictionary as completion.
4.1.3 Building a Term-Document Matrix
A term-document matrix indicates the relationship be-
tween terms and documents, where each row stands for a
term and each column for a document, and an entry is
the number of occurrences of the term in the document.
4
All of our implementation codes are attached in the end of
this report.
TermDocumentMatrix terms: 3621, documents: 3200
Non-/sparse entries 27543/11559657
Sparsity 100%
Maximal term length 38
Weighting term frequency (tf)
Table 1: TermDocumentMatrix
Figure 2: layout of whole tweets
Alternatively, one can also build a document-term matrix
by swapping row and column. In this report, we build
a term-document matrix from the above processed corpus
with function ”TermDocumentMatrix()”.
As the table 1 shows, there are totally 3621 terms and
3200 document in the ”TermDocumentMatrix”. We can see
that it is very sparse, with nearly 100% of the entries being
zero, which means that the majority terms are not contained
in the document.
We can also see the layout of whole tweets in figure2, they
are mainly located in two parts. As the large number of data,
we cannot tell those words clearly. Therefore, we select some
terms from the total data and show their distributions as
figure3 and figure4 shows. We can see that most of terms
are connected within a bounded zone, which means that
they have associations more or less.
5. FREQUENT TERMS AND ASSOCIATIONS
Based on the above data process, now we show the fre-
quent words. Note that there are 3200 tweets in total.
We first choose those words which appear more than 100
times, the results are shown as table 2. We can see that, for
example, the number of ”parisattack” , ”pour” and ”victim”
are all more that 100, which means they have high frequency
in this topical”prayforparis”.
In the further process, we show the number of all of words
that appear at least 100 times. The result is as figure5 shows.
As figure5 contains so many terms that we cannot tell the
number of each term. So we only choose 70 terms, and
show the number of all of words that appear at least 100
Figure 3: layout-1 of some parts selected from whole
tweets
Figure 4: layout-2 of some parts selected from whole
tweets
Figure 5: Total words that appear at least 100 times
Figure 6: Selecting some Words that appear at least
100 times
lose over papajackadvic struggl trust
0.56 0.56 0.56 0.56 0.56
worri prayfor think hope simoncowel
0.56 0.40 0.40 0.32 0.29
scare stay
0.28 0.25
Table 3: words associated with ”pray” with correla-
tion no less than 0.25
number of tweets words
[1] ”´ld’” ”attentat” ”aux” ”?a”
[5] ”de” ”d´ej`a” ”et” ”everyon”
[9] ”fait” ”franc” ”go” ”il”
[13] ”jamai” ”jour” ”la” ”les”
[17] ”louistomlinson” ”moi” ”ne” ”noubliera”
[21] ”novembr” ”pari” ”parisattack” ”pas”
[25] ”pens´le” ”pour” ”prayforpari” ”que”
[29] ”rt” ”simoncr” ”thought” ”un”
[33] ”victim” ”vous” ”y” ”ytbclara”
Table 2: the number of words are larger than 100
times. As the figure6 shows, it is not surprised that the count
of ”prayforparis” is highest, which is more than 3000. The
second one is ”pari” with the ”parisattack” following. This
result indicates that most of people care for Paris attack and
pray for Paris.
To find the associations among words, we take the ”pray”
for example to see which words are associated with ”pray”
with correlation no less than 0.25.
From the table 3,we can see that there are 12 terms includ-
ing ”loss”, ”struggl”, ”trust”, ”hope” have connection with
”pray”, in which six terms such as ”loss”, ”papajackadvic”
and ”trust” are associated with ”pray” by the correction of
0.56, While ”prayfor” and ”hope” have correction 0.40 and
0.32 with ”pray”, respectively.
6. CLUSTERING WORDS
We then try to find clusters of words with hierarchical
clustering. Sparse terms are removed, so that the plot of
clustering will not be crowded with words. we cut related
data into 10 clusters. The agglomeration method is set to
ward, which denotes the increase in variance when two clus-
ters are merged.
In the figure7, we can see different topics related to ”pray-
forparis” in the tweets. Words ”les”, ”parisattack” aˇrfaita´s
and some other words are clustered into one group, because
there are a couple of tweets on parisattack. Another group
contains ”everyone” and ”thought” because everyone are fo-
cus on this event. We can also see that ”moir”, ”d´ej`a” ”pray-
forpari” are in a single group, which means they have few
relationships.
Figure 7: cluster (10 groups)
7. EXPERIMENTS ABOUT SENTIMENTS
Figure 8: Emotion categories of #prayforparis
Figure 9: Classification by polarity of #prayforparis
Figure 10: A wordcloud of #prayforparis
Stages
&Individuals Works
Stage 1 Literatures survey
Determine project topic
Stage 2 R programming and
text ming learning
Stage 3 Implementations
Stage 4 Presentation
and Final report
Qiaoyang Zhang Mainly read the references:
[1],[9],[11],[12], and[16];
Sentiment analysis implementation.
Fayan Tao Mainly read [2], [3], [10], [13] and [14];
Data preprocessing
and data analysis
Junyi Lu Mainly read [4] [5] and [15];
Analyze data association
and do cluster words.
Remark All of us read [6], [7], [8], [17] [18], and [19].
Table 4: Timetable and working plan
We also did an experiment about sentiment in R software
with the method mentioned in the related works. We loaded
a package named ”sentiment” in R software and analyzed
the sentiment of tweets about a hashtag ”#prayforparis” in
Twitter. We used the ”sentiment” package to mine more
than 6800 tweets on Twitter and established corpus[12] in
R to mainly analysis the related words of speech, frequency
and its correlation. figure 8 shows the emotion categories of
”#prayforparis” with emotion dictionary. In this figure, we
can see that nearly 1000 persons felt sad and angry about
the terrorist attacks in Paris(angry about the terrorist attack
from ISIS).And there are a small number of people felt afraid
and surprised.
In the figure 9,we can see that nearly 5000 people used
positive words and more than 1500 people used negative
words in their tweets. In addition, there are less than 500
people used no polarity words in the hashtag about ”#pray-
forparis”.
From the picture 10 of word cloud[17], we can intuitively
see the most frequently used words about ”#prayforparis”in
Twitter(the larger the front, the more used in tweets). The
most of polarities were concentrated in the type of sadness,
anger and disgust.
From these experimental data, we can draw a conclusion
that the general attitudes of people around the world toward
terrorist attack is sad and anger. Most people feel sorry for
the victims and pray for the victims in paris.They are also
strongly against terrorism.
8. WORKING PLAN
To finish this project, we made a timetable and working
plan as table 4 shows.
9. CONCLUSION AND FUTURE WORKS
In this report, we apply R to do text mining and analy-
sis about ”#prayforparis” in Twitter. We first do the data
preprocessing, such as data cleaning and stemming words.
Then we show the tweets frequency and association, we find
that ”prayforparis” ranks the highest frequency, and most
of words we mined are related to ”prayforparis”, ”paris” and
”parisattack”. We also show the layout of whole tweets and
some extracted tweets. Additional, we cluster tweets topic
as 10 groups to see the connections of terms. Since tweets
can indicate users’ attitudes and emotions well, so we fur-
ther do the sentiment analysis. We find that most people
expressed their sadness and anger about Paris attack by ISIS
and praied for paris. As the results show, the majority hold
the positive attitudes toward to this attack, mainly because
of hope for good future to Paris and whole wold as well.
As the data we mined is limited to one topic, and it is not
so large, which may result in data incompleteness. Addi-
tional, there are some problems existing during the data pre-
processing, for example, the ”termdocmatrix” is so sparse,
which are likely to have an bad influence on the following
analysis and evaluations. In the future works, we plan to
develop a better model or algorithm, which can be used to
mine and analyze different kinds of social networks data by
R. We will also focus on the improvement of data prepro-
cessing, so that it can make the result more precise.
10. ACKNOWLEDGMENT
We wish to thank Dr. Hong-Ning DAI for his patient
guidance and vital suggestions on this report.
11. REFERENCES
[1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R.
Passonneau. Sentiment analysis of twitter data.
Proceedings of the Workshop on Languages in Social
Media. 39(4):620´lC622, 2011.
[2] V.Aswini, S.K.Lavanya, Pattern Discovery for Text
Mining Computation of Power, Energy, Information
and Communication (ICCPEIC), 2014 International
Conference on IEEE. PP. 412-416. 2014.
[3] E. Bakshy, J.M. Hofman, W. A. Mason, and D. J.
Watts. Everyone’s an influencer: quantifying influence
on twitter. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 65-74. 2011.
DOI=http://dx.doi.org/10.1145/1935826.1935845
[4] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan,
Mining Social Media Data for Understanding
Studentsa´r Learning Experiences. IEEE Trans. Learn.
Technol., vol. 7, no. 3, pp. 246´lC259, 2014.
[5] Kuan-Cheng Lin et al., Mining the user clusters on
Facebook fan pages based on topic and sentiment
analysis. Information Reuse and Integration (IRI),
2014 IEEE 15th International Conference on , vol.,
no., pp.627-632, 13-15 Aug. 2014
[6] I.Feinerer, tm: Text Mining Package. R package
version 0.5-7.1. 2012.
[7] I.Fellows, wordcloud: Word Clouds. R package version
2.0. 2012.
[8] J. Gentry, twitteR: R based Twitter client. R package
version 0.99.19. 2012.
[9] I.Guellil and K.Boukhalfa. Social big data mining: A
survey focused on opinion mining and sentiments
analysis. In Programming and Systems (ISPS), 2015
12th International Symposium on, pp. 1–10, April
2015.
[10] E. Katz.The two-step flow of communication: An
up-to-date report on an hypothesis. Public Opinion
Quarterly, 21(1):61´lC78, 1957.
[11] M. Mathioudakis and N. Koudas. TwitterMonitor :
Trend Detection over the Twitter Stream. Proceeding:
SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data.
ACM New York, NY. pp. 1155–1157. 2010.
[12] A. Pak and P. Paroubek. Twitter as a corpus for
sentiment analysis and opinion mining. In Seventh
Conference on International Language Resources
Evaluation, 2010.
[13] J. Teevan, D. Ramage, and M. R. Morris.
TwitterSearch: a comparison of microblog search and
web search. In *Proceedings of the fourth ACM
international conference on Web search and data
mining* (WSDM ’11). ACM, New York, NY, USA,
pp. 35-44. 2011.DOI=
http://dx.doi.org/10.1145/1935826.1935842
[14] S.M. Wu, J. M. Hofman, W. A. Mason, and D.J.
Watts. Who says what to whom on twitter. In
*Proceedings of the 20th international conference on
World wide web* (WWW ’11). ACM, New York, NY,
USA, pp.705-714. 2011.
DOI=http://dx.doi.org/10.1145/1963405.1963504
[15] Hsin-Ying Wu; Kuan-Liang Liu; C. Trappey,
Understanding customers using Facebook Pages: Data
mining users feedback using text analysis. Computer
Supported Cooperative Work in Design (CSCWD),
Proceedings of the 2014 IEEE 18th International
Conference on , vol., no., pp.346-350, 21-23 May 2014
[16] L. M. Zhang, Y. Jia, X. Zhu, B. Zhou and Y. Han.
User-Level Sentiment Evolution Analysis in Microblog.
Browse Journals & Magazines, Communications,
China. Volume:11 Issue:12 pp. 152–163. 2011.
[17] Y.C. Zhao, R and Data Mining: Examples and Case
Studies. Published by Elsevier. 2012.
[18] More details about R:
https://www.r-project.org/about.html
[19] More information about stemming:
https://en.wikipedia.org/wiki/Stemming
APPENDIX
A. CODES FOR TEXTMINING
1 l i b r a r y (ROAuth)
2 l i b r a r y ( bitops )
3 l i b r a r y ( RCurl )
4 l i b r a r y ( twitteR )
5 l i b r a r y (NLP)
6 l i b r a r y (tm)
7 l i b r a r y ( RColorBrewer )
8 l i b r a r y ( wordcloud )
9 l i b r a r y (XML)
10 #Set t w i t t e r auth url
11 reqTokenURL <− ”https :// api . t w i t t e r . com/oauth/ request token ”
12 accessTokenURL <− ”https :// api . t w i t t e r . com/oauth/ access token ”
13 authURL <− ”https :// api . t w i t t e r . com/oauth/ authorize ”
14 #Set t w i t t e r key
15 consumerkey <− ”PXoumpl5ndvroikd1DPeGkcqE ”
16 consumerSecret <− ”raDtyWXPYBS5zAH0WVjUGKoiObIAEpHroWJ8G6UjlVn5DBdzbv”
17 accessToken <− ”3954258018−HALNbJ0Jo0pPVK844ZvNBnz5yRCXcdyTPKNE4rq”
18 acce ss Secr e t <− ”K45pUUUpWjqwSM0VgQZWDzx7D7F7RN74fB7gDg1EAh05B”
19 setup twitter oauth ( consumerkey , consumerSecret , accessToken ,
20 +acce ss Secr e t )
21 l i b r a r y ( twitteR )
22 tweets <− searchTwitter ( ”PrayforParis ” , s i nc e = ”2015−11−13” ,
23 + u n t i l = ”2015−12−14” , n = 3200)
24 ( nDocs <− length ( tweets ))
25 #[ 1 ] 3200
26 # convert tweets to a data frame
27 tweets . df <− twListToDF ( tweets )
28 dim( tweets . df )
29 # 3200 16
30 #Text cleaning
31 l i b r a r y (tm)
32 # build a corpus , and s p e c i f y the source to be character vectors
33 myCorpus <− Corpus ( VectorSource ( tweets . df$text ))
34 # convert to lower case
35 # tm v0 .6
36 myCorpus <− tm map(myCorpus , content transformer ( tolower ))
37 # tm v0.5−10
38 # myCorpus <− tm map(myCorpus , tolower )
39 # remove URLs
40 removeURL <− function (x) gsub ( ”http [ ˆ [ : space : ] ] ∗ ” , ”” , x)
41 # tm v0 .6
42 myCorpus <− tm map(myCorpus , content transformer (removeURL ))
43 # tm v0.5−10
44 # myCorpus <− tm map(myCorpus , removeURL)
45 # remove anything other than English l e t t e r s or space
46 removeNumPunct <− function (x) gsub ( ” [ ˆ [ : alpha : ] [ : space : ] ] ∗ ” , ”” , x)
47 myCorpus <− tm map(myCorpus , content transformer (removeNumPunct ))
48 # remove punctuation
49 # myCorpus <− tm map(myCorpus , removePunctuation )
50 # remove numbers
51 # myCorpus <− tm map(myCorpus , removeNumbers )
52 # add two extra stop words : ”pray ” and ”f o r ”
53 myStopwords <− c ( stopwords ( ’ e n g l i s h ’ ) , ”pray ” , ”f o r ”)
54 # remove ”ISIS ” and ”Paris ” from stopwords
55 myStopwords <− s e t d i f f ( myStopwords , c ( ”ISIS ” , ”Paris ”))
56 # remove stopwords from corpus
57 myCorpus <− tm map(myCorpus , removeWords , myStopwords )
58 # remove extra whitespace
59 myCorpus <− tm map(myCorpus , stripWhitespace )
60 # keep a copy of corpus to use l a t e r as a dictionary
61 #f o r stem completion
62 myCorpusCopy <− myCorpus
63 # stem words
64 myCorpus <− tm map(myCorpus , stemDocument )
65 # inspect the f i r s t 5 documents ( tweets )
66 # inspect (myCorpus [ 1 : 5 ] )
67 # The code below i s used f o r to make text f i t f o r paper width
68 f o r ( i in c ( 1 : 2 , 320)) {
69 cat ( paste0 ( ” [ ” , i , ” ] ”))
1
2 writeLines ( strwrap ( as . character (myCorpus [ [ i ] ] ) , 60))}
3 #[ 1 ] RT BahutConfess PrayForPari
4 #[ 2 ] FCBayern dontbombsyria i s i PrayForUmmah i s r a i l spdbpt bbc
5 #PrayforPari Merkel franc BVBPAOK saudi
6 #[ 3 2 0 ] RT RodrigueDLG Rip aux victim du bataclan AMAs PrayForParid
7 # tm v0.5−10
8 # myCorpus <− tm map(myCorpus , stemCompletion )
9 # tm v0 .6
10 stemCompletion2 <− function (x , dictionary ) {
11 x <− u n l i s t ( s t r s p l i t ( as . character (x ) , ” ”))
12 # Unexpectedly , stemCompletion completes an empty s t r i n g to
13 # a word in dictionary . Remove empty s t r i n g to avoid above i s s u e .
14 x <− x [ x != ”” ]
15 x <− stemCompletion (x , dictionary=dictionary )
16 x <− paste (x , sep=”” , c o l l a p s e=” ”)
17 PlainTextDocument ( stripWhitespace (x ))
18 }
19 myCorpus <− lapply (myCorpus , stemCompletion2 ,
20 +dictionary=myCorpusCopy)
21 myCorpus <− Corpus ( VectorSource (myCorpus ))
22 # count frequency of ”ISIS ”
23 ISISCases <− lapply (myCorpusCopy ,
24 function (x) { grep ( as . character (x ) , pattern = ” <ISIS ”) } )
25 sum( u n l i s t ( ISISCases ))
26 ## [ 1 ] 8
27 # count frequency of ”pray ”
28 prayCases <− lapply (myCorpusCopy ,
29 function (x) { grep ( as . character (x ) , pattern = ” <pray ”) } )
30 sum( u n l i s t ( prayCases ))
31 ## [ 1 ] 1136
32 # replace ”Islam ” with ”ISIS ”
33 myCorpus <− tm map(myCorpus , content transformer ( gsub ) ,
34 pattern = ”Islam ” , replacement = ”ISIS ”)
35 tdm <− TermDocumentMatrix (myCorpus , control =
36 +l i s t ( wordLengths = c (1 , Inf ) ) )
37 tdm
38 #<<TermDocumentMatrix ( terms : 3621 , documents : 3200)>>
39 #Non−/sparse e n t r i e s : 27543/11559657
40 #Sparsity : 100%
41 #Maximal term length : 38
42 #Weighting : term frequency ( t f )
43 #Frequent Words and Asso c i a t i o n s
44 idx <− which ( dimnames (tdm) $Terms == ”pray ”)
45 inspect (tdm [ idx + ( 0 : 5) , 1 0 : 1 6 ] )
46 #############
47 <<TermDocumentMatrix ( terms : 6 , documents : 7)>>
48 Non−/sparse e n t r i e s : 2/40
49 Sparsity : 95%
50 Maximal term length : 14
51 Weighting : term frequency ( t f )
52 Docs
53 Terms 10 11 12 13 14 15 16
54 pray 0 1 0 0 0 0 0
55 prayed 0 0 0 0 0 0 0
56 prayer 0 0 0 0 1 0 0
57 prayersburundi 0 0 0 0 0 0 0
58 p r a y e r s f o r f r 0 0 0 0 0 0 0
59 p r a y e r s f o r p a r i 0 0 0 0 0 0 0
60 ##########
61 # inspect frequent words
62 findFreqTerms (tdm , lowfreq =100)
63 termFrequency <− rowSums( as . matrix (tdm))
64 termFrequency <− subset ( termFrequency , termFrequency >=100)
65 # inspect frequent words
66 ( freq . terms <− findFreqTerms (tdm , lowfreq = 100))
67 term . freq <− rowSums( as . matrix (tdm ))
68 term . freq <− subset ( term . freq , term . freq >= 100)
69 df <− data . frame ( term = names ( term . freq ) , freq = term . freq )
1
2 l i b r a r y ( ggplot2 )
3 ggplot ( df , aes (x = term , y = freq )) + geom bar ( stat = ”i d e n t i t y ”) +
4 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
5 #s e l e c t some terms
6 ggplot ( df [ 3 0 : 6 0 , 4 0 : 8 0 ] , aes (x = term , y = freq )) +
7 +geom bar ( stat = ”i d e n t i t y ”) +
8 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p ()
9 # which words are associated with ”pray ”?
10 findAssocs (tdm , ’ pray ’ , 0.25)
11 #c l u s t i n g words
12 # remove sparse terms
13 tdm2 <− removeSparseTerms (tdm , sparse =0.95)
14 m2 <− as . matrix (tdm2)
15 #### c l u s t e r terms
16 distMatrix <− d i s t ( s c a l e (m2))
17 f i t <− hclust ( distMatrix , method=”ward .D2”)
18 #other methods : complete average centroid
19 plot ( f i t )
20 # cut tree into 10 c l u s t e r s
21 rect . hclust ( f i t , k=10)
22 ( groups <− cutree ( f i t , k=10))
23 ##############################
24 > ( groups <− cutree ( f i t , k=10))
25 ´ld’ attentat ?a d´lej´ld’ et everyon
26 1 2 2 3 1 4
27 f a i t i l jamai l e s moi noubliera
28 2 5 2 2 6 2
29 pari parisattack pour prayforpari rt simoncr
30 7 2 1 8 9 2
31 thought un victim y ytbclara
32 4 10 1 5 1
33 ##################
34 #change tdm to a Boolean matrix
35 termDocMatrix=as . matrix (tdm)
36 #termDocMatrix=as . matrix (tdm [ 4 0 : 240 ,40:240])
37 #remove ”r ” , ”data ” and ”mining ”
38 idx <− which ( dimnames ( termDocMatrix ) $Terms %in% c ( ”pray ” , ”par i s ” , ”shoot ”))
39 M <− termDocMatrix[−idx , ]
40 # build a tweet−tweet adjacency matrix
41 tweetMatrix <− t (M) %∗% M
42 l i b r a r y ( igraph )
43 g <− graph . adjacency ( tweetMatrix , weighted=T, mode = ”undirected ”)
44 V( g ) $degree <− degree ( g )
45 g <− s i m p l i f y ( g )
46 #set l a b e l s of v e r t i c e s to tweet IDs
47 V( g ) $ l a b e l <− V( g )$name
48 V( g ) $ l a b e l . cex <− 1
49 V( g ) $ l a b e l . c o l o r <− rgb ( . 4 , 0 , 0 , . 7 )
50 V( g ) $ s i z e <− 2
51 V( g ) $frame . c o l o r <− NA
52 barplot ( table (V( g ) $degree ))
53 tdm=tdm [ 1 : 2 0 0 , 1 : 2 0 0 ]
54 idx <− V( g ) $degree == 0
55 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
56 #load t w i t t e r text
57 #l i b r a r y ( twitteR)# load ( f i l e = ”data/rdmTweets . RData”)
58 #convert tweets to a data frame
59 df <− do . c a l l ( ”rbind ” , lapply (tdm , as . data . frame ))
60 #set l a b e l s to the IDs and the f i r s t 20 characters of tweets
61 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
62 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
63 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
64 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
65 E( g ) $width <− egam
66 set . seed (3152)
67 layout2 <− layout . fruchterman . reingold ( g )
68 plot (g , layout=layout2 )
69 #termDocMatrix=as . matrix (tdm [ 4 0 : 1 0 0 , 1 4 0 : 2 0 0 ] )
70 dim( termDocMatrix )
1
2 termDocMatrix [ termDocMatrix >=1] <− 1
3 # transform into a term−term adjacency matrix
4 termMatrix <− termDocMatrix %∗% t ( termDocMatrix )
5 # inspect terms numbered 5 to 10
6 dim( termMatrix )
7 # [ 1 ] 3642 3200
8 termMatrix [ 5 : 1 0 , 5 : 1 0 ]
9 ################
10 Terms abrahammateomus abzzni accept account acontecem across
11 abrahammateomus 1 0 0 0 0 0
12 abzzni 0 1 0 0 0 0
13 accept 0 0 2 0 0 0
14 account 0 0 0 1 0 0
15 acontecem 0 0 0 0 2 0
16 across 0 0 0 0 0 2
17 ##############
18 l i b r a r y ( igraph )
19 # build a graph from the above matrix
20 g <− graph . adjacency ( termMatrix , weighted=T, mode=”undirected ”)
21 # remove loops
22 g <− s i m p l i f y ( g )
23 # set l a b e l s and degrees of v e r t i c e s
24 V( g ) $ l a b e l <− V( g )$name
25 V( g ) $degree <− degree ( g )
26 # set seed to make the layout reproducible set . seed (30)
27 layout1 <− layout . fruchterman . reingold ( g )
28 plot (g , layout=layout1 )
29 set . seed (3000) #3152
30 layout2 <− layout . fruchterman . reingold ( g )
31 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
32 +substr ( def$text [ idx ] , 1 , 20) , sep=” : ”)
33 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
34 E( g ) $color <− rb ( . 5 , . 5 , 0 , egam)
35 E( g ) $width <− egam
36 set . seed (3152)
37 layout2 <− layout . fruchterman . reingold ( g )
38 plot (g , layout=layout2 )
39 #########################################
40 termMatrix <− termMatrix [1500:2000 ,1500:2000]
41 # create a graph
42 #g <− graph . incidence ( termDocMatrix , mode=c ( ” a l l ”))
43 g <− graph . incidence ( termMatrix , mode=c ( ” a l l ”))
44 # get index f o r term v e r t i c e s and tweet v e r t i c e s
45 nTerms <− nrow (M)
46 nDocs <− ncol (M)
47 idx . terms <− 1: nTerms
48 idx . docs <− (nTerms +1):( nTerms+nDocs )
49 # set c o l o r s and s i z e s f o r v e r t i c e s
50 V( g ) $degree <− degree ( g )
51 V( g ) $color [ idx . terms ] <− rgb (0 , 1 , 0 , . 5 )
52 V( g ) $ s i z e [ idx . terms ] <− 6
53 V( g ) $color [ idx . docs ] <− rgb (1 , 0 , 0 , . 4 )
54 V( g ) $ s i z e [ idx . docs ] <− 4
55 V( g ) $frame . c o l o r <− NA
56 # set vertex l a b e l s and t h e i r c o l o r s and s i z e s
57 V( g ) $ l a b e l <− V( g )$name
58 V( g ) $ l a b e l . c o l o r <− rgb (0 , 0 , 0 , 0.5)
59 V( g ) $ l a b e l . cex <− 1.4∗V( g ) $degree /max(V( g ) $degree ) + 1
60 # set edge width and c o l o r
61 E( g ) $width <− .3
62 E( g ) $color <− rgb ( . 5 , . 5 , 0 , . 3 )
63 set . seed (1500)
64 plot (g , layout=layout . fruchterman . reingold )
65 idx <− V( g ) $degree == 0
66 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
1 # convert tweets to a data frame
2 df <− do . c a l l ( ”rbind ” , lapply ( termMatrix , as . data . frame ))
3 # set l a b e l s to the IDs and the f i r s t 20 characters of tweets
4 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] ,
5 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”)
6 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2)
7 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam)
8 E( g ) $width <− egam
9 set . seed (3152)
10 layout2 <− layout . fruchterman . reingold ( g )
11 plot (g , layout=layout2 )
12 ###############sentiment a n a l y s i s #############
13 # harvest some tweets
14 some tweets = searchTwitter ( ”#pr ay f or par i s ” , n=10000, lang=”en ”)
15 # get the text
16 some txt = sapply ( some tweets , function (x) x$getText ( ) )
17 # remove retweet e n t i t i e s
18 some txt = gsub ( ”(RT| via ) ( ( ? :   bW∗@w+)+)” , ”” , some txt )
19 # remove at people
20 some txt = gsub ( ”@w+” , ”” , some txt )
21 # remove punctuation
22 some txt = gsub ( ” [ [ : punct : ] ] ” , ”” , some txt )
23 # remove numbers
24 some txt = gsub ( ” [ [ : d i g i t : ] ] ” , ”” , some txt )
25 # remove html l i n k s
26 some txt = gsub ( ”http w+” , ”” , some txt )
27 # remove unnecessary spaces
28 some txt = gsub ( ” [  t ]{2 ,} ” , ”” , some txt )
29 some txt = gsub ( ”ˆ s +| s+$ ” , ”” , some txt )
30 # define ”tolower e rr or handling ” function
31 try . er r or = function (x)
32 {
33 # create missing value
34 y = NA
35 # tryCatch er r or
36 t r y e r r o r = tryCatch ( tolower (x ) , e r ror=function ( e ) e )
37 # i f not an er r or
38 i f ( ! i n h e r i t s ( try error , ”e r r or ”))
39 y = tolower (x)
40 # r e s u l t
41 return (y)
42 }
43 # lower case using try . e r ror with sapply
44 some txt = sapply ( some txt , try . er r or )
45 # remove NAs in some txt
46 some txt = some txt [ ! i s . na ( some txt ) ]
47 names ( some txt ) = NULL
48 # c l a s s i f y emotion
49 class emo = c l a s s i f y e m o t i o n ( some txt , algorithm=”bayes ” , p r i o r =1.0)
50 # get emotion best f i t
51 emotion = class emo [ , 7 ]
52 # s u b s t i t u t e NA’ s by ”unknown”
53 emotion [ i s . na ( emotion ) ] = ”unknown”
54 # c l a s s i f y p o l a r i t y
55 c l a s s p o l = c l a s s i f y p o l a r i t y ( some txt , algorithm=”bayes ”)
56 # get p o l a r i t y best f i t
57 p o l a r i t y = c l a s s p o l [ , 4 ]
58 # data frame with r e s u l t s
59 sent df = data . frame ( text=some txt , emotion=emotion ,
60 p o l a r i t y=polarity , stringsAsFactors=FALSE)
61 # sort data frame
62 sent df = within ( sent df ,
63 +emotion <− f a c t o r ( emotion ,
64 +l e v e l s=names ( sort ( table ( emotion ) , decreasing=TRUE) ) ) )
1
2 # plot d i s t r i b u t i o n of emotions
3 ggplot ( sent df , aes (x=emotion )) +
4 geom bar ( aes (y =.. count . . , f i l l =emotion )) +
5 s c a l e f i l l b r e w e r ( p a l e t t e=”Dark2 ”) +
6 labs (x=”emotion c a t e g o r i e s ” , y=”number of tweets ”) +
7 labs ( t i t l e = ”Sentiment Analysis of Tweets about
8 +Starbucks n( c l a s s i f i c a t i o n by emotion ) ” ,
9 +plot . t i t l e = element text ( s i z e =12))
10 # plot d i s t r i b u t i o n of p o l a r i t y
11 ggplot ( sent df , aes (x=p o l a r i t y )) +
12 geom bar ( aes (y =.. count . . , f i l l =p o l a r i t y )) +
13 s c a l e f i l l b r e w e r ( p a l e t t e=”RdGy”) +
14 labs (x=”p o l a r i t y c a t e g o r i e s ” , y=”number of tweets ”) +
15 labs ( t i t l e = ”Sentiment Analysis of Tweets about
16 +#pr a yf or pa r is n( c l a s s i f i c a t i o n by p o l a r i t y ) ” ,
17 +plot . t i t l e = element text ( s i z e =12))
18 # separating text by emotion
19 emos = l e v e l s ( f a c t o r ( sent df$emotion ))
20 nemo = length ( emos )
21 emo . docs = rep ( ”” , nemo)
22 f o r ( i in 1: nemo)
23 {
24 tmp = some txt [ emotion == emos [ i ] ]
25 emo . docs [ i ] = paste (tmp , c o l l a p s e=” ”)
26 }
27 # remove stopwords
28 emo . docs = removeWords (emo . docs , stopwords ( ”e n g l i s h ”))
29 # create corpus
30 corpus = Corpus ( VectorSource (emo . docs ))
31 tdm = TermDocumentMatrix ( corpus )
32 tdm = as . matrix (tdm)
33 colnames (tdm) = emos
34 # comparison word cloud
35 comparison . cloud (tdm , c o l o r s = brewer . pal (nemo , ”Dark2 ”) ,
36 +s c a l e = c ( 3 , . 5 ) , random . order = FALSE, t i t l e . s i z e = 1. 5)

Contenu connexe

Tendances

IRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET Journal
 
P036401020107
P036401020107P036401020107
P036401020107theijes
 
Paper id 24201441
Paper id 24201441Paper id 24201441
Paper id 24201441IJRAT
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerankJames Arnold
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisIRJET Journal
 
SEGMENTING TWITTER HASHTAGS
SEGMENTING TWITTER HASHTAGSSEGMENTING TWITTER HASHTAGS
SEGMENTING TWITTER HASHTAGSijnlc
 
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRESM_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRESMiklas Njor
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisTELKOMNIKA JOURNAL
 
Big data analysis of news and social media content
Big data analysis of news and social media contentBig data analysis of news and social media content
Big data analysis of news and social media contentFiras Husseini
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksMohamed El-Geish
 
Discovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionDiscovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionFinalyear Projects
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?George Sam
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questionsmoresmile
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEEFINALYEARSTUDENTPROJECTS
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An IntroductionAli Abbasi
 
Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks   Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks Bohdan Pavlyshenko
 
Hao lyu slides_sarcasm
Hao lyu slides_sarcasmHao lyu slides_sarcasm
Hao lyu slides_sarcasmHao Lyu
 
Social Network Analysis - full show
Social Network Analysis - full showSocial Network Analysis - full show
Social Network Analysis - full showScott Gomer
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Subhajit Sahu
 

Tendances (20)

IRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment Analysis
 
P036401020107
P036401020107P036401020107
P036401020107
 
Paper id 24201441
Paper id 24201441Paper id 24201441
Paper id 24201441
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerank
 
Twitter Based Election Prediction and Analysis
Twitter Based Election Prediction and AnalysisTwitter Based Election Prediction and Analysis
Twitter Based Election Prediction and Analysis
 
SEGMENTING TWITTER HASHTAGS
SEGMENTING TWITTER HASHTAGSSEGMENTING TWITTER HASHTAGS
SEGMENTING TWITTER HASHTAGS
 
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRESM_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic Analysis
 
Big data analysis of news and social media content
Big data analysis of news and social media contentBig data analysis of news and social media content
Big data analysis of news and social media content
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social Networks
 
Discovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detectionDiscovering emerging topics in social streams via link anomaly detection
Discovering emerging topics in social streams via link anomaly detection
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks   Data Mining of Informational Stream in Social Networks
Data Mining of Informational Stream in Social Networks
 
Hao lyu slides_sarcasm
Hao lyu slides_sarcasmHao lyu slides_sarcasm
Hao lyu slides_sarcasm
 
Social Network Analysis - full show
Social Network Analysis - full showSocial Network Analysis - full show
Social Network Analysis - full show
 
Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)Deeper Inside PageRank (NOTES)
Deeper Inside PageRank (NOTES)
 

En vedette

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...Deolu Adeleye
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RNikhil Gadkar
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with RYanchang Zhao
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Frank Oellien
 
MOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPMOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPAnkita Jadhao
 
Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014muink
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Toolsaiaioo
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...NextMove Software
 
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0oriza steva andra
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining Bhawi247
 
Network biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningNetwork biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningLars Juhl Jensen
 
Text Analytics Past, Present & Future
Text Analytics Past, Present & FutureText Analytics Past, Present & Future
Text Analytics Past, Present & FutureSeth Grimes
 
Pingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsPingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsChris Riley ☁
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersSeth Grimes
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Large-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLarge-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLars Juhl Jensen
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 

En vedette (20)

Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...
Brief Lecture on Text Mining and Social Network Analysis with R, by Deolu Ade...
 
Twitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using RTwitter Hashtag #appleindia Text Mining using R
Twitter Hashtag #appleindia Text Mining using R
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
 
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
Text Mining - Techniques & Limitations (A Pharmaceutical Industry Viewpoint)
 
MOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLPMOBILE DEVICE FORENSICS USING NLP
MOBILE DEVICE FORENSICS USING NLP
 
Khoury ashg2014
Khoury ashg2014Khoury ashg2014
Khoury ashg2014
 
Vaklipi Text Analytics Tools
Vaklipi Text Analytics ToolsVaklipi Text Analytics Tools
Vaklipi Text Analytics Tools
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
Export, Import Menggunakan Ms.Excel dan Join Data Attibute pada ArcGis 10.0
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
Network biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text miningNetwork biology: Large-scale biomedical data and text mining
Network biology: Large-scale biomedical data and text mining
 
Text Analytics Past, Present & Future
Text Analytics Past, Present & FutureText Analytics Past, Present & Future
Text Analytics Past, Present & Future
 
Pingar - The Future of Text Analytics
Pingar - The Future of Text AnalyticsPingar - The Future of Text Analytics
Pingar - The Future of Text Analytics
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and Providers
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Large-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effectsLarge-scale data and text mining - Linking proteins, chemicals, and side effects
Large-scale data and text mining - Linking proteins, chemicals, and side effects
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
Applied text mining
Applied text miningApplied text mining
Applied text mining
 

Similaire à Text mining on Twitter information based on R platform

P11 goonetilleke
P11 goonetillekeP11 goonetilleke
P11 goonetillekeRahul Yadav
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAanargha gangadharan
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAMary Lis Joseph
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAParvathy Devaraj
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
A topology based approach twittersdlfkjsdlkfj
A topology based approach twittersdlfkjsdlkfjA topology based approach twittersdlfkjsdlkfj
A topology based approach twittersdlfkjsdlkfjKunal Mittal
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
 
What Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesWhat Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesKrisKasianovitz
 
ONLINE TOXIC COMMENTS.pptx
ONLINE TOXIC COMMENTS.pptxONLINE TOXIC COMMENTS.pptx
ONLINE TOXIC COMMENTS.pptxyegnajayasimha21
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxjasoninnes20
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxcurwenmichaela
 
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET Journal
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogsEtico Capital
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweetsijtsrd
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...csandit
 

Similaire à Text mining on Twitter information based on R platform (20)

E017433538
E017433538E017433538
E017433538
 
F017433947
F017433947F017433947
F017433947
 
P11 goonetilleke
P11 goonetillekeP11 goonetilleke
P11 goonetilleke
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
A topology based approach twittersdlfkjsdlkfj
A topology based approach twittersdlfkjsdlkfjA topology based approach twittersdlfkjsdlkfj
A topology based approach twittersdlfkjsdlkfj
 
s00146-014-0549-4.pdf
s00146-014-0549-4.pdfs00146-014-0549-4.pdf
s00146-014-0549-4.pdf
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
What Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker NotesWhat Your Tweets Tell Us About You, Speaker Notes
What Your Tweets Tell Us About You, Speaker Notes
 
ONLINE TOXIC COMMENTS.pptx
ONLINE TOXIC COMMENTS.pptxONLINE TOXIC COMMENTS.pptx
ONLINE TOXIC COMMENTS.pptx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docxBUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
BUS 625 Week 4 Response to Discussion 2Guided Response Your.docx
 
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogs
 
vishwas
vishwasvishwas
vishwas
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweets
 
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
A MODEL BASED ON SENTIMENTS ANALYSIS FOR STOCK EXCHANGE PREDICTION - CASE STU...
 

Text mining on Twitter information based on R platform

  • 1. Text mining on Twitter information based on R platform Qiaoyang ZHANG ∗ Computer and information science system Macau University of Science and Technology 3269046927@qq.com Fayan TAO † Computer and information science system Macau University of Science and Technology fytao2015@gmail.com Junyi LU ‡ Computer and information science system Macau University of Science and Technology 448673862@qq.com ABSTRACT Twitter is one of the most popular social networks, which plays a vital role in this new era. Exploring the information diffusion on Twitter is attractive and useful. In this report, we apply R to do text mining and anal- ysis, which is about the topic–”#prayforparis” in Twitter. We first do the data preprocessing, such as data cleaning and stemming words. Then we show tweets frequency and association. We find that the word ”prayforparis” ranks the highest frequency, and most of words we mined are related to ”prayforparis”, ”paris” and ”parisattack”. We also show lay- outs of whole tweets and some extracted tweets. Additional, we cluster tweets as 10 groups to see the connections of dif- ferent topics. Since tweets can indicate usersa´r attitudes and emotions well, so we further do the sentiment analysis. We find that most people expressed their sadness and anger about Paris attack by ISIS and pray for Paris. Besides, the majority hold the positive attitudes toward this attack. Keywords text mining; Twitter; R; ”#prayforparis”; sentiment analysis 1. INTRODUCTION AND MOTIVATION As data mining and big data become hot research spots in this new era, the technique of data analysis is required much higher as well. It is difficult to store and analyze large data by using traditional database methodologies. So we try to employ the powerful statistics platform R to do big data mining and analysis, because R provides kinds of statistical models and data analysis methods, such as classic statistical tests, time-series analysis, classification and clustering. ∗We rank the authors’ names by the inverse alphabet order of the first number of authors’ last name. Stu ID:1509853G-II20-0033 †Stu ID:1509853F-II20-0019 ‡Stu ID:1509853G-II20-0061 ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 We try to analyze a large social network data set, which is mainly focus on Twitter users and their expressions about latest news in this project. And it is executed to discover some characteristics those tweets have. By analyzing the large amount of social network data, we can get better knowl- edge on users’ preferences and habits, which will be helpful for people who are interested in such data. For example, business firms/companies can provide better services after analyzing similar social networks data. That is why we want to choose this topic. 2. RELATED WORKS 2.1 Sentiments analysis by searching Twitter and Weibo 1 User-Level sentiment evolution can be analysis in Weibo. ZHANG Lumin and JIA Yan et al[16] firstly proposed a mul- tidimensional sentiment model with hierarchical structure to analyze users’s complicate sentiments. Michael Mathioudakis and Nick Koudas[11] presented a ”TwitterMonitor”. It is a system that performs trend detec- tion over the Twitter stream. The system identifies emerg- ing topics (i.e. ’trends’) on Twitter in real time and provides meaningful analysis that synthesize an accurate description of each topic. Users interact with the system by ordering the identified trends using different criteria and submitting their own description for each trend. Twitter, in particular, is currently the major microblog- ging service, with more than 50 million subscribers. Twitter users generate short text messages–the so–called ”tweets” to report their current thoughts and actions, comment on breaking news and engage in discussions.[11] Passonneau and Rebecca[1] mainly introduced a model based on Tree Kernel to analysis the POS-specific prior po- larity features of Twitter data and used a Partial Tree (PT) kernel which first proposed by Moschitti (2006) to calculate the similarity between two trees(You can see an example in figure 1).They are divided sentiment in tweets into 3 cat- egories: positive, negative and neutral. They marked the sentiment expressed by emotions with an emotional dictio- nary and translate acronym (e.g. gr8, gr8t = great. lol = laughing out loud.) with an acronym dictionary. Those dic- tionaries can maps emotions or acronyms to their polarity. And they used an English ”Word stop” dictionary which are found in Word Net to identify stop words and used an sen- timent dictionary which has many positive words, negative 1 This part of related works is provided by Qiaoyang Zhang.
  • 2. Figure 1: A tree kernel for a synthesized tweet: ”@Fernando this isn’t a great day for playing the HARP! :)” words and neutral words to map words in tweets to their polarity. The accuracy of the model they used is higher than the accuracy of Unigram model by 4.02%. And the standard deviation of the model they used is lower than the standard deviation of Unigram model by 0.52%. 2.2 Study information diffusion on Twitter 2 A number of recent papers have explored the information diffusion on Twitter, which is one of the most popular social networks. In the year 2011, Shaomei Wu et al.[14] focused on pro- duction, flow, consumption of information in the context of Twitter. They exploited a Twitter ”lists” to distinguish elite users(Celebrities, Media, Organizations, Bloggers) and ordinary users; and they found strong homophily within cat- egories, which means that each category mainly follows it- self. They also re–examined the classical ”two- step flow” theory[10] of communications, finding considerable support for it on Twitter. Additional, various URLs’ lifespans were demonstrated under different categories. Finally, they ex- amined the attention paid by the different user categories to different news topics. This paper Sheds clearly light on how media information is transmitted on Twitter. The approach of defining a limit set of predetermined user categories presented can be ex- plored to automatic classification schemes. However, they just focus on the one narrow cross-section of the media in- formation(URLs). It would be better if their methods are applied to other channels(TV, Radio); Another weakness exists in this paper is lacking information flow on Twitter with other sources of outcome data(e.g. users’ opinions and actions). Daniel Ramage et al.[13]studied search behaviors on Twit- ter especially for the information that users prefer searching. They also compared Twitter search with web search in terms of users’ queries. They found that Twitter results contain more social events and contents, while web results include more facts and navigation. Eytan Bakshy et al.[3] conducted regression model to ana- lyze Twitter data. They explored word of mouth marketing to study usersa´r influence on Twitter not only on commu- nication but also on URLs. They found that the largest 2 This part of related works is provided by Fayan Tao. cascades tend to be generated by users who have been influ- ential in the past and who have a large number of followers. They also found that URLs that were rated more interesting and/or elicited more positive feelings by workers on Mechan- ical Turk were more likely to spread. As we can see that, all of three papers mentioned above are all focus on a large number of tweets, and employ different methods to analyze various characteristics of tweets from different aspects. But they are all limited to mainly focus on Twitter data rather extend to other more social networks. 2.3 Semantic Analysis and Text Mining 3 Many researches are done to gain better understanding people’s characteristics in a specific field by analyzing the semantics of social network content. This has many appli- cations especially for the business marketing purpose Topic mining and sentiment analysis is done on follow- ers’ comments on a company’s Facebook Fan Page and the authors get most frequent terms in each domain (TF, TF- IDF, three sentiments) and sentiment distributions through- out one year and its relation versus ”Likes”, respectively [5]. This can help the marketing staffs aware of the sentiment trend as well as main sentiment to adjust the marketing technique. Support Vector Machine (SVM) Classification Model is used in their analysis method. Before classification, word segmentation and feature extraction is done. Feature extraction is based on semantic dictionary and some addi- tional rules. They found that the sentiment distribution of the comments can be a contributing factor of the distribu- tion of ”Likes”. Hsin-Ying Wu et al. [15] presented a method of analyzing Facebook posts serving as a marketing tool to help young entrepreneurs to identify existing competitors in the market and also their succeed factors and features during decision making process. The overall mining process consists of three stages: 1 Extracting Facebook posts; 2 Text data preprocessing; 2 Key phrases and terms filtering and extraction. In detail, they did word segmentation to original com- ments based on the lexicons, morphological rules for quan- tifier words and reduplicated words. The words and phrases 3 This part of related works is provided by Junyi Lu.
  • 3. are extracted from text files and transformed into a key phrase matrix based on frequencies. And next, a k-means clustering algorithm based on the phrase frequency matrix and their similarity is used to identify the most important phrases(i.e. features and factors of each shop). Various tools is utilized in their study. CKIP is for Chinese word segmen- tation, PERL is for extracting text files and WEKA is for key phrase clustering. Social network mining is also done in educational field. Chen et al.[4] conducted an initial research on mining tweets for understanding the students’ learning experiences. They first use Radian6, a commercial social monitoring tool, to acquire studentsa´r posts using the hashtag #engineering- Problems and they collected 19,799 unique tweets. Due to the ambiguity and complexity of natural language, they con- ducted inductive content analysis and categorized the tweets into 5 prominent themes and one group called ”others”. The main hashtag, non-letter symbols, repeating letters, stop- words is removed in preprocessing stage. Multi-label naive Bayesian classifier is used because a tweet can reflect sev- eral problems. Then they obtained another data set us- ing the geocode of Purdue University with a radius of 1.3 miles to demonstrate the effectiveness of the classifier and try to detect the studentsa´r problems. They also demon- strated the multi-label naive Bayesian classifier performs better than other state-of-the-art classifiers (SVM and M3L) according to 4 parameters(Accuracy, Precision, Recall, F1). But therea´rs a main defect in their method since they as- sume the categories are independent when they transform the problem into single-label classification problems. Most of the text mining process is much the same as each other. Generally, text preprocessing is conducted(stopwords, punctuation, weird symbols and characters removal, seg- mentation) at the beginning. (Some study like sentiment analysis need Part-of-speech Tagging.) Then a term fre- quency matrix is built with the data set to calculate the term frequencies. Finally, classification and clustering is mostly used to analyze the data and generate knowledge. 3. TEXT MINING UNDER R PLATFORM 3.1 About R R[18] is a language and environment for statistical com- puting and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. R provides a wide variety of statistical (linear and nonlin- ear modelling, classical statistical tests, time-series analysis, classification, clustering.) and graphical techniques, and is highly extensible. R also provides an Open Source route to participation in statistical research. R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It com- piles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. 3.2 The idea Text mining[2][17] is the discovery of interesting knowl- edge in text documents. It is a challenging issue to find the accurate knowledge from unstructured text documents to help users to find what they want. It can be defied as the art of extracting data from large amount of texts. It allows to structure and categorize the text contents which are initially non-organized and heterogeneous. Text mining is an important data mining technique which includes the most successful technique to extract the effective patterns. This report presents examples of text mining with R. Twitter text (”prayforparis”)is used as the data to analyze. It starts with extracting text from Twitter. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. Next, words and tweets are clustered to find groups of words and topics of tweets. Finally, a sentiment analysis of tweets are explored, and a word cloud is used to present important words in documents. In this report, ”tweet” and ”document” will be used in- terchangeably, so are ”word” and ”term”. There are three important packages used in the examples: twitteR, tm and wordcloud Package twitter[8] provides access to Twitter data, tm[6] provides functions for text mining, and wordcloud[7] visualizes the result with a word cloud 2. 4. IMPLEMENTATIONS 4 4.1 Data Preprocessing We firstly mine 3200 tweets from twitter by search the main topic ”prayforparis” during the time 13rd, Nov,2015 to 13rd, Dec,2015. Then we do some data preprocessing. 4.1.1 Data Cleaning The tweets are first converted to a data frame and then to a corpus, which is a collection of text documents. After that, the corpus needs a couple of transformations, including changing letters to lower case, adding ”pray” and ”for” as ex- tra stop words and removing URLs, punctuations, numbers extra whitespace and stop words. Next, we keep a copy of corpus to use later as a dictionary for stem completion 4.1.2 Stemming Words Stemming[19] is the term used in linguistic morphology and information retrieval to describe the process for reduc- ing inflected (or sometimes derived) words to their word stem, base or root formalgenerally a written word form. A stemmer for English, for example, should identify the string ”stems”, ”stemmer”, ”stemming”, ”stemmed” as based on ”stem”. Using word stremming makes the words would look normal. This can be achieved with function ”stemCom- pletion()” in R . In the following steps, we use ”stemCompletion()” to com- plete the stems with the unstemmed corpus ”myCorpus- Copy” as a dictionary. With the default setting, it takes the most frequent match in dictionary as completion. 4.1.3 Building a Term-Document Matrix A term-document matrix indicates the relationship be- tween terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document. 4 All of our implementation codes are attached in the end of this report.
  • 4. TermDocumentMatrix terms: 3621, documents: 3200 Non-/sparse entries 27543/11559657 Sparsity 100% Maximal term length 38 Weighting term frequency (tf) Table 1: TermDocumentMatrix Figure 2: layout of whole tweets Alternatively, one can also build a document-term matrix by swapping row and column. In this report, we build a term-document matrix from the above processed corpus with function ”TermDocumentMatrix()”. As the table 1 shows, there are totally 3621 terms and 3200 document in the ”TermDocumentMatrix”. We can see that it is very sparse, with nearly 100% of the entries being zero, which means that the majority terms are not contained in the document. We can also see the layout of whole tweets in figure2, they are mainly located in two parts. As the large number of data, we cannot tell those words clearly. Therefore, we select some terms from the total data and show their distributions as figure3 and figure4 shows. We can see that most of terms are connected within a bounded zone, which means that they have associations more or less. 5. FREQUENT TERMS AND ASSOCIATIONS Based on the above data process, now we show the fre- quent words. Note that there are 3200 tweets in total. We first choose those words which appear more than 100 times, the results are shown as table 2. We can see that, for example, the number of ”parisattack” , ”pour” and ”victim” are all more that 100, which means they have high frequency in this topical”prayforparis”. In the further process, we show the number of all of words that appear at least 100 times. The result is as figure5 shows. As figure5 contains so many terms that we cannot tell the number of each term. So we only choose 70 terms, and show the number of all of words that appear at least 100 Figure 3: layout-1 of some parts selected from whole tweets Figure 4: layout-2 of some parts selected from whole tweets Figure 5: Total words that appear at least 100 times Figure 6: Selecting some Words that appear at least 100 times lose over papajackadvic struggl trust 0.56 0.56 0.56 0.56 0.56 worri prayfor think hope simoncowel 0.56 0.40 0.40 0.32 0.29 scare stay 0.28 0.25 Table 3: words associated with ”pray” with correla- tion no less than 0.25
  • 5. number of tweets words [1] ”´ld’” ”attentat” ”aux” ”?a” [5] ”de” ”d´ej`a” ”et” ”everyon” [9] ”fait” ”franc” ”go” ”il” [13] ”jamai” ”jour” ”la” ”les” [17] ”louistomlinson” ”moi” ”ne” ”noubliera” [21] ”novembr” ”pari” ”parisattack” ”pas” [25] ”pens´le” ”pour” ”prayforpari” ”que” [29] ”rt” ”simoncr” ”thought” ”un” [33] ”victim” ”vous” ”y” ”ytbclara” Table 2: the number of words are larger than 100 times. As the figure6 shows, it is not surprised that the count of ”prayforparis” is highest, which is more than 3000. The second one is ”pari” with the ”parisattack” following. This result indicates that most of people care for Paris attack and pray for Paris. To find the associations among words, we take the ”pray” for example to see which words are associated with ”pray” with correlation no less than 0.25. From the table 3,we can see that there are 12 terms includ- ing ”loss”, ”struggl”, ”trust”, ”hope” have connection with ”pray”, in which six terms such as ”loss”, ”papajackadvic” and ”trust” are associated with ”pray” by the correction of 0.56, While ”prayfor” and ”hope” have correction 0.40 and 0.32 with ”pray”, respectively. 6. CLUSTERING WORDS We then try to find clusters of words with hierarchical clustering. Sparse terms are removed, so that the plot of clustering will not be crowded with words. we cut related data into 10 clusters. The agglomeration method is set to ward, which denotes the increase in variance when two clus- ters are merged. In the figure7, we can see different topics related to ”pray- forparis” in the tweets. Words ”les”, ”parisattack” aˇrfaita´s and some other words are clustered into one group, because there are a couple of tweets on parisattack. Another group contains ”everyone” and ”thought” because everyone are fo- cus on this event. We can also see that ”moir”, ”d´ej`a” ”pray- forpari” are in a single group, which means they have few relationships. Figure 7: cluster (10 groups) 7. EXPERIMENTS ABOUT SENTIMENTS Figure 8: Emotion categories of #prayforparis Figure 9: Classification by polarity of #prayforparis Figure 10: A wordcloud of #prayforparis
  • 6. Stages &Individuals Works Stage 1 Literatures survey Determine project topic Stage 2 R programming and text ming learning Stage 3 Implementations Stage 4 Presentation and Final report Qiaoyang Zhang Mainly read the references: [1],[9],[11],[12], and[16]; Sentiment analysis implementation. Fayan Tao Mainly read [2], [3], [10], [13] and [14]; Data preprocessing and data analysis Junyi Lu Mainly read [4] [5] and [15]; Analyze data association and do cluster words. Remark All of us read [6], [7], [8], [17] [18], and [19]. Table 4: Timetable and working plan We also did an experiment about sentiment in R software with the method mentioned in the related works. We loaded a package named ”sentiment” in R software and analyzed the sentiment of tweets about a hashtag ”#prayforparis” in Twitter. We used the ”sentiment” package to mine more than 6800 tweets on Twitter and established corpus[12] in R to mainly analysis the related words of speech, frequency and its correlation. figure 8 shows the emotion categories of ”#prayforparis” with emotion dictionary. In this figure, we can see that nearly 1000 persons felt sad and angry about the terrorist attacks in Paris(angry about the terrorist attack from ISIS).And there are a small number of people felt afraid and surprised. In the figure 9,we can see that nearly 5000 people used positive words and more than 1500 people used negative words in their tweets. In addition, there are less than 500 people used no polarity words in the hashtag about ”#pray- forparis”. From the picture 10 of word cloud[17], we can intuitively see the most frequently used words about ”#prayforparis”in Twitter(the larger the front, the more used in tweets). The most of polarities were concentrated in the type of sadness, anger and disgust. From these experimental data, we can draw a conclusion that the general attitudes of people around the world toward terrorist attack is sad and anger. Most people feel sorry for the victims and pray for the victims in paris.They are also strongly against terrorism. 8. WORKING PLAN To finish this project, we made a timetable and working plan as table 4 shows. 9. CONCLUSION AND FUTURE WORKS In this report, we apply R to do text mining and analy- sis about ”#prayforparis” in Twitter. We first do the data preprocessing, such as data cleaning and stemming words. Then we show the tweets frequency and association, we find that ”prayforparis” ranks the highest frequency, and most of words we mined are related to ”prayforparis”, ”paris” and ”parisattack”. We also show the layout of whole tweets and some extracted tweets. Additional, we cluster tweets topic as 10 groups to see the connections of terms. Since tweets can indicate users’ attitudes and emotions well, so we fur- ther do the sentiment analysis. We find that most people expressed their sadness and anger about Paris attack by ISIS and praied for paris. As the results show, the majority hold the positive attitudes toward to this attack, mainly because of hope for good future to Paris and whole wold as well. As the data we mined is limited to one topic, and it is not so large, which may result in data incompleteness. Addi- tional, there are some problems existing during the data pre- processing, for example, the ”termdocmatrix” is so sparse, which are likely to have an bad influence on the following analysis and evaluations. In the future works, we plan to develop a better model or algorithm, which can be used to mine and analyze different kinds of social networks data by R. We will also focus on the improvement of data prepro- cessing, so that it can make the result more precise. 10. ACKNOWLEDGMENT We wish to thank Dr. Hong-Ning DAI for his patient guidance and vital suggestions on this report. 11. REFERENCES [1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau. Sentiment analysis of twitter data. Proceedings of the Workshop on Languages in Social Media. 39(4):620´lC622, 2011. [2] V.Aswini, S.K.Lavanya, Pattern Discovery for Text Mining Computation of Power, Energy, Information and Communication (ICCPEIC), 2014 International Conference on IEEE. PP. 412-416. 2014. [3] E. Bakshy, J.M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer: quantifying influence on twitter. In *Proceedings of the fourth ACM international conference on Web search and data mining* (WSDM ’11). ACM, New York, NY, USA, pp. 65-74. 2011. DOI=http://dx.doi.org/10.1145/1935826.1935845 [4] X. Chen, M. Vorvoreanu, and K. P. C. Madhavan, Mining Social Media Data for Understanding Studentsa´r Learning Experiences. IEEE Trans. Learn. Technol., vol. 7, no. 3, pp. 246´lC259, 2014. [5] Kuan-Cheng Lin et al., Mining the user clusters on Facebook fan pages based on topic and sentiment analysis. Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on , vol., no., pp.627-632, 13-15 Aug. 2014 [6] I.Feinerer, tm: Text Mining Package. R package version 0.5-7.1. 2012. [7] I.Fellows, wordcloud: Word Clouds. R package version 2.0. 2012. [8] J. Gentry, twitteR: R based Twitter client. R package version 0.99.19. 2012. [9] I.Guellil and K.Boukhalfa. Social big data mining: A survey focused on opinion mining and sentiments analysis. In Programming and Systems (ISPS), 2015
  • 7. 12th International Symposium on, pp. 1–10, April 2015. [10] E. Katz.The two-step flow of communication: An up-to-date report on an hypothesis. Public Opinion Quarterly, 21(1):61´lC78, 1957. [11] M. Mathioudakis and N. Koudas. TwitterMonitor : Trend Detection over the Twitter Stream. Proceeding: SIGMOD ’10 Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM New York, NY. pp. 1155–1157. 2010. [12] A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Seventh Conference on International Language Resources Evaluation, 2010. [13] J. Teevan, D. Ramage, and M. R. Morris. TwitterSearch: a comparison of microblog search and web search. In *Proceedings of the fourth ACM international conference on Web search and data mining* (WSDM ’11). ACM, New York, NY, USA, pp. 35-44. 2011.DOI= http://dx.doi.org/10.1145/1935826.1935842 [14] S.M. Wu, J. M. Hofman, W. A. Mason, and D.J. Watts. Who says what to whom on twitter. In *Proceedings of the 20th international conference on World wide web* (WWW ’11). ACM, New York, NY, USA, pp.705-714. 2011. DOI=http://dx.doi.org/10.1145/1963405.1963504 [15] Hsin-Ying Wu; Kuan-Liang Liu; C. Trappey, Understanding customers using Facebook Pages: Data mining users feedback using text analysis. Computer Supported Cooperative Work in Design (CSCWD), Proceedings of the 2014 IEEE 18th International Conference on , vol., no., pp.346-350, 21-23 May 2014 [16] L. M. Zhang, Y. Jia, X. Zhu, B. Zhou and Y. Han. User-Level Sentiment Evolution Analysis in Microblog. Browse Journals & Magazines, Communications, China. Volume:11 Issue:12 pp. 152–163. 2011. [17] Y.C. Zhao, R and Data Mining: Examples and Case Studies. Published by Elsevier. 2012. [18] More details about R: https://www.r-project.org/about.html [19] More information about stemming: https://en.wikipedia.org/wiki/Stemming APPENDIX A. CODES FOR TEXTMINING
  • 8. 1 l i b r a r y (ROAuth) 2 l i b r a r y ( bitops ) 3 l i b r a r y ( RCurl ) 4 l i b r a r y ( twitteR ) 5 l i b r a r y (NLP) 6 l i b r a r y (tm) 7 l i b r a r y ( RColorBrewer ) 8 l i b r a r y ( wordcloud ) 9 l i b r a r y (XML) 10 #Set t w i t t e r auth url 11 reqTokenURL <− ”https :// api . t w i t t e r . com/oauth/ request token ” 12 accessTokenURL <− ”https :// api . t w i t t e r . com/oauth/ access token ” 13 authURL <− ”https :// api . t w i t t e r . com/oauth/ authorize ” 14 #Set t w i t t e r key 15 consumerkey <− ”PXoumpl5ndvroikd1DPeGkcqE ” 16 consumerSecret <− ”raDtyWXPYBS5zAH0WVjUGKoiObIAEpHroWJ8G6UjlVn5DBdzbv” 17 accessToken <− ”3954258018−HALNbJ0Jo0pPVK844ZvNBnz5yRCXcdyTPKNE4rq” 18 acce ss Secr e t <− ”K45pUUUpWjqwSM0VgQZWDzx7D7F7RN74fB7gDg1EAh05B” 19 setup twitter oauth ( consumerkey , consumerSecret , accessToken , 20 +acce ss Secr e t ) 21 l i b r a r y ( twitteR ) 22 tweets <− searchTwitter ( ”PrayforParis ” , s i nc e = ”2015−11−13” , 23 + u n t i l = ”2015−12−14” , n = 3200) 24 ( nDocs <− length ( tweets )) 25 #[ 1 ] 3200 26 # convert tweets to a data frame 27 tweets . df <− twListToDF ( tweets ) 28 dim( tweets . df ) 29 # 3200 16 30 #Text cleaning 31 l i b r a r y (tm) 32 # build a corpus , and s p e c i f y the source to be character vectors 33 myCorpus <− Corpus ( VectorSource ( tweets . df$text )) 34 # convert to lower case 35 # tm v0 .6 36 myCorpus <− tm map(myCorpus , content transformer ( tolower )) 37 # tm v0.5−10 38 # myCorpus <− tm map(myCorpus , tolower ) 39 # remove URLs 40 removeURL <− function (x) gsub ( ”http [ ˆ [ : space : ] ] ∗ ” , ”” , x) 41 # tm v0 .6 42 myCorpus <− tm map(myCorpus , content transformer (removeURL )) 43 # tm v0.5−10 44 # myCorpus <− tm map(myCorpus , removeURL) 45 # remove anything other than English l e t t e r s or space 46 removeNumPunct <− function (x) gsub ( ” [ ˆ [ : alpha : ] [ : space : ] ] ∗ ” , ”” , x) 47 myCorpus <− tm map(myCorpus , content transformer (removeNumPunct )) 48 # remove punctuation 49 # myCorpus <− tm map(myCorpus , removePunctuation ) 50 # remove numbers 51 # myCorpus <− tm map(myCorpus , removeNumbers ) 52 # add two extra stop words : ”pray ” and ”f o r ” 53 myStopwords <− c ( stopwords ( ’ e n g l i s h ’ ) , ”pray ” , ”f o r ”) 54 # remove ”ISIS ” and ”Paris ” from stopwords 55 myStopwords <− s e t d i f f ( myStopwords , c ( ”ISIS ” , ”Paris ”)) 56 # remove stopwords from corpus 57 myCorpus <− tm map(myCorpus , removeWords , myStopwords ) 58 # remove extra whitespace 59 myCorpus <− tm map(myCorpus , stripWhitespace ) 60 # keep a copy of corpus to use l a t e r as a dictionary 61 #f o r stem completion 62 myCorpusCopy <− myCorpus 63 # stem words 64 myCorpus <− tm map(myCorpus , stemDocument ) 65 # inspect the f i r s t 5 documents ( tweets ) 66 # inspect (myCorpus [ 1 : 5 ] ) 67 # The code below i s used f o r to make text f i t f o r paper width 68 f o r ( i in c ( 1 : 2 , 320)) { 69 cat ( paste0 ( ” [ ” , i , ” ] ”))
  • 9. 1 2 writeLines ( strwrap ( as . character (myCorpus [ [ i ] ] ) , 60))} 3 #[ 1 ] RT BahutConfess PrayForPari 4 #[ 2 ] FCBayern dontbombsyria i s i PrayForUmmah i s r a i l spdbpt bbc 5 #PrayforPari Merkel franc BVBPAOK saudi 6 #[ 3 2 0 ] RT RodrigueDLG Rip aux victim du bataclan AMAs PrayForParid 7 # tm v0.5−10 8 # myCorpus <− tm map(myCorpus , stemCompletion ) 9 # tm v0 .6 10 stemCompletion2 <− function (x , dictionary ) { 11 x <− u n l i s t ( s t r s p l i t ( as . character (x ) , ” ”)) 12 # Unexpectedly , stemCompletion completes an empty s t r i n g to 13 # a word in dictionary . Remove empty s t r i n g to avoid above i s s u e . 14 x <− x [ x != ”” ] 15 x <− stemCompletion (x , dictionary=dictionary ) 16 x <− paste (x , sep=”” , c o l l a p s e=” ”) 17 PlainTextDocument ( stripWhitespace (x )) 18 } 19 myCorpus <− lapply (myCorpus , stemCompletion2 , 20 +dictionary=myCorpusCopy) 21 myCorpus <− Corpus ( VectorSource (myCorpus )) 22 # count frequency of ”ISIS ” 23 ISISCases <− lapply (myCorpusCopy , 24 function (x) { grep ( as . character (x ) , pattern = ” <ISIS ”) } ) 25 sum( u n l i s t ( ISISCases )) 26 ## [ 1 ] 8 27 # count frequency of ”pray ” 28 prayCases <− lapply (myCorpusCopy , 29 function (x) { grep ( as . character (x ) , pattern = ” <pray ”) } ) 30 sum( u n l i s t ( prayCases )) 31 ## [ 1 ] 1136 32 # replace ”Islam ” with ”ISIS ” 33 myCorpus <− tm map(myCorpus , content transformer ( gsub ) , 34 pattern = ”Islam ” , replacement = ”ISIS ”) 35 tdm <− TermDocumentMatrix (myCorpus , control = 36 +l i s t ( wordLengths = c (1 , Inf ) ) ) 37 tdm 38 #<<TermDocumentMatrix ( terms : 3621 , documents : 3200)>> 39 #Non−/sparse e n t r i e s : 27543/11559657 40 #Sparsity : 100% 41 #Maximal term length : 38 42 #Weighting : term frequency ( t f ) 43 #Frequent Words and Asso c i a t i o n s 44 idx <− which ( dimnames (tdm) $Terms == ”pray ”) 45 inspect (tdm [ idx + ( 0 : 5) , 1 0 : 1 6 ] ) 46 ############# 47 <<TermDocumentMatrix ( terms : 6 , documents : 7)>> 48 Non−/sparse e n t r i e s : 2/40 49 Sparsity : 95% 50 Maximal term length : 14 51 Weighting : term frequency ( t f ) 52 Docs 53 Terms 10 11 12 13 14 15 16 54 pray 0 1 0 0 0 0 0 55 prayed 0 0 0 0 0 0 0 56 prayer 0 0 0 0 1 0 0 57 prayersburundi 0 0 0 0 0 0 0 58 p r a y e r s f o r f r 0 0 0 0 0 0 0 59 p r a y e r s f o r p a r i 0 0 0 0 0 0 0 60 ########## 61 # inspect frequent words 62 findFreqTerms (tdm , lowfreq =100) 63 termFrequency <− rowSums( as . matrix (tdm)) 64 termFrequency <− subset ( termFrequency , termFrequency >=100) 65 # inspect frequent words 66 ( freq . terms <− findFreqTerms (tdm , lowfreq = 100)) 67 term . freq <− rowSums( as . matrix (tdm )) 68 term . freq <− subset ( term . freq , term . freq >= 100) 69 df <− data . frame ( term = names ( term . freq ) , freq = term . freq )
  • 10. 1 2 l i b r a r y ( ggplot2 ) 3 ggplot ( df , aes (x = term , y = freq )) + geom bar ( stat = ”i d e n t i t y ”) + 4 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p () 5 #s e l e c t some terms 6 ggplot ( df [ 3 0 : 6 0 , 4 0 : 8 0 ] , aes (x = term , y = freq )) + 7 +geom bar ( stat = ”i d e n t i t y ”) + 8 xlab ( ”Terms ”) + ylab ( ”Count ”) + c o o r d f l i p () 9 # which words are associated with ”pray ”? 10 findAssocs (tdm , ’ pray ’ , 0.25) 11 #c l u s t i n g words 12 # remove sparse terms 13 tdm2 <− removeSparseTerms (tdm , sparse =0.95) 14 m2 <− as . matrix (tdm2) 15 #### c l u s t e r terms 16 distMatrix <− d i s t ( s c a l e (m2)) 17 f i t <− hclust ( distMatrix , method=”ward .D2”) 18 #other methods : complete average centroid 19 plot ( f i t ) 20 # cut tree into 10 c l u s t e r s 21 rect . hclust ( f i t , k=10) 22 ( groups <− cutree ( f i t , k=10)) 23 ############################## 24 > ( groups <− cutree ( f i t , k=10)) 25 ´ld’ attentat ?a d´lej´ld’ et everyon 26 1 2 2 3 1 4 27 f a i t i l jamai l e s moi noubliera 28 2 5 2 2 6 2 29 pari parisattack pour prayforpari rt simoncr 30 7 2 1 8 9 2 31 thought un victim y ytbclara 32 4 10 1 5 1 33 ################## 34 #change tdm to a Boolean matrix 35 termDocMatrix=as . matrix (tdm) 36 #termDocMatrix=as . matrix (tdm [ 4 0 : 240 ,40:240]) 37 #remove ”r ” , ”data ” and ”mining ” 38 idx <− which ( dimnames ( termDocMatrix ) $Terms %in% c ( ”pray ” , ”par i s ” , ”shoot ”)) 39 M <− termDocMatrix[−idx , ] 40 # build a tweet−tweet adjacency matrix 41 tweetMatrix <− t (M) %∗% M 42 l i b r a r y ( igraph ) 43 g <− graph . adjacency ( tweetMatrix , weighted=T, mode = ”undirected ”) 44 V( g ) $degree <− degree ( g ) 45 g <− s i m p l i f y ( g ) 46 #set l a b e l s of v e r t i c e s to tweet IDs 47 V( g ) $ l a b e l <− V( g )$name 48 V( g ) $ l a b e l . cex <− 1 49 V( g ) $ l a b e l . c o l o r <− rgb ( . 4 , 0 , 0 , . 7 ) 50 V( g ) $ s i z e <− 2 51 V( g ) $frame . c o l o r <− NA 52 barplot ( table (V( g ) $degree )) 53 tdm=tdm [ 1 : 2 0 0 , 1 : 2 0 0 ] 54 idx <− V( g ) $degree == 0 55 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 ) 56 #load t w i t t e r text 57 #l i b r a r y ( twitteR)# load ( f i l e = ”data/rdmTweets . RData”) 58 #convert tweets to a data frame 59 df <− do . c a l l ( ”rbind ” , lapply (tdm , as . data . frame )) 60 #set l a b e l s to the IDs and the f i r s t 20 characters of tweets 61 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] , 62 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”) 63 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2) 64 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam) 65 E( g ) $width <− egam 66 set . seed (3152) 67 layout2 <− layout . fruchterman . reingold ( g ) 68 plot (g , layout=layout2 ) 69 #termDocMatrix=as . matrix (tdm [ 4 0 : 1 0 0 , 1 4 0 : 2 0 0 ] ) 70 dim( termDocMatrix )
  • 11. 1 2 termDocMatrix [ termDocMatrix >=1] <− 1 3 # transform into a term−term adjacency matrix 4 termMatrix <− termDocMatrix %∗% t ( termDocMatrix ) 5 # inspect terms numbered 5 to 10 6 dim( termMatrix ) 7 # [ 1 ] 3642 3200 8 termMatrix [ 5 : 1 0 , 5 : 1 0 ] 9 ################ 10 Terms abrahammateomus abzzni accept account acontecem across 11 abrahammateomus 1 0 0 0 0 0 12 abzzni 0 1 0 0 0 0 13 accept 0 0 2 0 0 0 14 account 0 0 0 1 0 0 15 acontecem 0 0 0 0 2 0 16 across 0 0 0 0 0 2 17 ############## 18 l i b r a r y ( igraph ) 19 # build a graph from the above matrix 20 g <− graph . adjacency ( termMatrix , weighted=T, mode=”undirected ”) 21 # remove loops 22 g <− s i m p l i f y ( g ) 23 # set l a b e l s and degrees of v e r t i c e s 24 V( g ) $ l a b e l <− V( g )$name 25 V( g ) $degree <− degree ( g ) 26 # set seed to make the layout reproducible set . seed (30) 27 layout1 <− layout . fruchterman . reingold ( g ) 28 plot (g , layout=layout1 ) 29 set . seed (3000) #3152 30 layout2 <− layout . fruchterman . reingold ( g ) 31 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] , 32 +substr ( def$text [ idx ] , 1 , 20) , sep=” : ”) 33 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2) 34 E( g ) $color <− rb ( . 5 , . 5 , 0 , egam) 35 E( g ) $width <− egam 36 set . seed (3152) 37 layout2 <− layout . fruchterman . reingold ( g ) 38 plot (g , layout=layout2 ) 39 ######################################### 40 termMatrix <− termMatrix [1500:2000 ,1500:2000] 41 # create a graph 42 #g <− graph . incidence ( termDocMatrix , mode=c ( ” a l l ”)) 43 g <− graph . incidence ( termMatrix , mode=c ( ” a l l ”)) 44 # get index f o r term v e r t i c e s and tweet v e r t i c e s 45 nTerms <− nrow (M) 46 nDocs <− ncol (M) 47 idx . terms <− 1: nTerms 48 idx . docs <− (nTerms +1):( nTerms+nDocs ) 49 # set c o l o r s and s i z e s f o r v e r t i c e s 50 V( g ) $degree <− degree ( g ) 51 V( g ) $color [ idx . terms ] <− rgb (0 , 1 , 0 , . 5 ) 52 V( g ) $ s i z e [ idx . terms ] <− 6 53 V( g ) $color [ idx . docs ] <− rgb (1 , 0 , 0 , . 4 ) 54 V( g ) $ s i z e [ idx . docs ] <− 4 55 V( g ) $frame . c o l o r <− NA 56 # set vertex l a b e l s and t h e i r c o l o r s and s i z e s 57 V( g ) $ l a b e l <− V( g )$name 58 V( g ) $ l a b e l . c o l o r <− rgb (0 , 0 , 0 , 0.5) 59 V( g ) $ l a b e l . cex <− 1.4∗V( g ) $degree /max(V( g ) $degree ) + 1 60 # set edge width and c o l o r 61 E( g ) $width <− .3 62 E( g ) $color <− rgb ( . 5 , . 5 , 0 , . 3 ) 63 set . seed (1500) 64 plot (g , layout=layout . fruchterman . reingold ) 65 idx <− V( g ) $degree == 0 66 V( g ) $ l a b e l . c o l o r [ idx ] <− rgb (0 , 0 , . 3 , . 7 )
  • 12. 1 # convert tweets to a data frame 2 df <− do . c a l l ( ”rbind ” , lapply ( termMatrix , as . data . frame )) 3 # set l a b e l s to the IDs and the f i r s t 20 characters of tweets 4 V( g ) $ l a b e l [ idx ] <− paste (V( g )$name [ idx ] , 5 +substr ( df$text [ idx ] , 1 , 20) , sep=” : ”) 6 egam <− ( log (E( g ) $weight )+.2) / max( log (E( g ) $weight )+.2) 7 E( g ) $color <− rgb ( . 5 , . 5 , 0 , egam) 8 E( g ) $width <− egam 9 set . seed (3152) 10 layout2 <− layout . fruchterman . reingold ( g ) 11 plot (g , layout=layout2 ) 12 ###############sentiment a n a l y s i s ############# 13 # harvest some tweets 14 some tweets = searchTwitter ( ”#pr ay f or par i s ” , n=10000, lang=”en ”) 15 # get the text 16 some txt = sapply ( some tweets , function (x) x$getText ( ) ) 17 # remove retweet e n t i t i e s 18 some txt = gsub ( ”(RT| via ) ( ( ? : bW∗@w+)+)” , ”” , some txt ) 19 # remove at people 20 some txt = gsub ( ”@w+” , ”” , some txt ) 21 # remove punctuation 22 some txt = gsub ( ” [ [ : punct : ] ] ” , ”” , some txt ) 23 # remove numbers 24 some txt = gsub ( ” [ [ : d i g i t : ] ] ” , ”” , some txt ) 25 # remove html l i n k s 26 some txt = gsub ( ”http w+” , ”” , some txt ) 27 # remove unnecessary spaces 28 some txt = gsub ( ” [ t ]{2 ,} ” , ”” , some txt ) 29 some txt = gsub ( ”ˆ s +| s+$ ” , ”” , some txt ) 30 # define ”tolower e rr or handling ” function 31 try . er r or = function (x) 32 { 33 # create missing value 34 y = NA 35 # tryCatch er r or 36 t r y e r r o r = tryCatch ( tolower (x ) , e r ror=function ( e ) e ) 37 # i f not an er r or 38 i f ( ! i n h e r i t s ( try error , ”e r r or ”)) 39 y = tolower (x) 40 # r e s u l t 41 return (y) 42 } 43 # lower case using try . e r ror with sapply 44 some txt = sapply ( some txt , try . er r or ) 45 # remove NAs in some txt 46 some txt = some txt [ ! i s . na ( some txt ) ] 47 names ( some txt ) = NULL 48 # c l a s s i f y emotion 49 class emo = c l a s s i f y e m o t i o n ( some txt , algorithm=”bayes ” , p r i o r =1.0) 50 # get emotion best f i t 51 emotion = class emo [ , 7 ] 52 # s u b s t i t u t e NA’ s by ”unknown” 53 emotion [ i s . na ( emotion ) ] = ”unknown” 54 # c l a s s i f y p o l a r i t y 55 c l a s s p o l = c l a s s i f y p o l a r i t y ( some txt , algorithm=”bayes ”) 56 # get p o l a r i t y best f i t 57 p o l a r i t y = c l a s s p o l [ , 4 ] 58 # data frame with r e s u l t s 59 sent df = data . frame ( text=some txt , emotion=emotion , 60 p o l a r i t y=polarity , stringsAsFactors=FALSE) 61 # sort data frame 62 sent df = within ( sent df , 63 +emotion <− f a c t o r ( emotion , 64 +l e v e l s=names ( sort ( table ( emotion ) , decreasing=TRUE) ) ) )
  • 13. 1 2 # plot d i s t r i b u t i o n of emotions 3 ggplot ( sent df , aes (x=emotion )) + 4 geom bar ( aes (y =.. count . . , f i l l =emotion )) + 5 s c a l e f i l l b r e w e r ( p a l e t t e=”Dark2 ”) + 6 labs (x=”emotion c a t e g o r i e s ” , y=”number of tweets ”) + 7 labs ( t i t l e = ”Sentiment Analysis of Tweets about 8 +Starbucks n( c l a s s i f i c a t i o n by emotion ) ” , 9 +plot . t i t l e = element text ( s i z e =12)) 10 # plot d i s t r i b u t i o n of p o l a r i t y 11 ggplot ( sent df , aes (x=p o l a r i t y )) + 12 geom bar ( aes (y =.. count . . , f i l l =p o l a r i t y )) + 13 s c a l e f i l l b r e w e r ( p a l e t t e=”RdGy”) + 14 labs (x=”p o l a r i t y c a t e g o r i e s ” , y=”number of tweets ”) + 15 labs ( t i t l e = ”Sentiment Analysis of Tweets about 16 +#pr a yf or pa r is n( c l a s s i f i c a t i o n by p o l a r i t y ) ” , 17 +plot . t i t l e = element text ( s i z e =12)) 18 # separating text by emotion 19 emos = l e v e l s ( f a c t o r ( sent df$emotion )) 20 nemo = length ( emos ) 21 emo . docs = rep ( ”” , nemo) 22 f o r ( i in 1: nemo) 23 { 24 tmp = some txt [ emotion == emos [ i ] ] 25 emo . docs [ i ] = paste (tmp , c o l l a p s e=” ”) 26 } 27 # remove stopwords 28 emo . docs = removeWords (emo . docs , stopwords ( ”e n g l i s h ”)) 29 # create corpus 30 corpus = Corpus ( VectorSource (emo . docs )) 31 tdm = TermDocumentMatrix ( corpus ) 32 tdm = as . matrix (tdm) 33 colnames (tdm) = emos 34 # comparison word cloud 35 comparison . cloud (tdm , c o l o r s = brewer . pal (nemo , ”Dark2 ”) , 36 +s c a l e = c ( 3 , . 5 ) , random . order = FALSE, t i t l e . s i z e = 1. 5)