Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Machine Classification and Analysis of Suicide-Related Communication on Twitter
1. Machine Classification and Analysis of Suicide-
Related Communication on Twitter
Presentation @ ACM Hypertext 2015
Pete Burnap, Gualtiero (Walter) Colombo & Jonathan
Scourfield
Social Data Science Lab
School of Computer Science and Informatics & School of Social
Sciences
Cardiff University
@pbFeed @socdatalab
2. Social Data Science Lab - @socdatalab
• Formed in 2015 out of the Collaborative Online Social Media
Observatory (COSMOS) programme of work (cosmosproject.net)
• Mission is to continue the work of COSMOS in democratising access
to big social data (e.g. Twitter, Foursquare, Instagram) amongst the
academic, private, public and third sectors.
• A significant proportion of research funds have been awarded to
collect and analyse social media data in the contexts of Societal
Safety and Security e.g. social tension, hate speech, crime
reporting and fear of crime, suicidal ideation
• Working with Metropolitan Police, Department of Health,
Food Standards Agency
3. The Problem
• Our previous research has studied online social
networks as “social machines” that enable spread of
malicious or potentially dangerous information (e.g.
rumour, hate speech, malware)
• Concern about suicide and Internet has moved from
dedicated suicide websites to general social media
platforms
• Previous research has shown spikes in recorded suicide
rates due to increased risk factors (e.g. celebrity suicide)
4. The Problem
• Normalisation of suicidal language (Daine et al., 2013)
• To date research has tended to rely on human coding of
online content – difficult to scale to ‘volume’, or suicide
notes (different state of mind?)
• Social media analysis has yet to distinguish between
different types of suicidal communication
5. Research Aims
• To explore the potential of natural language processing and
machine learning for automated identification and
differentiation of suicide-related communication in very large
social media data sets
• This would enable those responsible for supporting safety and
wellbeing (e.g. samaritans) to establish a more realistic idea of
the volume of suicidal information online and possibly identify
emerging ‘clusters’
• While computation is essential, the work was driven from the
s tart by a strong understanding of suicidal
communication/language with established suicide
researchers
6. Developing a classifier for suicide-related social media
content
• Anonymised data from suicide discussion fora
• Human annotated – ‘is this person suicidal?’
• Identify (TF.IDF) terms & phrases from ‘suicidal texts’
• Automated collection of data from Twitter & Tumblr using TF.IDF
terms
• Human annotated sample (n=2000 1k Twitter + 1k Tumblr) –
coding frame
• c1: Evidence of possible suicidal intent
• c2: Campaigning (i.e. petitions etc.)
• c3: Flippant reference to suicide
• c4: Information or support
• c5: Memorial or condolence
• c6: Reporting news of someone’s suicide (not bombing)
• c7: None of the above
7. Features
(Set 1) Lexical characteristics of sentences used, such as the Parts of
Speech (POS), and other language structural features, such as the
the most frequently used words and phrases. References to self
and others are also captured with POS – these terms have been
identified in previous research as being evident within suicidal
communication
(Set 2) Sentiment, affective and emotional features and levels of the
terms used within the text. Emotions such as fear, anger and
general aggressiveness are particularly prominent in suicidal
communication (WordNet Affect)
(Set 3) Language expressed in short, informal text such as social media
posts within a limited number of characters. These were
extracted from annotated Tumblr posts
8. Machine Classification
• Key question here is: what are the features of suicidal
ideation, and what are the features of the other classes?
• Accuracy important but explanatory value also crucial
• Methods used for the classifier
• Probabilistic (Naïve Bayes), non-probabilistic linear (linear
SVM) and rule-based (Decision Tree) machine classifier
• Principal Components Analysis (1444 to 255 features)
• Improvement with ‘ensemble’ classifier designed to
incorporate diverse principal components (Rotation Forest]
11. Classifier accuracy
PCA
P 0.321 0.345 0.762 0.507
(combined)
R 0.641 0.385 0.205 0.436
F 0.427 0.364 0.323 0.469
Table 3: Confusion matrix for the best performing
classification model
classi.
c1 c2 c3 c4 c5 c6 c7
as
c1 57 0 16 0 0 0 5
c2 0 19 2 4 0 3 0
c3 13 1 142 0 0 5 16
c4 0 4 5 20 0 3 3
c5 1 1 1 0 31 1 1
c6 0 6 7 6 2 80 3
c7 18 0 20 1 2 4 98
6. DISCUSSION
In this section we analyse the main feature components pro-
duced by running the PCA procedure on the combined set
that resulted in the best set of results, as shown in Tables 1
Exam
regex
ing’ .
ideati
Other
tainin
when
that
verbs
words
and ‘
pear
a↵ect
c2: F
we ca
regula
minol
cific t
to thi
c3: A
conce
prese
F-measure: c1 = 0.690, all classes: 0.728
12. Predictive Features
d to suicide
information
enting sources
ws (research
of the name
lated to the
d of the ‘TV’
memorial, in-
are the com-
in the tweets
tive features
ot related to
such as gen-
hat’s wrong
tes (such as
es that could
but are also
Table 5: Principal components per class
c1 - Evidence of possible suicidal intent
0.185word list1 end it all 521+0.185end it all+0.179it all now
+0.179all now+0.175it all
0.149word list1 want to be dead 554-0.133 -0.129i think
+0.125word list1 to commit suicide 547+0.114really
0.149word list1 want to be dead 554+0.145wn a↵ect11 alarm
496-0.123number of adverb superlative 211-0.121word list7
relationship 780+0.118regEx class6 +.+report.+ 701
0.153thinking about killing+0.153about killing myself
+0.153about killing+0.147so im+0.147wn a↵ect11 misery 314
0.119number of predeterminers 206+0.117regEx class1 +.+
((cutting|depres|sui)|these|bad|sad).+(thoughts|feel)
.+ 667+0.115wn domain astrology 160-0.106bombing
0.231regEx class1 +.+(bdie).+(bmy).+bsleep.+0.177word
list want to be dead 554-0.155wn domain dentistry 113
-0.146wn a↵ect11 security 277-0.129wn a↵ect11 admiration
c2 - Campaigning (i.e. petitions etc.)
0.25 word list2 support 746-0.134wn domain racing 84
13. Explanatory features
• Word-lists and regular expressions (regex) extracted from online
suicide-related discussion forums and other microblogging Web
sites provide ‘clues’ effective for the suicidal ideation class
• Lexical and grammar features such as POSs appear mostly
ineffective
• ‘Affective’ language very relevant (such as those represented by the
WordNet library of ‘cognitive synonyms’) and able to well represent
the affective and emotional states associated to this particular type
of language.
• Sentiment Scores generated by software tools for sentiment
analysis appear also ineffective and either scarcely or not at all
included within the principal components predictive of each
class
14. Networks of Suicidal Ideation
“…shortest path of retweets of suicidal ideation
was higher than previous studies that reported
on general retweet path length. Our results
found an average of 5, while other research
reported metrics between 2 and 4.8.”
Colombo, G., Burnap, P., Hodorog, A. and Scourfield, J. (2015) ‘Analysing the connectivity and
communication of suicidal users on Twitter’, Computer Communications - available open
access http://tinyurl.com/suicidenetworks