SubTopic Detection of Tweets Related to an Entity

Sub-Topic Detection Of Tweets
Related To An Entity
International Institute of Information Technology-Hyderabad
Mentor - Sandeep Pannem
By
P Yashaswi (201102111) Aayush Asawa(201305617)
Kumari Ankita(201101161) Diksha J. Yadav(201125130)

Introduction
➢ Tweets are classified according to the “Topic” and then the “Subtopic” they
refer to.
○ “Topic” refers to any major event in the real world.
○ “Subtopics” are fine-grained aspects of such events.
➢ Mining subtopics from entities/topics from tweets helps in trend
analysis, social monitoring, topic tracking and reputation
mining.
➢ Generally all tweets related to a particular entity have similar keywords. So,
while detecting the subtopics will have to deal with more features.

Work Flow
Training Data
Store
features in
Lucene
Classifier
(Phase 1,2,3)
Detected
Subtopic
Extract
Tweet
features
Input Tweet

Approach
Input : Training set of tweets which have subtopic names as class labels.
Test tweets which are to be classified into subtopics
Output : Assign subtopics to each of the test tweets
The entire workflow can be broken into three phases :
1. Pre-processing
2. Feature Extraction and Representation
3. Classification.

Feature Extraction
The following features are extracted from each tweet :
➢ TweetConcepts (using TagMe API)
➢ Named entity and event phrases( using Twical)
➢ URLConcepts(using TagMe API on the content in the external links)
➢ Key Phrases(extracting noun phrases after POS tagging)
➢ Hash tags
➢ Categories(extracting categories for the titles got though TagMe)
Similarity Measures used :
➢ Wikipedia miner(for comparing wikipedia titles)
➢ Wordnet similarity measure(to compare key phrases)

Classification
➢ Subtopic detection is considered as a classification problem where
subtopics are the class labels for the tweets which are the data points.
➢ The classifier derives logic from what features majority of the tweet
(datapoints) of a particular subtopic(class label) have.
➢ Based on the features initial seed clusters are created for each topic and
each cluster is represented as crisp information and index.
➢ The features of test tweets are found and compared with the clusters, and
then a cluster to which it best matches is assigned to the test tweet.
➢ This is done using Machine Learning technique.

Pre-Processing
Pre-processing involves the following steps :
➢ Removal of stopwords from the tweets and stemming from the training
data points.
➢ Extracting URLS from the tweets.
This is done for both training and test tweets.

Algorithm
Offline Process
1. All the tweets in the training data are grouped together according to their
sub topic
2. For every tweet in a subtopic, the features are extracted and are grouped to
form subtopic features.
3. The subtopic features of all the subtopic are stored in the lucene index
under different fields.
4. All those features that are common in two or more subtopics are removed,
also those features are removed that are directly related to the entity name.

Algorithm
Online Procedure
1. Phase 1 : The category features of the test tweet are searched in the lucene
index and the top 10 subtopics are listed.
2. Phase 2 : The tweet concepts and URL concepts of test tweet are compared
with that of the top 10 subtopics from Phase 1 and top 5 subtopics are
listed based on wikipedia miner similarity measure.
3. Phase 3 : NER, Key phrases, event phrases are compared with the top 5
category list from phase 2 using wordnet similarity measures. For hash tags
direct intersection is done .After this the best of 5 subtopics is chosen
All these can also be clubbed together to get the best subtopic

Experiments
➢ RepLab 2013 data set was used. The dataset contains tweets for 61entities.
Each entity has about 700 tweets for training and 1500 tweets for testing.
➢ For evaluation we use Reliability ,Sensitivity and F Measure.
The results that we got for the entity “Volvo” are:
Sensitivity : 0.37 , Reliability : 0.39 F measure : 0.38

Future Work
➢ We can build an SVM classifier which can accurately determine which
feature has to be given preference while classifying the tweets
➢ The input vectors would have dimensions as various features of various
subtopics with the corresponding similarity measures as the coefficients ,
where the labelled subtopic is the class label
➢ In the testing phase we can create similar vectors for test tweets to get their
corresponding subtopics

Reference
1. REINA at RepLab2013 Topic Detection Task: Community Detection
2. Entity Tracking in Real-Time using Sub-Topic Detection on Twitter

SubTopic Detection of Tweets Related to an Entity

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (12)

Similaire à SubTopic Detection of Tweets Related to an Entity

Similaire à SubTopic Detection of Tweets Related to an Entity (20)

Dernier

Dernier (20)

SubTopic Detection of Tweets Related to an Entity