Ensuring Technical Readiness For Copilot in Microsoft 365
Filtering out improper user accounts from twitter user accounts for discovering individuals interested in certain topic
1. Filtering out improper user accounts from twitter
user accounts for discovering individuals interested
in certain topic
Chao CAI, Shun SHIRAMATSU
Dept. of Computer Science, Graduate School of Engineering, Nagoya Institute of Technology
2. Background
• Continuously growing demand on participant-scouting for online opinion
collection(Web-based debate system, online survey, etc. )
• Twitter as an SNS holding over 45 million monthly active users in Japan who can
be the latent participants
• Appearances of improper user account in the user accounts collected by certain
keywords
• (e.g. official account, Bot, etc.)
3. Collagree
• Web-based debate system
• Also used by local government of Nagoya for opinion collection
• We aim to develop a participant invitation agent
4. Procedure of invitation agent
Keyword list
extraction(or
prepare in
advance)
Gathering
and filtering
the initial
user account
set
More
specific
classification
of user
group
Participants
invitation
5. Definition of Improper user account
Official user: specific terms in user onscreen name or description
• (e.g. kousiki akkaunto or company name).
Inactive user: retweeting only the campaign contents, usually without a
description, onscreen name consisting of random characters combination
Robot user: specific terms in user onscreen name or description, description and
tweet content containing Ads or promotion.
• (e.g. bot)
6. Approach
• Collecting data with Twitter search API and streaming API based on keywords or
hashtags
• For keeping the balance of data (ratio of improper and individual account)
• MeCab for tokenization, TFIDF for vectorization before constructing feature vectors
• Two ways to generate feature vector, Mixed process and Separated process
• Mixed : processing tweet contents and user information (name and description) as one
document
• Separated : processing two parts as two documents in two different corpora
• Using rbf-SVM as learning model
• Performing well in binary classification task
7. Related work
A Machine Learning Approach to Twitter User Classification (Marco Pennacchiotti
2011)
• Proposed a general model for user profiling and ran a deep analysis on tweet
linguistic contents
• Designing the feature vectors with (1) user Information, (2) tweet contents, (3) tweet behavior and
(4) user relationship
• We dealt with (1) and (2) in this research.
• Not considering description as good-quality information
• 48% of English users not having bio in their description
• Over 50% of Avatar irrelevant to their classification task
• Only aiming for English twitter user
• Differences of use habit between English and Japanese users
8. Each Tweet
data
User information
(onscreen name &
description)
Tweet contents
tf-idf of one
term
First second third … First
secon
d
third …
Combine
First
secon
d
third … First
secon
d
third …
Information
vector Text vector
Feature vector (Separated)
Tokenization
and
vectorization
First
secon
d
third …
Feature vector (Mixed)
tf-idf of one
term
9. Training data
• We assumed a particular topic: “child care”
• Firstly collected by streaming API based on keyword
list (子育て, 育児, 待機児童, 育休,ホームスタート, マタニ
ティ, 出産, 子どもの貧困, シングルマザー, 産後, 保育)
• 269 tweet collected, 210 improper accounts, 59 individual
accounts
• Secondly collected by twitter search API based on
hashtag list (#あたしおかあさんだけど,あたしおかあさんだか
ら,#ぼくおとうさんだから,#おまえおとうさんなのに, #おまえお
とうさんだろ) obtained from Twitter trend
• 400 tweet collected, 37 improper account, 363 individual
account
• We fortunately found the hashtags suitable for collect
tweets by individual users
• The data consisted of 669 tweet texts with user
information
• 452 accounts are individuals and 247 ones are improper
accounts.
Improper
78%
Individual
22%
Improper
9%
Individual
91%
10. Main Idea: Binary Classification
based on the contents of individual
information and tweet
Example of user account groups
Improper
user
Individual
user
11. SVM settings
The experiment ran on 5 different hyperparameter settings using rbf-kernel SVM
C: the cost parameter
• cost parameter trade off misclassification of training samples against complexity of prediction
surface with gamma.
Default Setting1 Setting2 Setting3 Setting4
C 1 2x10-5 2x1015 2x10-5 2x1015
Gamma 1/n (n: number of
dimension)
2x10-15 2x10-15 2x103 2x103
12. Results of
experiment
s
Result of experiments
Separate
d 4-pt
higher in
setting2
Mixed 2-pt
higher in
setting2
Separated
1-pt
higher in
setting2
13. Evaluation
o All settings performing well on recall score:
ounbalance of the data
o Settings2 gave the best balanced performance on both prediction and recall
accuracy:
oThe large C and small gamma providing more support vectors to deal with the
similarity of data
o Manual labeling put an influence on the result
o Mistaken labeling
o Mixed and separated process both performing well
o Separated process providing more feature of data
14. Conclusions
the contents of user information and tweet can be the essential factor in filtering
task
Still not enough when dealing with much more data
Some keywords or hashtags appearing in Twitter trend may help collecting
individual account
Improper account requiring time to respond to the trend
The model expected to be lack of reliance when dealing with enormous data
Simplicity of feature vector for each user, considering only one tweet of the user
15. Future work
Propose a method to find hashtags or keywords which can provide mostly individual
accounts
Help collecting training data
Infer some features of improper accounts
Including tweet behavior and user relationship [Marco Pennacchiotti 2011] in feature
vector design
Deep learning will be considered if training data is much more
Link with existed platform (e.g. Collagree)
Experimenting the system in practice
Notes de l'éditeur
Thanks for coming
at first, please let me introduce myself
my name is xxxxx from department of xxxxxx
Today I want to talk about our own research, the title of which is xxxxxx
--------------
Lets begin with the background of our research
Since we are living in the IT society. There is definitely growing need of xxxxx
Where we can find latent participants, we considered the social network service such as twitter
Twitter as an SNS is holding xxxx who we want to invite to those events
But during the collection of user data based on certain keyword list, a lot of improper user appeared such as xxxx who we want to get rid of
As we mentioned before, there are a lot of web-based debate system, in this research, we concentrated on collagree
Collagree is aiming for consensus generating and also used by Nagoya government to collect regional residences opinion
For the better use of this platform, we think it would be great if we can invite more people from different locations with a diversity of backgrounds to offer their new ideas
So we plan to develop a participants invitation agent for this system.
Here is the procedure of the whole agent
Firstly the agent will receive a keyword list which can be prepared by human or extracted from the introduction of the debate topic
And then the agent will collect the user set based on the list and filter out the improper user
Before the agent actually sent the invitation, there will be a more specific classification of the user group to find out which user can really attend the debate
And then the agent will sent the invitation to the users
This research is focused on the second part, filter out the improper user.
So which kind of account is improper
Here is how we defined the improper account
There are three kinds of them
Official user account is used by company or public facilities, they usually have specific terms in their user onscreen name or description
The second is inactive user, who only retweet the campaign contents for a gift, and they usually don’t have a description but with random characters combination in their onscreen name
The last one is robot user who are also likely to have specific terms in their description or onscreen name such as bot, and often there are Ads or promotion information in their tweet.
to filter out these kinds of accounts
Here is the approach for this research
To begin with, we collect the data with twitter search api and streaming api based on the keywords or hashtags to keep the balance of positive and negative samples
Since they are mainly written in Japanese, we need mecab to tokenize and use the tfidf for vectorization
We proposed two ways to generate the feature vector which are mix process and separated process which I like to demonstrate later
So for the mixed process, we xxxxx
And for separated process, we xxxx
And we choose the rbf-kernel SVM as learning model since the SVM perform well in binary classification
There are some related work.
One was done by Marco Pennacchiotti 2011
Xxxxxx
They proposed a general model for user profiling and ran a deep analysis on tweet linguistic contents.
They designed their feature vector with four parts, xxxxxx
And we dealt with xxxxx in our research
However they did not consider xxxxx
Since there are xxxxxxx and over 50% xxxx
And their research was focused on English twitter user
But there are definitely a lot of differences between English and Japanese user such as the use habit, language
As I mentioned we utilized user information and tweet contents for feature vector design
Here I like to give you a walk through about the design
So firstly we got the initial data which consist of -----------
For the mixed process, we process these two parts as one document to generate one vector , and each dimension is filled with the value of TFIDF of one term
this is the feature vector of mixed process
for separated process, we process these these two parts separately to generate two vector by tfidf
of course each dimension is filled with the value of tfidf of one term
then we combine two vector into one
And this is the feature vector of separated process
And then we try this approach in practice
Here is the training data for experiment
We firstly assume a topic for debate in collagree, child care
Then we collect the data twice
First time is by streaming api based on the keyword list as you can see
Among the 269 users, 78% percent are improper
Send time we used the twitter search api based on the hashtag which we happened to find in twitter trend
The hashtag is about a song which is related to child care
In this time of search, 91 percent of all 400 tweets are tweeted by individual users
So the whole data consisted of 669 tweets contents with user information
Here’s the samples from each group in the data
To do that, we use SVM ran on 5 different hyperparameter settings
Xxxxx with different combination of C and gamma
C by the way, is the cost parameter which will trade off the misclassification against complexity of prediction surface with gamma.
--------
Lets take a look at the result of experiments
You can see setting2 gave the best performance on F measure
And the separated process is one point higher than mixed process by setting2
On recall score separated is 4 point higher but on precision mixed one is 2 point higher
And though all settings performed well on recall score
only setting2 gave well performance on precision score
And we consider that the unbalance and lack of data was the reason why all settings gave well performance on recall score
And about the setting2 giving the best balance performance, we think that is because large C and small gamma providing more support vectors to deal with the similarity of data
Since we label the data ourselves, the result could affected by the mistaken label of human.
and though the mixed process and separated process both performed well, but we think that separated process can give more feature of the data
Here the conclusions,
We consider that the contents of user information and tweet can be important factors in filtering task
But they are still not enough if the data is much more
Some keyword list or hashtags in twitter trend probably can help collecting individual account
We are considering that improper accounts may need time to react to those trend
But the model is expected to be unreliable when dealing with a large amount of data
Since the feature vector is similar with each other and we only consider one tweet for each user
For our future work,
We would like to find a way to detect the hashtags or keywords which can help us find more individual account to help us collect training data and maybe it can reveal some features of improper users
We are planning to include tweet behavior and relationship of user in feature vector design, and we are considering introduce the deep learning into our research if we can get enough data
Finally we want to connect our filter system to the existed platform, which in this case, collagree, to evaluate our system in practice