HU, Xiao (University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_619.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Towards Automatic Analysis of Online Discussions among Hong Kong Students
1. Xiao Hu
University of Hong Kong
CITE Research Symposium 2013
May 12, 2013
Towards Automatic Analysis of Online
Discussions Among Hong Kong
Students
2. Outline
Goals and Purposes
Data Mining and Applications to Online Discussions
Classification
Association Rule Mining
Findings
More questions to answer
Bridging research and teaching
3. Goals and Purposes
Online discussions are widely used in education
Effective for communication and collaboration
Need tools to monitor online discussions
Data mining may help (semi-)automatically identify
various patterns in online discussions, for example:
Threads that need interventions
Outcome predictions
Role identification (e.g., question raiser, answer
provide, etc.)
Network analysis of student groups
Assessment of discussion quality
.....
4. This Study
How effective it is to mine online discussions
of HK students?
A case study on
1,965 discussion posts
on the subject of global warming
collected from five primary or secondary schools in
Hong Kong from years 2006-2009
383 discussion threads involving 1 to 21
participants
Two commonly used Data Mining techniques
Classification
Association rule mining
5. What is Data Mining?
To identify patterns (or to prove no patterns) from a
dataset
DM is NOT querying databases
Where you know what you are looking for
E.g., total sales in the past three years
DM is NOT statistical testing
Where you know the hypotheses
E.g. H0: the means of two groups are equal
DM is discovery-based
Find out unknown patterns, generate hypotheses
DM is iterative
exhaustively explore very large data sets
6. Data Mining –
Classification
Functionality: to assign one of a number of class
labels to each instance of your data
Examples of classification tasks:
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or
fraudulent
Categorizing news stories as finance, weather,
entertainment, sports, etc
Categorizing library materials by catalogs
Predicting whether a post in an online forum will get
replies or not
7. How Classification Works?
Given a collection of data (training set )
Each instance contains a set of attributes, one of the
attributes is the class label.
Find (calculate) a model for the class label as a
function of the values of other attributes
Goal: previously unseen data can then be fed to
the model and the model assigns a class label
as accurately as possible
Performance measure: accuracy
How many instances are correctly classified
8. An Illustrative Example (1)
8
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
9. An Illustrative Example (2)
9
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Unseen Data
(Jeff, Professor, 4)
Tenured?
10. Classifying Online Discussions
(1)
Task1: threads with one vs. many participants
To predict whether a post belongs to a thread
involving only one participant or a thread involving
many (> 14) participants
Attributes used to build classification model
Words in the posts: individual words (unigram)
two consecutive words (bigrams)
Classification algorithm: Naive Bayesian
Empirically effective in text categorization
Performance: 79.07%
11. Classifying Online Discussions
(2)
Task2: initial posts with vs. without replies
To predict whether an initial post are likely to get
replies or not
Attributes used to build classification model
Words in the posts: individual words (unigram)
two consecutive words (bigrams)
Classification algorithm: Naive Bayesian
Empirically effective in text categorization
Performance: 64%
Need to look deeper: mine patterns in each
category
12. Data Mining – Association Rules
Functionality: to find associative relations
between patterns frequently occurring in your
data
{Pattern A} => {Pattern B} with certain probability
Examples of association rule mining tasks:
Basket (shopping cart) analysis: customers buying
product A often also buy product B
Medical diagnosis: a patient with symptoms A is
likely to have disease B
Protein sequences: the appearances of amino acids
A indicates a greater chance of also having amino
acids C
Online discussions: a post with word or phrase A is
likely to be in class B
13. Mining Association Rules from
Online Discussions (1)
Task 1: Words and phrases strongly associated
with threads with one or many participants
Rank One participant Many participants
1 dioxide i agree
2 carbon dioxide agree
3 carbon i
4 temperature greenhouse gases
5 global warming i think
6 global think
7 warming yes
8 power carbon dioxide
9 air global warming
10 water yeah
14. Mining Association Rules from
Online Discussions (2)
Task 2: Words and phrases strongly associated
with initial posts with or without replies
Rank Has no reply Has replies
1 global warming protect
2 earth’s melt
3 global world
4 warming warming
5 earth sea
6 s i
7 greenhouse ice
8 effect rise
9 gases global warming
10 greenhouse effect global
15. Findings and future work
Data mining techniques were able to find patterns
from online discussions among Hong Kong
students
It was feasible to distinguish threads and posts in
contrast categories
Same techniques can be applied to distinguish
Shallow and deep discussions (depth of threads)
Confusion level of posts (need annotations on
training data)
Speech acts of posts (need annotations on training
data)
Emotions in the posts (need annotations on training
data)
16. Integrating Research and
Teaching
Both data mining techniques are discussed and
practiced in the Data Mining course in the
Bachelor of Science in Information Management
(BSIM 0018)
The tool used in this project is also taught in the
course
Projects like this can be students’ course projects,