Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Aum workshop paper_presentation
1. Semantically Enriched Machine Learning Approach to
Filter YouTube Comments for Socially Augmented User
Models
Ahmad Ammari, Vania Dimitrova, Dimoklis
Despotakis
School of Computing, University of Leeds,
Leeds, UK
Presented By:
Ahmad Ammari
User and Community Modelling
School of Computing, University of Leeds,
UK
2. Outline
• The ImREAL Project
• Socially Augmented User Modelling
• Research Objective, Roadmap,
Challenges
• The Social Noise Filtering Approach
– Machine Learning – Based
– Methodology
– Comment Content Pre-Processing
– Semantic Enrichment
– Scoring and Labelling the Training Dataset
• Experimental Description / Results
• Evaluation
• Conclusions & Future Work
3. Immersive Reflective
Experience-based Adaptive
Specific Targeted Research Project STReP – FP7
Learning
Partners
University of Leeds, UK; Trinity College Dublin, Ireland;
Graz University of Technology, Austria; University of Erlangen-Nuremberg, Ger;
Delft University of Technology, NL; Imaginary SRL - IMA, Italy;
Empower The User, ETU, Ireland;
Problem:
Experience in a simulated world is disconnected from the ‘real-
world’
REALITY VIRTUALITY
ImREAL
Augmented Reality Approach Augmented Virtuality
4. Augmented Simulated Experiential
Learning
Interactive
User
model
Adaptive
Simulated Experiential
Learning Environment
coach
Augmented
user Real
modelling world
Practice
activity
model-
ling
Provide Meta-
content cognitive Records of Real
Other participants
Job-related
(e.g. customers,
scaffolding Experiences
managers)
Simulated Learning Environment Real World Experience
5. Augmented User Modelling
Socially Augmented User Modelling
Open
Social Spaces
Simulated
Environment
User
Profiles
Sports
Psycholo Social
gy
Profile
s
Diseases
Politic
s
Existing User
Socially
Model
Augmented User Limited Weighted Social
Model Scope!! Interests
6. Broad Research Objective
Mining Social Media Content
generated by Users having awareness
and/or Interest in an Activity Domain
to Derive Social Profiles
that Augment Existing User Models
7. Research Roadmap / Challenges
• Three-Phase Research Roadmap
towards achieving the Broad Objective
Phase One
Phase Three
Phase Two
Social
Noise
Filtration
8. The Social Noise Filtering Approach
• Supervised Machine Learning Model
– Historic Content with known relevance states are
used for training
– Machine Learning Model learns the underlying
rules
– Model is used to predict unknown relevance
states for new content with certain prediction
confidence
9. The Social Noise Filtration Service:
Methodology
Semantically
Enriched Job
Experimental Interview Bag of
CASE STUDY:
ly Controlled Analyze Filtering YouTube Comments
Words (JIBoW)
Comments
Social Media Source: YouTube
Subject Content: Public Comments on Shared
Videos
SCORE
Activity Domain: Job Interview
Term – Comment
Matrix
(Training Corpus)
S
C
Public
Pre- O
Comments R
Process E
On
S
YouTube
10. YouTube Video Selection
• Selected as part of a research study by
[Despotakis, Lau & Dimitrova, 2011]
• Four Job Interview-related categories are
manually identified from video content
– Guides / Best Practices
– Interviewee’s Stories
– Interviewer’s Stories
– Interview Mock Examples
• Videos from all categories are selected to
retrieve the comment set for ML training
11. Comment Content Pre-Processing
• Objective: Deriving dataset for
Classification
Stop tfidf
Comment
– Term
Word Stemming
Weighting Matrix
Removal
CTM
1 2 3 4
I think most
Americans are like the
first example
think – Americans – like – first –
example
12. Semantically Enriched Job Interview
Bag of Words
• A Semantically Enriched Job Interview Bag of Words (JIBoW)
used as Novel Means to Score and Label Training YouTube
Comment Set
• Collection of Textual Comments on Job Interview Videos [*]
– Experimentally controlled
– Closed social space
• Text and Semantic Pre-Processing Phases
• Semantically Expanded by the WordNet Lexicon and DISCO
with Word Synonyms, Antonyms, Derivations, and
semantically similar words
[*] Despotakis, Lau, Dimitrova (2011): A Semantic
Approach to Extract Individual Viewpoints from User
Comments on An Activity, AUM Workshop, UMAP
2011, Girona, Spain
13. Scoring and Labelling Training Corpus
• A Novel Term Frequency – based Mathematical Model
• Computes a Relevance Score for each observation in the
training comment dataset
– Intersection Size between Comment BoW and JIBoW
– Score is Normalized by the Average Intersection Size
• A Threshold is used to classify the comments for
training a binary classifier
• Labels observation (noisy, relevant) accordingly
14. Example Scoring & Labelling
C1: “The interviewee looks confident, he should
have some job experience in his work life”
Comment JIBOW
BOW w10
interviewee w21
confident w34
job w4
experience w57
work w113
life wn
16. Datasets
• YouTube API for Retrieval, Lucene API for Pre-
Processing
• Post –YouTube Corpus Description:
Analysis Data Experimentally Controlled Corpus
• Training Corpus: 1159 Instances
– Classified by the scoring model for Training C4.5 & Naïve
Bayes Multinomial (NBM) Classifiers
– {724 Noisy, 435 Relevant}
• Derived a Comment Term Matrix : 1159 Instances X 903
tfidf Term Weights + 1 Discrete Class Column
17. Experimental Results
• Three variations of Training-to-Testing ratio
Models for each classifier have been trained &
tested
See Evaluation
ROC Area
Results
• The Two Classifiers show good performance
in predicting relevant & noisy comments in the
testing data sets
• C4.5 is slightly better in predicting noisy
comments from within the total noise in the
data
• NBM shows less risk in misclassifying
relevant comments as noise
18. Evaluation
Human-based Evaluation Experiment was
conducted to measure how well the service:
Goal1: Considers the comments that show
awareness in the application domain (Job
Interviews) See Example Question and
Records
Goal2: Considers the comments that their authors
are likely interested in the application domain
See Example Question and
Records
19. Evaluation Results
Number of Evaluators 2
Number of Evaluated Comments (15% of Whole 180
Dataset)
Number of Comment Scored as Relevant 90
Comments
Number of Comment Scored as Noisy Comments
Evaluator 2 90
Evaluator 1
Goal 2 Goal 1 Goal 2 Goal 1
9%
3% Noisy
Noisy
15%
17 24 46%
% % Relevant
Releva 19%
42% 45% 66%
59 55% nt Doesn't
% know
Doesn't
know
Metric Goal 2 Goal 1 Metric Goal 2 Goal 1
Total Match Rate 51.1% 68.3% Total Match Rate 32.2% 60.0%
Total Mismatch Total Mismatch
48.9% 31.7% 67.8% 40.0%
Rate Rate
Precision (Noisy) 42.2% 76.7% Precision (Noisy) 36.7% 90.6%
Precision Precision
76.7% 63.3% 73.3% 44.4%
(Relevant) (Relevant)
Recall (Noisy) 73.1% 67.6% Recall (Noisy) 84.6% 68.2%
20. Summary
• Conclusions
– High Rate of YouTube Video comments are Noisy
– ML Models are good in Predicting and Filtering
out Comments that do not show author
awareness nor interests in the Activity Domain of
Interests
• Future Work
– Add more filters to improve the Scoring and
Labelling Mechanism based on Evaluation
Baseline
– Exploit Activity Modelling Ontology to Derive
JIBoW
– Evaluate Impact of Semantic Enrichment
21. YouTube-based Social Profiling Service:
Methodology
YouTube / SM Comments Noise Filtration Service Comments Predicted as
Relevant
RC1 … ……. RCn
…….
Clusters of Social Profiles
Profile1 Profile2 ProfileN
x y u o p q
e r x o x c
e y f g z s
Associations of
Profiling Source Authors
Frequent Characteristics
YT User Profiles
Uploaded YT Video meta data
Favored YT Video meta data
ImREAL Comments on the YT Videos
Simulators Social Profiling Corpus