Exploring the Future Potential of AI-Enabled Smartphone Processors
Anticipating Discussion Activity on Community Forums
1. Anticipating Discussion Activity on Community Forums Matthew Rowe, Sofia Angeletou and HarithAlani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The Third IEEE International Conference on Social Computing. MIT, Boston, USA. 2011
2. Community Content 1 Anticipating Discussion Activity on Community Forums Online communities are now used to: Ask questions Post opinions and ideas Discuss events and current issues Content analysis in online communities is attractive for: Market analysis Brand consensus and product opinion Social network analytics in the US is predicted to reach $1 billion by 2014 (Forrester 2009) Masses of data is now being published in online communities: Facebook has more than 60 million status updates per day (Facebook statistics 2010)
4. The Need for Analysis Analysts need to know which piece of content will generate the most activity i.e. the most auspicious or influential Helps focus the attention of human and computerised analysts What to track? Need to understand the effect features (community and content) have on attention to content Enable content creators to shape their content in order to maximise impact E.g. promoters, government policy makers RQ1: Which features are key to stimulating discussions? RQ2: How do these features influence discussion length? Anticipating Discussion Activity on Community Forums 3
5. Outline Anticipating Discussion Activity: Approach Overview Identifying Seed Posts Predicting Discussion Activity Features Dataset Community Message Board: Boards.ie 1. Identifying Seed Posts 2. Predicting Discussion Activity Findings Conclusions Anticipating Discussion Activity on Community Forums 4
6. Approach Overview Two-stage approach to predict discussion activity in online communities: 1. Identify seed posts i.e. Thread starters that yield a reply Will a given post start a discussion? What are the properties that seed posts exhibit? What parameters tend to trigger a discussion? 2. Predict discussion activity levels From the identified seed posts What is the level of discussion that a seed post will generate? What features correlate with heightened discussion activity? Anticipating Discussion Activity on Community Forums 5
7. Features For each post, model: a) the author, b) the content and c) the topical concentration of the author F1: User Features In-degree, out-degree: social network properties of the author Post count, age, post rate: participation information of the author F2: Content Features Post length, referral count, time in day: surface features of the post Complexity: cumulative entropy of terms in the post Readability: Gunning Fog index of the post Informativeness: TF-IDF measure of terms within the post Polarity: average sentiment of terms in the post Anticipating Discussion Activity on Community Forums 6
8. Features (2) F3: Focus Features Topic entropy: the concentration of the author across community forums Higher entropy indicates a wider spread of forum activity More random distribution, less concentrated Topic Likelihood: the likelihood that a user posts in a specific forum given his post history Measures the affinity that a user has with a given forum Lower likelihood indicates a user posting on an unfamiliar topic Anticipating Discussion Activity on Community Forums 7
9. Dataset: Boards.ie Irish community message board that was established in 1998 Covers a wide array of topics and themes in forums E.g. World of Warcraft, Japanese Culture, Rugby We were provided with the complete dataset spanning 1998-2008 of all posts and forum information Focussed on 2006 due to the scale of entire dataset No explicit social connections exist in the dataset Social network features were built from the reply-to graph 6-month window prior to the post date was used to build the user and focus features Anticipating Discussion Activity on Community Forums 8
10. 1. Identifying Seed Posts Will a given post start a discussion? What are the properties that seed posts exhibit? Experiment Setup: Used all thread starter posts from Boards.ie in 2006 Training/validation/testing sets using a 70/20/10% random split Binary classification task: Is this a seed post or not? Measures: precision, recall, f-measure, area under ROC curve Performed 2 experiments: a) Model Selection Tested individual feature sets (user, content, focus) and combinations b) Feature Assessment Dropping 1 feature at a time, record reduction in f-measure Anticipating Discussion Activity on Community Forums 9
14. 2. Predicting Discussion Activity What is the level of discussion that a seed post will generate? What features correlate with heightened discussion activity? Experiment Setup: Train: seed posts in 70% training split Test: seed posts in 20% validation split Measure: Normalised Discounted Cumulative Gain (nDCG) Look at varying rank positions: nDCG@k, k=1,2,5,10,20,50,100 Performed 2 experiments a) Model Selection Regression models: Linear, Isotonic, Support Vector Regression Tested individual feature sets (user, content, focus) and combinations b) Feature Contributions Assess the features in the best performing model from a) Anticipating Discussion Activity on Community Forums 13
16. 2.a) Model Selection Anticipating Discussion Activity on Community Forums 15 Support Vector Regression Isotonic Linear
17. 2.b) Feature Contributions What features correlate with heightened discussion activity? Anticipating Discussion Activity on Community Forums 16
18.
19. Negative sentiment posts generate more activityAnticipating Discussion Activity on Community Forums 17
20. Conclusions and Future Work The two-stage approach is able to: Identify seed posts to a high degree of accuracy F-measure: 0.792 Predict discussion activity levels nDCG@1: 0.89 (linear regression model) Content and focus features yield best performing model Average nDCG@k: 0.756 Findings inform: Market Analysts to track high activity posts from the outset Content creators to shape content in order to maximise impact Currently applying approach over different platforms: How can we predict activity on a given social web system? How do social web systems differ in generate activity? Anticipating Discussion Activity on Community Forums 18
21. Anticipating Discussion Activity on Community Forums 19 Questions? Web: http://people.kmi.open.ac.uk/rowe Email: m.c.rowe@open.ac.uk Twitter: @mattroweshow
Notes de l'éditeur
80% to 20% skew towards seeds from non-seeds
Content features outperform user featuresContent and focus outperforms other feature combinationsAll feature together works bestDiffers from Twitter analysis – user features were better predictors than content features
Trained J48 with all features using the training splitTested it on the held-out 10%Dropped1 feature at a time from the model and classified the test splitLooking for features that have greatest reduction in accuracy
Boxplots show:Higher referral counts correlate with non-seedsSpamHigher forum likelihood correlates with seedsUsers who concentrate their discussions within select forums will start a discussion – as they’re known to the communityHigher informativeness correlated with non-seeds
Solitary features:User features perform best as the solitary feature sets for Linear regression and SVRFocus features best for Isotonic regressionCombinedContent and focus perform best for Linear Isotonic
Smallest SD for content and focus features
A user can expect increased discussion activity if he/she hasLow forum entropyHigh forum likelihoodIs negative in his/her posts Uses complex language (wide vocab – i.e. articulate)