SlideShare une entreprise Scribd logo
1  sur  43
It’s all in the Content: State of the art Best
Answer Prediction based on Discretisation
of Shallow Linguistic Features
George Gkotsis, Karen Stepanyan, Carlos
Pedrinaci, John Domingue, Maria Liakata*
Knowledge Media Institute, The Open University
*Department of Computer Science, University of Warwick
Outline
• Motivation
• Problem description
• Proposed solution
• Evaluation
• Discussion & Conclusion
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Motivation
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Questions on social networking sites
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Recommendations
&
opinions
Authoritative
responses
Expert &
Empirical
knowledge
Queries on CQA
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Why best answer prediction?
• Information overload
• Increase awareness in the community
• Answer questions more efficiently
• One way to study social media reception
• Plus:
• Finding experts in communities
• Study of language use
• Trend analysis
• …
• Visit 
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Problem description
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Best answer prediction in Social Q&A
• Binary classification problem
• Is it solved?
• Yes, partially
• Current solutions depend on:
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Answer Ratings
• Score, #comments
Knowledge is Future & Unknown
User Ratings
• User Reputation
• UpVotes etc
• Preferential attachment
Knowledge is Past & Not
always available
State of the art solutions
“…we observe significant assortativity in the reputations of
co-answerers, relationships between reputation and
answer speed, and that the probability of an answer
being chosen as the best one strongly depends on
temporal characteristics of answer arrivals.”
Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, Jure Leskovec
Discovering Value from Community Activity on Focused Question
Answering Sites: A Case Study of Stack Overflow.
KDD 2012
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
State of the art solutions (cont.)
“When available, scoring (or rating) features improve
prediction results significantly, which demonstrates the
value of community feedback and reputation for identifying
valuable answers.”
Grégoire Burel, Yulan He, Harith Alani.
Automatic Identification of Best Answers in Online Enquiry
Communities
ESWC 2012
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
State of the art solutions
Summary
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Our solution
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Linguistic User Ratings Answer ratings
Average Precision
StackExchange network
SE “is all about getting answers, it’s not a
discussion forum, there’s no chit-chat”
• 123 Q&A sites
• 5,622,330 users
• 9.5 million questions
• 16.3 million answers
• 9.3 million visits per day
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
20 June 2014:
Training Dataset
September 2013 dump
StackOverflow & 20 of the most active SE websites
Questions with Accepted Answers
• 4,366,662 Non Accepted Answers
• 3,939,224 Accepted Answers
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Accepted
Answers
47%
Non
Accepted
Answers…
SE websites
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Non Accepted
Accepted
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow
91%
The Rest
9%
3,375,817
3,795,276
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
stackoverflow
Non Accepted
Answers
Accepted
Answers
Shallow Linguistic features
• Long history, coming from studies on readability
1. Average number of characters per word
2. Average number of words per sentence
3. Number of words in the longest sentence
4. Answer length
5. Log Likehood:
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Pitler and Nenkova, 2008
StackOverflow – Activity
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow – Length
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow – Log Likehood
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow – Characters Per Word
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow – Longest Sentence
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow – Words Per Sentence
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
StackOverflow
Overview of shallow features’ evolution
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Shallow features: Observations
• Accepted answers tend to be:
• Longer
• Differ more from the community vocabulary
• Contain shorter words
• Have longer longest sentences
• Have more words per sentence
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
But how good are shallow features?
But how good are shallow features?
• 58% macro precision (our baseline)
• Possible reasons
1. Evolution of language characteristics
• Language becomes more eloquent
2. Variance is huge
3. Universal classifier looks unreachable, e.g.:
• SuperUser average length is 577
• Skeptics average length is 2,154
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Proposed solution
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Objectives
• Build a classifier which is:
1. Based on linguistic features solely
2. Robust
• Performs equally well to other classifiers that use user ratings (past
knowledge) or answer ratings (future knowledge)
3. Universal
• Same classifier applicable to as many SE websites possible
(domain agnostic)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Feature discretisation
Example for Length
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Group by question
Question Id
1
5
Answer Id
6
7
Length
2 200
3 150
4 250
150
100
Sort by Length in descending order
Rank
LengthD
1
2
3
1
2
Information Gain from Discretisation
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Feature discretisation
Category Name Information Gain
Linguistic
Length 0.0226
LongestSentence 0.0121
LL 0.0053
WordsPerSentence 0.0048
CharactersPerWord 0.0052
Linguistic
Discretisation
LengthD 0.2168
LongestSentenceD 0.1750
LLD 0.1180
WordsPerSentenceD 0.1404
CharactersPerWordD 0.1162
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
20x increase
User and answer rating features
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Category Name Information Gain
Other
Age 0.0539
CreationDateD 0.1575
AnswerCount 0.3270
User Rating
UserReputation 0.0836
UserUpVotes 0.0535
UserDownVotes 0.0412
UserViews 0.0528
UserUpDownVotes 0.0508
Answer rating
Score 0.0792
CommentCount 0.0286
ScoreRatio 0.4539
Evaluation
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
What are we evaluating?
1. Prediction
2. How good is it compared with the SOTA?
3. Generality
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
1. Prediction – Features used
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
User
Rating
Answer
Rating
Past Knowledge Future Knowledge
1. Prediction
• Classifier was Alternate Decision Trees (ADT)
• Binary, boosting, numerical data
• Weka
• 10-fold validation
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
1. Prediction
SE Website P R FM AUC
stackoverflow.com 0.82 0.66 0.73 0.85
apple.stackexchange.com 0.84 0.68 0.75 0.86
askubuntu.com 0.84 0.74 0.79 0.88
drupal.stackexchange.com 0.87 0.79 0.83 0.89
electronics.stackexchange.com 0.79 0.65 0.71 0.84
english.stackexchange.com 0.77 0.52 0.62 0.83
gamedev.stackexchange.com 0.82 0.71 0.76 0.87
gaming.stackexchange.com 0.87 0.79 0.83 0.91
gis.stackexchange.com 0.85 0.73 0.78 0.87
math.stackexchange.com 0.85 0.74 0.79 0.87
mathoverflow.net 0.83 0.7 0.76 0.87
meta.stackoverflow.com 0.87 0.69 0.77 0.87
physics.stackexchange.com 0.86 0.71 0.78 0.88
programmers.stackexchange.com 0.76 0.4 0.52 0.84
serverfault.com 0.83 0.66 0.74 0.85
skeptics.stackexchange.com 0.87 0.83 0.85 0.91
stats.stackexchange.com 0.85 0.79 0.82 0.89
superuser.com 0.84 0.65 0.73 0.85
tex.stackexchange.com 0.87 0.77 0.82 0.88
unix.stackexchange.com 0.81 0.68 0.74 0.85
wordpress.stackexchange.com 0.88 0.8 0.84 0.89
Average 0.84 0.7 0.76 0.87
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
SE Website P R FM AUC
stackoverflow.com 0.82 0.66 0.73 0.85
Macro Average 0.84 0.7 0.76 0.87
2. Comparison with other solutions
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Linguistic
Linguistic
Discretisation
Other
User
Rating
Answer
Rating
Case Features Used
1 Linguistic
2 Linguistic & Discretisation
3 Linguistic & Discretisation &
Other
4 Linguistic & Other & User
Rating
(no discretisation)
5 Linguistic & Other & User
Rating
(with discretisation)
6 All features
(Answer and User Rating
with discretisation)
Comparison
Case Features Used P R FM AUC
1 Linguistic 0.58 0.60 0.56 0.60
2 Linguistic & Discretisation 0.81 0.70 0.74 0.84
3 Linguistic & Discretisation &
Other
0.84 0.7 0.76 0.87
4 Linguistic & Other & User
Rating
(no discretisation)
0.82 0.69 0.75 0.86
5 Linguistic & Other & User
Rating
(with discretisation)
0.82 0.72 0.77 0.88
6 All features
(Answer and User Rating
with discretisation)
0.88 0.85 0.86 0.94
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
3. Generality
• Leave-one-out
• Trained a classifier for each SE website based on all other SE
websites
(Stackoverflow was evaluated but was excluded from training due to its size)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
P R FM AUC
Macro average based on self-training
(results from the first part of evaluation) 0.84 0.7 0.76 0.87
Leave-one-out 0.83 0.7 0.76 0.87
Discussion & Conclusion
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Best Answer prediction
• Community feedback on the answers remains the best
way for determining the best answer, but
• Discretisation reveals a lot more information
• Content features, even shallow ones CAN be very informative
• Independent from past (not always available) knowledge
• Independent from future knowledge
• Web application/service is under development
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
Best Answer
Prediction
User &
answer rating
Linguistic
features
?
Proposed
solution
Thank you
23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
http://xkcd.com/386/

Contenu connexe

En vedette

Voice of Customer and Beyond
Voice of Customer and BeyondVoice of Customer and Beyond
Voice of Customer and BeyondLucieColt
 
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκείαgper2014
 
Il costo di realizzazione e gli aspetti commerciali
Il costo di realizzazione e gli aspetti commercialiIl costo di realizzazione e gli aspetti commerciali
Il costo di realizzazione e gli aspetti commercialimadisroom
 
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italia
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’ItaliaLa Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italia
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italiamadisroom
 
Come è costruita la Stanza Antisismica: aspetti tecnici
Come è costruita la Stanza Antisismica: aspetti tecniciCome è costruita la Stanza Antisismica: aspetti tecnici
Come è costruita la Stanza Antisismica: aspetti tecnicimadisroom
 
Madis Room: cosa è la stanza antisismica e quanto costa
Madis Room: cosa è la stanza antisismica e quanto costaMadis Room: cosa è la stanza antisismica e quanto costa
Madis Room: cosa è la stanza antisismica e quanto costamadisroom
 
презентация с днем рождения
презентация с днем рожденияпрезентация с днем рождения
презентация с днем рожденияskazkakotel
 
Eric Chaney's Blue Book, July 24, 2014
Eric Chaney's Blue Book, July 24, 2014Eric Chaney's Blue Book, July 24, 2014
Eric Chaney's Blue Book, July 24, 2014AXA_US
 
Grand estela maría_unidad5y6
Grand estela maría_unidad5y6Grand estela maría_unidad5y6
Grand estela maría_unidad5y6Teligrand
 
В сказке всё у нас цветёт
В сказке всё у нас цветётВ сказке всё у нас цветёт
В сказке всё у нас цветётskazkakotel
 
Sua Phan Mem Cham Cong Mitaco 5v2
Sua Phan Mem Cham Cong Mitaco 5v2Sua Phan Mem Cham Cong Mitaco 5v2
Sua Phan Mem Cham Cong Mitaco 5v2Nhat Le
 
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casa
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casaLa prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casa
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casamadisroom
 
Madis Room: il brevetto e l'installazione
Madis Room: il brevetto e l'installazioneMadis Room: il brevetto e l'installazione
Madis Room: il brevetto e l'installazionemadisroom
 
Voice of Customer and Beyond
Voice of Customer and BeyondVoice of Customer and Beyond
Voice of Customer and BeyondLucieColt
 
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...madisroom
 
Madis Room: come funziona in caso di sisma
Madis Room: come funziona in caso di sismaMadis Room: come funziona in caso di sisma
Madis Room: come funziona in caso di sismamadisroom
 

En vedette (17)

3 konsep kbat v4
3 konsep kbat v43 konsep kbat v4
3 konsep kbat v4
 
Voice of Customer and Beyond
Voice of Customer and BeyondVoice of Customer and Beyond
Voice of Customer and Beyond
 
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία
4ο γυμνάσιο χαλανδρίου το νερό στη θρησκεία
 
Il costo di realizzazione e gli aspetti commerciali
Il costo di realizzazione e gli aspetti commercialiIl costo di realizzazione e gli aspetti commerciali
Il costo di realizzazione e gli aspetti commerciali
 
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italia
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’ItaliaLa Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italia
La Madis Room nasce dall’analisi dei dati sull’alta sismicità dell’Italia
 
Come è costruita la Stanza Antisismica: aspetti tecnici
Come è costruita la Stanza Antisismica: aspetti tecniciCome è costruita la Stanza Antisismica: aspetti tecnici
Come è costruita la Stanza Antisismica: aspetti tecnici
 
Madis Room: cosa è la stanza antisismica e quanto costa
Madis Room: cosa è la stanza antisismica e quanto costaMadis Room: cosa è la stanza antisismica e quanto costa
Madis Room: cosa è la stanza antisismica e quanto costa
 
презентация с днем рождения
презентация с днем рожденияпрезентация с днем рождения
презентация с днем рождения
 
Eric Chaney's Blue Book, July 24, 2014
Eric Chaney's Blue Book, July 24, 2014Eric Chaney's Blue Book, July 24, 2014
Eric Chaney's Blue Book, July 24, 2014
 
Grand estela maría_unidad5y6
Grand estela maría_unidad5y6Grand estela maría_unidad5y6
Grand estela maría_unidad5y6
 
В сказке всё у нас цветёт
В сказке всё у нас цветётВ сказке всё у нас цветёт
В сказке всё у нас цветёт
 
Sua Phan Mem Cham Cong Mitaco 5v2
Sua Phan Mem Cham Cong Mitaco 5v2Sua Phan Mem Cham Cong Mitaco 5v2
Sua Phan Mem Cham Cong Mitaco 5v2
 
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casa
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casaLa prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casa
La prevenzione nazionale è l’anello debole: dobbiamo farla da soli, in casa
 
Madis Room: il brevetto e l'installazione
Madis Room: il brevetto e l'installazioneMadis Room: il brevetto e l'installazione
Madis Room: il brevetto e l'installazione
 
Voice of Customer and Beyond
Voice of Customer and BeyondVoice of Customer and Beyond
Voice of Customer and Beyond
 
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...
La Stanza Antisismica nelle case private e negli edifici pubblici come forma ...
 
Madis Room: come funziona in caso di sisma
Madis Room: come funziona in caso di sismaMadis Room: come funziona in caso di sisma
Madis Room: come funziona in caso di sisma
 

Similaire à It’s all in the Content: State of the art Best Answer Prediction based on Discretisation of Shallow Linguistic Features

Leveraging Textual Features for Best Answer Prediction in Community-based Que...
Leveraging Textual Features for Best Answer Prediction in Community-based Que...Leveraging Textual Features for Best Answer Prediction in Community-based Que...
Leveraging Textual Features for Best Answer Prediction in Community-based Que...George Gkotsis
 
Newcomers Breakfast
Newcomers BreakfastNewcomers Breakfast
Newcomers BreakfastTerri Bays
 
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...EarthCube
 
Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Puya - Hossein Vahabi
 
Platform Showcase: Open2Study. Moodlemoot AU 2013
Platform Showcase: Open2Study. Moodlemoot AU 2013Platform Showcase: Open2Study. Moodlemoot AU 2013
Platform Showcase: Open2Study. Moodlemoot AU 2013s_dua
 
An IDE-Based Context-Aware Meta Search Engine
An IDE-Based Context-Aware Meta Search EngineAn IDE-Based Context-Aware Meta Search Engine
An IDE-Based Context-Aware Meta Search EngineMasud Rahman
 
SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...Sandra Gesing
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Charting the Design and Analytics Agenda of Learnersourcing Systems
Charting the Design and Analytics Agenda of Learnersourcing SystemsCharting the Design and Analytics Agenda of Learnersourcing Systems
Charting the Design and Analytics Agenda of Learnersourcing SystemsHassan Khosravi
 
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...divya_prabha
 
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...LinDa_FP7
 
Purdue unal iron hacks 2019 spring - award ceremony (1)
Purdue unal iron hacks 2019 spring - award ceremony (1)Purdue unal iron hacks 2019 spring - award ceremony (1)
Purdue unal iron hacks 2019 spring - award ceremony (1)Purdue RCODI
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Maria Eskevich
 
Domain-driven competence assessment in virtual learning environments. Applica...
Domain-driven competence assessment in virtual learning environments. Applica...Domain-driven competence assessment in virtual learning environments. Applica...
Domain-driven competence assessment in virtual learning environments. Applica...Antonio Balderas
 
QUANT-Question Answering Benchmark Curator
QUANT-Question Answering Benchmark CuratorQUANT-Question Answering Benchmark Curator
QUANT-Question Answering Benchmark CuratorRiaHariGusmita
 
Purdue unal iron hacks 2018 spring - award ceremony
Purdue unal iron hacks 2018 spring - award ceremonyPurdue unal iron hacks 2018 spring - award ceremony
Purdue unal iron hacks 2018 spring - award ceremonyPurdue RCODI
 

Similaire à It’s all in the Content: State of the art Best Answer Prediction based on Discretisation of Shallow Linguistic Features (20)

Leveraging Textual Features for Best Answer Prediction in Community-based Que...
Leveraging Textual Features for Best Answer Prediction in Community-based Que...Leveraging Textual Features for Best Answer Prediction in Community-based Que...
Leveraging Textual Features for Best Answer Prediction in Community-based Que...
 
Newcomers Breakfast
Newcomers BreakfastNewcomers Breakfast
Newcomers Breakfast
 
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...
AHM 2014: Conceptual Design, Developing a Data-Oriented Human-Centric Enterpr...
 
Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017
 
Platform Showcase: Open2Study. Moodlemoot AU 2013
Platform Showcase: Open2Study. Moodlemoot AU 2013Platform Showcase: Open2Study. Moodlemoot AU 2013
Platform Showcase: Open2Study. Moodlemoot AU 2013
 
Course Design for Student Engagement- Social Presence and MOOCS
Course Design for Student Engagement- Social Presence and MOOCSCourse Design for Student Engagement- Social Presence and MOOCS
Course Design for Student Engagement- Social Presence and MOOCS
 
An IDE-Based Context-Aware Meta Search Engine
An IDE-Based Context-Aware Meta Search EngineAn IDE-Based Context-Aware Meta Search Engine
An IDE-Based Context-Aware Meta Search Engine
 
DataShare: Empowering Researcher Data Curation
DataShare: Empowering Researcher Data CurationDataShare: Empowering Researcher Data Curation
DataShare: Empowering Researcher Data Curation
 
SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...SGCI - The Science Gateways Community Institute: International Collaboration ...
SGCI - The Science Gateways Community Institute: International Collaboration ...
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Charting the Design and Analytics Agenda of Learnersourcing Systems
Charting the Design and Analytics Agenda of Learnersourcing SystemsCharting the Design and Analytics Agenda of Learnersourcing Systems
Charting the Design and Analytics Agenda of Learnersourcing Systems
 
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
 
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...
20141030 LinDA Workshop echallenges2014 - Open data commons for european citi...
 
Purdue unal iron hacks 2019 spring - award ceremony (1)
Purdue unal iron hacks 2019 spring - award ceremony (1)Purdue unal iron hacks 2019 spring - award ceremony (1)
Purdue unal iron hacks 2019 spring - award ceremony (1)
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
Domain-driven competence assessment in virtual learning environments. Applica...
Domain-driven competence assessment in virtual learning environments. Applica...Domain-driven competence assessment in virtual learning environments. Applica...
Domain-driven competence assessment in virtual learning environments. Applica...
 
Lak20 drill down recommendation
Lak20 drill down recommendationLak20 drill down recommendation
Lak20 drill down recommendation
 
QUANT-Question Answering Benchmark Curator
QUANT-Question Answering Benchmark CuratorQUANT-Question Answering Benchmark Curator
QUANT-Question Answering Benchmark Curator
 
Purdue unal iron hacks 2018 spring - award ceremony
Purdue unal iron hacks 2018 spring - award ceremonyPurdue unal iron hacks 2018 spring - award ceremony
Purdue unal iron hacks 2018 spring - award ceremony
 

Dernier

APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.krishnachandrapal52
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxgalaxypingy
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查ydyuyu
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolinonuriaiuzzolino1
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 

Dernier (20)

APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 

It’s all in the Content: State of the art Best Answer Prediction based on Discretisation of Shallow Linguistic Features

  • 1. It’s all in the Content: State of the art Best Answer Prediction based on Discretisation of Shallow Linguistic Features George Gkotsis, Karen Stepanyan, Carlos Pedrinaci, John Domingue, Maria Liakata* Knowledge Media Institute, The Open University *Department of Computer Science, University of Warwick
  • 2. Outline • Motivation • Problem description • Proposed solution • Evaluation • Discussion & Conclusion 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 3. Motivation 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 4. Questions on social networking sites 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Recommendations & opinions Authoritative responses Expert & Empirical knowledge
  • 5. Queries on CQA 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 6. Why best answer prediction? • Information overload • Increase awareness in the community • Answer questions more efficiently • One way to study social media reception • Plus: • Finding experts in communities • Study of language use • Trend analysis • … • Visit  23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 7. Problem description 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 8. Best answer prediction in Social Q&A • Binary classification problem • Is it solved? • Yes, partially • Current solutions depend on: 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Answer Ratings • Score, #comments Knowledge is Future & Unknown User Ratings • User Reputation • UpVotes etc • Preferential attachment Knowledge is Past & Not always available
  • 9. State of the art solutions “…we observe significant assortativity in the reputations of co-answerers, relationships between reputation and answer speed, and that the probability of an answer being chosen as the best one strongly depends on temporal characteristics of answer arrivals.” Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, Jure Leskovec Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow. KDD 2012 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 10. State of the art solutions (cont.) “When available, scoring (or rating) features improve prediction results significantly, which demonstrates the value of community feedback and reputation for identifying valuable answers.” Grégoire Burel, Yulan He, Harith Alani. Automatic Identification of Best Answers in Online Enquiry Communities ESWC 2012 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 11. State of the art solutions Summary 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Our solution 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Linguistic User Ratings Answer ratings Average Precision
  • 12. StackExchange network SE “is all about getting answers, it’s not a discussion forum, there’s no chit-chat” • 123 Q&A sites • 5,622,330 users • 9.5 million questions • 16.3 million answers • 9.3 million visits per day 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) 20 June 2014:
  • 13. Training Dataset September 2013 dump StackOverflow & 20 of the most active SE websites Questions with Accepted Answers • 4,366,662 Non Accepted Answers • 3,939,224 Accepted Answers 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Accepted Answers 47% Non Accepted Answers…
  • 14. SE websites 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 Non Accepted Accepted
  • 15. 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) StackOverflow 91% The Rest 9% 3,375,817 3,795,276 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 7,000,000 8,000,000 stackoverflow Non Accepted Answers Accepted Answers
  • 16. Shallow Linguistic features • Long history, coming from studies on readability 1. Average number of characters per word 2. Average number of words per sentence 3. Number of words in the longest sentence 4. Answer length 5. Log Likehood: 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Pitler and Nenkova, 2008
  • 17. StackOverflow – Activity 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 18. StackOverflow – Length 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 19. StackOverflow – Log Likehood 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 20. StackOverflow – Characters Per Word 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 21. StackOverflow – Longest Sentence 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 22. StackOverflow – Words Per Sentence 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 23. StackOverflow Overview of shallow features’ evolution 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 24. Shallow features: Observations • Accepted answers tend to be: • Longer • Differ more from the community vocabulary • Contain shorter words • Have longer longest sentences • Have more words per sentence 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) But how good are shallow features?
  • 25. But how good are shallow features? • 58% macro precision (our baseline) • Possible reasons 1. Evolution of language characteristics • Language becomes more eloquent 2. Variance is huge 3. Universal classifier looks unreachable, e.g.: • SuperUser average length is 577 • Skeptics average length is 2,154 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 26. Proposed solution 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 27. Objectives • Build a classifier which is: 1. Based on linguistic features solely 2. Robust • Performs equally well to other classifiers that use user ratings (past knowledge) or answer ratings (future knowledge) 3. Universal • Same classifier applicable to as many SE websites possible (domain agnostic) 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 28. Feature discretisation Example for Length 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Group by question Question Id 1 5 Answer Id 6 7 Length 2 200 3 150 4 250 150 100 Sort by Length in descending order Rank LengthD 1 2 3 1 2
  • 29. Information Gain from Discretisation 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 30. Feature discretisation Category Name Information Gain Linguistic Length 0.0226 LongestSentence 0.0121 LL 0.0053 WordsPerSentence 0.0048 CharactersPerWord 0.0052 Linguistic Discretisation LengthD 0.2168 LongestSentenceD 0.1750 LLD 0.1180 WordsPerSentenceD 0.1404 CharactersPerWordD 0.1162 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) 20x increase
  • 31. User and answer rating features 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Category Name Information Gain Other Age 0.0539 CreationDateD 0.1575 AnswerCount 0.3270 User Rating UserReputation 0.0836 UserUpVotes 0.0535 UserDownVotes 0.0412 UserViews 0.0528 UserUpDownVotes 0.0508 Answer rating Score 0.0792 CommentCount 0.0286 ScoreRatio 0.4539
  • 32. Evaluation 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 33. What are we evaluating? 1. Prediction 2. How good is it compared with the SOTA? 3. Generality 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 34. 1. Prediction – Features used 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Linguistic Linguistic Discretisation Other User Rating Answer Rating Past Knowledge Future Knowledge
  • 35. 1. Prediction • Classifier was Alternate Decision Trees (ADT) • Binary, boosting, numerical data • Weka • 10-fold validation 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Linguistic Linguistic Discretisation Other
  • 36. 1. Prediction SE Website P R FM AUC stackoverflow.com 0.82 0.66 0.73 0.85 apple.stackexchange.com 0.84 0.68 0.75 0.86 askubuntu.com 0.84 0.74 0.79 0.88 drupal.stackexchange.com 0.87 0.79 0.83 0.89 electronics.stackexchange.com 0.79 0.65 0.71 0.84 english.stackexchange.com 0.77 0.52 0.62 0.83 gamedev.stackexchange.com 0.82 0.71 0.76 0.87 gaming.stackexchange.com 0.87 0.79 0.83 0.91 gis.stackexchange.com 0.85 0.73 0.78 0.87 math.stackexchange.com 0.85 0.74 0.79 0.87 mathoverflow.net 0.83 0.7 0.76 0.87 meta.stackoverflow.com 0.87 0.69 0.77 0.87 physics.stackexchange.com 0.86 0.71 0.78 0.88 programmers.stackexchange.com 0.76 0.4 0.52 0.84 serverfault.com 0.83 0.66 0.74 0.85 skeptics.stackexchange.com 0.87 0.83 0.85 0.91 stats.stackexchange.com 0.85 0.79 0.82 0.89 superuser.com 0.84 0.65 0.73 0.85 tex.stackexchange.com 0.87 0.77 0.82 0.88 unix.stackexchange.com 0.81 0.68 0.74 0.85 wordpress.stackexchange.com 0.88 0.8 0.84 0.89 Average 0.84 0.7 0.76 0.87 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) SE Website P R FM AUC stackoverflow.com 0.82 0.66 0.73 0.85 Macro Average 0.84 0.7 0.76 0.87
  • 37. 2. Comparison with other solutions 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Linguistic Linguistic Discretisation Other User Rating Answer Rating Case Features Used 1 Linguistic 2 Linguistic & Discretisation 3 Linguistic & Discretisation & Other 4 Linguistic & Other & User Rating (no discretisation) 5 Linguistic & Other & User Rating (with discretisation) 6 All features (Answer and User Rating with discretisation)
  • 38. Comparison Case Features Used P R FM AUC 1 Linguistic 0.58 0.60 0.56 0.60 2 Linguistic & Discretisation 0.81 0.70 0.74 0.84 3 Linguistic & Discretisation & Other 0.84 0.7 0.76 0.87 4 Linguistic & Other & User Rating (no discretisation) 0.82 0.69 0.75 0.86 5 Linguistic & Other & User Rating (with discretisation) 0.82 0.72 0.77 0.88 6 All features (Answer and User Rating with discretisation) 0.88 0.85 0.86 0.94 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 39. 3. Generality • Leave-one-out • Trained a classifier for each SE website based on all other SE websites (Stackoverflow was evaluated but was excluded from training due to its size) 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) P R FM AUC Macro average based on self-training (results from the first part of evaluation) 0.84 0.7 0.76 0.87 Leave-one-out 0.83 0.7 0.76 0.87
  • 40. Discussion & Conclusion 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 41. Best Answer prediction • Community feedback on the answers remains the best way for determining the best answer, but • Discretisation reveals a lot more information • Content features, even shallow ones CAN be very informative • Independent from past (not always available) knowledge • Independent from future knowledge • Web application/service is under development 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14)
  • 42. 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) Best Answer Prediction User & answer rating Linguistic features ? Proposed solution
  • 43. Thank you 23-26 June 2014 ACM Web Science Conference 2014 (WebSci14) http://xkcd.com/386/