1. Sustainable Questions
Determining the expiration date of answers
27 August 2012
Supervisors Bart de Goede
Maarten de Rijke, Anne Schuth Universiteit van Amsterdam
2. Outline
• Introduction to CQA
• Problem statement
• Approach
• Cluster similar questions
• Compare answers in clusters
• Classify sustainable clusters
• Discussion and conclusion
3. Community Question Answering
• Community of users asking and answering questions
• Natural language
• Formally, a service that involves:
1) A method for a person to present his/her information need in
natural language,
2) a place where other people can respond to that information
need and
3) a community built around such a service based on
participation. (Shah et al., 2009)
5. Community Question Answering
• CQA-services have many answered questions
• CQA-retrieval aims to find answered questions similar to the
question a user posts
• However, not all questions may be readily reused:
• Who designed the Eiffel Tower?
Alexander Gustave Eiffel.
• Who is the prime minister of the UK?
Now: David Cameron. Before: Gordon Brown.
6. Problem statement
• Some questions are sustainable and can readily be reused, others
are not
• A question is sustainable if the answer to that question is
independent of the point in time the question is asked
• So, if the answer to semantically similar questions over time does
not change, the questions are considered sustainable
7. Research questions
RQ1: What are the distinguishing properties of sustainable ques-
tions?
RQ2: Can we measure these properties of sustainability?
RQ3: Can we tell sustainable and non-sustainable questions apart
based on these properties?
8. Approach:
What makes a question sustainable?
1. Cluster semantically similar questions
2. Compare answers in each cluster
3. Classify clusters as sustainable
Time
9. Cluster semantically similar questions
• Questions are semantically similar if they would be satisfied by
the same information when asked at the same time
• However, questions tend to be
• very short
• phrased in different ways
• noisy
• littered with function words
10. Cluster semantically similar questions
• Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent
Dirichlet Allocation (LDA; Blei et al., 2003)
• topic modeling techniques
• cosine distance between topic vectors
• Locality Sensitive Hashing (LSH; Charikar, 2002)
• Used for near-duplicate detection
• Intuition: near-duplicates are very likely to be similar
11. Cluster semantically similar questions
• Manually labeled set of 559 question pairs
• Calculate accuracy on samples of Yahoo! Answers Comprehensive
Questions and Answers version 1.0
3.5
sample size
algorithm 10K 100K all 3.0
LDA 0.435 0.500 -
LSA 0.706 0.638 - 2.5
LSH16bits 0.472 0.484 0.500
LSH24bits 0.465 0.502 0.495 2.0
Density
LSH32bits 0.512 0.514 0.509
LSH40bits 0.523 0.537 0.542 1.5
Accuracy of several question clustering methods.
1.0
Missing values represent experiments that never
Table 2: Accuracy of several question clustering
terminated. methods. Miss-
12. Compare answers in each cluster
• Answers to similar questions that do not change over time
indicate sustainable questions
• Output of LSA contained 904 clusters:
• 9 clusters considered sustainable
• 143 clusters considered similar
• 756 clusters considered all
• Compute properties of question-answer pairs (change, time,
number of answers, etc.)
13. Compare answers in each cluster
8
Linear fitted line
7 Cumulative cosine distance
6
5
Cumulative cosine distance
4
3
2
1
0
−1
−2
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006
14. Compare answers in each cluster
3.5 3.5 0.009
all all all
similar 0.008 similar
3.0 3.0 similar
sustainable sustainable
sustainable
0.007
2.5 2.5
0.006
2.0 2.0 0.005
Density
Density
Density
1.5 1.5 0.004
0.003
1.0
1.0
0.002
s-
0.5
0.5 0.001
0.0 0.000
−200 −100 0 100 200 300 400 500
−0.5
0.0 0.0 0.5 1.0 1.5
−0.5 0.0 Average cosine distance
0.5 1.0 1.5 Days between question posted and last answer
Average cosine distance
th
r), Figure 5: Kernel density estimation of the average cosine dis- Figure 8: Kernel density estimation of the average time in days
e- tance (i.e. 6: Kernel density estimation of theas best accord- dis-
Figure change rate) between answers labeled average cosine
between posting of a question and the last answer a question
tance (i.e. change rate) between semanticized answers labeled
ing to either the user or the community.
4], received.
as best according to either the user or the community.
rt
ia The clusters in the similar class are only required to have similar (shown in Figure 5) with a kernel density estimation of the average
to questions—questions asking for thequestion being marked as resolved.
posting of a question and that same information—regardless time in days between posting a question and that question receiving
nk ofAlmost all questions are answeredbe sustainable of posting, although
the answers; these clusters can thus within days and unsustain- its last answer (shown in Figure 8) we see that the time between the
en able. Additionally, the clusters in the sustainable class are required more
similar and sustainable question clusters seem to incorporate posting of a question and receiving its last answer is very indicative
re to questions thatthat do not change be answered satisfactorydefini-
have answers require longer to over time. Note that this than regular
or in describing sustainability: the longer a question solicits answers,
tion implies that the sustainable class is a subset of the similar class
15. lar in describing sustainability: the longer a question solicits answers,
the higher the probability of said question to be sustainable.
In addition, from the simple properties (average, standard devi-
Classify slope, SSE; detailed in Section 4.1.2) of clusters, we con-
ation, clusters as sustainable
structed five feature sets, as listed in Table 3. These correspond to
approaches disscussed in Section 3.2; change per question (i.e. the
• Construct feature sets (change, change over time, time to answer)
amount of change between sequential questions), change per ques-
tion normalised for time, and the change over time for semanticized
• Train a classifier* on of questions, as well as the time between asking
representations re-sampled data
and answering of questions (both between asking and labeling of
• Accuracy in answer, and time between asking and reception of the last
the best stratified 10-fold cross-validation:
answer). Also, we used a combination of the ‘change over time’ and
‘time to answer’ sets.
feature set accuracy
800
change per question 66,9%
change over time 86,0%
semanticized change over time 75,3%
ays time to answer 89,3%
ed change/time combination 91,5%
*We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)
16. Conclusions
• Explored a new problem concerning sustainability and reusability of questions
in a CQA setting
• Sustainability can be reasonably estimated by simple question properties,
where time is most descriptive (RQ1)
• These properties can be obtained easily, also from data from other CQA
services (RQ2)
• Using a simple classifier, these properties can be used to distinguish
sustainable from non-sustainable questions (RQ3)
17. Future work
• Scaling (considered sample 3% of training set)
• Clustering:
• on answers (twice as long as questions)
• both (where do clusters of answers and questions ‘agree’?)
• retrieval approach
• Evaluation; does factoring in sustainability have a positive effect on precision?
19. References
• D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine
Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http://
dl.acm.org/citation.cfm?id=944919.944937
• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of
the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM,
2002.
• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent
semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407,
1990.
• M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data
mining software: an update. SIGKDD, 11(1):10–18, 2009.
• J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.
• C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science
Research, 31(4):205–209, 2009.
20. Data descriptives
sample size
Statistic 10K 100K all
Number of questions 10K 100K 3.2M
Average number of answers/question 7,1 7,1 7,1
Std. dev. number of answers/question 7,4 7,2 8,1
Average number of characters/question 175,0 176,7 177,3
Std. dev. of characters/question 204,2 200,0 201,7
Median of characters/question 103 104 105
Average number of characters/answer 332,8 336,5 336,0
Std. dev. of characters/answer 507,6 503,7 499,6
Median of characters/answer 168 175 177
Average number of sentences/question 2,8 2,8 2,9
Std. dev. number of sentences/question 2,7 2,6 2,6
06 Apr 2006 May 2006 Jun 2006
Median number of sentences/questions 2 2 2
Average number of sentences/answer 3,9 3,9 3,9
e cumulative cosine distance be- Std. dev. number of sentences/answer 6,3 5,2 5,1
ations of answers with linear fit- Median number of sentences/answer 2 2 2
However, here the timing of the
Question languages 6 12 28
Main categories 163 176 179
Categories 869 1744 2853
Sub categories 677 1245 1539
s are more likely to solicit answers
an non-sustainable questions; many
away and disappear in the timeline
ns keep getting attention, and are Table 1: Descriptive statistics of the Yahoo! Answer data set.
The average number of answers is per question, the aver-
21. Cluster properties
3.5
all
similar
3.0
sustainable
2.5
2.0
Density
1.5
1.0
Miss-
0.5
0.0
−0.5 0.0 0.5 1.0 1.5
Average cosine distance
with
22. Cluster properties
0.009
all
0.008 similar
sustainable
0.007
0.006
0.005
Density
0.004
0.003
0.002
0.001
0.000
−200 −100 0 100 200 300 400 500
1.5 Days between question posted and last answer
Figure 8: Kernel density estimation of the average time in days
23. posting of a question and that question being marked as resolved. time
Almost all questions are answered within days of posting, although its la
similar and sustainable question clusters seem to incorporate more post
Cluster properties longer to be answered satisfactory than regular
questions that require in d
clusters. However, the distinction is not that clear. the h
0.030 In
all atio
similar
0.025 struc
sustainable
appr
0.020
amo
tion
repr
Density
0.015
and
the b
0.010 answ
‘time
0.005
0.000
−400 −200 0 200 400 600 800
Days between question posted and best answer
Figure 7: Kernel density estimation of average time in days
24. dentify
nswers
need to Figure 2: As in Figure 1, the cumulative cosine distance be-
tween vector representations of answers with linear fitted line
Cluster a single cluster. However, here the timing of the answers is
for properties
taken in to account.
e of the 8
r or the Linear fitted line
ulative 7 Cumulative cosine distance
1), as 6
er time
nge in 5
Cumulative cosine distance
in the
gesting 4
olution 3
ow the
2
1
0
−1
0 1 2 3 4 5 6 7 8
Figure 3: Cumulative cosine distance between semanticized
25. Cluster properties
8
Linear fitted line
7 Cumulative cosine distance Sta
6 Nu
5 Ave
Std
Cumulative cosine distance
4
3
Ave
Std
2 Me
1 Ave
Std
0
Me
−1
Ave
−2 Std
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006
Me
Ave