SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Sustainable Questions
Determining the expiration date of answers




                                                          27 August 2012
Supervisors                                                Bart de Goede
Maarten de Rijke, Anne Schuth                Universiteit van Amsterdam
Outline
• Introduction to CQA

• Problem statement

• Approach

  • Cluster similar questions

  • Compare answers in clusters

  • Classify sustainable clusters

• Discussion and conclusion
Community Question Answering

• Community of users asking and answering questions

• Natural language

• Formally, a service that involves:

  1) A method for a person to present his/her information need in
     natural language,

  2) a place where other people can respond to that information
     need and

  3) a community built around such a service based on
     participation. (Shah et al., 2009)
Community Question Answering
Community Question Answering

• CQA-services have many answered questions

• CQA-retrieval aims to find answered questions similar to the
  question a user posts

• However, not all questions may be readily reused:

  • Who designed the Eiffel Tower?
    Alexander Gustave Eiffel.

  • Who is the prime minister of the UK?
    Now: David Cameron. Before: Gordon Brown.
Problem statement

• Some questions are sustainable and can readily be reused, others
  are not

• A question is sustainable if the answer to that question is
  independent of the point in time the question is asked

• So, if the answer to semantically similar questions over time does
  not change, the questions are considered sustainable
Research questions

 RQ1: What are the distinguishing properties of sustainable ques-
 tions?

 RQ2: Can we measure these properties of sustainability?

 RQ3: Can we tell sustainable and non-sustainable questions apart
 based on these properties?
Approach:
What makes a question sustainable?


1. Cluster semantically similar questions

2. Compare answers in each cluster

3. Classify clusters as sustainable




                                     Time
Cluster semantically similar questions

• Questions are semantically similar if they would be satisfied by
  the same information when asked at the same time

• However, questions tend to be

  • very short

  • phrased in different ways

  • noisy

  • littered with function words
Cluster semantically similar questions

• Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent
  Dirichlet Allocation (LDA; Blei et al., 2003)

  • topic modeling techniques

  • cosine distance between topic vectors

• Locality Sensitive Hashing (LSH; Charikar, 2002)

  • Used for near-duplicate detection

  • Intuition: near-duplicates are very likely to be similar
Cluster semantically similar questions

• Manually labeled set of 559 question pairs

• Calculate accuracy on samples of Yahoo! Answers Comprehensive
  Questions and Answers version 1.0

                                                                                                     3.5
                                               sample size
                           algorithm        10K 100K                all                              3.0

                           LDA             0.435      0.500         -
                           LSA             0.706      0.638         -                                2.5

                           LSH16bits       0.472      0.484     0.500
                           LSH24bits       0.465      0.502     0.495                                2.0




                                                                                           Density
                           LSH32bits       0.512      0.514     0.509
                           LSH40bits       0.523      0.537     0.542                                1.5


                       Accuracy of several question clustering methods.
                                                                                                     1.0
                       Missing values represent experiments that never
        Table 2:   Accuracy of several question clustering
                       terminated.                                        methods. Miss-
Compare answers in each cluster

• Answers to similar questions that do not change over time
  indicate sustainable questions

• Output of LSA contained 904 clusters:

  • 9 clusters considered sustainable

  • 143 clusters considered similar

  • 756 clusters considered all

• Compute properties of question-answer pairs (change, time,
  number of answers, etc.)
Compare answers in each cluster
                               8
                                     Linear fitted line
                               7     Cumulative cosine distance

                               6

                               5
 Cumulative cosine distance




                               4

                               3

                               2

                               1

                               0

                              −1

                              −2
                                   Jan 2006     Feb 2006     Mar 2006   Apr 2006   May 2006   Jun 2006
Compare answers in each cluster



                3.5 3.5                                                                                     0.009
                                                                       all     all                                                                                          all
                                                                       similar                              0.008                                                           similar
                3.0 3.0                                                        similar
                                                                       sustainable                                                                                          sustainable
                                                                              sustainable
                                                                                                            0.007
                2.5 2.5
                                                                                                            0.006

                2.0 2.0                                                                                     0.005




                                                                                                  Density
      Density
            Density




                1.5 1.5                                                                                     0.004

                                                                                                            0.003
                1.0
                      1.0
                                                                                                            0.002
 s-
                0.5
                      0.5                                                                                   0.001

                0.0                                                                                         0.000
                                                                                                               −200   −100        0        100        200         300       400           500
                 −0.5
                    0.0      0.0                 0.5             1.0               1.5
                      −0.5         0.0 Average cosine distance
                                                       0.5             1.0                  1.5                              Days between question posted and last answer
                                           Average cosine distance
th
r),   Figure 5: Kernel density estimation of the average cosine dis-                              Figure 8: Kernel density estimation of the average time in days
 e-   tance (i.e. 6: Kernel density estimation of theas best accord- dis-
        Figure change rate) between answers labeled average cosine
                                                                                                  between posting of a question and the last answer a question
        tance (i.e. change rate) between semanticized answers labeled
      ing to either the user or the community.
4],                                                                                               received.
          as best according to either the user or the community.
 rt
 ia      The clusters in the similar class are only required to have similar                      (shown in Figure 5) with a kernel density estimation of the average
to    questions—questions asking for thequestion being marked as resolved.
         posting of a question and that same information—regardless                               time in days between posting a question and that question receiving
nk    ofAlmost all questions are answeredbe sustainable of posting, although
         the answers; these clusters can thus within days and unsustain-                          its last answer (shown in Figure 8) we see that the time between the
en    able. Additionally, the clusters in the sustainable class are required more
         similar and sustainable question clusters seem to incorporate                            posting of a question and receiving its last answer is very indicative
re    to questions thatthat do not change be answered satisfactorydefini-
         have answers require longer to over time. Note that this than regular
or                                                                                                in describing sustainability: the longer a question solicits answers,
      tion implies that the sustainable class is a subset of the similar class
lar           in describing sustainability: the longer a question solicits answers,
              the higher the probability of said question to be sustainable.
                 In addition, from the simple properties (average, standard devi-
        Classify slope, SSE; detailed in Section 4.1.2) of clusters, we con-
              ation, clusters as sustainable
              structed five feature sets, as listed in Table 3. These correspond to
              approaches disscussed in Section 3.2; change per question (i.e. the
        • Construct feature sets (change, change over time, time to answer)
              amount of change between sequential questions), change per ques-
              tion normalised for time, and the change over time for semanticized
        • Train a classifier* on of questions, as well as the time between asking
              representations re-sampled data
              and answering of questions (both between asking and labeling of
        • Accuracy in answer, and time between asking and reception of the last
              the best stratified 10-fold cross-validation:
              answer). Also, we used a combination of the ‘change over time’ and
             ‘time to answer’ sets.
                                  feature set                                         accuracy

800
                                  change per question                                  66,9%
                                  change over time                                     86,0%
                                  semanticized change over time                        75,3%
ays                               time to answer                                       89,3%
 ed                               change/time combination                              91,5%
      *We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)
Conclusions

• Explored a new problem concerning sustainability and reusability of questions
  in a CQA setting


• Sustainability can be reasonably estimated by simple question properties,
  where time is most descriptive (RQ1)


• These properties can be obtained easily, also from data from other CQA
  services (RQ2)


• Using a simple classifier, these properties can be used to distinguish
  sustainable from non-sustainable questions (RQ3)
Future work

• Scaling (considered sample 3% of training set)


• Clustering:


   • on answers (twice as long as questions)


   • both (where do clusters of answers and questions ‘agree’?)


   • retrieval approach


• Evaluation; does factoring in sustainability have a positive effect on precision?
Questions?
References

• D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine
  Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http://
  dl.acm.org/citation.cfm?id=944919.944937
• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of
  the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM,
  2002.
• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent
  semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407,
  1990.
• M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data
  mining software: an update. SIGKDD, 11(1):10–18, 2009.
• J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.
• C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science
  Research, 31(4):205–209, 2009.
Data descriptives

                                                                                      sample size
                                         Statistic                                 10K 100K           all
                                         Number of questions                       10K    100K      3.2M
                                         Average number of answers/question         7,1     7,1       7,1
                                         Std. dev. number of answers/question       7,4     7,2       8,1
                                         Average number of characters/question    175,0   176,7     177,3
                                         Std. dev. of characters/question         204,2   200,0     201,7
                                         Median of characters/question              103     104       105
                                         Average number of characters/answer      332,8   336,5     336,0
                                         Std. dev. of characters/answer           507,6   503,7     499,6
                                         Median of characters/answer                168     175      177
                                         Average number of sentences/question       2,8     2,8       2,9
                                         Std. dev. number of sentences/question     2,7     2,6       2,6
06    Apr 2006   May 2006   Jun 2006
                                         Median number of sentences/questions         2       2         2
                                         Average number of sentences/answer         3,9     3,9       3,9
e cumulative cosine distance be-         Std. dev. number of sentences/answer       6,3     5,2       5,1
ations of answers with linear fit-        Median number of sentences/answer            2       2         2
However, here the timing of the
                                         Question languages                          6      12        28
                                         Main categories                           163     176       179
                                         Categories                                869    1744      2853
                                         Sub categories                            677    1245      1539
 s are more likely to solicit answers
an non-sustainable questions; many
away and disappear in the timeline
ns keep getting attention, and are      Table 1: Descriptive statistics of the Yahoo! Answer data set.
                                        The average number of answers is per question, the aver-
Cluster properties
                  3.5
                                                                all
                                                                similar
                  3.0
                                                                sustainable

                  2.5


                  2.0
        Density




                  1.5


                  1.0

Miss-
                  0.5


                  0.0
                   −0.5   0.0             0.5             1.0                 1.5
                                Average cosine distance

with
Cluster properties
                   0.009
                                                                                   all
                   0.008                                                           similar
                                                                                   sustainable
                   0.007

                   0.006

                   0.005
         Density




                   0.004

                   0.003

                   0.002

                   0.001

                   0.000
                      −200   −100        0        100        200         300       400           500
1.5                                 Days between question posted and last answer


         Figure 8: Kernel density estimation of the average time in days
posting of a question and that question being marked as resolved.                               time
    Almost all questions are answered within days of posting, although                              its la
    similar and sustainable question clusters seem to incorporate more                              post
Cluster properties longer to be answered satisfactory than regular
    questions that require                                                                          in d
    clusters. However, the distinction is not that clear.                                           the h
                0.030                                                                                  In
                                                                                all                 atio
                                                                                similar
                0.025                                                                               struc
                                                                                sustainable
                                                                                                    appr
                0.020
                                                                                                    amo
                                                                                                    tion
                                                                                                    repr
      Density




                0.015
                                                                                                    and
                                                                                                    the b
                0.010                                                                               answ
                                                                                                    ‘time
                0.005


                0.000
                   −400    −200            0          200          400         600            800
                                  Days between question posted and best answer


     Figure 7:            Kernel density estimation of average time in days
dentify
nswers
need to   Figure 2: As in Figure 1, the cumulative cosine distance be-
          tween vector representations of answers with linear fitted line
     Cluster a single cluster. However, here the timing of the answers is
          for properties
          taken in to account.
e of the                                   8
r or the                                        Linear fitted line
 ulative                                   7    Cumulative cosine distance
  1), as                                   6
er time
 nge in                                    5
             Cumulative cosine distance




  in the
gesting                                    4

olution                                    3
 ow the
                                           2

                                           1

                                           0

                                          −1
                                            0   1        2        3          4   5   6    7      8


            Figure 3:                               Cumulative cosine distance between semanticized
Cluster properties
                                  8
                                        Linear fitted line
                                  7     Cumulative cosine distance                                          Sta
                                  6                                                                         Nu
                                  5                                                                         Ave
                                                                                                            Std
    Cumulative cosine distance




                                  4

                                  3
                                                                                                            Ave
                                                                                                            Std
                                  2                                                                         Me
                                  1                                                                         Ave
                                                                                                            Std
                                  0
                                                                                                            Me
                                 −1
                                                                                                            Ave
                                 −2                                                                         Std
                                      Jan 2006     Feb 2006     Mar 2006   Apr 2006   May 2006   Jun 2006
                                                                                                            Me
                                                                                                            Ave

Contenu connexe

Similaire à Sustainable Questions

CIC 17 - Nominal Scaling of Print Substrates
CIC 17 - Nominal Scaling of Print SubstratesCIC 17 - Nominal Scaling of Print Substrates
CIC 17 - Nominal Scaling of Print Substratesnmoroney
 
ST.Monteiro-EmbeddedFeatureSelection.pdf
ST.Monteiro-EmbeddedFeatureSelection.pdfST.Monteiro-EmbeddedFeatureSelection.pdf
ST.Monteiro-EmbeddedFeatureSelection.pdfgrssieee
 
Shape contexts
Shape contextsShape contexts
Shape contextshuebesao
 
Hulett david
Hulett davidHulett david
Hulett davidNASAPMC
 
Why we don’t know how many colors there are
Why we don’t know how many colors there areWhy we don’t know how many colors there are
Why we don’t know how many colors there areJan Morovic
 

Similaire à Sustainable Questions (6)

CIC 17 - Nominal Scaling of Print Substrates
CIC 17 - Nominal Scaling of Print SubstratesCIC 17 - Nominal Scaling of Print Substrates
CIC 17 - Nominal Scaling of Print Substrates
 
ST.Monteiro-EmbeddedFeatureSelection.pdf
ST.Monteiro-EmbeddedFeatureSelection.pdfST.Monteiro-EmbeddedFeatureSelection.pdf
ST.Monteiro-EmbeddedFeatureSelection.pdf
 
Shape contexts
Shape contextsShape contexts
Shape contexts
 
Hulett david
Hulett davidHulett david
Hulett david
 
PyPy: Dynamic Language Compilation Framework
PyPy: Dynamic Language Compilation FrameworkPyPy: Dynamic Language Compilation Framework
PyPy: Dynamic Language Compilation Framework
 
Why we don’t know how many colors there are
Why we don’t know how many colors there areWhy we don’t know how many colors there are
Why we don’t know how many colors there are
 

Dernier

Deerfoot Church of Christ Bulletin 2 25 24
Deerfoot Church of Christ Bulletin 2 25 24Deerfoot Church of Christ Bulletin 2 25 24
Deerfoot Church of Christ Bulletin 2 25 24deerfootcoc
 
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdf
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdfThe-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdf
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdfSana Khan
 
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...INDIAN YOUTH SECURED ORGANISATION
 
PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!spy7777777guy
 
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptx
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptxMeaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptx
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptxStephen Palm
 
Deerfoot Church of Christ Bulletin 4 14 24
Deerfoot Church of Christ Bulletin 4 14 24Deerfoot Church of Christ Bulletin 4 14 24
Deerfoot Church of Christ Bulletin 4 14 24deerfootcoc
 
A357 Hate can stir up strife, but love can cover up all mistakes. hate, love...
A357 Hate can stir up strife, but love can cover up all mistakes.  hate, love...A357 Hate can stir up strife, but love can cover up all mistakes.  hate, love...
A357 Hate can stir up strife, but love can cover up all mistakes. hate, love...franktsao4
 
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptxThe King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptxOH TEIK BIN
 
"There are probably more Nobel Laureates who are people of faith than is gen...
 "There are probably more Nobel Laureates who are people of faith than is gen... "There are probably more Nobel Laureates who are people of faith than is gen...
"There are probably more Nobel Laureates who are people of faith than is gen...Steven Camilleri
 
Deerfoot Church of Christ Bulletin 3 31 24
Deerfoot Church of Christ Bulletin 3 31 24Deerfoot Church of Christ Bulletin 3 31 24
Deerfoot Church of Christ Bulletin 3 31 24deerfootcoc
 
Codex Singularity: Search for the Prisca Sapientia
Codex Singularity: Search for the Prisca SapientiaCodex Singularity: Search for the Prisca Sapientia
Codex Singularity: Search for the Prisca Sapientiajfrenchau
 
empathy map for students very useful.pptx
empathy map for students very useful.pptxempathy map for students very useful.pptx
empathy map for students very useful.pptxGeorgePhilips7
 
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLS
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLSA MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLS
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLSRickPatrick9
 
PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!spy7777777guy
 
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptx
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptxA Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptx
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptxOH TEIK BIN
 
Prach Autism AI - Artificial Intelligence
Prach Autism AI - Artificial IntelligencePrach Autism AI - Artificial Intelligence
Prach Autism AI - Artificial Intelligenceprachaibot
 

Dernier (19)

Deerfoot Church of Christ Bulletin 2 25 24
Deerfoot Church of Christ Bulletin 2 25 24Deerfoot Church of Christ Bulletin 2 25 24
Deerfoot Church of Christ Bulletin 2 25 24
 
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdf
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdfThe-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdf
The-Clear-Quran,-A-Thematic-English-Translation-by-Dr-Mustafa-Khattab.pdf
 
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...
Gangaur Celebrations 2024 - Rajasthani Sewa Samaj Karimnagar, Telangana State...
 
PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!
 
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptx
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptxMeaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptx
Meaningful Pursuits: Pursuing Obedience_Ecclesiastes.pptx
 
Deerfoot Church of Christ Bulletin 4 14 24
Deerfoot Church of Christ Bulletin 4 14 24Deerfoot Church of Christ Bulletin 4 14 24
Deerfoot Church of Christ Bulletin 4 14 24
 
A357 Hate can stir up strife, but love can cover up all mistakes. hate, love...
A357 Hate can stir up strife, but love can cover up all mistakes.  hate, love...A357 Hate can stir up strife, but love can cover up all mistakes.  hate, love...
A357 Hate can stir up strife, but love can cover up all mistakes. hate, love...
 
English - The Dangers of Wine Alcohol.pptx
English - The Dangers of Wine Alcohol.pptxEnglish - The Dangers of Wine Alcohol.pptx
English - The Dangers of Wine Alcohol.pptx
 
The Precious Blood of the Lord Jesus Christ.pptx
The Precious Blood of the Lord Jesus Christ.pptxThe Precious Blood of the Lord Jesus Christ.pptx
The Precious Blood of the Lord Jesus Christ.pptx
 
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptxThe King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
The King 'Great Goodness' Part 1 Mahasilava Jataka (Eng. & Chi.).pptx
 
"There are probably more Nobel Laureates who are people of faith than is gen...
 "There are probably more Nobel Laureates who are people of faith than is gen... "There are probably more Nobel Laureates who are people of faith than is gen...
"There are probably more Nobel Laureates who are people of faith than is gen...
 
Deerfoot Church of Christ Bulletin 3 31 24
Deerfoot Church of Christ Bulletin 3 31 24Deerfoot Church of Christ Bulletin 3 31 24
Deerfoot Church of Christ Bulletin 3 31 24
 
Codex Singularity: Search for the Prisca Sapientia
Codex Singularity: Search for the Prisca SapientiaCodex Singularity: Search for the Prisca Sapientia
Codex Singularity: Search for the Prisca Sapientia
 
empathy map for students very useful.pptx
empathy map for students very useful.pptxempathy map for students very useful.pptx
empathy map for students very useful.pptx
 
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLS
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLSA MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLS
A MEMORIAL TRIBUTE TO THE FOUR BROTHER BILLS
 
PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!PROPHECY-- The End Of My People Forever!
PROPHECY-- The End Of My People Forever!
 
The spiritual moderator of vincentian groups
The spiritual moderator of vincentian groupsThe spiritual moderator of vincentian groups
The spiritual moderator of vincentian groups
 
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptx
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptxA Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptx
A Tsunami Tragedy ~ Wise Reflections for Troubled Times (Eng. & Chi.).pptx
 
Prach Autism AI - Artificial Intelligence
Prach Autism AI - Artificial IntelligencePrach Autism AI - Artificial Intelligence
Prach Autism AI - Artificial Intelligence
 

Sustainable Questions

  • 1. Sustainable Questions Determining the expiration date of answers 27 August 2012 Supervisors Bart de Goede Maarten de Rijke, Anne Schuth Universiteit van Amsterdam
  • 2. Outline • Introduction to CQA • Problem statement • Approach • Cluster similar questions • Compare answers in clusters • Classify sustainable clusters • Discussion and conclusion
  • 3. Community Question Answering • Community of users asking and answering questions • Natural language • Formally, a service that involves: 1) A method for a person to present his/her information need in natural language, 2) a place where other people can respond to that information need and 3) a community built around such a service based on participation. (Shah et al., 2009)
  • 5. Community Question Answering • CQA-services have many answered questions • CQA-retrieval aims to find answered questions similar to the question a user posts • However, not all questions may be readily reused: • Who designed the Eiffel Tower? Alexander Gustave Eiffel. • Who is the prime minister of the UK? Now: David Cameron. Before: Gordon Brown.
  • 6. Problem statement • Some questions are sustainable and can readily be reused, others are not • A question is sustainable if the answer to that question is independent of the point in time the question is asked • So, if the answer to semantically similar questions over time does not change, the questions are considered sustainable
  • 7. Research questions RQ1: What are the distinguishing properties of sustainable ques- tions? RQ2: Can we measure these properties of sustainability? RQ3: Can we tell sustainable and non-sustainable questions apart based on these properties?
  • 8. Approach: What makes a question sustainable? 1. Cluster semantically similar questions 2. Compare answers in each cluster 3. Classify clusters as sustainable Time
  • 9. Cluster semantically similar questions • Questions are semantically similar if they would be satisfied by the same information when asked at the same time • However, questions tend to be • very short • phrased in different ways • noisy • littered with function words
  • 10. Cluster semantically similar questions • Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent Dirichlet Allocation (LDA; Blei et al., 2003) • topic modeling techniques • cosine distance between topic vectors • Locality Sensitive Hashing (LSH; Charikar, 2002) • Used for near-duplicate detection • Intuition: near-duplicates are very likely to be similar
  • 11. Cluster semantically similar questions • Manually labeled set of 559 question pairs • Calculate accuracy on samples of Yahoo! Answers Comprehensive Questions and Answers version 1.0 3.5 sample size algorithm 10K 100K all 3.0 LDA 0.435 0.500 - LSA 0.706 0.638 - 2.5 LSH16bits 0.472 0.484 0.500 LSH24bits 0.465 0.502 0.495 2.0 Density LSH32bits 0.512 0.514 0.509 LSH40bits 0.523 0.537 0.542 1.5 Accuracy of several question clustering methods. 1.0 Missing values represent experiments that never Table 2: Accuracy of several question clustering terminated. methods. Miss-
  • 12. Compare answers in each cluster • Answers to similar questions that do not change over time indicate sustainable questions • Output of LSA contained 904 clusters: • 9 clusters considered sustainable • 143 clusters considered similar • 756 clusters considered all • Compute properties of question-answer pairs (change, time, number of answers, etc.)
  • 13. Compare answers in each cluster 8 Linear fitted line 7 Cumulative cosine distance 6 5 Cumulative cosine distance 4 3 2 1 0 −1 −2 Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006
  • 14. Compare answers in each cluster 3.5 3.5 0.009 all all all similar 0.008 similar 3.0 3.0 similar sustainable sustainable sustainable 0.007 2.5 2.5 0.006 2.0 2.0 0.005 Density Density Density 1.5 1.5 0.004 0.003 1.0 1.0 0.002 s- 0.5 0.5 0.001 0.0 0.000 −200 −100 0 100 200 300 400 500 −0.5 0.0 0.0 0.5 1.0 1.5 −0.5 0.0 Average cosine distance 0.5 1.0 1.5 Days between question posted and last answer Average cosine distance th r), Figure 5: Kernel density estimation of the average cosine dis- Figure 8: Kernel density estimation of the average time in days e- tance (i.e. 6: Kernel density estimation of theas best accord- dis- Figure change rate) between answers labeled average cosine between posting of a question and the last answer a question tance (i.e. change rate) between semanticized answers labeled ing to either the user or the community. 4], received. as best according to either the user or the community. rt ia The clusters in the similar class are only required to have similar (shown in Figure 5) with a kernel density estimation of the average to questions—questions asking for thequestion being marked as resolved. posting of a question and that same information—regardless time in days between posting a question and that question receiving nk ofAlmost all questions are answeredbe sustainable of posting, although the answers; these clusters can thus within days and unsustain- its last answer (shown in Figure 8) we see that the time between the en able. Additionally, the clusters in the sustainable class are required more similar and sustainable question clusters seem to incorporate posting of a question and receiving its last answer is very indicative re to questions thatthat do not change be answered satisfactorydefini- have answers require longer to over time. Note that this than regular or in describing sustainability: the longer a question solicits answers, tion implies that the sustainable class is a subset of the similar class
  • 15. lar in describing sustainability: the longer a question solicits answers, the higher the probability of said question to be sustainable. In addition, from the simple properties (average, standard devi- Classify slope, SSE; detailed in Section 4.1.2) of clusters, we con- ation, clusters as sustainable structed five feature sets, as listed in Table 3. These correspond to approaches disscussed in Section 3.2; change per question (i.e. the • Construct feature sets (change, change over time, time to answer) amount of change between sequential questions), change per ques- tion normalised for time, and the change over time for semanticized • Train a classifier* on of questions, as well as the time between asking representations re-sampled data and answering of questions (both between asking and labeling of • Accuracy in answer, and time between asking and reception of the last the best stratified 10-fold cross-validation: answer). Also, we used a combination of the ‘change over time’ and ‘time to answer’ sets. feature set accuracy 800 change per question 66,9% change over time 86,0% semanticized change over time 75,3% ays time to answer 89,3% ed change/time combination 91,5% *We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)
  • 16. Conclusions • Explored a new problem concerning sustainability and reusability of questions in a CQA setting • Sustainability can be reasonably estimated by simple question properties, where time is most descriptive (RQ1) • These properties can be obtained easily, also from data from other CQA services (RQ2) • Using a simple classifier, these properties can be used to distinguish sustainable from non-sustainable questions (RQ3)
  • 17. Future work • Scaling (considered sample 3% of training set) • Clustering: • on answers (twice as long as questions) • both (where do clusters of answers and questions ‘agree’?) • retrieval approach • Evaluation; does factoring in sustainability have a positive effect on precision?
  • 19. References • D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http:// dl.acm.org/citation.cfm?id=944919.944937 • M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002. • S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407, 1990. • M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: an update. SIGKDD, 11(1):10–18, 2009. • J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993. • C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science Research, 31(4):205–209, 2009.
  • 20. Data descriptives sample size Statistic 10K 100K all Number of questions 10K 100K 3.2M Average number of answers/question 7,1 7,1 7,1 Std. dev. number of answers/question 7,4 7,2 8,1 Average number of characters/question 175,0 176,7 177,3 Std. dev. of characters/question 204,2 200,0 201,7 Median of characters/question 103 104 105 Average number of characters/answer 332,8 336,5 336,0 Std. dev. of characters/answer 507,6 503,7 499,6 Median of characters/answer 168 175 177 Average number of sentences/question 2,8 2,8 2,9 Std. dev. number of sentences/question 2,7 2,6 2,6 06 Apr 2006 May 2006 Jun 2006 Median number of sentences/questions 2 2 2 Average number of sentences/answer 3,9 3,9 3,9 e cumulative cosine distance be- Std. dev. number of sentences/answer 6,3 5,2 5,1 ations of answers with linear fit- Median number of sentences/answer 2 2 2 However, here the timing of the Question languages 6 12 28 Main categories 163 176 179 Categories 869 1744 2853 Sub categories 677 1245 1539 s are more likely to solicit answers an non-sustainable questions; many away and disappear in the timeline ns keep getting attention, and are Table 1: Descriptive statistics of the Yahoo! Answer data set. The average number of answers is per question, the aver-
  • 21. Cluster properties 3.5 all similar 3.0 sustainable 2.5 2.0 Density 1.5 1.0 Miss- 0.5 0.0 −0.5 0.0 0.5 1.0 1.5 Average cosine distance with
  • 22. Cluster properties 0.009 all 0.008 similar sustainable 0.007 0.006 0.005 Density 0.004 0.003 0.002 0.001 0.000 −200 −100 0 100 200 300 400 500 1.5 Days between question posted and last answer Figure 8: Kernel density estimation of the average time in days
  • 23. posting of a question and that question being marked as resolved. time Almost all questions are answered within days of posting, although its la similar and sustainable question clusters seem to incorporate more post Cluster properties longer to be answered satisfactory than regular questions that require in d clusters. However, the distinction is not that clear. the h 0.030 In all atio similar 0.025 struc sustainable appr 0.020 amo tion repr Density 0.015 and the b 0.010 answ ‘time 0.005 0.000 −400 −200 0 200 400 600 800 Days between question posted and best answer Figure 7: Kernel density estimation of average time in days
  • 24. dentify nswers need to Figure 2: As in Figure 1, the cumulative cosine distance be- tween vector representations of answers with linear fitted line Cluster a single cluster. However, here the timing of the answers is for properties taken in to account. e of the 8 r or the Linear fitted line ulative 7 Cumulative cosine distance 1), as 6 er time nge in 5 Cumulative cosine distance in the gesting 4 olution 3 ow the 2 1 0 −1 0 1 2 3 4 5 6 7 8 Figure 3: Cumulative cosine distance between semanticized
  • 25. Cluster properties 8 Linear fitted line 7 Cumulative cosine distance Sta 6 Nu 5 Ave Std Cumulative cosine distance 4 3 Ave Std 2 Me 1 Ave Std 0 Me −1 Ave −2 Std Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006 Me Ave