SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Improving the Generation
of Ground Truths based on
Partially Ordered Lists
Julián Urbano, Mónica Marrero,
Diego Martín and Juan Lloréns
http://julian-urbano.info
Twitter: @julian_urbano


                                                 ISMIR 2010
                            Utrecht, Netherlands, August 11th
2



Outline
• Introduction
• Current Methodology
• Inconsistencies
 ▫ Due to Arrangement
 ▫ Due to Aggregation
 ▫ Fully Consistent Lists
• Alternative Aggregation Functions
 ▫ Measure of List Consistency
• Results
 ▫ MIREX 2005 Results Revisited
• Conclusions and Future Work
• Some thoughts on Evaluation in MIR
3



Similarity Tasks
• Symbolic Melodic Similarity (SMS)
• Audio Music Similarity (AMS)
  ▫ Not covered here

• Given a piece of music (i.e. the query) retrieve
  others musically similar to it
• How do we measure the similarity of a
  document to a query (i.e. the relevance)?
  ▫ Traditionally with fixed level-based scales
     Similar, not similar
     Very similar, somewhat similar, not similar
4



Relevance Judgments
• For similarity tasks, they are very problematic

• Relevance is rather continuous
  [Selfridge-Field, 1998][Typke et al., 2005]
 ▫ Single melodic changes are not perceived to
   change the overall melody
      Move a note up or down in pitch
      Shorten or enlarge it
      Add or remove a note
 ▫ But the similarity is weaker as more changes apply

• Where is the line between relevance levels?
5



Partially Ordered Lists
• The relevance of a document is implied by its
  position in a partially ordered list [Typke et al., 2005]
  ▫ Does not need any prefixed relevance scale

• Ordered groups of documents equally relevant
  ▫ Have to keep the order of the groups
  ▫ Allow permutations within the same group
6



Partially Ordered Lists (II)




Relevance levels do show up, but they are not pre-fixed beforehand
7



Partially Ordered Lists (III)
• Used in the first edition of MIREX in 2005
 [Downie et al., 2005]


• Widely accepted by the MIR community
  to report new developments
 [Urbano et al., 2010][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]


• Four-step methodology
 1. Filter out non-similar documents in the collection
 2. Have the experts rank the candidates
 3.Arrange the candidates by their median/mean rank
 4.Aggregate candidates whose ranks are not
    significantly different (Mann-Whitney U) [Mann et al., 1947]
8



Partially Ordered Lists (and IV)
• MIREX was forced to move to traditional
  level-based relevance since 2006 [Downie et al., 2010]
  ▫   Partially ordered lists are expensive (step 2)
  ▫   They have some odd results (step 2)
  ▫   They are hard to replicate (step 2)
  ▫   It may leave out relevant results (step 1)

• We have already explored alternatives to step 2
  (and by extension 3 and 4) [Urbano et al., SIGIR CSE 2010]
  ▫ 3-point preference judgments via crowdsourcing
• Here we focus on steps 3 and 4
  ▫ The lists have inconsistencies that lead to
    incorrect evaluation
9



Intra-group Inconsistencies
• Two incipits in the same group were ranked
  significantly different by the experts
• If a system returns them in reverse order it
  will be considered correct, despite they were
  ranked clearly different by the experts
    2   3   4   5   6   7   8   9       Query 700.010.591-1.4.2
    ≠   ≠   ≠   ≠   ≠   ≠   ≠   ≠   1
        ≠   ≠   ≠   ≠   ≠   ≠   ≠   2
            =   =   =   ≠   ≠   ≠   3
                =   =   ≠   ≠   ≠   4
                    ≠   ≠   ≠   ≠   5
                        =   =   ≠   6

  11 of the 21 pairs are    =   =   7
                                =   8
  incorrectly aggregated
10



Inter-group Inconsistencies
• Two incipits in different groups were not
  ranked significantly different by the experts
• If a system returns them in reverse order it
  will not be considered correct, despite no
  difference could be found between their ranks
    2   3   4   5   6   7   8       Query 190.011.224-1.1.1
    ≠   ≠   ≠   ≠   ≠   ≠   ≠   1
        ≠   ≠   ≠   ≠   ≠   ≠   2
            ≠   =   ≠   ≠   ≠   3
                =   =   =   =   4
                    =   =   =   5
                        =   =   6
                            =   7
                                …
                                                 …
11



Due to Arrangement
• In step 3 incipits are ordered by median
 ▫ Mean to break ties
• But in step 4 the Mann-Whitney U test is used

• Central tendency measures (median and
  mean) might not be appropriate because
 ▫ They ignore the dispersion in the samples

• Incipits are incorrectly ordered in step 3
 ▫ Source of inter-group inconsistencies
12



Due to Aggregation
• Traverse the list from top to bottom
 ▫ Begin a new group if the pivot is significantly
   different from all incipits in the current group

• This generates very large groups
 ▫ Incipits at the top are considered similar to the
   ones at the end just because they are both similar
   to the ones in the middle
 ▫ Source of intra-group inconsistencies
    178 of the 509 intra-pairs (35%) inconsistent

• The group-initiator has to be very different
13



Due to Aggregation (and II)
• The aggregation function may place the pivot in
  a new group, but the next one is not different
  from the ones in the group just closed
 ▫ Source of inter-group inconsistencies
 ▫ The pivot was just sufficiently different
 ▫ Or it was incorrectly arranged in step 3
    2   3   4   5   6   7   8       Query 190.011.224-1.1.1
    ≠   ≠   ≠   ≠   ≠   ≠   ≠   1
        ≠   ≠   ≠   ≠   ≠   ≠   2
            ≠   =   ≠   ≠   ≠   3
                =   =   =   =   4
                    =   =   =   5
                        =   =   6
                            =   7
14



Fully Consistent Lists
• Two sources of inconsistency
 ▫ Arrangement (inter-)
 ▫ Aggregation (inter- and intra-)

• There is a more profound problem
 ▫ Hypothesis testing is not transitive
 ▫ Not rejecting H0 does not mean accepting it
• Mann-Whitney U may say something like this
 ▫ A < B, B < C and A ≥ C (1-tailed test)
 ▫ A = B, B = C and A ≠ C (2-tailed test)

• We can not ensure fully consistent lists
15



Alternative Aggregation
• A function too permissive lead to large groups
  ▫ Likelihood of intra-group inconsistencies
• A function too restrictive leads to small groups
  ▫ Likelihood of inter-group inconsistencies

• We consider three rationales to follow
  ▫ All: a group begins if all incipits are different from
    the pivot. This should lead to larger groups.
  ▫ Any: a group begins if any incipit is different from
    the pivot. This should lead to smaller groups.
  ▫ Prev: a group begins if the previous incipit is
    different from the pivot.
16



Alternative Aggregation (and II)
• After the arrangement in step 3 we may assume
  that an incipit ranked higher has a true
  rank either higher or equal, but not lower
 ▫ 1-tailed tests are more powerful than the 2-tailed
    It is more probable for them to find a difference if
     there really is one


• Combine the three rationales with the two tests

• All-2, Any-2, Prev-2, All-1, Any-1 and Prev-1
 ▫ All-2 is the function originally used by Typke et al.
17



Measure of List Consistency
• Follow the logics behind ADR [Typke et al., 2006]
• Traverse the list from top to bottom
  ▫ Calculate the expanded set of allowed incipits
     All previous ones and those in the same group
  ▫ Compute the percentage of correct expansions
     The pivot is not considered (it is always correct)
  ▫ Average over all ranks in the list
     Ignore the last rank (it always expands to all incipits)

• 1 = all expansions are correct
  ▫ Fully consistent list (not to be expected)
• 0 = that no expansion is correct
18



Measure of List Consistency (II)
• Ground truth = 〈 (A, B), (C), (D, E, F) 〉, but
  ▫ A = C (inter-group inconsistency, false negative)
  ▫ D ≠ F (intra-group inconsistency, false positive)

                  Correct      Actual % of correct
     Position
                 expansion expansion expansions
         1      B,C         B             0.5
         2      A           A               1
         3      A,B         A,B             1
         4      A,B,C,E     A,B,C,E,F     0.8
         5      A,B,C,D,F A,B,C,D,F         1
                        List consistency 0.86
19



Measure of List Consistency (and III)
• Again, it comes in two flavors

• ADR-1 consistency with 1-tailed tests
 ▫ Accounts for inconsistencies due to
   arrangement and aggregation

• ADR-2 consistency with 2-tailed tests
 ▫ Only accounts for inconsistencies due to
   aggregation
20



Results
• Re-generate the 11 lists used in MIREX 2005
  with the alternative aggregation functions

• Compare with the original All-2 in terms of
 ▫ ADR-1 consistency across the 11 queries
 ▫ Group size across the 11 queries
 ▫ Are they correlated?

• Re-evaluate the MIREX 2005 SMS task
 ▫ Would it have been different?
21



List Consistency vs Group Size




       Aggregation      ADR-1         Incipits
                                                 Pearson’s r
        function      consistency    per group
          All-2      0.844          3.752         -0.892***
         Any-2       0.913**        2.539*        -0.862***
         Prev-2      0.857          3.683         -0.937***
          All-1      0.881          3.297         -0.954***
         Any-1       0.926**        1.981**       -0.749***
         Prev-1      0.916*         2.858         -0.939***
22



List Consistency vs Group Size (and II)
• The original function is outperformed by all
  the five alternatives proposed
  ▫ ADR-1 consistency raises from 0.844 to 0.926
     Significant at the 0.05 level with just 11 data points

• The relative order is kept within test types
  ▫ All is worse than Prev, which is worse than Any
• All-x are also more variable across lists

• The smaller its groups, the more consistent the list
  ▫ This is why Any-x is better than All-x
23



Example: Query 600.053.481-1.1.1
                           All-2   Any-2   Prev-2   All-1   Any-1   Prev-1
                             1       1       1        1       1       1
                            2       2        2       2       2        2
                            3       3        3       3       3        3
                            3       3        3       3       3        3
                            3       3        3       3       3        3
                            3       4        4       3       4        4
                            3       4        4       4       5        5
                            3       4        4       4       5        5
                            3        5       4       4       5        5
     ADR-1 consistency 0.782       0.908   0.928    0.95    0.975   0.975
% intra- inconsistencies 0.667     0.333   0.333    0.222    0        0
% inter- inconsistencies    0       0.1    0.037     0      0.033   0.033
24



MIREX 2005 Revisited
• The lists could have been more consistent
 ▫ How would that have affected the evaluation?
• Re-evaluate the 7 systems with the five
  alternative functions and compare the results
 System   All-2   Any-2   Prev-2    All-1   Any-1   Prev-1
  GAM     0.66     0.59   0.66     0.624    0.583   0.605
    O     0.65    0.607    0.65    0.643    0.593   0.639
   US     0.642   0.604   0.642    0.639    0.594   0.628
  TWV     0.571   0.558   0.571    0.566    0.556   0.564
  L(P3)   0.558    0.52   0.558     0.54    0.515   0.534
 L(DP)    0.543   0.503   0.543    0.511    0.494   0.506
   FM     0.518   0.498   0.518    0.507    0.483   0.507
           -      0.81      1      0.81    0.714   0.714
25



MIREX 2005 Revisited (and II)
• All systems perform up to 12% worse
 ▫ The alternatives have smaller groups, which
   allows fewer false positives due to
   intra-group inconsistencies

• The ranking of systems would have changed
 ▫ Kendall’s τ = 0.714 to 0.81

• We overestimated system effectiveness
 ▫ And not just in MIREX, other papers did too
26



Conclusions
• Partially ordered lists make a better ground truth
  for similarity tasks, but they have problems

• We disclosed new (more fundamental) issues
  ▫ Intra- and inter-inconsistencies
  ▫ We can not expect fully consistent lists
     The evaluation will always be incorrect to some extent
     At least with this methodology
• We proposed several alternatives and
  a way to measure the consistency of a list
  ▫ All alternatives yield more consistent ground truths
  ▫ Proving we have overestimated system performance
27



Future Work
• Evaluate other collections
• The significance level used was α=0.25
  ▫ Why? How does it affect the consistency?
• Other effectiveness measures can be proposed

• We believe that partially ordered lists should
  come back to the official evaluations
  ▫ First, make them cheaper and solve their problems
• We are working on it! [Urbano et al., SIGIR CSE 2010]
  ▫   Auto-organizing preference judgments
  ▫   Crowdsourcing
  ▫   Pooling
  ▫   Minimal and incremental test collections
28



Evaluation Experiments
• Essential for Information Retrieval
• But somewhat scarce in Music IR
 ▫ Private collections
      Royalties and Copyright do not exactly help…
 ▫   Non-standard methodologies
 ▫   Non-standard effectiveness measures
 ▫   Hard to replicate
 ▫   Threats to internal and external validity
• MIR community acknowledges the need for
  these formal evaluation experiments [Downie, 2004]
• MIREX came up in 2005 to help with this, but…
29



Meta-Evaluation Analysis
• … now we have to meta-evaluate
 ▫   How well are we doing?
 ▫   Are we really improving our systems?
 ▫   Are we fair with all systems?
 ▫   Should we try new methodologies?
 ▫   Are we really measuring what we want to?
 ▫   How far can we go?
 ▫   Are we covering all user needs?
 ▫   Are our assumptions reasonable?
• Can we improve the evaluation itself?
 ▫ It would make the field improve more rapidly
30



And That’s It!




                 Picture by 姒儿喵喵

Contenu connexe

Plus de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 

Plus de Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 

Improving the Generation of Ground Truths based on Partially Ordered Lists

  • 1. Improving the Generation of Ground Truths based on Partially Ordered Lists Julián Urbano, Mónica Marrero, Diego Martín and Juan Lloréns http://julian-urbano.info Twitter: @julian_urbano ISMIR 2010 Utrecht, Netherlands, August 11th
  • 2. 2 Outline • Introduction • Current Methodology • Inconsistencies ▫ Due to Arrangement ▫ Due to Aggregation ▫ Fully Consistent Lists • Alternative Aggregation Functions ▫ Measure of List Consistency • Results ▫ MIREX 2005 Results Revisited • Conclusions and Future Work • Some thoughts on Evaluation in MIR
  • 3. 3 Similarity Tasks • Symbolic Melodic Similarity (SMS) • Audio Music Similarity (AMS) ▫ Not covered here • Given a piece of music (i.e. the query) retrieve others musically similar to it • How do we measure the similarity of a document to a query (i.e. the relevance)? ▫ Traditionally with fixed level-based scales  Similar, not similar  Very similar, somewhat similar, not similar
  • 4. 4 Relevance Judgments • For similarity tasks, they are very problematic • Relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005] ▫ Single melodic changes are not perceived to change the overall melody  Move a note up or down in pitch  Shorten or enlarge it  Add or remove a note ▫ But the similarity is weaker as more changes apply • Where is the line between relevance levels?
  • 5. 5 Partially Ordered Lists • The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005] ▫ Does not need any prefixed relevance scale • Ordered groups of documents equally relevant ▫ Have to keep the order of the groups ▫ Allow permutations within the same group
  • 6. 6 Partially Ordered Lists (II) Relevance levels do show up, but they are not pre-fixed beforehand
  • 7. 7 Partially Ordered Lists (III) • Used in the first edition of MIREX in 2005 [Downie et al., 2005] • Widely accepted by the MIR community to report new developments [Urbano et al., 2010][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006] • Four-step methodology 1. Filter out non-similar documents in the collection 2. Have the experts rank the candidates 3.Arrange the candidates by their median/mean rank 4.Aggregate candidates whose ranks are not significantly different (Mann-Whitney U) [Mann et al., 1947]
  • 8. 8 Partially Ordered Lists (and IV) • MIREX was forced to move to traditional level-based relevance since 2006 [Downie et al., 2010] ▫ Partially ordered lists are expensive (step 2) ▫ They have some odd results (step 2) ▫ They are hard to replicate (step 2) ▫ It may leave out relevant results (step 1) • We have already explored alternatives to step 2 (and by extension 3 and 4) [Urbano et al., SIGIR CSE 2010] ▫ 3-point preference judgments via crowdsourcing • Here we focus on steps 3 and 4 ▫ The lists have inconsistencies that lead to incorrect evaluation
  • 9. 9 Intra-group Inconsistencies • Two incipits in the same group were ranked significantly different by the experts • If a system returns them in reverse order it will be considered correct, despite they were ranked clearly different by the experts 2 3 4 5 6 7 8 9 Query 700.010.591-1.4.2 ≠ ≠ ≠ ≠ ≠ ≠ ≠ ≠ 1 ≠ ≠ ≠ ≠ ≠ ≠ ≠ 2 = = = ≠ ≠ ≠ 3 = = ≠ ≠ ≠ 4 ≠ ≠ ≠ ≠ 5 = = ≠ 6 11 of the 21 pairs are = = 7 = 8 incorrectly aggregated
  • 10. 10 Inter-group Inconsistencies • Two incipits in different groups were not ranked significantly different by the experts • If a system returns them in reverse order it will not be considered correct, despite no difference could be found between their ranks 2 3 4 5 6 7 8 Query 190.011.224-1.1.1 ≠ ≠ ≠ ≠ ≠ ≠ ≠ 1 ≠ ≠ ≠ ≠ ≠ ≠ 2 ≠ = ≠ ≠ ≠ 3 = = = = 4 = = = 5 = = 6 = 7 … …
  • 11. 11 Due to Arrangement • In step 3 incipits are ordered by median ▫ Mean to break ties • But in step 4 the Mann-Whitney U test is used • Central tendency measures (median and mean) might not be appropriate because ▫ They ignore the dispersion in the samples • Incipits are incorrectly ordered in step 3 ▫ Source of inter-group inconsistencies
  • 12. 12 Due to Aggregation • Traverse the list from top to bottom ▫ Begin a new group if the pivot is significantly different from all incipits in the current group • This generates very large groups ▫ Incipits at the top are considered similar to the ones at the end just because they are both similar to the ones in the middle ▫ Source of intra-group inconsistencies  178 of the 509 intra-pairs (35%) inconsistent • The group-initiator has to be very different
  • 13. 13 Due to Aggregation (and II) • The aggregation function may place the pivot in a new group, but the next one is not different from the ones in the group just closed ▫ Source of inter-group inconsistencies ▫ The pivot was just sufficiently different ▫ Or it was incorrectly arranged in step 3 2 3 4 5 6 7 8 Query 190.011.224-1.1.1 ≠ ≠ ≠ ≠ ≠ ≠ ≠ 1 ≠ ≠ ≠ ≠ ≠ ≠ 2 ≠ = ≠ ≠ ≠ 3 = = = = 4 = = = 5 = = 6 = 7
  • 14. 14 Fully Consistent Lists • Two sources of inconsistency ▫ Arrangement (inter-) ▫ Aggregation (inter- and intra-) • There is a more profound problem ▫ Hypothesis testing is not transitive ▫ Not rejecting H0 does not mean accepting it • Mann-Whitney U may say something like this ▫ A < B, B < C and A ≥ C (1-tailed test) ▫ A = B, B = C and A ≠ C (2-tailed test) • We can not ensure fully consistent lists
  • 15. 15 Alternative Aggregation • A function too permissive lead to large groups ▫ Likelihood of intra-group inconsistencies • A function too restrictive leads to small groups ▫ Likelihood of inter-group inconsistencies • We consider three rationales to follow ▫ All: a group begins if all incipits are different from the pivot. This should lead to larger groups. ▫ Any: a group begins if any incipit is different from the pivot. This should lead to smaller groups. ▫ Prev: a group begins if the previous incipit is different from the pivot.
  • 16. 16 Alternative Aggregation (and II) • After the arrangement in step 3 we may assume that an incipit ranked higher has a true rank either higher or equal, but not lower ▫ 1-tailed tests are more powerful than the 2-tailed  It is more probable for them to find a difference if there really is one • Combine the three rationales with the two tests • All-2, Any-2, Prev-2, All-1, Any-1 and Prev-1 ▫ All-2 is the function originally used by Typke et al.
  • 17. 17 Measure of List Consistency • Follow the logics behind ADR [Typke et al., 2006] • Traverse the list from top to bottom ▫ Calculate the expanded set of allowed incipits  All previous ones and those in the same group ▫ Compute the percentage of correct expansions  The pivot is not considered (it is always correct) ▫ Average over all ranks in the list  Ignore the last rank (it always expands to all incipits) • 1 = all expansions are correct ▫ Fully consistent list (not to be expected) • 0 = that no expansion is correct
  • 18. 18 Measure of List Consistency (II) • Ground truth = 〈 (A, B), (C), (D, E, F) 〉, but ▫ A = C (inter-group inconsistency, false negative) ▫ D ≠ F (intra-group inconsistency, false positive) Correct Actual % of correct Position expansion expansion expansions 1 B,C B 0.5 2 A A 1 3 A,B A,B 1 4 A,B,C,E A,B,C,E,F 0.8 5 A,B,C,D,F A,B,C,D,F 1 List consistency 0.86
  • 19. 19 Measure of List Consistency (and III) • Again, it comes in two flavors • ADR-1 consistency with 1-tailed tests ▫ Accounts for inconsistencies due to arrangement and aggregation • ADR-2 consistency with 2-tailed tests ▫ Only accounts for inconsistencies due to aggregation
  • 20. 20 Results • Re-generate the 11 lists used in MIREX 2005 with the alternative aggregation functions • Compare with the original All-2 in terms of ▫ ADR-1 consistency across the 11 queries ▫ Group size across the 11 queries ▫ Are they correlated? • Re-evaluate the MIREX 2005 SMS task ▫ Would it have been different?
  • 21. 21 List Consistency vs Group Size Aggregation ADR-1 Incipits Pearson’s r function consistency per group All-2 0.844 3.752 -0.892*** Any-2 0.913** 2.539* -0.862*** Prev-2 0.857 3.683 -0.937*** All-1 0.881 3.297 -0.954*** Any-1 0.926** 1.981** -0.749*** Prev-1 0.916* 2.858 -0.939***
  • 22. 22 List Consistency vs Group Size (and II) • The original function is outperformed by all the five alternatives proposed ▫ ADR-1 consistency raises from 0.844 to 0.926  Significant at the 0.05 level with just 11 data points • The relative order is kept within test types ▫ All is worse than Prev, which is worse than Any • All-x are also more variable across lists • The smaller its groups, the more consistent the list ▫ This is why Any-x is better than All-x
  • 23. 23 Example: Query 600.053.481-1.1.1 All-2 Any-2 Prev-2 All-1 Any-1 Prev-1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 3 4 4 3 4 4 4 5 5 3 4 4 4 5 5 3 5 4 4 5 5 ADR-1 consistency 0.782 0.908 0.928 0.95 0.975 0.975 % intra- inconsistencies 0.667 0.333 0.333 0.222 0 0 % inter- inconsistencies 0 0.1 0.037 0 0.033 0.033
  • 24. 24 MIREX 2005 Revisited • The lists could have been more consistent ▫ How would that have affected the evaluation? • Re-evaluate the 7 systems with the five alternative functions and compare the results System All-2 Any-2 Prev-2 All-1 Any-1 Prev-1 GAM 0.66 0.59 0.66 0.624 0.583 0.605 O 0.65 0.607 0.65 0.643 0.593 0.639 US 0.642 0.604 0.642 0.639 0.594 0.628 TWV 0.571 0.558 0.571 0.566 0.556 0.564 L(P3) 0.558 0.52 0.558 0.54 0.515 0.534 L(DP) 0.543 0.503 0.543 0.511 0.494 0.506 FM 0.518 0.498 0.518 0.507 0.483 0.507  - 0.81 1 0.81 0.714 0.714
  • 25. 25 MIREX 2005 Revisited (and II) • All systems perform up to 12% worse ▫ The alternatives have smaller groups, which allows fewer false positives due to intra-group inconsistencies • The ranking of systems would have changed ▫ Kendall’s τ = 0.714 to 0.81 • We overestimated system effectiveness ▫ And not just in MIREX, other papers did too
  • 26. 26 Conclusions • Partially ordered lists make a better ground truth for similarity tasks, but they have problems • We disclosed new (more fundamental) issues ▫ Intra- and inter-inconsistencies ▫ We can not expect fully consistent lists  The evaluation will always be incorrect to some extent  At least with this methodology • We proposed several alternatives and a way to measure the consistency of a list ▫ All alternatives yield more consistent ground truths ▫ Proving we have overestimated system performance
  • 27. 27 Future Work • Evaluate other collections • The significance level used was α=0.25 ▫ Why? How does it affect the consistency? • Other effectiveness measures can be proposed • We believe that partially ordered lists should come back to the official evaluations ▫ First, make them cheaper and solve their problems • We are working on it! [Urbano et al., SIGIR CSE 2010] ▫ Auto-organizing preference judgments ▫ Crowdsourcing ▫ Pooling ▫ Minimal and incremental test collections
  • 28. 28 Evaluation Experiments • Essential for Information Retrieval • But somewhat scarce in Music IR ▫ Private collections  Royalties and Copyright do not exactly help… ▫ Non-standard methodologies ▫ Non-standard effectiveness measures ▫ Hard to replicate ▫ Threats to internal and external validity • MIR community acknowledges the need for these formal evaluation experiments [Downie, 2004] • MIREX came up in 2005 to help with this, but…
  • 29. 29 Meta-Evaluation Analysis • … now we have to meta-evaluate ▫ How well are we doing? ▫ Are we really improving our systems? ▫ Are we fair with all systems? ▫ Should we try new methodologies? ▫ Are we really measuring what we want to? ▫ How far can we go? ▫ Are we covering all user needs? ▫ Are our assumptions reasonable? • Can we improve the evaluation itself? ▫ It would make the field improve more rapidly
  • 30. 30 And That’s It! Picture by 姒儿喵喵