SlideShare a Scribd company logo
1 of 68
Download to read offline
Audio Music Similarity and Retrieval:
                   Evaluation Power and Stability
                                   Julián Urbano @julian_urbano
                           Diego Martín, Mónica Marrero and Jorge Morato
                                         University Carlos III of Madrid




                                                                                       ISMIR 2011
Picture by Michael Shane                                                   Miami, USA · October 26th
AMS

retrieve audio clips
 musically similar
  to a query clip
grand results
  (MIREX 2009)
grand results
                                  (MIREX 2009)
I won!
                      oh, come on! it‘s so close!
                            but the difference is not significant…


         yeah, it’s not
          significant!
grand results
                                  (MIREX 2009)
I won!
                      oh, come on! it‘s so close!
                            but the difference is not significant…
                                                       did you hear?
         yeah, it’s not
          significant!     shut up… we are!
grand results
                                  (MIREX 2009)
I won!
                      oh, come on! it‘s so close!
                            but the difference is not significant…
                                                       did you hear?
         yeah, it’s not
          significant!     shut up… we are!                    damn it!

                               don‘t worry
                                 about it
what does it mean?




Picture by Sara A. Beyer
proper interpretation of p-values

H0: mean score of system A = mean score of B
H1: mean scores are different

a statistical test returns p<0.01, so we conclude A >> B

                          B A
proper interpretation of p-values

H0: mean score of system A = mean score of B
H1: mean scores are different

 a statistical test returns p<0.01, so we conclude A >> B

                                  B A
it means that if we assume H0
  and repeat the experiment,
  there is a <0.01 probability
  of having these result again*



                                        *or one even more extreme
conclusions about general behavior

           MIREX 2009                                         MIREX 2010
                                        this evaluation
                  A>B                  is not powerful               A?B
     system A is better than B, but it’s                     we can expect anything
        not statistically significant                       with a different collection

                                           …and stable
   this one
is powerful…                                                        A >> B
                 A >> B                                         we expect the same:
                                                          A is significantly better than B
        A is better than B, and it’s
         statistically significant                        but these could also happen:

                                                         A > B or A < B or A << B
    lack of power in MIREX 2010
                            minor stability conflict
                                                                      major stability conflict
it‘s all about reliability
Isaac Newton
               on the shoulders of giants
Text REtrieval Conference
no significance testing                                            depends on the
                             [Buckley and Voorhees, 2000]           measure used
            1% to 14% of comparisons show stability conflicts
         ~25% differences to ensure <5% conflicts with 50 queries

   sensitivity               [Sanderson and Zobel, 2005]             others were
                  improved reliability with pairwise t-tests         not as good
         virtually no conflicts if >10% differences with significance
effort                            [Voorhees, 2009]
             with many queries, even significance is unreliable

                                    [Sakai, 2007]
         major review: other collections and more recent measures
                some measures are much better than others
                                             does not mean they should not be used!
Music Similarity and Retrieval
                      [Typke et al., 2005][Urbano et al., 2010]
                alternative forms of ground truth for SMS
             reliable and comprehensive but too expensive
                                                                      no prefixed
                                 [Typke et al., 2006]                relevance scale
                          specific measure for the task
                                 [Jones et al., 2007]
          agreement between judgments by different people
                   propose to use more queries
despite high agreement,
evaluation does change…    [Urbano et al., 2010][Lee, 2010]
         cheaper judgments via crowdsourcing seems reliable
                                   [Urbano, 2011]                 more about this
                                many other things                  in 30 mins
it‘s actually about the
effort-reliability tradeoff
it‘s actually about the
 effort-reliability tradeoff
   task        relevance judgments      # of systems
# of queries         measures         system similarity
                statistical methods
measures
                                    &
                                judgments
Picture by Wessex Archaeology
how much information does the user gain?
 measure used in MIREX
  (with different name)      results as a set
                 AG@5: Average Gain in the top 5 documents
                                                     more realistic
                                                      user model
                             results as a list
            NDCG@5: Normalized Discounted Cumulated Gain

            ANDCG@5: Average NDCG across ranks
                                                                  first,
                                                  best documents first
            ADR@5: Average Dynamic Recall        and the lower the rank
                                                   the lower the gain*
*details   in the paper
how much information does a result provide?


        BROAD relevance judgments
                not similar = 0
             somewhat similar = 1
               very similar = 2

         FINE relevance judgments
        real-valued, from 0 to 10 or 100
look at MIREX 2009

  largest evaluation until 2011
power




Picture by Roger Green
% of pairwise comparisons that are significant
            what's the effect of:
                 number of queries
               relevance judgments
              effectiveness measures
% of pairwise comparisons that are significant
                   what's the effect of:
                           number of queries
                         relevance judgments
                        effectiveness measures
  all 100 queries set
% of pairwise comparisons that are significant
                   what's the effect of:
                           number of queries
                         relevance judgments
                        effectiveness measures
  all 100 queries set
                           5 query
                           subset




                          random sample
% of pairwise comparisons that are significant
                   what's the effect of:
                           number of queries
                         relevance judgments
                        effectiveness measures                         Broad judgments




                                                       % significant
  all 100 queries set
                           5 query
                           subset
                                                                           # queries
                                          evaluation
                                                                       Fine judgments




                                                       % significant
                          random sample
                                                                           # queries
% of pairwise comparisons that are significant
                   what's the effect of:
                           number of queries
                         relevance judgments
                        effectiveness measures                         Broad judgments


                                                                            52,500




                                                       % significant
  all 100 queries set
                           5 query                                           system
                           subset                                         comparisons
                                                                           # queries
                                          evaluation
                                                                       Fine judgments




                                                       % significant
                          random sample
              repeat 500 times for 5 query subsets
                   to minimize random effects
                                                                           # queries
% of pairwise comparisons that are significant
                   what's the effect of:
                           number of queries
                         relevance judgments
                        effectiveness measures                           Broad judgments


                                                                              52,500




                                                         % significant
  all 100 queries set
                          10 query                                             system
                           subset                                           comparisons
                                                                             # queries
                                      evaluation
                                                                         Fine judgments




                                                         % significant
         repeat another 500 times for 10 query subsets                       # queries
% of pairwise comparisons that are significant
                                what's the effect of:
                                        number of queries
   balanced across                    relevance judgments
      10 genres                      effectiveness measures                      Broad judgments




                                                                 % significant
               all 100 queries set
  barroque                             10 query
      blues                             subset
   classical
    country
                                                                                     # queries
    edance
        jazz
                                                    evaluation
                                                                                 Fine judgments
      metal
rap-hiphop




                                                                 % significant
  rock&roll
  romantic



                                stratified random sampling
                                     with equal priors                               # queries
% of pairwise comparisons that are significant
                what's the effect of:
                     number of queries
                   relevance judgments
                  effectiveness measures                     Broad judgments




                                             % significant
       all 100 query subset


                                                                 # queries
                                evaluation
                                                             Fine judgments




                                             % significant
                                                                 # queries
we simulate possible
evaluation scenarios
power results (larger is better)


                                                                       Broad judgments                                                                                       Fine judgments

                                                                                  power in




                                                                                                                                  46 48 50 52 54 56 58 60 62 64
                            46 48 50 52 54 56 58 60 62 64




                                                                                MIREX 2009
% Significant comparisons




                                                                                                      % Significant comparisons
                                                                                             AG
                                                                                             NDCG
                                                                                             ANDCG
                                                                                             ADR

                                                            40 45 50 55 60 65 70 75 80 85 90 95 100                                                               40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                          Query set size                                                                                        Query set size
power results (larger is better)

                            similar logarithmic trend except for ADRFine (expected)
                                                                        Broad judgments                                                                                       Fine judgments

                                                                                   power in




                                                                                                                                   46 48 50 52 54 56 58 60 62 64
                             46 48 50 52 54 56 58 60 62 64




                                                                                 MIREX 2009
% Significant comparisons




                                                                                                       % Significant comparisons
                                                                                              AG
                                                                                              NDCG
                                                                                              ANDCG
                                                                                              ADR

                                                             40 45 50 55 60 65 70 75 80 85 90 95 100                                                               40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                           Query set size                                                                                        Query set size
power results (larger is better)

                            similar logarithmic trend except for ADRFine (expected)
                                                                        Broad judgments                                                                                        Fine judgments

                                                                                   power in




                                                                                                                                    46 48 50 52 54 56 58 60 62 64
                             46 48 50 52 54 56 58 60 62 64




                                                                                 MIREX 2009
% Significant comparisons




                                                                                                        % Significant comparisons
                                                                                              AG
                                                                                              NDCG
                                                                                              ANDCG
                                                                                              ADR

                                                             40 45 50 55 60 65 70 75 80 85 90 95 100                                                                40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                           Query set size                                                                                         Query set size

 only 2 significant pairs                                                                                same power
missed with 70% effort                                                                                 with 70% effort!
   (probably unstable)
merely using more queries
does not pay off
when looking for power
stability




Picture by Dave Hunt
% of pairwise comparisons that are conflicting
            what's the effect of:
                 number of queries
               relevance judgments
              effectiveness measures
% of pairwise comparisons that are conflicting
                                what's the effect of:
                                        number of queries
                                      relevance judgments
                                     effectiveness measures
                                       5 query
                                       subset
               all 100 queries set
  barroque
      blues
   classical
    country
    edance
        jazz
      metal
rap-hiphop
  rock&roll
  romantic
% of pairwise comparisons that are conflicting
                                what's the effect of:
                                        number of queries
                                      relevance judgments
                                     effectiveness measures
                                       5 query
                                       subset
               all 100 queries set
  barroque
      blues
   classical
    country
    edance                             5 query   independent
        jazz
      metal
                                       subset      samples
rap-hiphop
  rock&roll
  romantic
% of pairwise comparisons that are conflicting
                                what's the effect of:
                                        number of queries
                                      relevance judgments
                                     effectiveness measures
                                                                               Broad judgments
                                       5 query
                                       subset




                                                               % conflicting
               all 100 queries set
  barroque
      blues
                                                 evaluation
   classical
    country
    edance                             5 query   independent                       #queries

        jazz
      metal
                                       subset      samples                     Fine judgments

rap-hiphop




                                                               % conflicting
  rock&roll
  romantic                                       evaluation



                                                                                   #queries
% of pairwise comparisons that are conflicting
                                what's the effect of:
                                        number of queries            52,500
                                      relevance judgments         cross-
                                                                  cross-collection
                                     effectiveness measures    system comparisons
                                                                                      Broad judgments
                                       5 query
                                       subset




                                                                      % conflicting
               all 100 queries set
  barroque
      blues
                                                 evaluation
   classical
    country
    edance                             5 query   independent                              #queries

        jazz
      metal
                                       subset      samples                            Fine judgments

rap-hiphop




                                                                      % conflicting
  rock&roll
  romantic                                       evaluation


                                   repeat 500 times
                               to minimize random effects                                 #queries
% of pairwise comparisons that are conflicting
                       what's the effect of:
                            number of queries
                          relevance judgments
                         effectiveness measures
                                                                  Broad judgments

                50 query subset




                                                  % conflicting
                                     evaluation
 with 100
total queries                                                         #queries

 we can’t go    50 query subset                                   Fine judgments
 beyond 50




                                                  % conflicting
                                     evaluation



                                                                      #queries
we simulate comparisons
across possible collections
stability results (lower is better)


                                                                   Broad judgments                                                                                             Fine judgments
                                                                                                   AG
                            8 10 12 14 16 18 20 22




                                                                                                                                        8 10 12 14 16 18 20 22
                                                                                                   NDCG
                                                                                                   ANDCG
                                                                                                   ADR
% Conflicting comparisons




                                                                                                            % Conflicting comparisons
                            6




                                                                                                                                        6
                            4




                                                                                                                                        4
                            2




                                                                                                                                        2
                                                     5   10   15   20   25     30       35   40   45   50                                                        5   10   15   20   25     30       35   40   45   50
                                                                    Query subset size                                                                                           Query subset size


                       stability in
                      MIREX 2009
stability results (lower is better)

                                                                                                       lack of power in one collection
                                                                                                            but not in the other
                                                                   Broad judgments                                                                                                 Fine judgments
                                                                                                   AG
                            8 10 12 14 16 18 20 22




                                                                                                                                            8 10 12 14 16 18 20 22
                                                                                                   NDCG
                                                                                                   ANDCG
                                                                                                   ADR
% Conflicting comparisons




                                                                                                                % Conflicting comparisons
                            6




                                                                                                                                            6
                            4




                                                                                                                                            4
                            2




                                                                                                                                            2
                                                     5   10   15   20   25     30       35   40   45    50                                                           5   10   15   20   25     30       35   40   45   50
                                                                    Query subset size                                                                                               Query subset size


                       stability in
                      MIREX 2009
stability results (lower is better)

                                                                                                        lack of power in one collection
ADR takes longer                                                                                             but not in the other
  to converge Broad judgments                                                                                                                                                       Fine judgments
                                                                                                    AG
                             8 10 12 14 16 18 20 22




                                                                                                                                             8 10 12 14 16 18 20 22
                                                                                                    NDCG
                                                                                                    ANDCG
                                                                                                    ADR
 % Conflicting comparisons




                                                                                                                 % Conflicting comparisons
                             6




                                                                                                                                             6
                             4




                                                                                                                                             4
                             2




                                                                                                                                             2
                                                      5   10   15   20   25     30       35   40   45    50                                                           5   10   15   20   25     30       35   40   45   50
                                                                     Query subset size                                                                                               Query subset size


                        stability in
                       MIREX 2009
stability results (lower is better)

                                                                                                        lack of power in one collection
ADR takes longer                                                                                             but not in the other
  to converge Broad judgments                                                                                                                                                       Fine judgments
                                                                                                    AG
                             8 10 12 14 16 18 20 22




                                                                                                                                             8 10 12 14 16 18 20 22
                                                                                                    NDCG
                                                                                                    ANDCG
                                                                                                    ADR
 % Conflicting comparisons




                                                                                                                 % Conflicting comparisons
                             6




                                                                                                                                             6
                             4




                                                                                                                                             4
                             2




                                                                                                                                             2
                                                      5   10   15   20   25     30       35   40   45    50                                                           5   10   15   20   25     30       35   40   45   50
                                                                     Query subset size                                                                                               Query subset size


                        stability in                                                               converge to <5% for >40 queries
                       MIREX 2009                                                                      (consistent with α=0.05)
merely using more queries
does not pay off
when looking for stability
type of conflicts (50 queries)
                                                           no major conflict
                                                              whatsoever
                                     A>B       A<B      A<<B
          measure    conflicts
                                   (power)   (minor)   (major)
           AG           3.36%        100%        0%        0%
           NDCG         3.77%      99.90%     0.10%        0%
  Broad




           ANDCG        4.73%      99.96%     0.04%        0%
           ADR          9.03%      99.94%     0.06%        0%
           AG           2.64%      99.86%     0.14%        0%
           NDCG         2.94%      99.74%     0.26%        0%
  Fine




           ANDCG        4.03%      99.91%     0.09%        0%
           ADR         19.08%      99.50%     0.50%        0%

  virtually all conflicts due to
lack of power in one collection
if significance shows up
it most probably is correct
     are we being too conservative?
statistics
Milton Friedman      Frank Wilcoxon   John Tukey
compare two systems

                     is the difference significant?
                 t-test, Wilcoxon test, sign test, etc.
       they make
different assumptions                                     stability conflict
                         significance level α
                     probability of Type I error
       (finding a significant difference when there is none)


                    usually, α=0.05 or α=0.01
         5% or 1% of my significant results are just wrong
MIREX 2009 compare several systems
           15 systems = 105 comparisons

       experiment-wide significance level = 1-(1-α)105 = 0.995
    we can expect at least one significant comparison to be wrong



                 instead, compare all systems at once
               ANOVA, Friedman test, Kruskal-Wallis, etc.
      used in MIREX
(with different assumptions)
  correct p-values to keep experiment-wide significance level <0.05
   Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
more stability
at the cost of
 less power
   is it worth it?
what a MIREX participant wants

          compare my system with the other 14
      comparisons between those 14 are uninteresting

   subexperiment: only 14 pairwise comparisons, not 105
   get back the power missed by considering the other 91
                         should throw out more conflicts too
number of comparisons grows linearly with number of systems
  subexperiment-wide significant level = 1-(1-α)14 = 0.512

 compare all systems with 1-tailed Wilcoxon tests at α=0.01
   experiment-wide significant level = 1-(1-0.01)105 = 0.652
 subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
power results (larger is better)


                                                                       Broad judgments                                                                                       Fine judgments
                            46 48 50 52 54 56 58 60 62 64




                                                                                                                                  46 48 50 52 54 56 58 60 62 64
% Significant comparisons




                                                                                                      % Significant comparisons
                                                                                             AG
                                                                                             NDCG
                                                                                             ANDCG
                                                                                             ADR

                                                            40 45 50 55 60 65 70 75 80 85 90 95 100                                                               40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                          Query set size                                                                                        Query set size


                                                                                                                                                                                          Friedman+Tukey
                                                                                                                                                                                           (as in MIREX)
power results (larger is better)
                                                                               all 1-tailed Wilcoxon comparisons
                                                                     is up to %20 more powerful than Friedman+Tukey
                                                                       Broad judgments                                                                                       Fine judgments
                            46 48 50 52 54 56 58 60 62 64




                                                                                                                                  46 48 50 52 54 56 58 60 62 64
% Significant comparisons




                                                                                                      % Significant comparisons
                                                                                             AG
                                                                                             NDCG
                                                                                             ANDCG
                                                                                             ADR

                                                            40 45 50 55 60 65 70 75 80 85 90 95 100                                                               40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                          Query set size                                                                                        Query set size


                                                                                                                                                                                          Friedman+Tukey
                                                                                                                                                                                           (as in MIREX)
power results (larger is better)
                                                                               all 1-tailed Wilcoxon comparisons
                                                                     is up to %20 more powerful than Friedman+Tukey
                                                                       Broad judgments                                                                                          Fine judgments
                            46 48 50 52 54 56 58 60 62 64




                                                                                                                                     46 48 50 52 54 56 58 60 62 64
% Significant comparisons




                                                                                                         % Significant comparisons
                                                                                             AG
                                                                                             NDCG
                                                                                             ANDCG
                                                                                             ADR

                                                            40 45 50 55 60 65 70 75 80 85 90 95 100                                                                  40 45 50 55 60 65 70 75 80 85 90 95 100
                                                                          Query set size                                                                                           Query set size

                                                                                        same power                                                                                           Friedman+Tukey
                                                                                           50%
                                                                                      with 50% effort!                                                                                        (as in MIREX)
stability results (lower is better)

    earlier convergence
because of increased power
                                                                    Broad judgments                                                                                             Fine judgments
                                                                                                    AG
                             8 10 12 14 16 18 20 22




                                                                                                                                         8 10 12 14 16 18 20 22
                                                                                                    NDCG
                                                                                                    ANDCG
                                                                                                    ADR
 % Conflicting comparisons




                                                                                                             % Conflicting comparisons
                             6




                                                                                                                                         6
                             4




                                                                                                                                         4
                             2




                                                                                                                                         2
                                                      5   10   15   20   25     30       35   40   45   50                                                        5   10   15   20   25     30       35   40   45   50
                                                                     Query subset size                                                                                           Query subset size
stability results (lower is better)

    earlier convergence
because of increased power
                                                                    Broad judgments                                                                                              Fine judgments
                                                                                                    AG
                             8 10 12 14 16 18 20 22




                                                                                                                                          8 10 12 14 16 18 20 22
                                                                                                    NDCG
                                                                                                    ANDCG
                                                                                                    ADR
 % Conflicting comparisons




                                                                                                              % Conflicting comparisons
                             6




                                                                                                                                          6
                             4




                                                                                                                                          4
                             2




                                                                                                                                          2
                                                      5   10   15   20   25     30       35   40   45   50                                                         5   10   15   20   25     30       35   40   45   50
                                                                     Query subset size                                                                                            Query subset size


                                                                                                         AG converges again to 3-4%
                                                                                                        (A)NDCG converge to 5-6%
type of conflicts (50 queries)

                                  A>B       A<B      A<<B
   measure          conflicts
                                (power)   (minor)   (major)
         AG            3.68%     96.32%    3.68%        0%
         NDCG          5.05%     96.82%    3.18%        0%
Broad




         ANDCG         6.08%     96.84%    3.13%     0.03%
         ADR           5.93%     95.12%    4.88%        0%
         AG            3.32%     98.34%    1.66%        0%    within known
                                                                Type III
         NDCG          6.58%     96.61%    3.39%        0%     error rates
Fine




         ANDCG         6.44%     94.94%    5.06%        0%
         ADR          12.48%     90.58%    9.37%     0.05%

                 again, due to
        lack of power in one collection               no major conflicts
effort-reliability tradeoff

                Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries
   measure       power    - conflicts =   stable     power    - conflicts =     stable
        AG      57.14% -      3.64% = 53.50%        55.10% -       3.68% = 51.42%
        NDCG    57.14% -      4.08% = 53.06%        57.01% -       5.05% = 51.96%
Broad




        ANDCG   57.14% -      4.19% = 52.95%        57.37% -       6.08% = 51.29%
        ADR     56.19% -      7.13% = 49.06%        57.30% -       5.93% = 51.37%
        AG      54.29% -      3.20% = 51.09%        54.31% -       3.32% = 50.99%
        NDCG    56.19% -      3.04% = 53.15%        57.56% -       6.58% = 50.98%
Fine




        ANDCG   56.19% -      2.96% = 53.23%        57.38% -       6.44% = 50.94%
        ADR     56.19% - 19.97% = 36.22%            55.03% - 12.48% = 42.55%


                                  virtually same reliability with half the effort!
Friedman-Tukey requires
too much effort
my point?
Do not attempt to accomplish greater results
by a greater effort of your little understanding,
but by a greater understanding of your little effort.
                                           ̶ Walter Russell
using more and more queries is pointless
     too much effort for the small gain in power and stability

  using different similarity scales has little effect
               using only one is probably just fine

some effectiveness measures are better than others
     they should still be used: they measure different things
            but bear in mind their power and stability

  some statistical methods are better than others
          virtually same realiability with half the effort

  if significance shows up it most probably is true
           at worst, conflicts are due to lack of power
Picture by Ronny Welter
forget about power and worry about effect-size
       eventually, significance becomes meaningless
             reduce the judging effort
          more queries in Symbolic Melodic Similarity
  reliable low-cost in-house evaluations and Crowdsourcing

             deeper evaluation cutoffs
   not just the top 5 documents: pay attention to ranking
    probably more reliable, and certainly more reusable

         effect of the number of systems
      specially if developed by the same research group

             other statistical methods
       Multiple Comparisons with a Control (baseline)
     other collections, tasks and measures
guide experimenters in
 the interpretation
 of the results and the
  tradeoff between
effort and reliability

More Related Content

Viewers also liked

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationRichard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondRichard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondRichard Diamond
 

Viewers also liked (11)

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 

More from Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Audio Music Similarity and Retrieval: Evaluation Power and Stability

  • 1. Audio Music Similarity and Retrieval: Evaluation Power and Stability Julián Urbano @julian_urbano Diego Martín, Mónica Marrero and Jorge Morato University Carlos III of Madrid ISMIR 2011 Picture by Michael Shane Miami, USA · October 26th
  • 2. AMS retrieve audio clips musically similar to a query clip
  • 3. grand results (MIREX 2009)
  • 4. grand results (MIREX 2009) I won! oh, come on! it‘s so close! but the difference is not significant… yeah, it’s not significant!
  • 5. grand results (MIREX 2009) I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are!
  • 6. grand results (MIREX 2009) I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are! damn it! don‘t worry about it
  • 7. what does it mean? Picture by Sara A. Beyer
  • 8. proper interpretation of p-values H0: mean score of system A = mean score of B H1: mean scores are different a statistical test returns p<0.01, so we conclude A >> B B A
  • 9. proper interpretation of p-values H0: mean score of system A = mean score of B H1: mean scores are different a statistical test returns p<0.01, so we conclude A >> B B A it means that if we assume H0 and repeat the experiment, there is a <0.01 probability of having these result again* *or one even more extreme
  • 10. conclusions about general behavior MIREX 2009 MIREX 2010 this evaluation A>B is not powerful A?B system A is better than B, but it’s we can expect anything not statistically significant with a different collection …and stable this one is powerful… A >> B A >> B we expect the same: A is significantly better than B A is better than B, and it’s statistically significant but these could also happen: A > B or A < B or A << B lack of power in MIREX 2010 minor stability conflict major stability conflict
  • 11. it‘s all about reliability
  • 12. Isaac Newton on the shoulders of giants
  • 13. Text REtrieval Conference no significance testing depends on the [Buckley and Voorhees, 2000] measure used 1% to 14% of comparisons show stability conflicts ~25% differences to ensure <5% conflicts with 50 queries sensitivity [Sanderson and Zobel, 2005] others were improved reliability with pairwise t-tests not as good virtually no conflicts if >10% differences with significance effort [Voorhees, 2009] with many queries, even significance is unreliable [Sakai, 2007] major review: other collections and more recent measures some measures are much better than others does not mean they should not be used!
  • 14. Music Similarity and Retrieval [Typke et al., 2005][Urbano et al., 2010] alternative forms of ground truth for SMS reliable and comprehensive but too expensive no prefixed [Typke et al., 2006] relevance scale specific measure for the task [Jones et al., 2007] agreement between judgments by different people propose to use more queries despite high agreement, evaluation does change… [Urbano et al., 2010][Lee, 2010] cheaper judgments via crowdsourcing seems reliable [Urbano, 2011] more about this many other things in 30 mins
  • 15. it‘s actually about the effort-reliability tradeoff
  • 16. it‘s actually about the effort-reliability tradeoff task relevance judgments # of systems # of queries measures system similarity statistical methods
  • 17. measures & judgments Picture by Wessex Archaeology
  • 18. how much information does the user gain? measure used in MIREX (with different name) results as a set AG@5: Average Gain in the top 5 documents more realistic user model results as a list NDCG@5: Normalized Discounted Cumulated Gain ANDCG@5: Average NDCG across ranks first, best documents first ADR@5: Average Dynamic Recall and the lower the rank the lower the gain* *details in the paper
  • 19. how much information does a result provide? BROAD relevance judgments not similar = 0 somewhat similar = 1 very similar = 2 FINE relevance judgments real-valued, from 0 to 10 or 100
  • 20. look at MIREX 2009 largest evaluation until 2011
  • 22. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures
  • 23. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures all 100 queries set
  • 24. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures all 100 queries set 5 query subset random sample
  • 25. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 queries set 5 query subset # queries evaluation Fine judgments % significant random sample # queries
  • 26. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 5 query system subset comparisons # queries evaluation Fine judgments % significant random sample repeat 500 times for 5 query subsets to minimize random effects # queries
  • 27. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 10 query system subset comparisons # queries evaluation Fine judgments % significant repeat another 500 times for 10 query subsets # queries
  • 28. % of pairwise comparisons that are significant what's the effect of: number of queries balanced across relevance judgments 10 genres effectiveness measures Broad judgments % significant all 100 queries set barroque 10 query blues subset classical country # queries edance jazz evaluation Fine judgments metal rap-hiphop % significant rock&roll romantic stratified random sampling with equal priors # queries
  • 29. % of pairwise comparisons that are significant what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 query subset # queries evaluation Fine judgments % significant # queries
  • 31. power results (larger is better) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  • 32. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  • 33. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size only 2 significant pairs same power missed with 70% effort with 70% effort! (probably unstable)
  • 34. merely using more queries does not pay off when looking for power
  • 36. % of pairwise comparisons that are conflicting what's the effect of: number of queries relevance judgments effectiveness measures
  • 37. % of pairwise comparisons that are conflicting what's the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance jazz metal rap-hiphop rock&roll romantic
  • 38. % of pairwise comparisons that are conflicting what's the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance 5 query independent jazz metal subset samples rap-hiphop rock&roll romantic
  • 39. % of pairwise comparisons that are conflicting what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgments rap-hiphop % conflicting rock&roll romantic evaluation #queries
  • 40. % of pairwise comparisons that are conflicting what's the effect of: number of queries 52,500 relevance judgments cross- cross-collection effectiveness measures system comparisons Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgments rap-hiphop % conflicting rock&roll romantic evaluation repeat 500 times to minimize random effects #queries
  • 41. % of pairwise comparisons that are conflicting what's the effect of: number of queries relevance judgments effectiveness measures Broad judgments 50 query subset % conflicting evaluation with 100 total queries #queries we can’t go 50 query subset Fine judgments beyond 50 % conflicting evaluation #queries
  • 42. we simulate comparisons across possible collections
  • 43. stability results (lower is better) Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 44. stability results (lower is better) lack of power in one collection but not in the other Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 45. stability results (lower is better) lack of power in one collection ADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 46. stability results (lower is better) lack of power in one collection ADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in converge to <5% for >40 queries MIREX 2009 (consistent with α=0.05)
  • 47. merely using more queries does not pay off when looking for stability
  • 48. type of conflicts (50 queries) no major conflict whatsoever A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.36% 100% 0% 0% NDCG 3.77% 99.90% 0.10% 0% Broad ANDCG 4.73% 99.96% 0.04% 0% ADR 9.03% 99.94% 0.06% 0% AG 2.64% 99.86% 0.14% 0% NDCG 2.94% 99.74% 0.26% 0% Fine ANDCG 4.03% 99.91% 0.09% 0% ADR 19.08% 99.50% 0.50% 0% virtually all conflicts due to lack of power in one collection
  • 49. if significance shows up it most probably is correct are we being too conservative?
  • 50. statistics Milton Friedman Frank Wilcoxon John Tukey
  • 51. compare two systems is the difference significant? t-test, Wilcoxon test, sign test, etc. they make different assumptions stability conflict significance level α probability of Type I error (finding a significant difference when there is none) usually, α=0.05 or α=0.01 5% or 1% of my significant results are just wrong
  • 52. MIREX 2009 compare several systems 15 systems = 105 comparisons experiment-wide significance level = 1-(1-α)105 = 0.995 we can expect at least one significant comparison to be wrong instead, compare all systems at once ANOVA, Friedman test, Kruskal-Wallis, etc. used in MIREX (with different assumptions) correct p-values to keep experiment-wide significance level <0.05 Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
  • 53. more stability at the cost of less power is it worth it?
  • 54. what a MIREX participant wants compare my system with the other 14 comparisons between those 14 are uninteresting subexperiment: only 14 pairwise comparisons, not 105 get back the power missed by considering the other 91 should throw out more conflicts too number of comparisons grows linearly with number of systems subexperiment-wide significant level = 1-(1-α)14 = 0.512 compare all systems with 1-tailed Wilcoxon tests at α=0.01 experiment-wide significant level = 1-(1-0.01)105 = 0.652 subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
  • 55. power results (larger is better) Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  • 56. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  • 57. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 % Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size same power Friedman+Tukey 50% with 50% effort! (as in MIREX)
  • 58. stability results (lower is better) earlier convergence because of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size
  • 59. stability results (lower is better) earlier convergence because of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size AG converges again to 3-4% (A)NDCG converge to 5-6%
  • 60. type of conflicts (50 queries) A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.68% 96.32% 3.68% 0% NDCG 5.05% 96.82% 3.18% 0% Broad ANDCG 6.08% 96.84% 3.13% 0.03% ADR 5.93% 95.12% 4.88% 0% AG 3.32% 98.34% 1.66% 0% within known Type III NDCG 6.58% 96.61% 3.39% 0% error rates Fine ANDCG 6.44% 94.94% 5.06% 0% ADR 12.48% 90.58% 9.37% 0.05% again, due to lack of power in one collection no major conflicts
  • 61. effort-reliability tradeoff Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries measure power - conflicts = stable power - conflicts = stable AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42% NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96% Broad ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29% ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37% AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99% NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98% Fine ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94% ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55% virtually same reliability with half the effort!
  • 64. Do not attempt to accomplish greater results by a greater effort of your little understanding, but by a greater understanding of your little effort. ̶ Walter Russell
  • 65. using more and more queries is pointless too much effort for the small gain in power and stability using different similarity scales has little effect using only one is probably just fine some effectiveness measures are better than others they should still be used: they measure different things but bear in mind their power and stability some statistical methods are better than others virtually same realiability with half the effort if significance shows up it most probably is true at worst, conflicts are due to lack of power
  • 67. forget about power and worry about effect-size eventually, significance becomes meaningless reduce the judging effort more queries in Symbolic Melodic Similarity reliable low-cost in-house evaluations and Crowdsourcing deeper evaluation cutoffs not just the top 5 documents: pay attention to ranking probably more reliable, and certainly more reusable effect of the number of systems specially if developed by the same research group other statistical methods Multiple Comparisons with a Control (baseline) other collections, tasks and measures
  • 68. guide experimenters in the interpretation of the results and the tradeoff between effort and reliability