SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Efficient Top-k Algorithms for Fuzzy Search in String
                     Collections

                             Rares Vernica             Chen Li

                             Department of Computer Science
                              University of California, Irvine


  First International Workshop on Keyword Search on Structured
                            Data, 2009




 Rares Vernica (UC Irvine)         Top-k Fuzzy String Search     KEYS 2009   1 / 17
Outline



1   Motivation


2   Efficient Top-k Algorithms
      Problem Formulation
      Algorithms Overview
      Top-k Single-Pass Search Algorithm


3   Experimental Evaluation




    Rares Vernica (UC Irvine)   Top-k Fuzzy String Search   KEYS 2009   2 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities




  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities

                 SELECT * FROM Actors
                 WHERE LastName = ’Shwartzenetrugger’
                 ORDER BY ’# Movies’ DESC LIMIT 1;


  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities

                 SELECT * FROM Actors
                 WHERE LastName = ’Shwartzenetrugger’
                 ORDER BY ’# Movies’ DESC LIMIT 1;

                 0 Results
  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.


     Which one result should the system return?


   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.


     Which one result should the system return?
     Which value is more important, # Movies or similarity?
   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Top-k Similar Strings

    Given:
            Weighted string collection
                    e.g., actors’ LastName and # Movies
            Query string
                    e.g., “Shwartzenetrugger”
            Similarity function
                    e.g, Edit Distance
            Scoring function (score of a string in terms of similarity and weight)
                    e.g., linear combination of similarity and popularity
            Integer k
    Return: k best strings in terms of overall score to the query string.




  Rares Vernica (UC Irvine)           Top-k Fuzzy String Search             KEYS 2009   5 / 17
Top-k Similar Strings

    Given:
            Weighted string collection
                    e.g., actors’ LastName and # Movies
            Query string
                    e.g., “Shwartzenetrugger”
            Similarity function
                    e.g, Edit Distance
            Scoring function (score of a string in terms of similarity and weight)
                    e.g., linear combination of similarity and popularity
            Integer k
    Return: k best strings in terms of overall score to the query string.
    Advantages over Range Search:
            specify k instead of a similarity threshold
            guaranteed k results; a range search might have 0 results


  Rares Vernica (UC Irvine)           Top-k Fuzzy String Search             KEYS 2009   5 / 17
Algorithms Overview




                                                             Range
                                                             Search
          Range
          Search



      Iterative Range            Single-Pass
          Search                   Search                 Two-Phase Search




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search            KEYS 2009   6 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica → {Ve,er,ro,on,ni,ic,ca}




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica → {Ve,er,ro,on,ni,ic,ca}

    Similar strings share a certain number of grams




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80
        2      ccd             0.70
        3      cd              0.60
        4      abcd            0.50
        5      bcc             0.40
             Figure: Dataset




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search   KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2
        3
               ccd
               cd
                               0.70
                               0.60          ⇒               1
                                                             4
                                                                    2
                                                                    5
                                                                             2
                                                                             3
                                                                                       4
                                                                                       5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2      ccd             0.70                          1      2        2         4
        3      cd              0.60                          4      5        3         5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2      ccd             0.70                          1      2        2         4
        3      cd              0.60                          4      5        3         5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2
        3
               ccd
               cd
                               0.70
                               0.60          ⇐               1
                                                             4
                                                                    2
                                                                    5
                                                                             2
                                                                             3
                                                                                       4
                                                                                       5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”
    Identified strings are verified by computing the real similarity.
    Verification is usually an expensive process.


  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
                                                              3    cd         0.60
                                                              4    abcd       0.50
                                                              5    bcc        0.40
                                                                   Figure: Dataset



                                                              ab     cc     cd    bc
                                                               1     2       2    4
                                                               4     5       3    5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset



                                                              ab     cc     cd    bc
                                                               1      2      2     4
                                                               4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset
   Sort the IDs on each list in ascending
   order
                                                              ab     cc     cd    bc
                                                               1      2      2     4
                                                               4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset
   Sort the IDs on each list in ascending
   order
   Scan the lists corresponding to the                        ab     cc     cd    bc
   grams in the query.                                         1      2      2     4
   e.g., “bcd”                                                 4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                              ID    String    Weight
                                                               1    ab         0.80
                                                               2    ccd        0.70
                                                               3    cd         0.60
Naïve approach: Round-Robin
                                                               4    abcd       0.50
    Scan all the lists in the same time                        5    bcc        0.40
    Maintain a list of “open” IDs (might
                                                                    Figure: Dataset
    still appear on some of the lists)
    Store the best k “closed” IDs in a
    top-k buffer
                                                               ab     cc     cd    bc
    Stop when the top-k buffer cannot                           1      2      2     4
    improve                                                     4      5      3     5
                                                                              4
                                                           Figure: Gram inverted-list
                                                           index
   Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm



                                                         ab     bc       cd
                                                          1      2        2
Heap-based
                                                         20      3        4
     Traverse the lists in a sorted                      21      4        5
     order using a heap on the top                         .
                                                           .      .
                                                                  .        .
                                                                           .
     IDs of the lists                                      .      .        .
                                                                19       19
     Advantages:
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram
                                                   inverted-lists for query
                                                   “abcd”


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009   10 / 17
Top-k Single-pass Search Algorithm



                                                       ab       bc       cd
                                                       →1        2        2
Heap-based
                                                       20        3        4
     Traverse the lists in a sorted                    21        4        5
     order using a heap on the top                       .
                                                         .        .
                                                                  .        .
                                                                           .
     IDs of the lists                                    .        .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc        cd
                                                       →1      →2         2
Heap-based
                                                       20       3         4
     Traverse the lists in a sorted                    21       4         5
     order using a heap on the top                       .
                                                         .       .
                                                                 .         .
                                                                           .
     IDs of the lists                                    .       .         .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc       cd
                                                       →1      →2       →2
Heap-based
                                                       20       3        4
     Traverse the lists in a sorted                    21       4        5
     order using a heap on the top                       .
                                                         .       .
                                                                 .        .
                                                                          .
     IDs of the lists                                    .       .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .           2
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc       cd
                                                       →1      →2       →2
Heap-based
                                                       20       3        4
     Traverse the lists in a sorted                    21       4        5
     order using a heap on the top                       .
                                                         .       .
                                                                 .        .
                                                                          .
     IDs of the lists                                    .       .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .           2
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                    1
                                                     ab      bc       cd
                                                                              Figure:
                                                     →1      →2       →2      Top-k
Heap-based
                                                     20       3        4      buffer,
     Traverse the lists in a sorted                  21       4        5      k =1
     order using a heap on the top                     .
                                                       .       .
                                                               .        .
                                                                        .
     IDs of the lists                                  .       .        .
                                                              19       19
     Advantages:                                                                    2
                                                              20       20
         1   No need to maintain the                           .        .           2
             list of “open” IDs                                .
                                                               .        .
                                                                        .
         2   Skip elements
                                                 Figure: Gram                 Figure:
                                                 inverted-lists for query     Min-
                                                 “abcd”                       heap


  Rares Vernica (UC Irvine)    Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      1
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   2
                                                                20       20
         1   No need to maintain the list                        .        .        2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      1
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   2
                                                                20       20
         1   No need to maintain the list                        .        .        2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                  20
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   3
                                                                20       20
         1   No need to maintain the list                        .        .        4
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   3
                                                                20       20
         1   No need to maintain the list                        .        .        4
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   4
                                                                20       20
         1   No need to maintain the list                        .        .       20
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   4
                                                                20       20
         1   No need to maintain the list                        .        .       20
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                  20
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab        bc       cd
                                                                                Figure:
                                                       1         2        2     Top-k
Heap-based
                                                     →20         3        4     buffer,
     Traverse the lists in a sorted                   21         4        5     k =1
     order using a heap on the top                      .
                                                        .         .
                                                                  .        .
                                                                           .
     IDs of the lists                                   .         .        .
                                                              19       19
     Advantages:                                                                  20
                                                             →20      →20
         1   No need to maintain the list                      .        .
             of “open” IDs                                     .
                                                               .        .
                                                                        .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Algorithms Overview




                                                             Range
                                                             Search
          Range
          Search



      Iterative Range            Single-Pass
          Search                   Search                 Two-Phase Search




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search           KEYS 2009   11 / 17
Experimental Setting

       Datasets:
            IMDB Actor Names1
                    actor names and the number of movies they played in
                    1.2 million actors, average name length 15
                    weight is the number of movies (log normalized)
            WEB Corpus Word Grams2
                    sequences of English words and their frequency on the Web
                    2.4 million sequences, average sequence length 20
                    weight is the frequency (log normalized)
       Jaccard similarity and normalized edit similarity, q = 3
       Index and data are stored in main memory at all times
       Implemented in C++ (GNU compiler) on Ubuntu Linux OS
       Intel 2.40GHz PC, 2GB RAM

  1
      http://www.imdb.com/interfaces
  2
      http://www.ldc.upenn.edu/Catalog LDC2006T13
  Rares Vernica (UC Irvine)         Top-k Fuzzy String Search             KEYS 2009   12 / 17
Benefits of Skipping Elements


             50
                          SPS
             40           SPS*
                                                                      Average running time for
 Time (ms)




             30                                                       top-10 queries. IMDB
                                                                      dataset with Jaccard
                                                                      similarity. Single-Pass
             20
                                                                      Search (SPS) algorithm
                                                                      and SPS without
             10                                                       skipping (SPS*).

             0
              0.2         0.4       0.6   0.8      1.0         1.2
                        Dataset Size (millions)
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search              KEYS 2009   13 / 17
Potential of the Two-Phase Algorithm


             40


             30                                                       Running time for 3
 Time (ms)




                                                                      top-10 queries. WEB
                                                                      Corpus dataset with
             20                                                       normalized edit
                                                                      similarity. Two-Phase
                                                                      algorithm with different
             10
                                                                      initial thresholds.

             0          Q1           Q2                Q3
                                    Queries
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search               KEYS 2009   14 / 17
Optimum Initial Threshold for the Two-Phase Algorithm


             100                                                       Average running time for
                                                                       top-10 queries. Web
              80                                                       Corpus dataset with
                                                                       normalized edit
 Time (ms)




              60                                                       similarity. Single-Pass
                                                                       Search (SPS) algorithm,
                                                                       Two-Phase (2PH)
              40                                                       algorithm, 2PH with the
                                                 SPS                   optimum initial threshold
              20                                 2PH                   (2PH Opt).
                                                 2PH Opt
                                                                       The Iterative Range Search
              0                                                        algorithm was to expensive to be
                                                                       plotted. The average running time
               0.4          0.8     1.2   1.6         2.0        2.4   was around 5s.
                           Dataset Size (millions)
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search                   KEYS 2009    15 / 17
Summary



   Approximate ranking queries in string collections
   Useful when mismatch between query and data
   Propose three approaches to solve the problem:
      1    Use existing approximate range search algorithms as a “black box”
           Proves to be the most expensive
      2    Use particularities of the top-k problem
           Proves to be very efficient
      3    Combine (1) and (2) sequentially
           Proves to be slightly more efficient




 Rares Vernica (UC Irvine)     Top-k Fuzzy String Search        KEYS 2009   16 / 17
The Flamingo Project




                                  This work is part of
                           The Flamingo Project at UC Irvine
                          http://flamingo.ics.uci.edu



  Rares Vernica (UC Irvine)         Top-k Fuzzy String Search   KEYS 2009   17 / 17
A Quick Note on Related Work


Fagin et. al [1]                               Our Setting
     similarity on multiple                           similarity on one string
     numerical attributes                             attribute
     traverse list of IDs                             traverse list of IDs
     one list per attribute                           one list per q-gram
     lists sorted on similarity to                    lists sorted on global weight
     that attribute
     lists have different orders of                   lists have the same order of
     IDs                                              IDs
     all IDs appear on all the lists                  a subset of IDs appear on
                                                      each list

[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.
In PODS, 2001
   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   17 / 17

Contenu connexe

En vedette

Semantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and RecruitingSemantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and RecruitingGlen Cathey
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Glen Cathey
 
Using the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionUsing the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionBenjamin Adrian
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1haleydawn
 
Sample Business Proposal Presentation
Sample Business Proposal PresentationSample Business Proposal Presentation
Sample Business Proposal PresentationDaryll Cabagay
 

En vedette (8)

Semantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and RecruitingSemantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and Recruiting
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
 
Using the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionUsing the Web of Data for Information Extraction
Using the Web of Data for Information Extraction
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Branding ppt
Branding pptBranding ppt
Branding ppt
 
Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1
 
Sample Business Proposal Presentation
Sample Business Proposal PresentationSample Business Proposal Presentation
Sample Business Proposal Presentation
 

Dernier

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Dernier (20)

Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Efficient Top-k Algorithms for Fuzzy String Search

  • 1. Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
  • 2. Outline 1 Motivation 2 Efficient Top-k Algorithms Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm 3 Experimental Evaluation Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
  • 3. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 4. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 5. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 6. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 7. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 8. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 9. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 10. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 11. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 12. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Advantages over Range Search: specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 13. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
  • 14. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 15. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 16. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 17. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 18. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 19. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 20. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 21. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 22. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 23. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca} Similar strings share a certain number of grams Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 24. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 25. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇒ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 26. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 27. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 28. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇐ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Identified strings are verified by computing the real similarity. Verification is usually an expensive process. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 29. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 30. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 31. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 32. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order Scan the lists corresponding to the ab cc cd bc grams in the query. 1 2 2 4 e.g., “bcd” 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 33. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 Naïve approach: Round-Robin 4 abcd 0.50 Scan all the lists in the same time 5 bcc 0.40 Maintain a list of “open” IDs (might Figure: Dataset still appear on some of the lists) Store the best k “closed” IDs in a top-k buffer ab cc cd bc Stop when the top-k buffer cannot 1 2 2 4 improve 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 34. Top-k Single-pass Search Algorithm ab bc cd 1 2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram inverted-lists for query “abcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 35. Top-k Single-pass Search Algorithm ab bc cd →1 2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 36. Top-k Single-pass Search Algorithm ab bc cd →1 →2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 37. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 38. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 39. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: →1 →2 →2 Top-k Heap-based 20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the . . 2 list of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 40. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 41. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 42. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 43. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 44. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 45. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 46. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 47. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 48. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 →20 →20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 49. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
  • 50. Experimental Setting Datasets: IMDB Actor Names1 actor names and the number of movies they played in 1.2 million actors, average name length 15 weight is the number of movies (log normalized) WEB Corpus Word Grams2 sequences of English words and their frequency on the Web 2.4 million sequences, average sequence length 20 weight is the frequency (log normalized) Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM 1 http://www.imdb.com/interfaces 2 http://www.ldc.upenn.edu/Catalog LDC2006T13 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
  • 51. Benefits of Skipping Elements 50 SPS 40 SPS* Average running time for Time (ms) 30 top-10 queries. IMDB dataset with Jaccard similarity. Single-Pass 20 Search (SPS) algorithm and SPS without 10 skipping (SPS*). 0 0.2 0.4 0.6 0.8 1.0 1.2 Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
  • 52. Potential of the Two-Phase Algorithm 40 30 Running time for 3 Time (ms) top-10 queries. WEB Corpus dataset with 20 normalized edit similarity. Two-Phase algorithm with different 10 initial thresholds. 0 Q1 Q2 Q3 Queries Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
  • 53. Optimum Initial Threshold for the Two-Phase Algorithm 100 Average running time for top-10 queries. Web 80 Corpus dataset with normalized edit Time (ms) 60 similarity. Single-Pass Search (SPS) algorithm, Two-Phase (2PH) 40 algorithm, 2PH with the SPS optimum initial threshold 20 2PH (2PH Opt). 2PH Opt The Iterative Range Search 0 algorithm was to expensive to be plotted. The average running time 0.4 0.8 1.2 1.6 2.0 2.4 was around 5s. Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
  • 54. Summary Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem: 1 Use existing approximate range search algorithms as a “black box” Proves to be the most expensive 2 Use particularities of the top-k problem Proves to be very efficient 3 Combine (1) and (2) sequentially Proves to be slightly more efficient Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
  • 55. The Flamingo Project This work is part of The Flamingo Project at UC Irvine http://flamingo.ics.uci.edu Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
  • 56. A Quick Note on Related Work Fagin et. al [1] Our Setting similarity on multiple similarity on one string numerical attributes attribute traverse list of IDs traverse list of IDs one list per attribute one list per q-gram lists sorted on similarity to lists sorted on global weight that attribute lists have different orders of lists have the same order of IDs IDs all IDs appear on all the lists a subset of IDs appear on each list [1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, 2001 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17