SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
A High-Performance Input-Aware
 Multiple String-Match Algorithm
                                    Erez
                                   Buchnik
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work


                        Page 2
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 3
The Multiple String-Match Problem
• Goal: Given a set of strings and input
 text, find all occurrences of any of the
 strings in the text
• Input: Set of strings L and input text M
• Output: Offsets 1 ≤ i ≤ |M| where a
 substring of M matches any of the
 strings in L
• Uses: AV, IPS, DPI, DNA Search etc…
                             Page 4
The Multiple String-Match Problem - References

• Aho-Corasick ’75
• Commentz-Walter ’79
• Rabin-Karp ’87
• Wu-Manber ’94
• Muth-Manber ’96
• Hopcroft-Motwani-Ullman ’00
• Dori-Landau ’06
                              Page 5
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 6
Stateful Approach (e.g. Aho-Corasick)


• One state
 transition per
 symbol
• Linear in the length of the input
• Large automatons cause cache-
 misses and degrade performance
                          Page 7
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 8
Guidelines
• INTUITIVE: Search for ‘Hints’ of
 a Match Before the Full Match

• REALISTIC: Use Prior
 Knowledge of Expected Input

• SIMPLE: Trivial Match Process

                      Page 9
Bouma2: Motif-Based String Match
Set of               re             Set of selected
           bore
strings                             2-symbols long
           core
                     ek             substrings
           trek
           bits      bi
          corridor   at
            boat
           book      ok
           cooks
                     or
• Preprocessing: Map every string to
 its own substring: Motif            Q1: How to
                                     select motifs?
                          Page 10
Bouma2: Motif-Based String Match (cont.)
     “ r a b b i t s       h a t e             c o o k s “
                       No match                No match



                        b o a t                b o o k
                Match                          Match      Match

            b i t s                            c o o k s
• Match: Examine symbols 2-by-2
 (STATELESS); attempt full match
 around motif occurrences
                                  Q2: How to
                                  resolve collisions?
                                     Page 11
Capturing all Occurrences

 “ h a b i t s    o f   r a b b i t s “
          Match                        Match

      b i t s                       b i t s

• Even-offset occurrences and odd-
 offset occurrences require separate
 passes, but instead…
                          Page 12
Upgrade #1: 2-Symbol Strides

 “ h a b i t s       o f   r a b b i t s “
     Match   Match                        Match

      b i t s                          b i t s

• We map each string TWICE: once to
 an even-offset motif, and once to an
 odd-offset motif
                             Page 13
Upgrade #2: Fast-Path / Slow-Path
       4                   14


“ h a b i t s   o f   r a b b i t s “     4
                                          14


 • Fast-Path:
 - Stateless
 - “Monolithic” (zero branches)
 - Cache-Aware (small direct-table)
 - SIMPLE…
                                Page 14
Upgrade #2: Fast-Path / Slow-Path
                 4                           14


     4   “ h a b i t s       o f      r a b b i t s “
    14
             Match   Match                        Match

              b i t s                        b i t s
• Slow-Path:
  - Memory-Efficient (pointers to
  original strings for comparison)
 - “Localized” (separate structure for
  every motif)
                                   Page 15
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 16
Bouma2 vs. Aho-Corasick
• n – length of input
• S – no. of string-matches in n
• m – no. of motif-matches in n
• l – length of the longest string
• Match Complexities:
- Aho-Corasick:     O( n  S )
                      n
- Bouma2:           O(  m  l )
                      2
                           Page 17
Bouma2 vs. Aho-Corasick (Speed)
 Bouma2      Bouma2 Slow-Path
 Fast-Path   (Sub-Optimal)
                                      Aho-Corasick




• In practice, Bouma2 is usually at
 least twice as fast as Aho-Corasick
• Fast-path alone is 10 times faster
                  Q3: How to optimize
                  slow-path?      Page 18
Bouma2 vs. Aho-Corasick (Cache)
  Bouma2
  Cache-Misses

                              Aho-Corasick
                              Cache-Misses




• Bouma2 exhibits 8.5 times less
 cache-misses than Aho-Corasick
 (fast-path + slow-path)
                           Page 19
Bouma2 vs. Aho-Corasick (Memory)
Bouma2      Bouma2      Original
Fast-Path   Slow-Path   Strings

                                       Aho-Corasick




• Bouma2 footprint is less than 70%
 of Aho-Corasick for textual search
 (down to 35% in other cases)
                             Page 20
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                       Page 21
Q1: How to select motifs?
                          bo co do id or re ri rr
             bo re         •              •
     Even
    Offset   co re            •           •
             co rr id or      •     • •         •
             b or e                    •
      Odd
    Offset   c or e                    •
             c or ri do r        •     •     •

• A1: Out of all 2-symbol substrings,
 find a minimum subset that covers
 all given strings (even & odd offsets)
                                    Page 22
Q1: How to select motifs?
                         bo co do id or re ri rr
            bo re         Χ              √
    Even
   Offset   co re            Χ           √
            co rr id or      Χ     Χ √         Χ
            b or e                    √
     Odd
   Offset   c or e                    √
            c or ri do r        Χ     √     Χ


• But… maybe the minimum subset is
 not the optimal subset?

                                   Page 23
Q1: How to select motifs?
• Bad selection of motifs for English
     text searches: substrings of ‘the’ -
     the most common word in English
                                                  at ea er he te th
                       Even
                      Offset   th ea te r               Χ                   Χ     √
                        Odd
                      Offset   t he at er Χ                  Χ       √

“The good, the bad and the ugly“ in theaters nearby
No match   No match    Match   No match   Match   No match

   thea ter             thea ter           thea ter                             Match


                                                                           thea ter

                                                                 Page 24
Q1: How to select motifs?
     2-Symbol Sequence Occurrence Probability
            bo         0.0002
            re         0.001861
            co         0.001028
            rr         0.000031
            id         0.001756
            or         0.000444
            ri         0.000284
            do         0.000151
• Use input-specific occurrence
 statistics to optimize motif-sets
• REALISTIC…
                                     Page 25
Q1: How to select motifs?
                          bo co do id or re ri rr
             bo re         √              Χ
     Even
    Offset   co re            √           Χ
             co rr id or      √     Χ √         Χ
             b or e                    √
      Odd
    Offset   c or e                    √
             c or ri do r        Χ     √     Χ

• NOTE: After selecting the motif-set,
 remove redundant mappings from
 the final String-to-Motif mapping
                                    Page 26
Statistics for Motif Selection
                      10000000

                       8000000
                                     00 00
(more than 100,000)
   Occurrences




                       6000000

                       4000000
                                       “rn”                                                  FF FF
                       2000000

                             0
                                 0     10000     20000   30000   40000             50000   60000       70000
                      35000000

                      30000000       00 00
(more than 40,000)




                      25000000
  Occurrences




                      20000000
                                                                                              FF FF
                      15000000
                                               “??”
                      10000000

                       5000000

                             0
                                 0     10000     20000   30000   40000             50000   60000       70000


• 2-symbol sequence statistics: IP
                      traffic (top) vs. OS files (bottom)
                                                                         Page 27
Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL

Minimize     c(t )  x
            tTL
                      t   ,
  whereas xt {0,1} for every t  TL

Subject To: for every w  L

 x  assoc (w, t )  1, and  x  assoc (w, t )  1
tTL
       t    0
                              tTL
                                      t        1


                                     Page 28
Q2: How to resolve collisions?
          -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6


               b        o   re              I
               c        o   re
               c        o   rridor
          corrid        o   r
• A2:
- Examine adjacent symbols at
 relative offsets to eliminate strings
- New structure: The Mangled-Trie
                               Page 29
The Mangled-Trie
                            ‘or’ Motif at Offset 0
                                  1                    OTHER
                                      Resolve:                  NO
                                      Offset -1                MATCH
                                          ‘b’                          ‘d’
                                                  NO                              NO
                                        ‘e’ in       NO            “corri” in           NO
                      ‘c’             Offset 2?     MATCH          Offset -6?          MATCH
                  2
      OTHER                            YES                           YES
  NO        Resolve:
MATCH       Offset 2                  “bore” in                   “corridor” in
     ‘e’                              Offset -1                     Offset -6

 “core” in
 Offset -1                                             bore
                      ‘r’                              core
              3                                        corridor
                            NO                    corridor
             “idor” in            NO
             Offset 3?           MATCH                                                  I
                                                  -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
           YES                            ...corricorridor...
        “corridor” in
          Offset -1                                               1      2   3

                                                            Page 30
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 31
Q3: How optimize slow-path?
• A3:
- Optimize Frequent Scenarios:
 Apply statistics to Mangled-Trie
 construction
- Improve Motif-Set Quality: Avoid
 slow-path altogether when possible


                        Page 32
More Future Work…
• Adaptive System: Collect statistics
 “on-the-go” and improve motif-set
• Faster Preprocessing: Custom
 Branch-and-Cut (Margot ’10)
• Regular Expressions
• Hardware Implementation
• Bouma3?…

                         Page 33
“ Search has always been about
 people. It's not an abstract thing.
 It's not a formula. It's about getting
 people what they need... It depends
 on the type of search you do—and
 how to take all those signals and
 put them together.”
- Udi Manber, Google, 2008
                         Page 34
Thank You

Contenu connexe

Similaire à Bouma2 talk

Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you wereJean-Baptiste Mazon
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsTejas Patil
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphSyed Zaid Irshad
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 

Similaire à Bouma2 talk (8)

Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you were
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applications
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Bouma2 talk

  • 1. A High-Performance Input-Aware Multiple String-Match Algorithm Erez Buchnik
  • 2. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 2
  • 3. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 3
  • 4. The Multiple String-Match Problem • Goal: Given a set of strings and input text, find all occurrences of any of the strings in the text • Input: Set of strings L and input text M • Output: Offsets 1 ≤ i ≤ |M| where a substring of M matches any of the strings in L • Uses: AV, IPS, DPI, DNA Search etc… Page 4
  • 5. The Multiple String-Match Problem - References • Aho-Corasick ’75 • Commentz-Walter ’79 • Rabin-Karp ’87 • Wu-Manber ’94 • Muth-Manber ’96 • Hopcroft-Motwani-Ullman ’00 • Dori-Landau ’06 Page 5
  • 6. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 6
  • 7. Stateful Approach (e.g. Aho-Corasick) • One state transition per symbol • Linear in the length of the input • Large automatons cause cache- misses and degrade performance Page 7
  • 8. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 8
  • 9. Guidelines • INTUITIVE: Search for ‘Hints’ of a Match Before the Full Match • REALISTIC: Use Prior Knowledge of Expected Input • SIMPLE: Trivial Match Process Page 9
  • 10. Bouma2: Motif-Based String Match Set of re Set of selected bore strings 2-symbols long core ek substrings trek bits bi corridor at boat book ok cooks or • Preprocessing: Map every string to its own substring: Motif Q1: How to select motifs? Page 10
  • 11. Bouma2: Motif-Based String Match (cont.) “ r a b b i t s h a t e c o o k s “ No match No match b o a t b o o k Match Match Match b i t s c o o k s • Match: Examine symbols 2-by-2 (STATELESS); attempt full match around motif occurrences Q2: How to resolve collisions? Page 11
  • 12. Capturing all Occurrences “ h a b i t s o f r a b b i t s “ Match Match b i t s b i t s • Even-offset occurrences and odd- offset occurrences require separate passes, but instead… Page 12
  • 13. Upgrade #1: 2-Symbol Strides “ h a b i t s o f r a b b i t s “ Match Match Match b i t s b i t s • We map each string TWICE: once to an even-offset motif, and once to an odd-offset motif Page 13
  • 14. Upgrade #2: Fast-Path / Slow-Path 4 14 “ h a b i t s o f r a b b i t s “ 4 14 • Fast-Path: - Stateless - “Monolithic” (zero branches) - Cache-Aware (small direct-table) - SIMPLE… Page 14
  • 15. Upgrade #2: Fast-Path / Slow-Path 4 14 4 “ h a b i t s o f r a b b i t s “ 14 Match Match Match b i t s b i t s • Slow-Path: - Memory-Efficient (pointers to original strings for comparison) - “Localized” (separate structure for every motif) Page 15
  • 16. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 16
  • 17. Bouma2 vs. Aho-Corasick • n – length of input • S – no. of string-matches in n • m – no. of motif-matches in n • l – length of the longest string • Match Complexities: - Aho-Corasick: O( n  S ) n - Bouma2: O(  m  l ) 2 Page 17
  • 18. Bouma2 vs. Aho-Corasick (Speed) Bouma2 Bouma2 Slow-Path Fast-Path (Sub-Optimal) Aho-Corasick • In practice, Bouma2 is usually at least twice as fast as Aho-Corasick • Fast-path alone is 10 times faster Q3: How to optimize slow-path? Page 18
  • 19. Bouma2 vs. Aho-Corasick (Cache) Bouma2 Cache-Misses Aho-Corasick Cache-Misses • Bouma2 exhibits 8.5 times less cache-misses than Aho-Corasick (fast-path + slow-path) Page 19
  • 20. Bouma2 vs. Aho-Corasick (Memory) Bouma2 Bouma2 Original Fast-Path Slow-Path Strings Aho-Corasick • Bouma2 footprint is less than 70% of Aho-Corasick for textual search (down to 35% in other cases) Page 20
  • 21. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 21
  • 22. Q1: How to select motifs? bo co do id or re ri rr bo re • • Even Offset co re • • co rr id or • • • • b or e • Odd Offset c or e • c or ri do r • • • • A1: Out of all 2-symbol substrings, find a minimum subset that covers all given strings (even & odd offsets) Page 22
  • 23. Q1: How to select motifs? bo co do id or re ri rr bo re Χ √ Even Offset co re Χ √ co rr id or Χ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • But… maybe the minimum subset is not the optimal subset? Page 23
  • 24. Q1: How to select motifs? • Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in English at ea er he te th Even Offset th ea te r Χ Χ √ Odd Offset t he at er Χ Χ √ “The good, the bad and the ugly“ in theaters nearby No match No match Match No match Match No match thea ter thea ter thea ter Match thea ter Page 24
  • 25. Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151 • Use input-specific occurrence statistics to optimize motif-sets • REALISTIC… Page 25
  • 26. Q1: How to select motifs? bo co do id or re ri rr bo re √ Χ Even Offset co re √ Χ co rr id or √ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • NOTE: After selecting the motif-set, remove redundant mappings from the final String-to-Motif mapping Page 26
  • 27. Statistics for Motif Selection 10000000 8000000 00 00 (more than 100,000) Occurrences 6000000 4000000 “rn” FF FF 2000000 0 0 10000 20000 30000 40000 50000 60000 70000 35000000 30000000 00 00 (more than 40,000) 25000000 Occurrences 20000000 FF FF 15000000 “??” 10000000 5000000 0 0 10000 20000 30000 40000 50000 60000 70000 • 2-symbol sequence statistics: IP traffic (top) vs. OS files (bottom) Page 27
  • 28. Motif Selection as an ILP Problem • L: a given string-set • TL: all 2-symbol substrings of strings in L • c(t): cost-function for every t in TL Minimize  c(t )  x tTL t , whereas xt {0,1} for every t  TL Subject To: for every w  L  x  assoc (w, t )  1, and  x  assoc (w, t )  1 tTL t 0 tTL t 1 Page 28
  • 29. Q2: How to resolve collisions? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 b o re I c o re c o rridor corrid o r • A2: - Examine adjacent symbols at relative offsets to eliminate strings - New structure: The Mangled-Trie Page 29
  • 30. The Mangled-Trie ‘or’ Motif at Offset 0 1 OTHER Resolve: NO Offset -1 MATCH ‘b’ ‘d’ NO NO ‘e’ in NO “corri” in NO ‘c’ Offset 2? MATCH Offset -6? MATCH 2 OTHER YES YES NO Resolve: MATCH Offset 2 “bore” in “corridor” in ‘e’ Offset -1 Offset -6 “core” in Offset -1 bore ‘r’ core 3 corridor NO corridor “idor” in NO Offset 3? MATCH I -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 YES ...corricorridor... “corridor” in Offset -1 1 2 3 Page 30
  • 31. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 31
  • 32. Q3: How optimize slow-path? • A3: - Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction - Improve Motif-Set Quality: Avoid slow-path altogether when possible Page 32
  • 33. More Future Work… • Adaptive System: Collect statistics “on-the-go” and improve motif-set • Faster Preprocessing: Custom Branch-and-Cut (Margot ’10) • Regular Expressions • Hardware Implementation • Bouma3?… Page 33
  • 34. “ Search has always been about people. It's not an abstract thing. It's not a formula. It's about getting people what they need... It depends on the type of search you do—and how to take all those signals and put them together.” - Udi Manber, Google, 2008 Page 34