SlideShare une entreprise Scribd logo
1  sur  40
Bouma2
  Erez Buchnik
   February-2012
”If you can raed tihs,
tehn you are prbbolay not a sttae-mhciane.”
Agenda

•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Agenda

•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
The Multiple Exact String-Match Problem

 “Given a string-set L ⊆         Σ ∗   and an
 input stream WI ∈ Σ∗, find all
 occurrences of any of the strings
 in L that appear in WI”


Uses: AV, IPS, DPI, DNA Search etc...
References

• Aho-Corasick ’75
• Commentz-Walter ’79
• Rabin-Karp ’87
• Wu-Manber ’94
• Muth-Manber ’96
• Hopcroft-Motwani-Ullman ’00
• Dori-Landau ’06
Agenda
•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Aho-Corasick
                                                     [^flda]
                                             0




                                                 l
             f                                                 d            a
                 1                           7                         10           13


         f               o                       a                 a            d
 2                       4                   8                         11           14



     e                           v               d                 n            a
 3                           5               9                         12           15




     f f e           f o v           l a d   d a n             a d a
Wu-Manber
     SKIP

fe 0        ffe
ad 0        lad
an 0        dan
da 0        ada
ov 0        fov
ff    1
fo 1
la    1
..    2
                  f f e   f o v   l a d   d a n   a d a
Rabin-Karp
 0
 1
 2
 3
 0
 4
 5
 6    lad   ffe   fov
 7
 8    dan   ada
 9
10
 0
11
12
             f f e      f o v   l a d   d a n   a d a
Agenda
•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Bouma2: Motif-Based String Match
 Set of                           Set of selected
            bore          re
 strings                          2-symbols long
            core          ek        substrings
            trek
            bits          bi
           corridor       at
             boat
             book         ok
            cooks
                          or
Preprocessing: Map every string to its own
substring: Motif                   Q1: How to
                                  select motifs?
Bouma2: Motif-Based String Match
      “ r a b b i t s       h a t e        c o o k s “
                        No match          No match



                          b o a t          b o o k
                  Match                   Match      Match

              b i t s                      c o o k s
Match: Examine symbols 2-by-2
(STATELESS, Consume-Order Agnostic);
attempt full match around motif occurrences
                                   Q2: How to resolve
                                       collisions?
Capturing all Occurrences

“ h a b i t s        o f   r a b b i t s “
             Match                  Match

       b i t s                  b i t s

Even-offset occurrences and odd-offset
occurrences require separate passes, but
instead...
Upgrade #1: 2-Symbol Strides

 “ h a b i t s          o f   r a b b i t s “
      Match     Match                  Match

        b i t s                    b i t s

• We map each string      TWICE: once to an
 even-offset motif, and once to an odd-
 offset motif
Upgrade #2: Fast-Path / Slow-Path
          4                      14


 “ h a b i t s    o f   r a b b i t s “   4
                                          14



Fast-Path:
- Stateless (agnostic to consume-order)
- “Monolithic” (zero branches)
- Cache-Aware (small direct-table)
- SIMPLE...
Upgrade #2: Fast-Path / Slow-Path
                     4                      14


     4       “ h a b i t s       o f   r a b b i t s “
    14
                 Match   Match                   Match

                  b i t s                   b i t s

Slow-Path:
- Memory-Efficient (pointers to original strings for
comparison)
- “Localized” (separate structure for every motif)
Agenda
•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Bouma2 vs. Aho-Corasick
• n – length of input
• S – no. of string-matches in n
• P – Probability of motif-match
• l – length of the longest string

Match Complexities:
- Aho-Corasick:        O( n S )
- Bouma2:         O(n (0.5 P (l 2)))
Benchmark

-   Performed against the Snort implementation of Aho-Corasick
-   Tested with 1GB of genuine IP traffic recorded at an ISP site
-   Database included 4,841 unique strings extracted from Snort rules, 3 bytes
    long or longer
-   Aggregate size of database strings: 98,546 bytes
-   Tested using Snort source-code merged with Bouma2 over Intel Core2
    Duo 2.53GHz with 1.95GB RAM running XP SP3
-   Profiled with Visual Studio 2010 Sampling Profiler
-   For Bouma2, three different motif-selection methods were compared:
B2-M (Minimum): Minimum motifs
B2-RS (Rare in Strings): Prefer motifs that occur less times within the
database strings
B2-RI (Rare in Input): Prefer motifs that are expected to occur less times in the
input (based on statistics over one third of the input)
Benchmark – Bouma2 vs. Snort AC (Throughput)
Throughput
(Mbit/sec)
3,500.00




3,000.00




2,500.00




2,000.00
                                                                                                           AC
                                                                                                           B2-M
                                                                                                           B2-RS
1,500.00                                                                                                   B2-RI




1,000.00




 500.00

                                                                                                   Total
                                                                                                   String Size
    0.00                                                                                           (bytes)
           0   10,000   20,000   30,000   40,000   50,000   60,000   70,000   80,000   90,000   100,000
Benchmark – Bouma2 vs. Snort AC (Memory)
      - Snort creates several AC instances, which are pre-filtered by port
      - The comparison was done against a single Bouma2 instance
Memory
Consumption
(bytes)
50,000,000




40,000,000




30,000,000
                                                                                                                 AC
                                                                                                                 B2-M
                                                                                                                 B2-RS
20,000,000                                                                                                       B2-RI




10,000,000



                                                                                                            Total
        0                                                                                                   String Size
             0   10,000   20,000   30,000   40,000   50,000   60,000   70,000   80,000   90,000   100,000
                                                                                                            (bytes)
Agenda
•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Q1: How to select motifs?
                             bo co do id or re ri rr
                bo re         •              •
        Even
       Offset   co re            •           •
                co rr id or      •     • •         •
                b or e                    •
         Odd
       Offset   c or e                    •
                c or ri do r        •     •     •

•   A1: Out of all 2-symbol substrings, find a
    minimum subset that covers all given strings
    (even & odd offsets)
Q1: How to select motifs?
                          bo co do id or re ri rr
             bo re         Χ              √
     Even
    Offset   co re            Χ           √
             co rr id or      Χ     Χ √         Χ
             b or e                    √
      Odd
    Offset   c or e                    √
             c or ri do r        Χ     √     Χ


• But... maybe the minimum subset is not
 the optimal subset?
Q1: How to select motifs?
 Bad selection of motifs for English text searches:
 substrings of ‘the’ - the most common word in
 English
                                                  at ea er he te th
                       Even
                      Offset   th ea te r               Χ             Χ     √
                        Odd
                      Offset   t he at er Χ                  Χ   √

“The good, the bad and the ugly“ in theaters nearby
No match   No match    Match   No match   Match   No match

   thea ter             thea ter           thea ter                       Match


                                                                     thea ter
Q1: How to select motifs?
       2-Symbol Sequence   Occurrence Probability
                 bo        0.0002
                re         0.001861
                co         0.001028
                rr         0.000031
                id         0.001756
                or         0.000444
                ri         0.000284
                do         0.000151

• Use input-specific occurrence statistics to
    optimize motif-sets
•   REALISTIC...
Q1: How to select motifs?
                             bo co do id or re ri rr
                bo re         √              Χ
        Even
       Offset   co re            √           Χ
                co rr id or      √     Χ √         Χ
                b or e                    √
         Odd
       Offset   c or e                    √
                c or ri do r        Χ     √     Χ

•   NOTE: After selecting the motif-set, remove
    redundant mappings from the final String-to-
    Motif mapping
Statistics for Motif Selection
                      10000000

                                     00 00
(more than 100,000)




                       8000000
   Occurrences




                       6000000

                       4000000         “rn”                                         FF FF
                       2000000

                             0
                                 0      10000     20000   30000   40000   50000   60000       70000
                      35000000
                      30000000       00 00
(more than 40,000)




                      25000000
  Occurrences




                      20000000
                      15000000                                                       FF FF
                      10000000
                                                “??”
                       5000000
                             0
                                 0      10000     20000   30000   40000   50000   60000       70000




• 2-symbol sequence statistics: IP traffic (top) vs.
                      OS files (bottom)
Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL

Minimize                c(t ), xt
                 t TL
   whereas        xt       {0,1} every
                               for             t   TL
Subject To: for every               w      L
        xt assoc0 (w, t ) 1 , and          xt assoc1 (w, t ) 1
 t TL                               t TL
Q2: How to resolve collisions?
                -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6


                     b          o   re               I
                     c          o   re
                     c          o   rridor
• A2:           corrid          o   r
- New structure: The Mangled-Trie
- Examine adjacent symbols at relative offsets to
 eliminate strings
- The Mangled-Trie itself dictates where to look next
  (instead of following a strict left-to-right sequence)
The Mangled-Trie
                            „or‟ Motif at Offset 0
                                  1                    OTHER
                                      Resolve:                  NO
                                      Offset -1                MATCH
                                          „b‟                          „d‟
                                                  NO                             NO
                                        „e‟ in       NO           “corri” in           NO
                      „c‟             Offset 2?     MATCH         Offset -6?          MATCH
                  2
      OTHER                            YES                          YES
  NO        Resolve:
MATCH       Offset 2                  “bore” in                  “corridor” in
     „e‟                              Offset -1                    Offset -6

 “core” in
 Offset -1                                             bore
                      „r‟                              core
              3                                        corridor
                            NO                    corridor
             “idor” in            NO
             Offset 3?           MATCH                                                 I
                                                  -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
           YES                            ...corricorridor...
        “corridor” in
          Offset -1                                              1       2   3
Q3: How to optimize slow-path?

• A3:
- Optimize Frequent Scenarios: Apply statistics
 to Mangled-Trie construction
- Improve Motif-Set Quality: Avoid slow-path
 altogether when possible
Agenda
•   Problem
•   Existing Solutions
•   Bouma2 – Model
•   Comparisons
•   Algorithm Design in Detail
•   Discussion
Bouma2:
Hash-Functions
Revisited
        Erez Buchnik
         March-2012
Hash Functions

    What is a Hash-Function?
    “A hash function is any algorithm or subroutine that maps large data sets of
    variable length, called keys, to smaller data sets of a fixed length. ...
    The values returned by a hash function are called hash values, hash

    codes, hash sums, checksums or simply hashes.       ”
                             What input should we
                                   expect?
    What is a GOOD (non-cryptographic) Hash-Function?
    “A good hash function should map the expected inputs as evenly as possible
    over its output range. That is, every hash value in the output range should be

    generated with roughly the same probability.   ”
Bouma2 defines a hash-function:
-   A tailored, optimized mapping of
    strings to their own substrings.
-   Collision-resolving is also optimized,
    based on relative offset information
The Multiple Exact String-Match Problem

“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find
all occurrences of any of the strings in L that appear in WI”




FACT: The definition of the problem DOES
NOT imply that we must scan the input from
left to right, or in any other order.
The Multiple Exact String-Match Problem

“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find
all occurrences of any of the strings in L that appear in WI”



CLAIM: Algorithms that impose a
consume-order constraint are in general
less efficient than algorithms that are
free of this constraint.
The Multiple Exact String-Match Problem
  “Given a string-set L ⊆ Σ∗
  and an input stream WI ∈ Σ∗,
  find all occurrences of                          5000
  any of the strings in L
                                                  Naïve
  that appear in WI”                              Approach


                                           1500
Which dominant factor should we
choose when designing an                   Aho-Corasick
efficient string-match             15
algorithm?...
                                  Bouma2

Contenu connexe

Similaire à Bouma2

Similaire à Bouma2 (16)

How MongoDB works
How MongoDB worksHow MongoDB works
How MongoDB works
 
Bouma2 talk
Bouma2 talkBouma2 talk
Bouma2 talk
 
Project10 presentation
Project10 presentationProject10 presentation
Project10 presentation
 
4 ee600 lab2_grp
4 ee600 lab2_grp4 ee600 lab2_grp
4 ee600 lab2_grp
 
Dac s05
Dac s05Dac s05
Dac s05
 
00 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp1300 chapter07 and_08_conversion_subroutines_force_sp13
00 chapter07 and_08_conversion_subroutines_force_sp13
 
Dsp U Lec02 Data Converters
Dsp U   Lec02 Data ConvertersDsp U   Lec02 Data Converters
Dsp U Lec02 Data Converters
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)
 
Lecture31
Lecture31Lecture31
Lecture31
 
Lp seminar
Lp seminarLp seminar
Lp seminar
 
RubyConf Argentina 2011
RubyConf Argentina 2011RubyConf Argentina 2011
RubyConf Argentina 2011
 
Lecture20
Lecture20Lecture20
Lecture20
 
Double patterning (4/20 update)
Double patterning (4/20 update)Double patterning (4/20 update)
Double patterning (4/20 update)
 
CMOS Analog Design Lect 1
CMOS Analog Design  Lect 1CMOS Analog Design  Lect 1
CMOS Analog Design Lect 1
 
Lecture30
Lecture30Lecture30
Lecture30
 
DBMS Class 2
DBMS Class 2DBMS Class 2
DBMS Class 2
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Bouma2

  • 1. Bouma2 Erez Buchnik February-2012
  • 2. ”If you can raed tihs, tehn you are prbbolay not a sttae-mhciane.”
  • 3. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 4. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 5. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ ∗ and an input stream WI ∈ Σ∗, find all occurrences of any of the strings in L that appear in WI” Uses: AV, IPS, DPI, DNA Search etc...
  • 6. References • Aho-Corasick ’75 • Commentz-Walter ’79 • Rabin-Karp ’87 • Wu-Manber ’94 • Muth-Manber ’96 • Hopcroft-Motwani-Ullman ’00 • Dori-Landau ’06
  • 7. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 8. Aho-Corasick [^flda] 0 l f d a 1 7 10 13 f o a a d 2 4 8 11 14 e v d n a 3 5 9 12 15 f f e f o v l a d d a n a d a
  • 9. Wu-Manber SKIP fe 0 ffe ad 0 lad an 0 dan da 0 ada ov 0 fov ff 1 fo 1 la 1 .. 2 f f e f o v l a d d a n a d a
  • 10. Rabin-Karp 0 1 2 3 0 4 5 6 lad ffe fov 7 8 dan ada 9 10 0 11 12 f f e f o v l a d d a n a d a
  • 11. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 12. Bouma2: Motif-Based String Match Set of Set of selected bore re strings 2-symbols long core ek substrings trek bits bi corridor at boat book ok cooks or Preprocessing: Map every string to its own substring: Motif Q1: How to select motifs?
  • 13. Bouma2: Motif-Based String Match “ r a b b i t s h a t e c o o k s “ No match No match b o a t b o o k Match Match Match b i t s c o o k s Match: Examine symbols 2-by-2 (STATELESS, Consume-Order Agnostic); attempt full match around motif occurrences Q2: How to resolve collisions?
  • 14. Capturing all Occurrences “ h a b i t s o f r a b b i t s “ Match Match b i t s b i t s Even-offset occurrences and odd-offset occurrences require separate passes, but instead...
  • 15. Upgrade #1: 2-Symbol Strides “ h a b i t s o f r a b b i t s “ Match Match Match b i t s b i t s • We map each string TWICE: once to an even-offset motif, and once to an odd- offset motif
  • 16. Upgrade #2: Fast-Path / Slow-Path 4 14 “ h a b i t s o f r a b b i t s “ 4 14 Fast-Path: - Stateless (agnostic to consume-order) - “Monolithic” (zero branches) - Cache-Aware (small direct-table) - SIMPLE...
  • 17. Upgrade #2: Fast-Path / Slow-Path 4 14 4 “ h a b i t s o f r a b b i t s “ 14 Match Match Match b i t s b i t s Slow-Path: - Memory-Efficient (pointers to original strings for comparison) - “Localized” (separate structure for every motif)
  • 18. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 19. Bouma2 vs. Aho-Corasick • n – length of input • S – no. of string-matches in n • P – Probability of motif-match • l – length of the longest string Match Complexities: - Aho-Corasick: O( n S ) - Bouma2: O(n (0.5 P (l 2)))
  • 20. Benchmark - Performed against the Snort implementation of Aho-Corasick - Tested with 1GB of genuine IP traffic recorded at an ISP site - Database included 4,841 unique strings extracted from Snort rules, 3 bytes long or longer - Aggregate size of database strings: 98,546 bytes - Tested using Snort source-code merged with Bouma2 over Intel Core2 Duo 2.53GHz with 1.95GB RAM running XP SP3 - Profiled with Visual Studio 2010 Sampling Profiler - For Bouma2, three different motif-selection methods were compared: B2-M (Minimum): Minimum motifs B2-RS (Rare in Strings): Prefer motifs that occur less times within the database strings B2-RI (Rare in Input): Prefer motifs that are expected to occur less times in the input (based on statistics over one third of the input)
  • 21. Benchmark – Bouma2 vs. Snort AC (Throughput) Throughput (Mbit/sec) 3,500.00 3,000.00 2,500.00 2,000.00 AC B2-M B2-RS 1,500.00 B2-RI 1,000.00 500.00 Total String Size 0.00 (bytes) 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
  • 22. Benchmark – Bouma2 vs. Snort AC (Memory) - Snort creates several AC instances, which are pre-filtered by port - The comparison was done against a single Bouma2 instance Memory Consumption (bytes) 50,000,000 40,000,000 30,000,000 AC B2-M B2-RS 20,000,000 B2-RI 10,000,000 Total 0 String Size 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000 (bytes)
  • 23. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 24. Q1: How to select motifs? bo co do id or re ri rr bo re • • Even Offset co re • • co rr id or • • • • b or e • Odd Offset c or e • c or ri do r • • • • A1: Out of all 2-symbol substrings, find a minimum subset that covers all given strings (even & odd offsets)
  • 25. Q1: How to select motifs? bo co do id or re ri rr bo re Χ √ Even Offset co re Χ √ co rr id or Χ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • But... maybe the minimum subset is not the optimal subset?
  • 26. Q1: How to select motifs? Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in English at ea er he te th Even Offset th ea te r Χ Χ √ Odd Offset t he at er Χ Χ √ “The good, the bad and the ugly“ in theaters nearby No match No match Match No match Match No match thea ter thea ter thea ter Match thea ter
  • 27. Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151 • Use input-specific occurrence statistics to optimize motif-sets • REALISTIC...
  • 28. Q1: How to select motifs? bo co do id or re ri rr bo re √ Χ Even Offset co re √ Χ co rr id or √ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • NOTE: After selecting the motif-set, remove redundant mappings from the final String-to- Motif mapping
  • 29. Statistics for Motif Selection 10000000 00 00 (more than 100,000) 8000000 Occurrences 6000000 4000000 “rn” FF FF 2000000 0 0 10000 20000 30000 40000 50000 60000 70000 35000000 30000000 00 00 (more than 40,000) 25000000 Occurrences 20000000 15000000 FF FF 10000000 “??” 5000000 0 0 10000 20000 30000 40000 50000 60000 70000 • 2-symbol sequence statistics: IP traffic (top) vs. OS files (bottom)
  • 30. Motif Selection as an ILP Problem • L: a given string-set • TL: all 2-symbol substrings of strings in L • c(t): cost-function for every t in TL Minimize c(t ), xt t TL whereas xt {0,1} every for t TL Subject To: for every w L xt assoc0 (w, t ) 1 , and xt assoc1 (w, t ) 1 t TL t TL
  • 31. Q2: How to resolve collisions? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 b o re I c o re c o rridor • A2: corrid o r - New structure: The Mangled-Trie - Examine adjacent symbols at relative offsets to eliminate strings - The Mangled-Trie itself dictates where to look next (instead of following a strict left-to-right sequence)
  • 32. The Mangled-Trie „or‟ Motif at Offset 0 1 OTHER Resolve: NO Offset -1 MATCH „b‟ „d‟ NO NO „e‟ in NO “corri” in NO „c‟ Offset 2? MATCH Offset -6? MATCH 2 OTHER YES YES NO Resolve: MATCH Offset 2 “bore” in “corridor” in „e‟ Offset -1 Offset -6 “core” in Offset -1 bore „r‟ core 3 corridor NO corridor “idor” in NO Offset 3? MATCH I -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 YES ...corricorridor... “corridor” in Offset -1 1 2 3
  • 33. Q3: How to optimize slow-path? • A3: - Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction - Improve Motif-Set Quality: Avoid slow-path altogether when possible
  • 34. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Algorithm Design in Detail • Discussion
  • 35. Bouma2: Hash-Functions Revisited Erez Buchnik March-2012
  • 36. Hash Functions What is a Hash-Function? “A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length. ... The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. ” What input should we expect? What is a GOOD (non-cryptographic) Hash-Function? “A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. ”
  • 37. Bouma2 defines a hash-function: - A tailored, optimized mapping of strings to their own substrings. - Collision-resolving is also optimized, based on relative offset information
  • 38. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find all occurrences of any of the strings in L that appear in WI” FACT: The definition of the problem DOES NOT imply that we must scan the input from left to right, or in any other order.
  • 39. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find all occurrences of any of the strings in L that appear in WI” CLAIM: Algorithms that impose a consume-order constraint are in general less efficient than algorithms that are free of this constraint.
  • 40. The Multiple Exact String-Match Problem “Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find all occurrences of 5000 any of the strings in L Naïve that appear in WI” Approach 1500 Which dominant factor should we choose when designing an Aho-Corasick efficient string-match 15 algorithm?... Bouma2

Notes de l'éditeur

  1. This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.
  2. This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.