2. ”If you can raed tihs,
tehn you are prbbolay not a sttae-mhciane.”
3. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
4. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
5. The Multiple Exact String-Match Problem
“Given a string-set L ⊆ Σ ∗ and an
input stream WI ∈ Σ∗, find all
occurrences of any of the strings
in L that appear in WI”
Uses: AV, IPS, DPI, DNA Search etc...
7. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
8. Aho-Corasick
[^flda]
0
l
f d a
1 7 10 13
f o a a d
2 4 8 11 14
e v d n a
3 5 9 12 15
f f e f o v l a d d a n a d a
9. Wu-Manber
SKIP
fe 0 ffe
ad 0 lad
an 0 dan
da 0 ada
ov 0 fov
ff 1
fo 1
la 1
.. 2
f f e f o v l a d d a n a d a
10. Rabin-Karp
0
1
2
3
0
4
5
6 lad ffe fov
7
8 dan ada
9
10
0
11
12
f f e f o v l a d d a n a d a
11. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
12. Bouma2: Motif-Based String Match
Set of Set of selected
bore re
strings 2-symbols long
core ek substrings
trek
bits bi
corridor at
boat
book ok
cooks
or
Preprocessing: Map every string to its own
substring: Motif Q1: How to
select motifs?
13. Bouma2: Motif-Based String Match
“ r a b b i t s h a t e c o o k s “
No match No match
b o a t b o o k
Match Match Match
b i t s c o o k s
Match: Examine symbols 2-by-2
(STATELESS, Consume-Order Agnostic);
attempt full match around motif occurrences
Q2: How to resolve
collisions?
14. Capturing all Occurrences
“ h a b i t s o f r a b b i t s “
Match Match
b i t s b i t s
Even-offset occurrences and odd-offset
occurrences require separate passes, but
instead...
15. Upgrade #1: 2-Symbol Strides
“ h a b i t s o f r a b b i t s “
Match Match Match
b i t s b i t s
• We map each string TWICE: once to an
even-offset motif, and once to an odd-
offset motif
16. Upgrade #2: Fast-Path / Slow-Path
4 14
“ h a b i t s o f r a b b i t s “ 4
14
Fast-Path:
- Stateless (agnostic to consume-order)
- “Monolithic” (zero branches)
- Cache-Aware (small direct-table)
- SIMPLE...
17. Upgrade #2: Fast-Path / Slow-Path
4 14
4 “ h a b i t s o f r a b b i t s “
14
Match Match Match
b i t s b i t s
Slow-Path:
- Memory-Efficient (pointers to original strings for
comparison)
- “Localized” (separate structure for every motif)
18. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
19. Bouma2 vs. Aho-Corasick
• n – length of input
• S – no. of string-matches in n
• P – Probability of motif-match
• l – length of the longest string
Match Complexities:
- Aho-Corasick: O( n S )
- Bouma2: O(n (0.5 P (l 2)))
20. Benchmark
- Performed against the Snort implementation of Aho-Corasick
- Tested with 1GB of genuine IP traffic recorded at an ISP site
- Database included 4,841 unique strings extracted from Snort rules, 3 bytes
long or longer
- Aggregate size of database strings: 98,546 bytes
- Tested using Snort source-code merged with Bouma2 over Intel Core2
Duo 2.53GHz with 1.95GB RAM running XP SP3
- Profiled with Visual Studio 2010 Sampling Profiler
- For Bouma2, three different motif-selection methods were compared:
B2-M (Minimum): Minimum motifs
B2-RS (Rare in Strings): Prefer motifs that occur less times within the
database strings
B2-RI (Rare in Input): Prefer motifs that are expected to occur less times in the
input (based on statistics over one third of the input)
21. Benchmark – Bouma2 vs. Snort AC (Throughput)
Throughput
(Mbit/sec)
3,500.00
3,000.00
2,500.00
2,000.00
AC
B2-M
B2-RS
1,500.00 B2-RI
1,000.00
500.00
Total
String Size
0.00 (bytes)
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
22. Benchmark – Bouma2 vs. Snort AC (Memory)
- Snort creates several AC instances, which are pre-filtered by port
- The comparison was done against a single Bouma2 instance
Memory
Consumption
(bytes)
50,000,000
40,000,000
30,000,000
AC
B2-M
B2-RS
20,000,000 B2-RI
10,000,000
Total
0 String Size
0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 100,000
(bytes)
23. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
24. Q1: How to select motifs?
bo co do id or re ri rr
bo re • •
Even
Offset co re • •
co rr id or • • • •
b or e •
Odd
Offset c or e •
c or ri do r • • •
• A1: Out of all 2-symbol substrings, find a
minimum subset that covers all given strings
(even & odd offsets)
25. Q1: How to select motifs?
bo co do id or re ri rr
bo re Χ √
Even
Offset co re Χ √
co rr id or Χ Χ √ Χ
b or e √
Odd
Offset c or e √
c or ri do r Χ √ Χ
• But... maybe the minimum subset is not
the optimal subset?
26. Q1: How to select motifs?
Bad selection of motifs for English text searches:
substrings of ‘the’ - the most common word in
English
at ea er he te th
Even
Offset th ea te r Χ Χ √
Odd
Offset t he at er Χ Χ √
“The good, the bad and the ugly“ in theaters nearby
No match No match Match No match Match No match
thea ter thea ter thea ter Match
thea ter
27. Q1: How to select motifs?
2-Symbol Sequence Occurrence Probability
bo 0.0002
re 0.001861
co 0.001028
rr 0.000031
id 0.001756
or 0.000444
ri 0.000284
do 0.000151
• Use input-specific occurrence statistics to
optimize motif-sets
• REALISTIC...
28. Q1: How to select motifs?
bo co do id or re ri rr
bo re √ Χ
Even
Offset co re √ Χ
co rr id or √ Χ √ Χ
b or e √
Odd
Offset c or e √
c or ri do r Χ √ Χ
• NOTE: After selecting the motif-set, remove
redundant mappings from the final String-to-
Motif mapping
30. Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL
Minimize c(t ), xt
t TL
whereas xt {0,1} every
for t TL
Subject To: for every w L
xt assoc0 (w, t ) 1 , and xt assoc1 (w, t ) 1
t TL t TL
31. Q2: How to resolve collisions?
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
b o re I
c o re
c o rridor
• A2: corrid o r
- New structure: The Mangled-Trie
- Examine adjacent symbols at relative offsets to
eliminate strings
- The Mangled-Trie itself dictates where to look next
(instead of following a strict left-to-right sequence)
32. The Mangled-Trie
„or‟ Motif at Offset 0
1 OTHER
Resolve: NO
Offset -1 MATCH
„b‟ „d‟
NO NO
„e‟ in NO “corri” in NO
„c‟ Offset 2? MATCH Offset -6? MATCH
2
OTHER YES YES
NO Resolve:
MATCH Offset 2 “bore” in “corridor” in
„e‟ Offset -1 Offset -6
“core” in
Offset -1 bore
„r‟ core
3 corridor
NO corridor
“idor” in NO
Offset 3? MATCH I
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
YES ...corricorridor...
“corridor” in
Offset -1 1 2 3
33. Q3: How to optimize slow-path?
• A3:
- Optimize Frequent Scenarios: Apply statistics
to Mangled-Trie construction
- Improve Motif-Set Quality: Avoid slow-path
altogether when possible
34. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Algorithm Design in Detail
• Discussion
36. Hash Functions
What is a Hash-Function?
“A hash function is any algorithm or subroutine that maps large data sets of
variable length, called keys, to smaller data sets of a fixed length. ...
The values returned by a hash function are called hash values, hash
codes, hash sums, checksums or simply hashes. ”
What input should we
expect?
What is a GOOD (non-cryptographic) Hash-Function?
“A good hash function should map the expected inputs as evenly as possible
over its output range. That is, every hash value in the output range should be
generated with roughly the same probability. ”
37. Bouma2 defines a hash-function:
- A tailored, optimized mapping of
strings to their own substrings.
- Collision-resolving is also optimized,
based on relative offset information
38. The Multiple Exact String-Match Problem
“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find
all occurrences of any of the strings in L that appear in WI”
FACT: The definition of the problem DOES
NOT imply that we must scan the input from
left to right, or in any other order.
39. The Multiple Exact String-Match Problem
“Given a string-set L ⊆ Σ∗ and an input stream WI ∈ Σ∗, find
all occurrences of any of the strings in L that appear in WI”
CLAIM: Algorithms that impose a
consume-order constraint are in general
less efficient than algorithms that are
free of this constraint.
40. The Multiple Exact String-Match Problem
“Given a string-set L ⊆ Σ∗
and an input stream WI ∈ Σ∗,
find all occurrences of 5000
any of the strings in L
Naïve
that appear in WI” Approach
1500
Which dominant factor should we
choose when designing an Aho-Corasick
efficient string-match 15
algorithm?...
Bouma2
Notes de l'éditeur
This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.
This template can be used as a starter file to give updates for project milestones.SectionsRight-click on a slide to add sections. Sections can help to organize your slides or facilitate collaboration between multiple authors.NotesUse the Notes section for delivery notes or to provide additional details for the audience. View these notes in Presentation View during your presentation. Keep in mind the font size (important for accessibility, visibility, videotaping, and online production)Coordinated colors Pay particular attention to the graphs, charts, and text boxes.Consider that attendees will print in black and white or grayscale. Run a test print to make sure your colors work when printed in pure black and white and grayscale.Graphics, tables, and graphsKeep it simple: If possible, use consistent, non-distracting styles and colors.Label all graphs and tables.