2. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 2
3. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 3
4. The Multiple String-Match Problem
• Goal: Given a set of strings and input
text, find all occurrences of any of the
strings in the text
• Input: Set of strings L and input text M
• Output: Offsets 1 ≤ i ≤ |M| where a
substring of M matches any of the
strings in L
• Uses: AV, IPS, DPI, DNA Search etc…
Page 4
6. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 6
7. Stateful Approach (e.g. Aho-Corasick)
• One state
transition per
symbol
• Linear in the length of the input
• Large automatons cause cache-
misses and degrade performance
Page 7
8. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 8
9. Guidelines
• INTUITIVE: Search for ‘Hints’ of
a Match Before the Full Match
• REALISTIC: Use Prior
Knowledge of Expected Input
• SIMPLE: Trivial Match Process
Page 9
10. Bouma2: Motif-Based String Match
Set of re Set of selected
bore
strings 2-symbols long
core
ek substrings
trek
bits bi
corridor at
boat
book ok
cooks
or
• Preprocessing: Map every string to
its own substring: Motif Q1: How to
select motifs?
Page 10
11. Bouma2: Motif-Based String Match (cont.)
“ r a b b i t s h a t e c o o k s “
No match No match
b o a t b o o k
Match Match Match
b i t s c o o k s
• Match: Examine symbols 2-by-2
(STATELESS); attempt full match
around motif occurrences
Q2: How to
resolve collisions?
Page 11
12. Capturing all Occurrences
“ h a b i t s o f r a b b i t s “
Match Match
b i t s b i t s
• Even-offset occurrences and odd-
offset occurrences require separate
passes, but instead…
Page 12
13. Upgrade #1: 2-Symbol Strides
“ h a b i t s o f r a b b i t s “
Match Match Match
b i t s b i t s
• We map each string TWICE: once to
an even-offset motif, and once to an
odd-offset motif
Page 13
14. Upgrade #2: Fast-Path / Slow-Path
4 14
“ h a b i t s o f r a b b i t s “ 4
14
• Fast-Path:
- Stateless
- “Monolithic” (zero branches)
- Cache-Aware (small direct-table)
- SIMPLE…
Page 14
15. Upgrade #2: Fast-Path / Slow-Path
4 14
4 “ h a b i t s o f r a b b i t s “
14
Match Match Match
b i t s b i t s
• Slow-Path:
- Memory-Efficient (pointers to
original strings for comparison)
- “Localized” (separate structure for
every motif)
Page 15
16. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 16
17. Bouma2 vs. Aho-Corasick
• n – length of input
• S – no. of string-matches in n
• m – no. of motif-matches in n
• l – length of the longest string
• Match Complexities:
- Aho-Corasick: O( n S )
n
- Bouma2: O( m l )
2
Page 17
18. Bouma2 vs. Aho-Corasick (Speed)
Bouma2 Bouma2 Slow-Path
Fast-Path (Sub-Optimal)
Aho-Corasick
• In practice, Bouma2 is usually at
least twice as fast as Aho-Corasick
• Fast-path alone is 10 times faster
Q3: How to optimize
slow-path? Page 18
19. Bouma2 vs. Aho-Corasick (Cache)
Bouma2
Cache-Misses
Aho-Corasick
Cache-Misses
• Bouma2 exhibits 8.5 times less
cache-misses than Aho-Corasick
(fast-path + slow-path)
Page 19
20. Bouma2 vs. Aho-Corasick (Memory)
Bouma2 Bouma2 Original
Fast-Path Slow-Path Strings
Aho-Corasick
• Bouma2 footprint is less than 70%
of Aho-Corasick for textual search
(down to 35% in other cases)
Page 20
21. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 21
22. Q1: How to select motifs?
bo co do id or re ri rr
bo re • •
Even
Offset co re • •
co rr id or • • • •
b or e •
Odd
Offset c or e •
c or ri do r • • •
• A1: Out of all 2-symbol substrings,
find a minimum subset that covers
all given strings (even & odd offsets)
Page 22
23. Q1: How to select motifs?
bo co do id or re ri rr
bo re Χ √
Even
Offset co re Χ √
co rr id or Χ Χ √ Χ
b or e √
Odd
Offset c or e √
c or ri do r Χ √ Χ
• But… maybe the minimum subset is
not the optimal subset?
Page 23
24. Q1: How to select motifs?
• Bad selection of motifs for English
text searches: substrings of ‘the’ -
the most common word in English
at ea er he te th
Even
Offset th ea te r Χ Χ √
Odd
Offset t he at er Χ Χ √
“The good, the bad and the ugly“ in theaters nearby
No match No match Match No match Match No match
thea ter thea ter thea ter Match
thea ter
Page 24
25. Q1: How to select motifs?
2-Symbol Sequence Occurrence Probability
bo 0.0002
re 0.001861
co 0.001028
rr 0.000031
id 0.001756
or 0.000444
ri 0.000284
do 0.000151
• Use input-specific occurrence
statistics to optimize motif-sets
• REALISTIC…
Page 25
26. Q1: How to select motifs?
bo co do id or re ri rr
bo re √ Χ
Even
Offset co re √ Χ
co rr id or √ Χ √ Χ
b or e √
Odd
Offset c or e √
c or ri do r Χ √ Χ
• NOTE: After selecting the motif-set,
remove redundant mappings from
the final String-to-Motif mapping
Page 26
28. Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL
Minimize c(t ) x
tTL
t ,
whereas xt {0,1} for every t TL
Subject To: for every w L
x assoc (w, t ) 1, and x assoc (w, t ) 1
tTL
t 0
tTL
t 1
Page 28
29. Q2: How to resolve collisions?
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
b o re I
c o re
c o rridor
corrid o r
• A2:
- Examine adjacent symbols at
relative offsets to eliminate strings
- New structure: The Mangled-Trie
Page 29
30. The Mangled-Trie
‘or’ Motif at Offset 0
1 OTHER
Resolve: NO
Offset -1 MATCH
‘b’ ‘d’
NO NO
‘e’ in NO “corri” in NO
‘c’ Offset 2? MATCH Offset -6? MATCH
2
OTHER YES YES
NO Resolve:
MATCH Offset 2 “bore” in “corridor” in
‘e’ Offset -1 Offset -6
“core” in
Offset -1 bore
‘r’ core
3 corridor
NO corridor
“idor” in NO
Offset 3? MATCH I
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
YES ...corricorridor...
“corridor” in
Offset -1 1 2 3
Page 30
31. Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
Page 31
32. Q3: How optimize slow-path?
• A3:
- Optimize Frequent Scenarios:
Apply statistics to Mangled-Trie
construction
- Improve Motif-Set Quality: Avoid
slow-path altogether when possible
Page 32
34. “ Search has always been about
people. It's not an abstract thing.
It's not a formula. It's about getting
people what they need... It depends
on the type of search you do—and
how to take all those signals and
put them together.”
- Udi Manber, Google, 2008
Page 34