5. REGEX ENGINE TYPES
Engine type Programs
DFA
awk (most versions), egrep (most versions), flex, lex, MySQL,
Procmail
Tradition NFA
GNU Emacs, Java, grep (most versions), less, more, .NET
languages, PCRE library, Perl, PHP, Python, Ruby, sed (most
versions), vi
Posix NFA
mawk, Mortice Kern Systems’ utilities, GNU Emacs (when
requested)
Hybrid
NFA/DFA
GNU awk, GNU grep / egrep, Tcl
6. REGEX ENGINE TYPES IN PHP
Text
processing
Programs
PCRE Regular Expressions (Perl-Compatible)
POSIX Regex
Regular Expression (POSIX Extended)
Deprecated from PHP 5.3; Removed from PHP 7.0
7. TESTING THE ENGINE TYPES
Traditional NFA or not?
「nfa|nfa not」 “nfa not”
“nfa not” “nfa not”
Traditional NFA DFA, NFA POSIX
8. TESTING THE ENGINE TYPES
DFA or POSIX NFA?
「X(.+)+X」 “=XX=====================”
POSIX NFA DFA
No match!
9. TWO ALL-ENCOMPASSING RULES
1. The match that begins earliest
(leftmost) wins
2. The standard quantifiers are
greedy (「*」,「+」,「?」,「{m,n}」)
10. THE MATCH THAT BEGINS EARLIEST WINS
◇ “a match” instead of “the match”
◇ attempt to match the beginning
of the string
◇ if all permutation are exhausted
without match, retry from next
character
11. THE MATCH THAT BEGINS EARLIEST WINS
「cat」 “The dragging belly indicates
your cat is too fat”
「fat|cat|belly|your」 “The dragging belly indicates
your cat is too fat”
12. THE STANDARD QUANTIFIERS ARE GREEDY
◇ minimum number of matches
that are required before it can
be considered successful
◇ maximum number that it will
ever attempt to match
13. THE STANDARD QUANTIFIERS ARE GREEDY
「.*(d+)」 “Copyright - 05 March 2016”
「.*(d*)」 “Copyright - 05 March 2016”
「d+(?!.*d)」 “Copyright - 05 March 2016”
17. NFA VS DFA
DFA NFA
Time Fast Slow
Space Less More
Type Deterministic Non Deterministic
Result Consistent Unpredictable
Backtracking ✗ ✓
Construction DFA ⊂ NFA NFA ⊃ DFA
Pre-compile Slower and more memory Faster and less memory
Then? Is boring Is funny
18. BACKTRACKING
◇ Consider each subexpression or
component in turn
◇ If it decides between two (or more) equally
viable options:
○ selects one
○ remember the others one
◇ If it’s successful (and the rest of the regex it
is also successful)
○ the match is finished
◇ Otherwise it backtracks to where it chose
the first option
19. TWO IMPORTANT POINTS ON BACKTRACKING
◇ When faced with multiple choices, which
should be tried first?
The engine always looks for greedy
quantifiers and skips lazy ones.
◇ When forced to backtrack, which saved
choice should the engine use?
The most recently saved option is the one
used (LIFO: Last In First Out)
20. SAVED STATES
◇ A match without backtracking
「ab?c」 “abc”͎ ͎saved states
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
21. SAVED STATES
◇ A match with backtracking
「ab?c」 “ac”͎ ͎saved states
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
✗
22. SAVED STATES
◇ A lazy match with backtracking
「ab??c」 “abc”͎ ͎saved states
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
✗
23. POSIX NFA
◇ A POSIX NFA does not stop with the
first match it finds, but continues to
try options states that might remain
◇ Each time it reached the end of the
regex, it would have another plausible
match
◇ Eventually, all options are exhausted
24. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implement regular expression pattern
matching using the same syntax and
semantics as Perl 5.
25. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implements regular expression pattern
matching using the same syntax and
semantics as Perl 5
Why a Perl regex clone?
Perl Regex is a standard de facto for the
web age.
26. PCRE2 vs PCRE
◇ This new API does not have any
user-visible C structure
◇ Function calls are used as the means
as interacting with the library
◇ JIT compilation has been moved into a
separate function
◇ It contains no static or global variables
◇ The idea of context in which PCRE
functions are called
27. BASE PROCESS IS EASY
pcre2_match()pcre2_compile() results...
28. PCRE2_COMPILE()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
29. PCRE2_COMPILE STRUCTURE
typedef struct pcre2_real_code {
pcre2_memctl memctl; /* Memory control fields */
const uint8_t *tables; /* The character tables */
void *executable_jit; /* Pointer to JIT code */
uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */
CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */
uint32_t magic_number; /* Paranoid and endianness check */
uint32_t compile_options; /* Options passed to pcre2_compile() */
uint32_t overall_options; /* Options after processing the pattern */
uint32_t flags; /* Various state flags */
uint32_t limit_heap; /* Limit set in the pattern */
uint32_t limit_match; /* Limit set in the pattern */
uint32_t limit_depth; /* Limit set in the pattern */
uint32_t first_codeunit; /* Starting code unit */
uint32_t last_codeunit; /* This codeunit must be seen */
uint16_t bsr_convention; /* What R matches */
uint16_t newline_convention; /* What is a newline? */
uint16_t max_lookbehind; /* Longest lookbehind (characters) */
uint16_t minlength; /* Minimum length of match */
uint16_t top_bracket; /* Highest numbered group */
uint16_t top_backref; /* Highest numbered back reference */
uint16_t name_entry_size; /* Size (code units) of table entries */
uint16_t name_count; /* Number of name entries in the table */
} pcre2_real_code;
30. PCRE2_MATCH()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
31. MATCH RESULT STRUCTURE
typedef struct pcre2_real_match_data {
pcre2_memctl memctl;
const pcre2_real_code *code; /* The pattern used for the match */
PCRE2_SPTR subject; /* The subject that was matched */
PCRE2_SPTR mark; /* Pointer to last mark */
PCRE2_SIZE leftchar; /* Offset to leftmost code unit */
PCRE2_SIZE rightchar; /* Offset to rightmost code unit */
PCRE2_SIZE startchar; /* Offset to starting code unit */
uint16_t matchedby; /* Type of match (normal, JIT, DFA) */
uint16_t oveccount; /* Number of pairs */
int rc; /* The return code from the match */
PCRE2_SIZE ovector[10000];/* The first field */
} pcre2_real_match_data;
32. OVECTOR
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
33. TRY IT YOURSELF!
docker pull delda/pcre2
docker run -it delda/pcre2 bash
delda/pcre2 is a docker image based on Debian Jessy
with a checkout of PCRE2 source code. The library is
installed in UTF8 with debug and JIT options active.
36. NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/
-----------------------------------------------
0 9 Bra
3 nfa
9 17 Alt
12 not nfa
26 26 Ket
29 End
-----------------------------------------------
Capturing subpattern count = 0
First code unit = 'n'
Last code unit = 'a'
Subject length lower bound = 3
data>
38. DFA EXAMPLE
data> nfa not
--->nfa not
+0 ^ n
+4 ^ n
+1 ^^ f
+5 ^^ f
+2 ^ ^ a
+6 ^ ^ a
+3 ^ ^ |
+7 ^ ^
+8 ^ ^ n
+9 ^ ^ o
+10 ^ ^ t
+11 ^ ^
0: nfa not
1: nfa
data> ^C