SlideShare a Scribd company logo
1 of 38
Download to read offline
/regexp?/
The mechanics of Expression Processing
with some PCRE2 referral
Meetup PUG
25 Luglio 2017
#AperiTech
I AM DAVIDE DELL’ERBA
Research & Development @
INDEX
◇ Regex engine types
◇ Two all-encompassing rules
◇ NFA vs DFA
◇ Backtracking
◇ PCRE2
REGEX ENGINE TYPES
◇ DFA
◇ Traditional NFA
◇ POSIX NFA
◇ Hybrid NFA/DFA
REGEX ENGINE TYPES
Engine type Programs
DFA
awk (most versions), egrep (most versions), flex, lex, MySQL,
Procmail
Tradition NFA
GNU Emacs, Java, grep (most versions), less, more, .NET
languages, PCRE library, Perl, PHP, Python, Ruby, sed (most
versions), vi
Posix NFA
mawk, Mortice Kern Systems’ utilities, GNU Emacs (when
requested)
Hybrid
NFA/DFA
GNU awk, GNU grep / egrep, Tcl
REGEX ENGINE TYPES IN PHP
Text
processing
Programs
PCRE Regular Expressions (Perl-Compatible)
POSIX Regex
Regular Expression (POSIX Extended)
Deprecated from PHP 5.3; Removed from PHP 7.0
TESTING THE ENGINE TYPES
Traditional NFA or not?
「nfa|nfa not」 “nfa not”
“nfa not” “nfa not”
Traditional NFA DFA, NFA POSIX
TESTING THE ENGINE TYPES
DFA or POSIX NFA?
「X(.+)+X」 “=XX=====================”
POSIX NFA DFA
No match!
TWO ALL-ENCOMPASSING RULES
1. The match that begins earliest
(leftmost) wins
2. The standard quantifiers are
greedy (「*」,「+」,「?」,「{m,n}」)
THE MATCH THAT BEGINS EARLIEST WINS
◇ “a match” instead of “the match”
◇ attempt to match the beginning
of the string
◇ if all permutation are exhausted
without match, retry from next
character
THE MATCH THAT BEGINS EARLIEST WINS
「cat」 “The dragging belly indicates
your cat is too fat”
「fat|cat|belly|your」 “The dragging belly indicates
your cat is too fat”
THE STANDARD QUANTIFIERS ARE GREEDY
◇ minimum number of matches
that are required before it can
be considered successful
◇ maximum number that it will
ever attempt to match
THE STANDARD QUANTIFIERS ARE GREEDY
「.*(d+)」 “Copyright - 05 March 2016”
「.*(d*)」 “Copyright - 05 March 2016”
「d+(?!.*d)」 “Copyright - 05 March 2016”
REGEX-DIRECTION VS TEXT-DIRECTION
◇ NFA engine is Regex-Directed
◇ DFA engine is Text-Directed
NFA ENGINE: REGEX-DIRECTED
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
DFA ENGINE: TEXT-DIRECTED
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎
「to(nite|knight|night)」 “tonight”͎ ͎
͎ ͎
͎
͎ ͎
NFA VS DFA
DFA NFA
Time Fast Slow
Space Less More
Type Deterministic Non Deterministic
Result Consistent Unpredictable
Backtracking ✗ ✓
Construction DFA ⊂ NFA NFA ⊃ DFA
Pre-compile Slower and more memory Faster and less memory
Then? Is boring Is funny
BACKTRACKING
◇ Consider each subexpression or
component in turn
◇ If it decides between two (or more) equally
viable options:
○ selects one
○ remember the others one
◇ If it’s successful (and the rest of the regex it
is also successful)
○ the match is finished
◇ Otherwise it backtracks to where it chose
the first option
TWO IMPORTANT POINTS ON BACKTRACKING
◇ When faced with multiple choices, which
should be tried first?
The engine always looks for greedy
quantifiers and skips lazy ones.
◇ When forced to backtrack, which saved
choice should the engine use?
The most recently saved option is the one
used (LIFO: Last In First Out)
SAVED STATES
◇ A match without backtracking
「ab?c」 “abc”͎ ͎saved states
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
SAVED STATES
◇ A match with backtracking
「ab?c」 “ac”͎ ͎saved states
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
✗
SAVED STATES
◇ A lazy match with backtracking
「ab??c」 “abc”͎ ͎saved states
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
✗
POSIX NFA
◇ A POSIX NFA does not stop with the
first match it finds, but continues to
try options states that might remain
◇ Each time it reached the end of the
regex, it would have another plausible
match
◇ Eventually, all options are exhausted
PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implement regular expression pattern
matching using the same syntax and
semantics as Perl 5.
PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implements regular expression pattern
matching using the same syntax and
semantics as Perl 5
Why a Perl regex clone?
Perl Regex is a standard de facto for the
web age.
PCRE2 vs PCRE
◇ This new API does not have any
user-visible C structure
◇ Function calls are used as the means
as interacting with the library
◇ JIT compilation has been moved into a
separate function
◇ It contains no static or global variables
◇ The idea of context in which PCRE
functions are called
BASE PROCESS IS EASY
pcre2_match()pcre2_compile() results...
PCRE2_COMPILE()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
PCRE2_COMPILE STRUCTURE
typedef struct pcre2_real_code {
pcre2_memctl memctl; /* Memory control fields */
const uint8_t *tables; /* The character tables */
void *executable_jit; /* Pointer to JIT code */
uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */
CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */
uint32_t magic_number; /* Paranoid and endianness check */
uint32_t compile_options; /* Options passed to pcre2_compile() */
uint32_t overall_options; /* Options after processing the pattern */
uint32_t flags; /* Various state flags */
uint32_t limit_heap; /* Limit set in the pattern */
uint32_t limit_match; /* Limit set in the pattern */
uint32_t limit_depth; /* Limit set in the pattern */
uint32_t first_codeunit; /* Starting code unit */
uint32_t last_codeunit; /* This codeunit must be seen */
uint16_t bsr_convention; /* What R matches */
uint16_t newline_convention; /* What is a newline? */
uint16_t max_lookbehind; /* Longest lookbehind (characters) */
uint16_t minlength; /* Minimum length of match */
uint16_t top_bracket; /* Highest numbered group */
uint16_t top_backref; /* Highest numbered back reference */
uint16_t name_entry_size; /* Size (code units) of table entries */
uint16_t name_count; /* Number of name entries in the table */
} pcre2_real_code;
PCRE2_MATCH()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
MATCH RESULT STRUCTURE
typedef struct pcre2_real_match_data {
pcre2_memctl memctl;
const pcre2_real_code *code; /* The pattern used for the match */
PCRE2_SPTR subject; /* The subject that was matched */
PCRE2_SPTR mark; /* Pointer to last mark */
PCRE2_SIZE leftchar; /* Offset to leftmost code unit */
PCRE2_SIZE rightchar; /* Offset to rightmost code unit */
PCRE2_SIZE startchar; /* Offset to starting code unit */
uint16_t matchedby; /* Type of match (normal, JIT, DFA) */
uint16_t oveccount; /* Number of pairs */
int rc; /* The return code from the match */
PCRE2_SIZE ovector[10000];/* The first field */
} pcre2_real_match_data;
OVECTOR
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
TRY IT YOURSELF!
docker pull delda/pcre2
docker run -it delda/pcre2 bash
delda/pcre2 is a docker image based on Debian Jessy
with a checkout of PCRE2 source code. The library is
installed in UTF8 with debug and JIT options active.
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/
data> nfa not
0: nfa
data>
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/auto_callout
data> nfa not
--->nfa not
+0 ^ n
+1 ^^ f
+2 ^ ^ a
+3 ^ ^ |
0: nfa
data>
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/
-----------------------------------------------
0 9 Bra
3 nfa
9 17 Alt
12 not nfa
26 26 Ket
29 End
-----------------------------------------------
Capturing subpattern count = 0
First code unit = 'n'
Last code unit = 'a'
Subject length lower bound = 3
data>
DFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -dfa
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/auto_callout
data>
DFA EXAMPLE
data> nfa not
--->nfa not
+0 ^ n
+4 ^ n
+1 ^^ f
+5 ^^ f
+2 ^ ^ a
+6 ^ ^ a
+3 ^ ^ |
+7 ^ ^
+8 ^ ^ n
+9 ^ ^ o
+10 ^ ^ t
+11 ^ ^
0: nfa not
1: nfa
data> ^C

More Related Content

What's hot

Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threadstech2click
 
Quick tour of PHP from inside
Quick tour of PHP from insideQuick tour of PHP from inside
Quick tour of PHP from insidejulien pauli
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 
Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23DefconRussia
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programmingLarion
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionMarina Kolpakova
 
Vm ware fuzzing - defcon russia 20
Vm ware fuzzing  - defcon russia 20Vm ware fuzzing  - defcon russia 20
Vm ware fuzzing - defcon russia 20DefconRussia
 
2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - englishJen Yee Hong
 
Basic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmersBasic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmersJen Yee Hong
 
Create your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 VeronaCreate your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 VeronaPatrick Allaert
 
Presentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasuresPresentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasurestharindunew
 

What's hot (20)

Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threads
 
Quick tour of PHP from inside
Quick tour of PHP from insideQuick tour of PHP from inside
Quick tour of PHP from inside
 
First session quiz
First session quizFirst session quiz
First session quiz
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
Mona cheatsheet
Mona cheatsheetMona cheatsheet
Mona cheatsheet
 
Buffer OverFlow
Buffer OverFlowBuffer OverFlow
Buffer OverFlow
 
C tutorial
C tutorialC tutorial
C tutorial
 
C
CC
C
 
Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programming
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Vm ware fuzzing - defcon russia 20
Vm ware fuzzing  - defcon russia 20Vm ware fuzzing  - defcon russia 20
Vm ware fuzzing - defcon russia 20
 
PHP7 is coming
PHP7 is comingPHP7 is coming
PHP7 is coming
 
2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english
 
Basic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmersBasic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmers
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Create your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 VeronaCreate your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 Verona
 
Presentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasuresPresentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasures
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 

Similar to Regular Expression (RegExp)

Php opcodes sep2008
Php opcodes sep2008Php opcodes sep2008
Php opcodes sep2008bengiuliano
 
Bare metal performance in Elixir
Bare metal performance in ElixirBare metal performance in Elixir
Bare metal performance in ElixirAaron Seigo
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial javaTpoint s
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Msfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheetMsfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheetCe.Se.N.A. Security
 
OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3Peter Tröger
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick reviewCe.Se.N.A. Security
 
please help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdfplease help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdfnewfaransportsfitnes
 
Unit 4
Unit 4Unit 4
Unit 4siddr
 
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docxAssignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docxbraycarissa250
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfabdulrahamanbags
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
 

Similar to Regular Expression (RegExp) (20)

Php opcodes sep2008
Php opcodes sep2008Php opcodes sep2008
Php opcodes sep2008
 
Bare metal performance in Elixir
Bare metal performance in ElixirBare metal performance in Elixir
Bare metal performance in Elixir
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
OpenMP
OpenMPOpenMP
OpenMP
 
Msfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheetMsfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheet
 
OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick review
 
please help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdfplease help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdf
 
Unit 4
Unit 4Unit 4
Unit 4
 
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docxAssignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
C programming session10
C programming  session10C programming  session10
C programming session10
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Embedded C programming session10
Embedded C programming  session10Embedded C programming  session10
Embedded C programming session10
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Crash course in verilog
Crash course in verilogCrash course in verilog
Crash course in verilog
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
Explain that explain
Explain that explainExplain that explain
Explain that explain
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Regular Expression (RegExp)

  • 1. /regexp?/ The mechanics of Expression Processing with some PCRE2 referral Meetup PUG 25 Luglio 2017 #AperiTech
  • 2. I AM DAVIDE DELL’ERBA Research & Development @
  • 3. INDEX ◇ Regex engine types ◇ Two all-encompassing rules ◇ NFA vs DFA ◇ Backtracking ◇ PCRE2
  • 4. REGEX ENGINE TYPES ◇ DFA ◇ Traditional NFA ◇ POSIX NFA ◇ Hybrid NFA/DFA
  • 5. REGEX ENGINE TYPES Engine type Programs DFA awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail Tradition NFA GNU Emacs, Java, grep (most versions), less, more, .NET languages, PCRE library, Perl, PHP, Python, Ruby, sed (most versions), vi Posix NFA mawk, Mortice Kern Systems’ utilities, GNU Emacs (when requested) Hybrid NFA/DFA GNU awk, GNU grep / egrep, Tcl
  • 6. REGEX ENGINE TYPES IN PHP Text processing Programs PCRE Regular Expressions (Perl-Compatible) POSIX Regex Regular Expression (POSIX Extended) Deprecated from PHP 5.3; Removed from PHP 7.0
  • 7. TESTING THE ENGINE TYPES Traditional NFA or not? 「nfa|nfa not」 “nfa not” “nfa not” “nfa not” Traditional NFA DFA, NFA POSIX
  • 8. TESTING THE ENGINE TYPES DFA or POSIX NFA? 「X(.+)+X」 “=XX=====================” POSIX NFA DFA No match!
  • 9. TWO ALL-ENCOMPASSING RULES 1. The match that begins earliest (leftmost) wins 2. The standard quantifiers are greedy (「*」,「+」,「?」,「{m,n}」)
  • 10. THE MATCH THAT BEGINS EARLIEST WINS ◇ “a match” instead of “the match” ◇ attempt to match the beginning of the string ◇ if all permutation are exhausted without match, retry from next character
  • 11. THE MATCH THAT BEGINS EARLIEST WINS 「cat」 “The dragging belly indicates your cat is too fat” 「fat|cat|belly|your」 “The dragging belly indicates your cat is too fat”
  • 12. THE STANDARD QUANTIFIERS ARE GREEDY ◇ minimum number of matches that are required before it can be considered successful ◇ maximum number that it will ever attempt to match
  • 13. THE STANDARD QUANTIFIERS ARE GREEDY 「.*(d+)」 “Copyright - 05 March 2016” 「.*(d*)」 “Copyright - 05 March 2016” 「d+(?!.*d)」 “Copyright - 05 March 2016”
  • 14. REGEX-DIRECTION VS TEXT-DIRECTION ◇ NFA engine is Regex-Directed ◇ DFA engine is Text-Directed
  • 15. NFA ENGINE: REGEX-DIRECTED 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎
  • 16. DFA ENGINE: TEXT-DIRECTED 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ 「to(nite|knight|night)」 “tonight”͎ ͎ ͎ ͎ ͎ ͎ ͎
  • 17. NFA VS DFA DFA NFA Time Fast Slow Space Less More Type Deterministic Non Deterministic Result Consistent Unpredictable Backtracking ✗ ✓ Construction DFA ⊂ NFA NFA ⊃ DFA Pre-compile Slower and more memory Faster and less memory Then? Is boring Is funny
  • 18. BACKTRACKING ◇ Consider each subexpression or component in turn ◇ If it decides between two (or more) equally viable options: ○ selects one ○ remember the others one ◇ If it’s successful (and the rest of the regex it is also successful) ○ the match is finished ◇ Otherwise it backtracks to where it chose the first option
  • 19. TWO IMPORTANT POINTS ON BACKTRACKING ◇ When faced with multiple choices, which should be tried first? The engine always looks for greedy quantifiers and skips lazy ones. ◇ When forced to backtrack, which saved choice should the engine use? The most recently saved option is the one used (LIFO: Last In First Out)
  • 20. SAVED STATES ◇ A match without backtracking 「ab?c」 “abc”͎ ͎saved states 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎
  • 21. SAVED STATES ◇ A match with backtracking 「ab?c」 “ac”͎ ͎saved states 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ ✗
  • 22. SAVED STATES ◇ A lazy match with backtracking 「ab??c」 “abc”͎ ͎saved states 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ ✗
  • 23. POSIX NFA ◇ A POSIX NFA does not stop with the first match it finds, but continues to try options states that might remain ◇ Each time it reached the end of the regex, it would have another plausible match ◇ Eventually, all options are exhausted
  • 24. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION The PCRE library is a set of function that implement regular expression pattern matching using the same syntax and semantics as Perl 5.
  • 25. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION The PCRE library is a set of function that implements regular expression pattern matching using the same syntax and semantics as Perl 5 Why a Perl regex clone? Perl Regex is a standard de facto for the web age.
  • 26. PCRE2 vs PCRE ◇ This new API does not have any user-visible C structure ◇ Function calls are used as the means as interacting with the library ◇ JIT compilation has been moved into a separate function ◇ It contains no static or global variables ◇ The idea of context in which PCRE functions are called
  • 27. BASE PROCESS IS EASY pcre2_match()pcre2_compile() results...
  • 28. PCRE2_COMPILE() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 29. PCRE2_COMPILE STRUCTURE typedef struct pcre2_real_code { pcre2_memctl memctl; /* Memory control fields */ const uint8_t *tables; /* The character tables */ void *executable_jit; /* Pointer to JIT code */ uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */ CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */ uint32_t magic_number; /* Paranoid and endianness check */ uint32_t compile_options; /* Options passed to pcre2_compile() */ uint32_t overall_options; /* Options after processing the pattern */ uint32_t flags; /* Various state flags */ uint32_t limit_heap; /* Limit set in the pattern */ uint32_t limit_match; /* Limit set in the pattern */ uint32_t limit_depth; /* Limit set in the pattern */ uint32_t first_codeunit; /* Starting code unit */ uint32_t last_codeunit; /* This codeunit must be seen */ uint16_t bsr_convention; /* What R matches */ uint16_t newline_convention; /* What is a newline? */ uint16_t max_lookbehind; /* Longest lookbehind (characters) */ uint16_t minlength; /* Minimum length of match */ uint16_t top_bracket; /* Highest numbered group */ uint16_t top_backref; /* Highest numbered back reference */ uint16_t name_entry_size; /* Size (code units) of table entries */ uint16_t name_count; /* Number of name entries in the table */ } pcre2_real_code;
  • 30. PCRE2_MATCH() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 31. MATCH RESULT STRUCTURE typedef struct pcre2_real_match_data { pcre2_memctl memctl; const pcre2_real_code *code; /* The pattern used for the match */ PCRE2_SPTR subject; /* The subject that was matched */ PCRE2_SPTR mark; /* Pointer to last mark */ PCRE2_SIZE leftchar; /* Offset to leftmost code unit */ PCRE2_SIZE rightchar; /* Offset to rightmost code unit */ PCRE2_SIZE startchar; /* Offset to starting code unit */ uint16_t matchedby; /* Type of match (normal, JIT, DFA) */ uint16_t oveccount; /* Number of pairs */ int rc; /* The return code from the match */ PCRE2_SIZE ovector[10000];/* The first field */ } pcre2_real_match_data;
  • 32. OVECTOR re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 33. TRY IT YOURSELF! docker pull delda/pcre2 docker run -it delda/pcre2 bash delda/pcre2 is a docker image based on Debian Jessy with a checkout of PCRE2 source code. The library is installed in UTF8 with debug and JIT options active.
  • 34. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ data> nfa not 0: nfa data>
  • 35. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data> nfa not --->nfa not +0 ^ n +1 ^^ f +2 ^ ^ a +3 ^ ^ | 0: nfa data>
  • 36. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ ----------------------------------------------- 0 9 Bra 3 nfa 9 17 Alt 12 not nfa 26 26 Ket 29 End ----------------------------------------------- Capturing subpattern count = 0 First code unit = 'n' Last code unit = 'a' Subject length lower bound = 3 data>
  • 37. DFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -dfa PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data>
  • 38. DFA EXAMPLE data> nfa not --->nfa not +0 ^ n +4 ^ n +1 ^^ f +5 ^^ f +2 ^ ^ a +6 ^ ^ a +3 ^ ^ | +7 ^ ^ +8 ^ ^ n +9 ^ ^ o +10 ^ ^ t +11 ^ ^ 0: nfa not 1: nfa data> ^C
  • 39. ANY QUESTIONS? You can find me at @delda80 github.com/delda info@davidedellerba.it