SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Regular Expression Best Practices
          Tony Stubblebine
          tony@tonystubblebine.com
          www.stubbleblog.com
          @tonystubblebine
Tabbed indentation is a sin but this isn't?
$string =~ s<
 (?:http://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).
)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)
){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F
d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-
fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-
)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?
:d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?:
...................
Abigail, comp.lang.perl.misc,
 http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59864
Best Practices for Any Programming
There are programming fundamentals that are
 routinely ignored by regular expression writers.
   
         Put a line break after statements and space
       between expressions.
   
       Throw in a comment or two.
   
       Use subroutines and modules to show structure
       and avoid duplication.
   
       Test.
Good Code
# Given a URL/URI, fetches it.
# Returns an HTTP::Response object.
sub get {
    my $self = shift; my $uri = shift;
    $uri = $self->base
      ? URI->new_abs( $uri, $self->base )
      : URI->new( $uri );
       return $self->SUPER::get( $uri-
    >as_string, @_ );
}
What if we didn't include
   documentation or whitespace?


sub get{my$self=shift;my$uri=shift;
 $uri=$self->base?URI->new_abs($uri,
 $self->base):URI-
 >new($uri);return$self-
 >SUPER::get($uri->as_string,@_);}
What if we were also as terse as
               possible?
So:

    No documentation

    No whitespace

    One character variable and method names
We'd have a regular expression.
sub g{my($s,$u)=@_;$u=$s->b?U-> n($u,
 $s->b):U->q($u);return$s-
 >SUPER::g($u->a,@_);}
What do we want from best
                   practices?

    Practices that maximize desired goals in certain
    applications.

    Goals of regex best practices:
    
        Maintainability
    
        Correctness
    
        Development Speed
#1: Use Extended Whitespace
   Add indentation, newlines, and comments to regular
    expressions
   Usage /x: m/regex/x
# Look for green or red foxes
$text =~ /(green | red)
             s
             fox (es)?
             # Allow more than one
             /x;
Extended Whitespace Gotchas
•   Must explicitly ask to match a space with s
    or <SPACE>
•   Must escape pound signs, #
Before

    What does this match?
$text =~ m/^([01]?dd?|2[0-4]d|
 25[0-5]).([01]?dd?|2[0-4]d|
 25[0-5]).([01]?dd?|2[0-4]d|
 25[0-5]).([01]?dd?|2[0-4]d|
 25[0-5])$/;
After
$text =~ m/
 # Match IP addresses like 169.146.10.45
 ^   # Start of string
 ([01]?dd?|2[0-4]d|25[0-5])
 # Number, 0-255
 .([01]?dd?|2[0-4]d|25[0-5])
 # 0-255
 .([01]?dd?|2[0-4]d|25[0-5])
 # 0-255
 .([01]?dd?|2[0-4]d|25[0-5])
 # 0-255
 $/x;
#2 Test


    You don't know your data.

    And you have a typo in your regex.

    Guaranteed surprises on both fronts.
Fun Gotcha
What file does this code open?
$file =
"/etc/passwd0/var/www/index.html";
if ( $file =~ m/^ .* .html/x ) {
    open (FILE, "$file);
}
Typical Gotcha
This matches foo.gif
But also... foojpg and jpg.doc
# match image files
m/ . gif | jpg | jpeg | png $/x
Test framework

    Write your regular expressions in a place where
    you can test them.

    Build up a list of positive and negative matches

    Include list in your documentation, ex:
# matches 800-555-1212 but not
# 800.555.1212 or 800-BETS-OFF
Hackers Test Framework
   Your “framework” could be this simple:
foreach my $test (@tests) {
    # looks like an image file?
    if (
        $test =~ m/ . gif | jpg | jpeg | png $/x ) {
          print "Matched on $testn";
    }    else {
          print "Failed match on $testn";
    }
}
Real Tests Are Better
my @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png");
my @fail   = ("gif.foo", "foo.gif.", "foopng", "foo.jpeg.bar");


sub match {
     return $_[0] =~ m/ . gif | jpg | jpeg | png $/x;
}


foreach my $test (@match) {
     ok( match($test), "$test matches");
}
foreach my $test (@fail) {
    ok( !match($test), "$test fails to match");
}
#3 Use Structure

... as a slow-witted human being I have a very
   small head and I had better learn to live with it
   and to respect my limitations and give them full
   credit, rather than to try to ignore them, for the
   latter vain effort will be punished by failure.
~ Edsger Dijkstra
Breaking up an email regex
We can write an email regex that looks like this:
m/$user@$domain/


Build your regexes from smaller regexes like this:
$user = "w+";
$domain = qr/w+.(w+.)*www?/i;
Use Post Processing

    It's easier to say a number is <= 255 in code than it is as
    a regular expression.
# IP Address check
$ip =~ m/^(d{1,3}).(d{1,3}).(d{1,3}).
 (d{1,3})$/;


foreach my $num ($1, $2, $3, $4) {
       $failure++ unless $num < 256;
}
#4. Good habits

    Regex are hard to debug, so avoid errors.

    Error avoidance habits:
    
        Group alternations with parentheses
    
        Use lazy quantifiers
    
        Don't use regular expressions
Group Alternations
Group your alternations. In this regex, the dot and
 end of string ($) are not part of your alternation.
m/ . (gif | jpg | jpeg | png) $/x
Use Lazy Quantifiers

    Use lazy quantifiers. It's easier to say when to
    stop.
<td>.*?</td>
Lazy Quantifiers...
Compare that to
#Matches too much
$text = "<td>foo</td><td>bar</td>";
$text =~ m!<td>.*</td>!;


#Matches too little
$text = "<td>foo <b>bar</b> </td>";
$text = m/<td>[^<]*/;
Don't use regular expressions

    Regular expressions don't deal well with
    nesting
$text = "<td> foo
 <table><tr><td>bar</td>...";
$text =~ m!<td> .*? </td>!;


    Use something better an HTML or XML parsing
    library instead.
Don't use regular expressions

    Regular expressions don't deal well with
    nesting
$text = "<td> foo
 <table><tr><td>bar</td>...";
$text =~ m!<td> .*? </td>!;


    Use something better an HTML or XML parsing
    library instead.
#5. Optimize Last

    It's more common for regular expressions to be
    broken then to be slow

    Optimize last.

    Start with the quantifiers
Optimizing Quantifiers
# This is slow because the match backtracks
 from the end
# of the file
$text = "M1 text i'm looking for M2 thousand
 more characters to come...";
$text =~ m/M1 (.*) M2/s;


# This is slow because the match looks for
 </body> at
# (nearly) every position.
$html =~ m!&ltbody> (.*?) </body>!xs;
Buy The Book!
Available from Amazon for $9.95
http://bit.ly/regexpr


Thank you for reading!
I'm tony@tonystubblebine.com

Contenu connexe

Dernier

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

En vedette (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Regex Best Practices

  • 1. Regular Expression Best Practices Tony Stubblebine tony@tonystubblebine.com www.stubbleblog.com @tonystubblebine
  • 2. Tabbed indentation is a sin but this isn't? $string =~ s< (?:http://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?). )*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+) ){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F d]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{ 2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{ 2}))|[;:@&=])*))?)?)|(?:ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(? :%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a- fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|- )*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(? :d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+! *'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'() ,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news:(?: ................... Abigail, comp.lang.perl.misc, http://aspn.activestate.com/ASPN/Cookbook/Rx/Recipe/59864
  • 3. Best Practices for Any Programming There are programming fundamentals that are routinely ignored by regular expression writers.  Put a line break after statements and space between expressions.  Throw in a comment or two.  Use subroutines and modules to show structure and avoid duplication.  Test.
  • 4. Good Code # Given a URL/URI, fetches it. # Returns an HTTP::Response object. sub get { my $self = shift; my $uri = shift; $uri = $self->base ? URI->new_abs( $uri, $self->base ) : URI->new( $uri ); return $self->SUPER::get( $uri- >as_string, @_ ); }
  • 5. What if we didn't include documentation or whitespace? sub get{my$self=shift;my$uri=shift; $uri=$self->base?URI->new_abs($uri, $self->base):URI- >new($uri);return$self- >SUPER::get($uri->as_string,@_);}
  • 6. What if we were also as terse as possible? So:  No documentation  No whitespace  One character variable and method names
  • 7. We'd have a regular expression. sub g{my($s,$u)=@_;$u=$s->b?U-> n($u, $s->b):U->q($u);return$s- >SUPER::g($u->a,@_);}
  • 8. What do we want from best practices?  Practices that maximize desired goals in certain applications.  Goals of regex best practices:  Maintainability  Correctness  Development Speed
  • 9. #1: Use Extended Whitespace  Add indentation, newlines, and comments to regular expressions  Usage /x: m/regex/x # Look for green or red foxes $text =~ /(green | red) s fox (es)? # Allow more than one /x;
  • 10. Extended Whitespace Gotchas • Must explicitly ask to match a space with s or <SPACE> • Must escape pound signs, #
  • 11. Before  What does this match? $text =~ m/^([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5]).([01]?dd?|2[0-4]d| 25[0-5])$/;
  • 12. After $text =~ m/ # Match IP addresses like 169.146.10.45 ^ # Start of string ([01]?dd?|2[0-4]d|25[0-5]) # Number, 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 .([01]?dd?|2[0-4]d|25[0-5]) # 0-255 $/x;
  • 13. #2 Test  You don't know your data.  And you have a typo in your regex.  Guaranteed surprises on both fronts.
  • 14. Fun Gotcha What file does this code open? $file = "/etc/passwd0/var/www/index.html"; if ( $file =~ m/^ .* .html/x ) { open (FILE, "$file); }
  • 15. Typical Gotcha This matches foo.gif But also... foojpg and jpg.doc # match image files m/ . gif | jpg | jpeg | png $/x
  • 16. Test framework  Write your regular expressions in a place where you can test them.  Build up a list of positive and negative matches  Include list in your documentation, ex: # matches 800-555-1212 but not # 800.555.1212 or 800-BETS-OFF
  • 17. Hackers Test Framework  Your “framework” could be this simple: foreach my $test (@tests) { # looks like an image file? if ( $test =~ m/ . gif | jpg | jpeg | png $/x ) { print "Matched on $testn"; } else { print "Failed match on $testn"; } }
  • 18. Real Tests Are Better my @match = ("foo.gif", "foo.bar.jpg", "bar_foo.gif.jpg.png"); my @fail = ("gif.foo", "foo.gif.", "foopng", "foo.jpeg.bar"); sub match { return $_[0] =~ m/ . gif | jpg | jpeg | png $/x; } foreach my $test (@match) { ok( match($test), "$test matches"); } foreach my $test (@fail) { ok( !match($test), "$test fails to match"); }
  • 19. #3 Use Structure ... as a slow-witted human being I have a very small head and I had better learn to live with it and to respect my limitations and give them full credit, rather than to try to ignore them, for the latter vain effort will be punished by failure. ~ Edsger Dijkstra
  • 20. Breaking up an email regex We can write an email regex that looks like this: m/$user@$domain/ Build your regexes from smaller regexes like this: $user = "w+"; $domain = qr/w+.(w+.)*www?/i;
  • 21. Use Post Processing  It's easier to say a number is <= 255 in code than it is as a regular expression. # IP Address check $ip =~ m/^(d{1,3}).(d{1,3}).(d{1,3}). (d{1,3})$/; foreach my $num ($1, $2, $3, $4) { $failure++ unless $num < 256; }
  • 22. #4. Good habits  Regex are hard to debug, so avoid errors.  Error avoidance habits:  Group alternations with parentheses  Use lazy quantifiers  Don't use regular expressions
  • 23. Group Alternations Group your alternations. In this regex, the dot and end of string ($) are not part of your alternation. m/ . (gif | jpg | jpeg | png) $/x
  • 24. Use Lazy Quantifiers  Use lazy quantifiers. It's easier to say when to stop. <td>.*?</td>
  • 25. Lazy Quantifiers... Compare that to #Matches too much $text = "<td>foo</td><td>bar</td>"; $text =~ m!<td>.*</td>!; #Matches too little $text = "<td>foo <b>bar</b> </td>"; $text = m/<td>[^<]*/;
  • 26. Don't use regular expressions  Regular expressions don't deal well with nesting $text = "<td> foo <table><tr><td>bar</td>..."; $text =~ m!<td> .*? </td>!;  Use something better an HTML or XML parsing library instead.
  • 27. Don't use regular expressions  Regular expressions don't deal well with nesting $text = "<td> foo <table><tr><td>bar</td>..."; $text =~ m!<td> .*? </td>!;  Use something better an HTML or XML parsing library instead.
  • 28. #5. Optimize Last  It's more common for regular expressions to be broken then to be slow  Optimize last.  Start with the quantifiers
  • 29. Optimizing Quantifiers # This is slow because the match backtracks from the end # of the file $text = "M1 text i'm looking for M2 thousand more characters to come..."; $text =~ m/M1 (.*) M2/s; # This is slow because the match looks for </body> at # (nearly) every position. $html =~ m!&ltbody> (.*?) </body>!xs;
  • 30. Buy The Book! Available from Amazon for $9.95 http://bit.ly/regexpr Thank you for reading! I'm tony@tonystubblebine.com