Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Regular expressions

3 328 vues

Publié le

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

Publié dans : Technologie
  • Soyez le premier à aimer ceci

Regular expressions

  1. 1. Regular ExpressionsPowerful string validation and extractionIgnaz Wanders – Architect @ Archimiddle@ignazw
  2. 2. Topics• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References
  3. 3. What are regular expressions?• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to verycomplex structures:– Input validation– S(ubs)tring replacement– ...• Example:• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  4. 4. History• Originates from automata and formal-language theories of computer science• Stephen Kleene  50’s: Kleene algebra• Kenneth Thompson  1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:– C, Perl– JavaScript, Java
  5. 5. Patterns• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern• Metacharacters: ([{^$|)?*+.– Period means “any character”– To search for period as string literal, escape with “”REGEX: foxTEXT: The quick brown foxRESULT: foxREGEX: fo.TEXT: The quick brown foxRESULT: foxREGEX: .o.TEXT: The quick brown foxRESULT: row, fox
  6. 6. Character classes (1/3)• Syntax: any characters between [ and ]• Character classes denote one letter• Negation: ^REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: ratREGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -REGEX: [^rcb]atTEXT: ratRESULT: -REGEX: [^rcb]atTEXT: hatRESULT: hat
  7. 7. Character classes (2/3)• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...• Unions: [0-4[6-8]], [a-p[r-w]], ...• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...• Subtractions: [a-f&&[^efg]], ...REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -
  8. 8. Character classes (3/3)predefined character classes equivalence. any characterd any digit [0-9]D any non-digit [^0-9], [^d]s any white-space character [ tnx0Bfr]S any non-white-space character [^s]w any word character [a-zA-Z_0-9]W any non-word character [^w]
  9. 9. Quantifiers (1/5)• Quantifiers allow character classes to match more than one character at a time.Quantifiers for character classes XX? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times
  10. 10. Quantifiers (2/5)• Examples of X?, X*, X+REGEX: “a?”TEXT: “”RESULT: “”REGEX: “a*”TEXT: “”RESULT: “”REGEX: “a+”TEXT: “”RESULT: -REGEX: “a?”TEXT: “a”RESULT: “a”REGEX: “a*”TEXT: “a”RESULT: “a”REGEX: “a+”TEXT: “a”RESULT: “a”REGEX: “a?”TEXT: “aaa”RESULT:“a”,”a”,”a”REGEX: “a*”TEXT: “aaa”RESULT: “aaa”REGEX: “a+”TEXT: “aaa”RESULT: “aaa”
  11. 11. Quantifiers (3/5)REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”
  12. 12. Quantifiers (4/5)• Greedy quantifiers:– read complete string– work backwards until match found– syntax: X?, X*, X+, ...• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...
  13. 13. Quantifiers (5/5)REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -greedyreluctantpossessive
  14. 14. Capturing groups (1/2)• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)
  15. 15. Capturing groups (2/2)• Backreferences to capturing groups are denoted by i with i an integer numberREGEX: “(dd)1”TEXT: “1212”RESULT: “1212”REGEX: “(dd)1”TEXT: “1234”RESULT: -
  16. 16. Boundaries (1/2)Boundary characters^ beginning of line$ end of lineb a word boundaryB a non-word boundaryA beginning of inputG end of previous matchz end of inputZ end of input, but before final terminator, if any
  17. 17. Boundaries (2/2)• Be aware:• End-of-line marker is $– Unix EOL is n– Windows EOL is rn– JDK uses any of the following as EOL:• n, rn, u0085, u2028, u2029• Always test your regular expressions on the target OS
  18. 18. Internationalization (1/2)• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^w+$• Extension to unicode character sets denoted by p{...}• Character set: [p{InCharacterSet}]– Create character classes from symbols in character sets.– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  19. 19. Internationalization (2/2)• Note that there are non-letters in character sets as well:– Latin-1 Supplement:• Categories:– Letters: p{L}– Uppercase letters: p{Lu}– “België” is matched by ^p{L}+$• Other (POSIX) categories:– Unicode currency symbols: p{Sc}– ASCII punctuation characters: p{Punct}¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  20. 20. Regular expressions in Java• Since JDK 1.4• Package java.util.regex– Pattern class– Matcher class• Convenience methods in java.lang.String• Alternative for JDK 1.3– Jakarta ORO project
  21. 21. java.util.regex.Pattern• Wrapper class for regular expressions• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]String regex = “(dd)1”;Pattern p = Pattern.compile(regex);
  22. 22. java.util.regex.Matcher• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): StringString regex = “(dd)1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();
  23. 23. java.lang.String• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String
  24. 24. Examples• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing
  25. 25. Examples: validation• Validate an e-mail address• A URL[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}|localw*)(:d+)?(/(w+[w/-.]*)?)?
  26. 26. Examples: searching text• Write HttpUnit test to submit HTML form and check whether HTTP response is aconfirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:9[0-9]{6}-[0-9]{6}Pattern p = Pattern.compile(regexp);Matcher m = p.matcher(text);boolean ok = m.find();String nr = m.group();
  27. 27. Examples: filtering• Filter e-mail with subjects with capitals only, and including a leading “Re:”(R[eE]:)*[^a-z]*$
  28. 28. Examples: parsing• Matches any opening and closing XML tag:– Note the use of the back reference<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  29. 29. Examples: duplicate lines• Suppose you want to remove duplicate lines from a text.– requirement here is that the lines are sorted alphabetically^(.*)(r?n1)+$
  30. 30. Examples: on-the-fly editing• Suppose you want to edit a file in batch: all occurrances of a certain string patternshould be replaced with another string.• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptorsdepending on environment
  31. 31. Quiz• What are the following regular expressions looking for?d+ at least one digit[-+]?d+ any integer((d*.?)?d+|d+(.?d*)) any positive decimal[p{L}][-.p{L} ]+ a place name
  32. 32. Conclusion• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings• think regex!
  33. 33. References• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions