SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Hadley Wickham
Stat405Regular expressions
Tuesday, 9 November 2010
1. Recap
2. Regular expressions
3. Other string processing functions
Tuesday, 9 November 2010
Recap
library(stringr)
load("email.rdata")
str(contents) # second 100 are spam
# Extract headers and bodies
breaks <- str_locate(contents, "nn")
header <- str_sub(contents, end = breaks[, 1])
body <- str_sub(contents, start = breaks[, 2])
Tuesday, 9 November 2010
# To parse into fields
h <- header[1]
lines <- str_split(h, "n")[[1]]
continued <- str_sub(lines, 1, 1) %in% c(" ", "t")
# This is a neat trick!
groups <- cumsum(!continued)
fields <- rep(NA, max(groups))
for (i in seq_along(fields)) {
fields[i] <- str_c(lines[groups == i],
collapse = "n")
}
Tuesday, 9 November 2010
We split the content into header and
body. And split up the header into fields.
Both of these tasks used fixed strings
What if the pattern we need to match is
more complicated?
Recap
Tuesday, 9 November 2010
# What if we could just look for lines that
# don't start with whitespace?
break_pos <- str_locate_all(h, "n[^ t]")[[1]]
break_pos[, 2] <- break_pos[, 1]
field_pos <- invert_match(break_pos)
str_sub(h, field_pos[, 1], field_pos[, 2])
Tuesday, 9 November 2010
Matching challenges
• How could we match a phone number?
• How could we match a date?
• How could we match a time?
• How could we match an amount of
money?
• How could we match an email address?
Tuesday, 9 November 2010
Pattern matching
Each of these types of data have a fairly
regular pattern that we can easily pick out
by eye
Today we are going to learn about regular
expressions, which are a pattern
matching language
Almost every programming language has
regular expressions, often with a slightly
different dialect. Very very useful
Tuesday, 9 November 2010
First challenge
• Matching phone numbers
• How are phone numbers normally
written?
• How can we describe this format?
• How can we extract the phone
numbers?
Tuesday, 9 November 2010
Nothing. It is a generic folder that has Enron Global Markets on
the cover. It is the one that I sent you to match your insert to
when you were designing. With the dots. I am on my way over for a
meeting, I'll bring one.
Juli Salvagio
Manager, Marketing Communications
Public Relations
Enron-3641
1400 Smith Street
Houston, TX 77002
713-345-2908-Phone
713-542-0103-Mobile
713-646-5800-Fax
Tuesday, 9 November 2010
Mark,
Good speaking with you. I'll follow up when I get your email.
Thanks,
Rosanna
Rosanna Migliaccio
Vice President
Robert Walters Associates
(212) 704-9900
Fax: (212) 704-4312
mailto:rosanna@robertwalters.com
http://www.robertwalters.com
Tuesday, 9 November 2010
Write down a description, in words, of the
usual format of phone numbers.
Your turn
Tuesday, 9 November 2010
Phone numbers
• 10 digits, normally grouped 3-3-4
• Separated by space, - or ()
• How can we express that with a computer
program? We’ll use regular expressions
• [0-9]: matches any number between 0
and 9
• [- ()]: matches -, space, ( or )
Tuesday, 9 November 2010
phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()]
[0-9][0-9][0-9][0-9]"
phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}"
test <- body[10]
cat(test)
str_detect(test, phone2)
str_locate(test, phone2)
str_locate_all(test, phone2)
str_extract(test, phone2)
str_extract_all(test, phone2)
Tuesday, 9 November 2010
Qualifier ≥ ≤
? 0 1
+ 1 Inf
* 0 Inf
{m,n} m n
{,n} 0 n
{m,} m Inf
Tuesday, 9 November 2010
# What do these regular expression match?
mystery1 <- "[0-9]{5}(-[0-9]{4})?"
mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}"
mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+
([a-z0-9-]*[a-z0-9]+)?)+)*"
# Think about them first, then input to
# http://xenon.stanford.edu/~xusch/regexp/analyzer.html
# (select java for language - it's closest to R for regexps)
Tuesday, 9 November 2010
New features
• () group parts of a regular expression
• . matches any character
(. matches a .)
• d matches a digit, s matches a space
• Other characters that need to be
escaped: $, ^
Tuesday, 9 November 2010
Tuesday, 9 November 2010
Regular expression
escapes
Tricky, because we are writing strings, but the
regular expression is the contents of the
string.
"a.b" represents the regular expression a
.b, which matches a.b only
"a.b" is an error
"" represents the regular expression 
which matches 
Tuesday, 9 November 2010
Your turn
Improve our initial regular expression for
matching a phone number.
Hints: use the tools linked from the website
to help. You can use str_extract_all(body,
pattern) to extract matches from the emails
Tuesday, 9 November 2010
phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}"
str_extract_all(body, phone3)
phone4 <- "[0-9][0-9+() -]{6,}[0-9]"
str_extract_all(body, phone4)
Tuesday, 9 November 2010
Regexp String Matches
. "." Any single character
. "." .
d "d" Any digit
s "s" Any white space
" """ "
 "" 
b "b" Word border
Tuesday, 9 November 2010
Your turn
Create a regular expression to match a
date. Test it against the following cases:
c("10/14/1979", "20/20/1945",
"1/1/1905", "5/5/5")
Create a regular expression to match a
time. Test it against the following cases:
c("12:30 pm", "2:15 AM", "312:23 pm",
"1:00 american", "08:20 am")
Tuesday, 9 November 2010
dates <- c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5")
str_detect(dates, "[1]?[0-9]/[1-3]?[0-9]/[0-9]{2,4}")
times <- c("12:30 pm", "2:15 AM", "312:23 pm",
"1:00 american", "08:20 am")
str_detect(times, "[01]?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]")
str_detect(times, "b1?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]b")
Tuesday, 9 November 2010
Your turn
Create a regular expression that matches
dates like "12 April 2008". Make sure your
expression works when the date is in a
sentence: "On 12 April 2008, I went to
town."
Extract just the month (hint: use str_split)
Tuesday, 9 November 2010
test <- c("15 April 2005", "28 February 1982",
"113 July 1984")
dates <- str_extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}",
test)
date <- dates[[1]]
str_split(date, " ")[[1]][2]
Tuesday, 9 November 2010
Regexp String Matches
[abc] "[abc]" a, b, or c
[a-c] "[a-c]" a, b, or c
[ac-] "[ac-"] a, c, or -
[.] "[.]" .
[.] "[.]" . or 
[ae-g.] "[ae-g.] a, e, f, g, or .
Tuesday, 9 November 2010
Regexp String Matches
[^abc] "[^abc]" Not a, b, or c
[^a-c] "[^a-c]" Not a, b, or c
[ac^] "[ac^]" a, c, or ^
Tuesday, 9 November 2010
Regexp String Matches
^a "^a" a at start of string
a$ "a$" a at end of string
^a$ "^a$" complete string =
a
$a "$a" $a
Tuesday, 9 November 2010
Your turn
Write a regular expression to match dollar
amounts. Compare the frequency of
mentioning money in spam and non-
spam emails.
Tuesday, 9 November 2010
money1 <- "$[0-9, ]+[0-9]"
str_extract_all(body, money1)
money2 <- "$[0-9, ]+[0-9](.[0-9]+)?"
str_extract_all(body, money2)
mean(str_detect(body[1:100], money2), na.rm = T)
mean(str_detect(body[101:200], money2), na.rm = T)
Tuesday, 9 November 2010
Function Parameters Result
str_detect string, pattern logical vector
str_locate string, pattern numeric matrix
str_extract string, pattern character vector
str_replace
string, pattern,
replacement
character vector
str_split_fixed string, pattern character matrix
Tuesday, 9 November 2010
Single Multiple
(output usually a list)
str_detect
str_locate str_locate_all
str_extract str_extract_all
str_replace str_replace_all
str_split_fixed str_split
Tuesday, 9 November 2010

Contenu connexe

Tendances

Flat Filer Presentation
Flat Filer PresentationFlat Filer Presentation
Flat Filer Presentation
alibby45
 

Tendances (7)

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
 
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham MoreheadSimple fuzzy Name Matching in Elasticsearch - Graham Morehead
Simple fuzzy Name Matching in Elasticsearch - Graham Morehead
 
Simple fuzzy name matching in elasticsearch paris meetup
Simple fuzzy name matching in elasticsearch   paris meetupSimple fuzzy name matching in elasticsearch   paris meetup
Simple fuzzy name matching in elasticsearch paris meetup
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Flat Filer Presentation
Flat Filer PresentationFlat Filer Presentation
Flat Filer Presentation
 
SQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting StartedSQL with PostgreSQL - Getting Started
SQL with PostgreSQL - Getting Started
 

En vedette (7)

27 development
27 development27 development
27 development
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
24 modelling
24 modelling24 modelling
24 modelling
 
27 development
27 development27 development
27 development
 
03 Conditional
03 Conditional03 Conditional
03 Conditional
 
02 Ddply
02 Ddply02 Ddply
02 Ddply
 
Math Studies conditional probability
Math Studies conditional probabilityMath Studies conditional probability
Math Studies conditional probability
 

Similaire à 22 spam

INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
Software Guru
 

Similaire à 22 spam (20)

08 functions
08 functions08 functions
08 functions
 
06 data
06 data06 data
06 data
 
24 Spam
24 Spam24 Spam
24 Spam
 
22 Spam
22 Spam22 Spam
22 Spam
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Programming with R in Big Data Analytics
Programming with R in Big Data AnalyticsProgramming with R in Big Data Analytics
Programming with R in Big Data Analytics
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
04 reports
04 reports04 reports
04 reports
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
Lec09-CS110 Computational Engineering
Lec09-CS110 Computational EngineeringLec09-CS110 Computational Engineering
Lec09-CS110 Computational Engineering
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
Ggplot2 v3
Ggplot2 v3Ggplot2 v3
Ggplot2 v3
 
04 Reports
04 Reports04 Reports
04 Reports
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
13 case-study
13 case-study13 case-study
13 case-study
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 

Plus de Hadley Wickham (18)

20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
03 extensions
03 extensions03 extensions
03 extensions
 
02 large
02 large02 large
02 large
 
01 intro
01 intro01 intro
01 intro
 
25 fin
25 fin25 fin
25 fin
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

22 spam

  • 2. 1. Recap 2. Regular expressions 3. Other string processing functions Tuesday, 9 November 2010
  • 3. Recap library(stringr) load("email.rdata") str(contents) # second 100 are spam # Extract headers and bodies breaks <- str_locate(contents, "nn") header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Tuesday, 9 November 2010
  • 4. # To parse into fields h <- header[1] lines <- str_split(h, "n")[[1]] continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_c(lines[groups == i], collapse = "n") } Tuesday, 9 November 2010
  • 5. We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Recap Tuesday, 9 November 2010
  • 6. # What if we could just look for lines that # don't start with whitespace? break_pos <- str_locate_all(h, "n[^ t]")[[1]] break_pos[, 2] <- break_pos[, 1] field_pos <- invert_match(break_pos) str_sub(h, field_pos[, 1], field_pos[, 2]) Tuesday, 9 November 2010
  • 7. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Tuesday, 9 November 2010
  • 8. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Tuesday, 9 November 2010
  • 9. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Tuesday, 9 November 2010
  • 10. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Tuesday, 9 November 2010
  • 11. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Tuesday, 9 November 2010
  • 12. Write down a description, in words, of the usual format of phone numbers. Your turn Tuesday, 9 November 2010
  • 13. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • How can we express that with a computer program? We’ll use regular expressions • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) Tuesday, 9 November 2010
  • 14. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Tuesday, 9 November 2010
  • 15. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Tuesday, 9 November 2010
  • 16. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Tuesday, 9 November 2010
  • 17. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other characters that need to be escaped: $, ^ Tuesday, 9 November 2010
  • 19. Regular expression escapes Tricky, because we are writing strings, but the regular expression is the contents of the string. "a.b" represents the regular expression a .b, which matches a.b only "a.b" is an error "" represents the regular expression which matches Tuesday, 9 November 2010
  • 20. Your turn Improve our initial regular expression for matching a phone number. Hints: use the tools linked from the website to help. You can use str_extract_all(body, pattern) to extract matches from the emails Tuesday, 9 November 2010
  • 21. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" str_extract_all(body, phone3) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" str_extract_all(body, phone4) Tuesday, 9 November 2010
  • 22. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Tuesday, 9 November 2010
  • 23. Your turn Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5") Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Tuesday, 9 November 2010
  • 24. dates <- c("10/14/1979", "20/20/1945", "1/1/1905", "5/5/5") str_detect(dates, "[1]?[0-9]/[1-3]?[0-9]/[0-9]{2,4}") times <- c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") str_detect(times, "[01]?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]") str_detect(times, "b1?[0-9]:[0-5][0-9] ([Aa]|[Pp])[Mm]b") Tuesday, 9 November 2010
  • 25. Your turn Create a regular expression that matches dates like "12 April 2008". Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." Extract just the month (hint: use str_split) Tuesday, 9 November 2010
  • 26. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- str_extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] str_split(date, " ")[[1]][2] Tuesday, 9 November 2010
  • 27. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Tuesday, 9 November 2010
  • 28. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Tuesday, 9 November 2010
  • 29. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Tuesday, 9 November 2010
  • 30. Your turn Write a regular expression to match dollar amounts. Compare the frequency of mentioning money in spam and non- spam emails. Tuesday, 9 November 2010
  • 31. money1 <- "$[0-9, ]+[0-9]" str_extract_all(body, money1) money2 <- "$[0-9, ]+[0-9](.[0-9]+)?" str_extract_all(body, money2) mean(str_detect(body[1:100], money2), na.rm = T) mean(str_detect(body[101:200], money2), na.rm = T) Tuesday, 9 November 2010
  • 32. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern numeric matrix str_extract string, pattern character vector str_replace string, pattern, replacement character vector str_split_fixed string, pattern character matrix Tuesday, 9 November 2010
  • 33. Single Multiple (output usually a list) str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_replace_all str_split_fixed str_split Tuesday, 9 November 2010