SlideShare a Scribd company logo
1 of 26
Download to read offline
Stat405                String processing


                                Hadley Wickham
Wednesday, 11 November 2009
1. Intro to email
              2. String basics
              3. Finding & splitting
              4. Next time:
                 Regular expressions



Wednesday, 11 November 2009
Motivation
                  Want to try and classify spam vs. ham
                  (non-spam) email.
                  Need to process character vectors
                  containing complete contents of email.
                  Eventually, will create variables that will be
                  useful for classification.
                  Same techniques very useful for data
                  cleaning.


Wednesday, 11 November 2009
Getting started
              load("email.rdata")
              str(contents) # second 100 are spam

              install.packages("stringr")
              # all functions starting with str_
              # come from this package
              help(package = "stringr")



Wednesday, 11 November 2009
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com
        with Microsoft SMTPSVC(5.0.2195.2966);
              Mon, 1 Oct 2001 14:39:38 -0500
        x-mimeole: Produced By Microsoft Exchange V6.0.4712.0
        content-class: urn:content-classes:message
        MIME-Version: 1.0
        Content-Type: text/plain;
        Content-Transfer-Encoding: binary
        Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Date: Mon, 1 Oct 2001 14:39:38 -0500
        Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
        X-MS-Has-Attach:
        X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
        Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q==
        X-Priority: 1
        Priority: Urgent
        Importance: high
        From: "Landau, Georgi" <Georgi.Landau@ENRON.com>
        To: "Bailey, Susan" <Susan.Bailey@ENRON.com>,
             "Boyd, Samantha" <Samantha.Boyd@ENRON.com>,
             "Heard, Marie" <Marie.Heard@ENRON.com>,
             "Jones, Tana" <Tana.Jones@ENRON.com>,
             "Panus, Stephanie" <Stephanie.Panus@ENRON.com>
        Return-Path: Georgi.Landau@ENRON.com

        Please check your records and let me know if you have any kind of documentation evidencing a merger
        indicating that Mercator Energy Incorporated merged with and into Energy West Incorporated?

        I am unable to find anything to substantiate this information.

        Thanks for your help.


        Georgi Landau
        Enron Net Works


Wednesday, 11 November 2009
...
        Date: Sun, 11 Nov 2001 11:55:05 -0500
        Message-Id: <200503101247.j2ACloAq014654@host.high-host.com>
        To: benrobinson13@shaw.ca
        Subject: LETS DO THIS TOGTHER
        From: ben1 <ben_1_wills@yahoo.com.au>
        X-Priority: 3 (Normal)
        CC:
        Mime-Version: 1.0
        Content-Type: text/plain; charset=us-ascii
        Content-Transfer-Encoding: 7bit
        X-Mailer: RLSP Mailer



        Dear Friend,

        Firstly, not to caurse you embarrassment, I am Barrister Benson Wills, a Solicitor at law and the
        personal attorney to late Mr. Mark Michelle a National of France, who used to be a private
        contractor with the Shell Petroleum Development Company in Saudi Arabia, herein after shall be
        referred to as my client. On the 21st of April 2001, him and his wife with their three children were
        involved in an auto crash, all occupants of the vehicle unfortunately lost their lives.

        Since then, I have made several enquiries with his country's embassies to locate any of my clients
        extended relatives, this has also proved unsuccessful. After these several unsuccessful attempts, I
        decided to contact you with this business partnership proposal. I have contacted
        you to assist in repatriating a huge amount of money left behind by my client before they get
        confiscated or declared unserviceable by the Finance/Security Company where these huge deposit was
        lodged.

        The deceased had a deposit valued presently at £30milion Pounds Sterling and the Company has issued
        me a notice to provide his next of kin or Beneficiary by Will otherwise have the account confiscated
        within the next thirty official working days.

        Since I have been unsuccessful in locating the relatives for over two years now, I seek your consent
        to present you as the next of kin/Will Beneficiary to the deceased so that the proceeds of this


Wednesday, 11 November 2009
Structure of an email
                  Headers give metadata
                  Empty line
                  Body of email


                  Other major complication is attachments
                  and alternative content, but we’ll ignore
                  those for class


Wednesday, 11 November 2009
Tasks

                  • Split into header and contents
                  • Split header into fields
                  • Generate useful variables for
                    distinguishing between spam and non-
                    spam



Wednesday, 11 November 2009
String basics



Wednesday, 11 November 2009
# Special characters
        a <- ""
        b <- """
        c <- "anbnc"


        a
        cat(a, "n")
        b
        cat(b, "n")
        c
        cat(c, "n")




Wednesday, 11 November 2009
Special characters
                  • Use  to “escape” special characters
                        • " = "
                        • n = new line
                        •  = 
                        • t = tab
                  • ?Quotes for more


Wednesday, 11 November 2009
Displaying strings
                  print() will display quoted form.
                  cat() will display actual contents (but
                  need to add a newline to the end).
                  Generally better to use message() if you
                  want to send a message to the user of
                  your function



Wednesday, 11 November 2009
Your turn
                  Create a string for each of the following
                  strings:
                     :-
                     (^_^")
                     @_'-'
                     m/
                  Create a multiline string.
                  Compare the output from print and cat
Wednesday, 11 November 2009
a <- ":-"
        b <- "(^_^")"
        c <- "@_'-'"
        d <- "m/"
        e <- "This stringngoes overnmultiple lines"


        a; b; c; d; e
        cat(str_join(a, b, c, d, e, "n", sep = "n"))




Wednesday, 11 November 2009
Back to the problem



Wednesday, 11 November 2009
Header vs. content
                  Need to split the string into two pieces,
                  based on the the location of double line
                  break:
                  str_locate(string, pattern)
                  Splitting = creating two substrings, one to
                  the right, one to the left:
                  str_sub(string, start, end)


Wednesday, 11 November 2009
str_locate("great")
     str_locate("fantastic", "a")
     str_locate("super", "a")

     superlatives <- c("great", "fantastic", "super")
     res <- str_locate(superlatives, "a")
     str(res)
     str(str_locate_all(superlatives, "a"))

     str_sub("testing", 1, 3)
     str_sub("testing", start = 4)
     str_sub("testing", end = 4)

     input <- c("abc", "defg")
     str_sub(input, c(2, 3))
Wednesday, 11 November 2009
Your turn

                  Use sub_locate() to identify the location
                  where the double break is (make sure to
                  check a few!)
                  Split the emails into header and content
                  with str_sub()



Wednesday, 11 November 2009
breaks <- str_locate(contents, "nn")

     # Remove invalid emails
     valid <- !is.na(breaks[, "start"])
     contents <- contents[valid]
     breaks <- breaks[valid, ]

     # Extract headers and bodies
     header <- str_sub(contents, end = breaks[, 1])
     body <- str_sub(contents, start = breaks[, 2])




Wednesday, 11 November 2009
Headers

                  • Each header starts at the beginning of
                    a new line
                  • Each header is composed of a name
                    and contents, separated by a colon




Wednesday, 11 November 2009
h <- header[2]

     # Does this work?
     str_split(h, "n")[[1]]

     # Why / why not?
     # How could you fix the problem?




Wednesday, 11 November 2009
# Split & patch up

     lines <- str_split(h, "n")

     continued <- str_sub(lines, 1, 1) %in% c(" ", "t")

     # This is a neat trick!
     groups <- cumsum(!continued)

     fields <- rep(NA, max(groups))
     for (i in seq_along(fields)) {
       fields[i] <- str_join(lines[groups == i],
         collapse = "n")
     }

Wednesday, 11 November 2009
Your turn
                  Write a small function that given a single
                  header field splits it into name and
                  contents.
                  Do you want to use str_split(), or
                  str_locate() & str_sub()?
                  Remember to get the algorithm working
                  before you write the function


Wednesday, 11 November 2009
test1 <- "Sender: <Lighthouse@independent.org>"
     test2 <- "Subject: Alice: Where is my coffee?"

     f1 <- function(input) {
       str_split(input, ": ")[[1]]
     }

     f2 <- function(input) {
       colon <- str_locate(input, ": ")
       c(
          str_sub(input, end = colon[, 1] - 1),
          str_sub(input, start = colon[, 2] + 1)
       )
     }

Wednesday, 11 November 2009
Your turn


                  Write a function that given an email
                  returns a list content the body of the
                  email, and a data frame of headers.




Wednesday, 11 November 2009
Next time

                  We split the content into header and
                  body. And split up the header into fields.
                  Both of these tasks used fixed strings.
                  What if the pattern we need to match is
                  more complicated?




Wednesday, 11 November 2009

More Related Content

Viewers also liked

VWL Wochenbericht (07.04- 14.04.2011)
VWL Wochenbericht (07.04- 14.04.2011)VWL Wochenbericht (07.04- 14.04.2011)
VWL Wochenbericht (07.04- 14.04.2011)Niklas Obermeier
 
Bd p1 eq6_anteproyecto
Bd p1 eq6_anteproyectoBd p1 eq6_anteproyecto
Bd p1 eq6_anteproyectoMartha
 
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...mkt_refrisat
 

Viewers also liked (12)

IPCC SRREN - Prof. Tony Lewis - EPA/SEAI Future Energy Workshop June 2011
IPCC SRREN - Prof. Tony Lewis - EPA/SEAI Future Energy Workshop June 2011IPCC SRREN - Prof. Tony Lewis - EPA/SEAI Future Energy Workshop June 2011
IPCC SRREN - Prof. Tony Lewis - EPA/SEAI Future Energy Workshop June 2011
 
VWL Wochenbericht (07.04- 14.04.2011)
VWL Wochenbericht (07.04- 14.04.2011)VWL Wochenbericht (07.04- 14.04.2011)
VWL Wochenbericht (07.04- 14.04.2011)
 
How to use the public folder
How to use the public folderHow to use the public folder
How to use the public folder
 
Espinoza hernan 6_b_t10
Espinoza hernan 6_b_t10Espinoza hernan 6_b_t10
Espinoza hernan 6_b_t10
 
Apresentação Aula 3
Apresentação Aula 3Apresentação Aula 3
Apresentação Aula 3
 
Empathie
EmpathieEmpathie
Empathie
 
Ofimática
OfimáticaOfimática
Ofimática
 
280 287 (1)
280 287 (1)280 287 (1)
280 287 (1)
 
Bd p1 eq6_anteproyecto
Bd p1 eq6_anteproyectoBd p1 eq6_anteproyecto
Bd p1 eq6_anteproyecto
 
Pressistema
PressistemaPressistema
Pressistema
 
Proto-kart à 5 roues
Proto-kart à 5 rouesProto-kart à 5 roues
Proto-kart à 5 roues
 
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...
REFRISAT NEWS 219 - Boteco SAT - REFRISAT e seus equipamentos refrescando o s...
 

Similar to 22 Spam

Similar to 22 Spam (20)

24 Spam
24 Spam24 Spam
24 Spam
 
22 spam
22 spam22 spam
22 spam
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
DB2 Native XML
DB2 Native XMLDB2 Native XML
DB2 Native XML
 
06 data
06 data06 data
06 data
 
21 spam
21 spam21 spam
21 spam
 
23 priority queue
23 priority queue23 priority queue
23 priority queue
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
04 reports
04 reports04 reports
04 reports
 
Basic linux commands
Basic linux commandsBasic linux commands
Basic linux commands
 
Lnx
LnxLnx
Lnx
 
Parser combinators
Parser combinatorsParser combinators
Parser combinators
 
Regular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command LineRegular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command Line
 
Ver. 32013CSCI 111 – Programming and Algorithms IProj.docx
Ver. 32013CSCI 111 – Programming and Algorithms IProj.docxVer. 32013CSCI 111 – Programming and Algorithms IProj.docx
Ver. 32013CSCI 111 – Programming and Algorithms IProj.docx
 
04 Reports
04 Reports04 Reports
04 Reports
 
Tip: Data Scoring: Convert data with XQuery
Tip: Data Scoring: Convert data with XQueryTip: Data Scoring: Convert data with XQuery
Tip: Data Scoring: Convert data with XQuery
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
Python_book.pdf
Python_book.pdfPython_book.pdf
Python_book.pdf
 
Python
PythonPython
Python
 

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

22 Spam

  • 1. Stat405 String processing Hadley Wickham Wednesday, 11 November 2009
  • 2. 1. Intro to email 2. String basics 3. Finding & splitting 4. Next time: Regular expressions Wednesday, 11 November 2009
  • 3. Motivation Want to try and classify spam vs. ham (non-spam) email. Need to process character vectors containing complete contents of email. Eventually, will create variables that will be useful for classification. Same techniques very useful for data cleaning. Wednesday, 11 November 2009
  • 4. Getting started load("email.rdata") str(contents) # second 100 are spam install.packages("stringr") # all functions starting with str_ # come from this package help(package = "stringr") Wednesday, 11 November 2009
  • 5. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com Please check your records and let me know if you have any kind of documentation evidencing a merger indicating that Mercator Energy Incorporated merged with and into Energy West Incorporated? I am unable to find anything to substantiate this information. Thanks for your help. Georgi Landau Enron Net Works Wednesday, 11 November 2009
  • 6. ... Date: Sun, 11 Nov 2001 11:55:05 -0500 Message-Id: <200503101247.j2ACloAq014654@host.high-host.com> To: benrobinson13@shaw.ca Subject: LETS DO THIS TOGTHER From: ben1 <ben_1_wills@yahoo.com.au> X-Priority: 3 (Normal) CC: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: RLSP Mailer Dear Friend, Firstly, not to caurse you embarrassment, I am Barrister Benson Wills, a Solicitor at law and the personal attorney to late Mr. Mark Michelle a National of France, who used to be a private contractor with the Shell Petroleum Development Company in Saudi Arabia, herein after shall be referred to as my client. On the 21st of April 2001, him and his wife with their three children were involved in an auto crash, all occupants of the vehicle unfortunately lost their lives. Since then, I have made several enquiries with his country's embassies to locate any of my clients extended relatives, this has also proved unsuccessful. After these several unsuccessful attempts, I decided to contact you with this business partnership proposal. I have contacted you to assist in repatriating a huge amount of money left behind by my client before they get confiscated or declared unserviceable by the Finance/Security Company where these huge deposit was lodged. The deceased had a deposit valued presently at £30milion Pounds Sterling and the Company has issued me a notice to provide his next of kin or Beneficiary by Will otherwise have the account confiscated within the next thirty official working days. Since I have been unsuccessful in locating the relatives for over two years now, I seek your consent to present you as the next of kin/Will Beneficiary to the deceased so that the proceeds of this Wednesday, 11 November 2009
  • 7. Structure of an email Headers give metadata Empty line Body of email Other major complication is attachments and alternative content, but we’ll ignore those for class Wednesday, 11 November 2009
  • 8. Tasks • Split into header and contents • Split header into fields • Generate useful variables for distinguishing between spam and non- spam Wednesday, 11 November 2009
  • 10. # Special characters a <- "" b <- """ c <- "anbnc" a cat(a, "n") b cat(b, "n") c cat(c, "n") Wednesday, 11 November 2009
  • 11. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Wednesday, 11 November 2009
  • 12. Displaying strings print() will display quoted form. cat() will display actual contents (but need to add a newline to the end). Generally better to use message() if you want to send a message to the user of your function Wednesday, 11 November 2009
  • 13. Your turn Create a string for each of the following strings: :- (^_^") @_'-' m/ Create a multiline string. Compare the output from print and cat Wednesday, 11 November 2009
  • 14. a <- ":-" b <- "(^_^")" c <- "@_'-'" d <- "m/" e <- "This stringngoes overnmultiple lines" a; b; c; d; e cat(str_join(a, b, c, d, e, "n", sep = "n")) Wednesday, 11 November 2009
  • 15. Back to the problem Wednesday, 11 November 2009
  • 16. Header vs. content Need to split the string into two pieces, based on the the location of double line break: str_locate(string, pattern) Splitting = creating two substrings, one to the right, one to the left: str_sub(string, start, end) Wednesday, 11 November 2009
  • 17. str_locate("great") str_locate("fantastic", "a") str_locate("super", "a") superlatives <- c("great", "fantastic", "super") res <- str_locate(superlatives, "a") str(res) str(str_locate_all(superlatives, "a")) str_sub("testing", 1, 3) str_sub("testing", start = 4) str_sub("testing", end = 4) input <- c("abc", "defg") str_sub(input, c(2, 3)) Wednesday, 11 November 2009
  • 18. Your turn Use sub_locate() to identify the location where the double break is (make sure to check a few!) Split the emails into header and content with str_sub() Wednesday, 11 November 2009
  • 19. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Wednesday, 11 November 2009
  • 20. Headers • Each header starts at the beginning of a new line • Each header is composed of a name and contents, separated by a colon Wednesday, 11 November 2009
  • 21. h <- header[2] # Does this work? str_split(h, "n")[[1]] # Why / why not? # How could you fix the problem? Wednesday, 11 November 2009
  • 22. # Split & patch up lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } Wednesday, 11 November 2009
  • 23. Your turn Write a small function that given a single header field splits it into name and contents. Do you want to use str_split(), or str_locate() & str_sub()? Remember to get the algorithm working before you write the function Wednesday, 11 November 2009
  • 24. test1 <- "Sender: <Lighthouse@independent.org>" test2 <- "Subject: Alice: Where is my coffee?" f1 <- function(input) { str_split(input, ": ")[[1]] } f2 <- function(input) { colon <- str_locate(input, ": ") c( str_sub(input, end = colon[, 1] - 1), str_sub(input, start = colon[, 2] + 1) ) } Wednesday, 11 November 2009
  • 25. Your turn Write a function that given an email returns a list content the body of the email, and a data frame of headers. Wednesday, 11 November 2009
  • 26. Next time We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings. What if the pattern we need to match is more complicated? Wednesday, 11 November 2009