SlideShare une entreprise Scribd logo
1  sur  29
Parsing a File with Perl

                Regexp, substr and oneliners




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Agenda
 Today we will see how to
 • Extract information from a file
 • Substr and regexp

 We already know how to use:
 • Scalar variables $ and arrays @
 • If, for, while, open, print, close…

Bioinformatics master course, ‘11/’12
 2                                      Paolo Marcatili
Task Today




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Protein Structures
 1st task:
 • Open a PDB file
 • Operate a symmetry transformation
 • Extract data from file header




Bioinformatics master course, ‘11/’12
 4                                      Paolo Marcatili
Zinc Finger
 2nd task:
 • Open a fasta file
 • Find all occurencies of Zinc Fingers

     (homework?)




Bioinformatics master course, ‘11/’12
 5                                      Paolo Marcatili
Parsing




Bioinformatics master course, ‘11/’12    Paolo Marcatili
Rationale
 Biological data -> human readable files

 If you can read it, Perl can read it as well
 *BUT*
 It can be tricky




Bioinformatics master course, ‘11/’12
 7                                      Paolo Marcatili
Parsing flow-chart
 Open the file
 For each line{
     look for “grammar”
     and store data
 }
 Close file
 Use data




Bioinformatics master course, ‘11/’12
 8                                      Paolo Marcatili
Substr




Bioinformatics master course, ‘11/’12            Paolo Marcatili
Substr
 substr($data, start, length)
 returns a substring from the expression supplied as first
    argument.




Bioinformatics master course, ‘11/’12
 10                                              Paolo Marcatili
Substr
 substr($data, start, length)
         ^         ^        ^
       your string      |       |
                 start from 0 |
             you can omit this
                 (you will extract up to the end of string)




Bioinformatics master course, ‘11/’12
 11                                              Paolo Marcatili
Substr
 substr($data, start, length)
 Examples:

 my $data=“il mattino ha l’oro in bocca”;
 print substr($data,0) . “n”; #prints all string
 print substr($data,3,5) . “n”; #prints matti
 print substr($data,25) . “n”; #prints bocca
 print substr($data,-5) . “n”; #prints bocca




Bioinformatics master course, ‘11/’12
 12                                              Paolo Marcatili
Pdb rotation




Bioinformatics master course, ‘11/’12   Paolo Marcatili
PDB
   ATOM     4   O   ASP L   1   43.716 -12.235   68.502   1.00 70.05        O
   ATOM     5   N   ILE L   2   44.679 -10.569   69.673   1.00 48.19        N
   …




   COLUMNS        DATA TYPE     FIELD        DEFINITION
   ------------------------------------------------------------------------------------
       -
    1 - 6         Record name   "ATOM "
    7 - 11        Integer       serial       Atom serial number.
   13 - 16        Atom          name         Atom name.
   17             Character     altLoc       Alternate location indicator.
   18 - 20        Residue name resName       Residue name.
   22             Character     chainID      Chain identifier.
   23 - 26        Integer       resSeq       Residue sequence number.
   27             AChar         iCode        Code for insertion of residues.
   31 - 38        Real(8.3)     x            Orthogonal coordinates for X in Angstroms
   39 - 46        Real(8.3)     y            Orthogonal coordinates for Y in Angstroms
   47 - 54        Real(8.3)     z            Orthogonal coordinates for Z in Angstroms
   55 - 80        Bla Bla Bla (not useful for our purposes)


Bioinformatics master course, ‘11/’12
 14                                                            Paolo Marcatili
simmetry
   X->Z
   Y->X
   Z->Y


                  Y




                                        X

Bioinformatics master course, ‘11/’12
 15                                         Paolo Marcatili
Rotation
   #! /usr/bin/perl -w

   use strict;
   open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!";
      open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!";
      while (my $line=<IG>){
         if (substr($line,0,4) eq "ATOM"){
             my $X= substr($line,30,8);
             my $Y= substr($line,38,8);
             my $Z= substr($line,46,8);
             print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54);
         }
   else{
             print IGR $line;
         }
   }
   close IG;
   close IGR;




Bioinformatics master course, ‘11/’12
 16                                                           Paolo Marcatili
RegExp




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Regular Expressions
     PDB have a “fixed” structures.

 What if we want to do something like
 “check for a valid email address”…




Bioinformatics master course, ‘11/’12
 18                                     Paolo Marcatili
Regular Expressions
        PDB have a “fixed” structures.

 What if we want to do something like
 “check for a valid email address”…
 1. There must be some letters or numbers
 2. There must be a @
 3. Other letters
 4. .something
 paolo.marcatili@gmail.com is good

 paolo.marcatili@.com is not good


Bioinformatics master course, ‘11/’12
 19                                     Paolo Marcatili
Regular Expressions
 $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/

 WHAAAT???

 This means:
 Check if $line has some chars at the beginning, then @, then
 some non-points, then a point, then at least two letters

 ….
 Ok, let’s start from something simpler :)




Bioinformatics master course, ‘11/’12
 20                                              Paolo Marcatili
Regular Expressions
 $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/

 WHAAAT???

 This means:
 Check if $line has some chars at the beginning, then @, then
 some non-points, then a point, then at least two letters

 ….
 Ok, let’s start from something simpler :)




Bioinformatics master course, ‘11/’12
 21                                              Paolo Marcatili
Regular Expressions
 $line =~ m/^ATOM/
 Line starts with ATOM

 $line =~ m/^ATOMs+/
 Line starts with ATOM, then there are some spaces

 $line =~ m/^ATOMs+[-|0-9]+/
 Line starts with ATOM, then there are some spaces, then there are some digits
       or -
 $line =~ m/^ATOMs+-?[0-9]+/
 Line starts with ATOM, then there are some spaces, then there can be a
       minus, then some digits




Bioinformatics master course, ‘11/’12
 22                                             Paolo Marcatili
Regular Expressions




Bioinformatics master course, ‘11/’12
 23                                     Paolo Marcatili
PDB Header
    We want to find %id for L and H chain




Bioinformatics master course, ‘11/’12
 24                                         Paolo Marcatili
PDB Header
    We want to find %id for L and H chain


    $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/);
    $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/);

    ONELINER!!


    cat IG.pdb | perl -ne ‘print “$1n”
    if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’




Bioinformatics master course, ‘11/’12
 25                                             Paolo Marcatili
Zinc Finger




Bioinformatics master course, ‘11/’12   Paolo Marcatili
Zinc Finger
   A zinc finger is a large superfamily of protein
      domains that can bind to DNA.

   A zinc finger consists of two antiparallel β
      strands, and an α helix.
   The zinc ion is crucial for the stability of this
      domain type - in the absence of the metal
      ion the domain unfolds as it is too small to
      have a hydrophobic core.
   The consensus sequence of a single finger is:

   C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H


Bioinformatics master course, ‘11/’12
 27                                            Paolo Marcatili
Homework
       Find all occurencies of ZF motif in zincfinger.fasta

   Put them in file ZF_motif.fasta

   e.g.
   weofjpihouwefghoicalcvgnfglapglifhtylhyuiui




Bioinformatics master course, ‘11/’12
 28                                     Paolo Marcatili
Homework
       Find all occurencies of ZF motif in zincfinger.fasta

   Put them in file ZF_motif.fasta

   e.g.
   weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

   calcvgnfglapglifhtylh




Bioinformatics master course, ‘11/’12
 29                                     Paolo Marcatili

Contenu connexe

Similaire à Regexp master 2011

Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tourSimon Proctor
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tourSimon Proctor
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsSunil Kumar Gunasekaran
 
Thoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguageThoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguagePatricia Aas
 
2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL TigerAkihiro Okuno
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil Witecki
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.pptMalkaParveen3
 
Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Patricia Aas
 
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleRuby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleAnton Shemerey
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Davide Cerbo
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Cody Engel
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018Checkmarx
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018Dor Tumarkin
 

Similaire à Regexp master 2011 (20)

Master datatypes 2011
Master datatypes 2011Master datatypes 2011
Master datatypes 2011
 
Master perl io_2011
Master perl io_2011Master perl io_2011
Master perl io_2011
 
Master unix 2011
Master unix 2011Master unix 2011
Master unix 2011
 
Perl IO
Perl IOPerl IO
Perl IO
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
Perl6 a whistle stop tour
Perl6 a whistle stop tourPerl6 a whistle stop tour
Perl6 a whistle stop tour
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applications
 
Thoughts On Learning A New Programming Language
Thoughts On Learning A New Programming LanguageThoughts On Learning A New Programming Language
Thoughts On Learning A New Programming Language
 
2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger2010/7/31 LTの虎@LL Tiger
2010/7/31 LTの虎@LL Tiger
 
Kamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, codeKamil witecki asynchronous, yet readable, code
Kamil witecki asynchronous, yet readable, code
 
introtorandrstudio.ppt
introtorandrstudio.pptintrotorandrstudio.ppt
introtorandrstudio.ppt
 
Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)Trying to learn C# (NDC Oslo 2019)
Trying to learn C# (NDC Oslo 2019)
 
Perl intro
Perl introPerl intro
Perl intro
 
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code styleRuby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
Ruby: OOP, metaprogramming, blocks, iterators, mix-ins, duck typing. Code style
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)Privet Kotlin (Windy City DevFest)
Privet Kotlin (Windy City DevFest)
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
 
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018(De)serial Killers - BSides Las Vegas & AppSec IL 2018
(De)serial Killers - BSides Las Vegas & AppSec IL 2018
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Regexp master 2011

  • 1. Parsing a File with Perl Regexp, substr and oneliners Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 2. Agenda Today we will see how to • Extract information from a file • Substr and regexp We already know how to use: • Scalar variables $ and arrays @ • If, for, while, open, print, close… Bioinformatics master course, ‘11/’12 2 Paolo Marcatili
  • 3. Task Today Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 4. Protein Structures 1st task: • Open a PDB file • Operate a symmetry transformation • Extract data from file header Bioinformatics master course, ‘11/’12 4 Paolo Marcatili
  • 5. Zinc Finger 2nd task: • Open a fasta file • Find all occurencies of Zinc Fingers (homework?) Bioinformatics master course, ‘11/’12 5 Paolo Marcatili
  • 6. Parsing Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 7. Rationale Biological data -> human readable files If you can read it, Perl can read it as well *BUT* It can be tricky Bioinformatics master course, ‘11/’12 7 Paolo Marcatili
  • 8. Parsing flow-chart Open the file For each line{ look for “grammar” and store data } Close file Use data Bioinformatics master course, ‘11/’12 8 Paolo Marcatili
  • 9. Substr Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 10. Substr substr($data, start, length) returns a substring from the expression supplied as first argument. Bioinformatics master course, ‘11/’12 10 Paolo Marcatili
  • 11. Substr substr($data, start, length) ^ ^ ^ your string | | start from 0 | you can omit this (you will extract up to the end of string) Bioinformatics master course, ‘11/’12 11 Paolo Marcatili
  • 12. Substr substr($data, start, length) Examples: my $data=“il mattino ha l’oro in bocca”; print substr($data,0) . “n”; #prints all string print substr($data,3,5) . “n”; #prints matti print substr($data,25) . “n”; #prints bocca print substr($data,-5) . “n”; #prints bocca Bioinformatics master course, ‘11/’12 12 Paolo Marcatili
  • 13. Pdb rotation Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 14. PDB ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N … COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------ - 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms 55 - 80 Bla Bla Bla (not useful for our purposes) Bioinformatics master course, ‘11/’12 14 Paolo Marcatili
  • 15. simmetry X->Z Y->X Z->Y Y X Bioinformatics master course, ‘11/’12 15 Paolo Marcatili
  • 16. Rotation #! /usr/bin/perl -w use strict; open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; } } close IG; close IGR; Bioinformatics master course, ‘11/’12 16 Paolo Marcatili
  • 17. RegExp Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 18. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… Bioinformatics master course, ‘11/’12 18 Paolo Marcatili
  • 19. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… 1. There must be some letters or numbers 2. There must be a @ 3. Other letters 4. .something paolo.marcatili@gmail.com is good paolo.marcatili@.com is not good Bioinformatics master course, ‘11/’12 19 Paolo Marcatili
  • 20. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :) Bioinformatics master course, ‘11/’12 20 Paolo Marcatili
  • 21. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :) Bioinformatics master course, ‘11/’12 21 Paolo Marcatili
  • 22. Regular Expressions $line =~ m/^ATOM/ Line starts with ATOM $line =~ m/^ATOMs+/ Line starts with ATOM, then there are some spaces $line =~ m/^ATOMs+[-|0-9]+/ Line starts with ATOM, then there are some spaces, then there are some digits or - $line =~ m/^ATOMs+-?[0-9]+/ Line starts with ATOM, then there are some spaces, then there can be a minus, then some digits Bioinformatics master course, ‘11/’12 22 Paolo Marcatili
  • 23. Regular Expressions Bioinformatics master course, ‘11/’12 23 Paolo Marcatili
  • 24. PDB Header We want to find %id for L and H chain Bioinformatics master course, ‘11/’12 24 Paolo Marcatili
  • 25. PDB Header We want to find %id for L and H chain $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/); $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/); ONELINER!! cat IG.pdb | perl -ne ‘print “$1n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’ Bioinformatics master course, ‘11/’12 25 Paolo Marcatili
  • 26. Zinc Finger Bioinformatics master course, ‘11/’12 Paolo Marcatili
  • 27. Zinc Finger A zinc finger is a large superfamily of protein domains that can bind to DNA. A zinc finger consists of two antiparallel β strands, and an α helix. The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core. The consensus sequence of a single finger is: C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H Bioinformatics master course, ‘11/’12 27 Paolo Marcatili
  • 28. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui Bioinformatics master course, ‘11/’12 28 Paolo Marcatili
  • 29. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui calcvgnfglapglifhtylh Bioinformatics master course, ‘11/’12 29 Paolo Marcatili