The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Regexp master 2011
1. Parsing a File with Perl
Regexp, substr and oneliners
Bioinformatics master course, ‘11/’12 Paolo Marcatili
2. Agenda
Today we will see how to
• Extract information from a file
• Substr and regexp
We already know how to use:
• Scalar variables $ and arrays @
• If, for, while, open, print, close…
Bioinformatics master course, ‘11/’12
2 Paolo Marcatili
4. Protein Structures
1st task:
• Open a PDB file
• Operate a symmetry transformation
• Extract data from file header
Bioinformatics master course, ‘11/’12
4 Paolo Marcatili
5. Zinc Finger
2nd task:
• Open a fasta file
• Find all occurencies of Zinc Fingers
(homework?)
Bioinformatics master course, ‘11/’12
5 Paolo Marcatili
7. Rationale
Biological data -> human readable files
If you can read it, Perl can read it as well
*BUT*
It can be tricky
Bioinformatics master course, ‘11/’12
7 Paolo Marcatili
8. Parsing flow-chart
Open the file
For each line{
look for “grammar”
and store data
}
Close file
Use data
Bioinformatics master course, ‘11/’12
8 Paolo Marcatili
10. Substr
substr($data, start, length)
returns a substring from the expression supplied as first
argument.
Bioinformatics master course, ‘11/’12
10 Paolo Marcatili
11. Substr
substr($data, start, length)
^ ^ ^
your string | |
start from 0 |
you can omit this
(you will extract up to the end of string)
Bioinformatics master course, ‘11/’12
11 Paolo Marcatili
12. Substr
substr($data, start, length)
Examples:
my $data=“il mattino ha l’oro in bocca”;
print substr($data,0) . “n”; #prints all string
print substr($data,3,5) . “n”; #prints matti
print substr($data,25) . “n”; #prints bocca
print substr($data,-5) . “n”; #prints bocca
Bioinformatics master course, ‘11/’12
12 Paolo Marcatili
14. PDB
ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O
ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N
…
COLUMNS DATA TYPE FIELD DEFINITION
------------------------------------------------------------------------------------
-
1 - 6 Record name "ATOM "
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
22 Character chainID Chain identifier.
23 - 26 Integer resSeq Residue sequence number.
27 AChar iCode Code for insertion of residues.
31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms
39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms
47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms
55 - 80 Bla Bla Bla (not useful for our purposes)
Bioinformatics master course, ‘11/’12
14 Paolo Marcatili
15. simmetry
X->Z
Y->X
Z->Y
Y
X
Bioinformatics master course, ‘11/’12
15 Paolo Marcatili
16. Rotation
#! /usr/bin/perl -w
use strict;
open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!";
open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!";
while (my $line=<IG>){
if (substr($line,0,4) eq "ATOM"){
my $X= substr($line,30,8);
my $Y= substr($line,38,8);
my $Z= substr($line,46,8);
print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54);
}
else{
print IGR $line;
}
}
close IG;
close IGR;
Bioinformatics master course, ‘11/’12
16 Paolo Marcatili
18. Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like
“check for a valid email address”…
Bioinformatics master course, ‘11/’12
18 Paolo Marcatili
19. Regular Expressions
PDB have a “fixed” structures.
What if we want to do something like
“check for a valid email address”…
1. There must be some letters or numbers
2. There must be a @
3. Other letters
4. .something
paolo.marcatili@gmail.com is good
paolo.marcatili@.com is not good
Bioinformatics master course, ‘11/’12
19 Paolo Marcatili
20. Regular Expressions
$line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/
WHAAAT???
This means:
Check if $line has some chars at the beginning, then @, then
some non-points, then a point, then at least two letters
….
Ok, let’s start from something simpler :)
Bioinformatics master course, ‘11/’12
20 Paolo Marcatili
21. Regular Expressions
$line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/
WHAAAT???
This means:
Check if $line has some chars at the beginning, then @, then
some non-points, then a point, then at least two letters
….
Ok, let’s start from something simpler :)
Bioinformatics master course, ‘11/’12
21 Paolo Marcatili
22. Regular Expressions
$line =~ m/^ATOM/
Line starts with ATOM
$line =~ m/^ATOMs+/
Line starts with ATOM, then there are some spaces
$line =~ m/^ATOMs+[-|0-9]+/
Line starts with ATOM, then there are some spaces, then there are some digits
or -
$line =~ m/^ATOMs+-?[0-9]+/
Line starts with ATOM, then there are some spaces, then there can be a
minus, then some digits
Bioinformatics master course, ‘11/’12
22 Paolo Marcatili
24. PDB Header
We want to find %id for L and H chain
Bioinformatics master course, ‘11/’12
24 Paolo Marcatili
25. PDB Header
We want to find %id for L and H chain
$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/);
$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/);
ONELINER!!
cat IG.pdb | perl -ne ‘print “$1n”
if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’
Bioinformatics master course, ‘11/’12
25 Paolo Marcatili
27. Zinc Finger
A zinc finger is a large superfamily of protein
domains that can bind to DNA.
A zinc finger consists of two antiparallel β
strands, and an α helix.
The zinc ion is crucial for the stability of this
domain type - in the absence of the metal
ion the domain unfolds as it is too small to
have a hydrophobic core.
The consensus sequence of a single finger is:
C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H
Bioinformatics master course, ‘11/’12
27 Paolo Marcatili
28. Homework
Find all occurencies of ZF motif in zincfinger.fasta
Put them in file ZF_motif.fasta
e.g.
weofjpihouwefghoicalcvgnfglapglifhtylhyuiui
Bioinformatics master course, ‘11/’12
28 Paolo Marcatili
29. Homework
Find all occurencies of ZF motif in zincfinger.fasta
Put them in file ZF_motif.fasta
e.g.
weofjpihouwefghoicalcvgnfglapglifhtylhyuiui
calcvgnfglapglifhtylh
Bioinformatics master course, ‘11/’12
29 Paolo Marcatili