1. BioPerl and
Comparative Genomics
• BioPerl - toolkit for Bioinformatics data
• Fungal comparative genomics
2. Overview of Toolkit
Bioperl is...
A Set of Perl modules for manipulating
genomic and other biological data
An Open Source Toolkit with many
contributors
A flexible and extensible system for doing
bioinformatics data manipulation
3. Some things you can do
Read in sequence data from a file in standard
formats (FASTA, GenBank, EMBL,
SwissProt,...)
Manipulate sequences, reverse complement,
translate coding DNA sequence to protein.
Parse a BLAST report, get access to every
bit of data in the report
4. Major Domains covered
in Bioperl
Sequences, Features, Annotations,
Pairwise alignment reports
Multiple Sequence Alignments
Bibliographic data
Graphical Rendering of sequence tracks
Database for features and sequences
5. Additional things
Gene prediction parsers
Trees, Parsing Phylogenetic and Molecular
Evolution software output
Population Genetic data and summary
statistics
Taxonomy
Protein Structure
6. Practical Examples
Manipulate a DNA or Protein Sequence
Read and write different Sequence formats
Extract sequence annotations and features
Parse a BLAST report
7. Orienting to the code
http://cvs.open-bio.org
bioperl-live - Core packages
bioperl-run - for running applications
bioperl-ext - C language extension
bioperl-db - bioperl BioSQL implementation
bioperl-pedigree, bioperl-microarray are
side-projects
9. Anatomy of a Bioperl
Module
perldoc Module -- perldoc Bio::SeqIO
SYNOPSIS -- runnable code
DESCRIPTION -- summary about the module
Each module will have annotated functions
11. Sequences and Features
TSS Feature
PolyA site
Exon Features Sequence
Genomic Sequence with
3 exons
1 Transcript Start Site (TSS)
1 Poly-A Site
12. Sequences, Features, and
Annotations
Sequence - DNA, RNA, AA
Feature container
Feature - Information with a Sequence
Location
Annotation - Information without explicit
Sequence location
14. Look at the Sequence
object
Common (Bio::PrimarySeq) methods
seq() - get the sequence as a string
length() - get the sequence length
subseq($s,$e) - get a subseqeunce
translate(...) - translate to protein [DNA]
revcom() - reverse complement [DNA]
display_id() - identifier string
description() - description string
15. Using a Sequence
my $str = “ATGAATGATGAA”;
my $seq = Bio::PrimarySeq->new(-seq => $str,
-display_id=”example”);
print “id is “, $seq-display_id,”
”;
print $seq-seq, “
”;
my $revcom = $seq-revcom;
print $revcom-seq, “
”;
print “frame1=”,$seq-translate-seq,“
”;
id is example
ATGAATGATGAA
TTCATCATTCAT
trans frame1=MNDE
16. Sequence Features
Bio::SeqFeatureI - interface - GFF derived
start(), end(), strand() for location information
location() - Bio::LocationI object (to represent
complex locations)
score,frame,primary_tag, source_tag - feature
information
spliced_seq() - for attached sequence, get the
sequence spliced.
19. Reading and Writing
Sequences
Bio::SeqIO
fasta, genbank, embl, swissprot,...
Takes care of writing out associated features
and annotations
Two functions
next_seq (reading sequences)
write_seq (writing sequences)
20. Writing a Sequence
use Bio::SeqIO;
# Let’s convert swissprot to fasta format
my $in = Bio::SeqIO-new(-format = ‘swiss’,
-file = ‘file.sp’);
my $out = Bio::SeqIO-new(-format = ‘fasta’,
-file = ‘file.fa’);`
while( my $seq = $in-next_seq ) {
$out-write_seq($seq);
}
22. A Detailed look at
BLAST parsing
3 Components
Result: Bio::Search::Result::ResultI
Hit: Bio::Search::Hit::HitI
HSP: Bio::Search::HSP::HSPI
23. use Bio::SearchIO;
my $cutoff = ’0.001’;
my $file = ‘BOSS_Ce.BLASTP’,
my $in = new Bio::SearchIO(-format = ‘blast’,
-file = $file);
while( my $r = $in-next_result ) {
print Query is: , $r-query_name, ,
$r-query_description, ,$r-query_length, aa
;
print Matrix was , $r-get_parameter(’matrix’),
;
while( my $h = $r-next_hit ) {
last if $h-significance $cutoff;
print Hit is , $h-name,
;
while( my $hsp = $h-next_hsp ) {
print HSP Len is , $hsp-length(’total’), ,
E-value is , $hsp-evalue, Bit score ,
$hsp-score,
,
Query loc: ,$hsp-query-start, ,
$hsp-query-end, ,
Sbject loc: ,$hsp-hit-start, ,
$hsp-hit-end,
;
}
}
}
24. BLAST Report
Copyright (C) 1996-2000 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.
Reference: Gish, W. (1996-2000) http://blast.wustl.edu
Query= BOSS_DROME Bride of sevenless protein precursor.
(896 letters)
Database: wormpep87
20,881 sequences; 9,238,759 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
F35H10.10 CE24945 status:Partially_confirmed TR:Q20073... 182 4.9e-11 1
M02H5.2 CE25951 status:Predicted TR:Q966H5 protein_id:... 86 0.15 1
ZC506.4 CE01682 locus:mgl-1 metatrophic glutamate recept... 91 0.18 1
F23D12.2 CE05700 status:Partially_confirmed TR:Q19761 ... 73 0.45 3
F35H10.10 CE24945 status:Partially_confirmed TR:Q20073
protein_id:AAA81683.2
Length = 1404
Score = 182 (69.1 bits), Expect = 4.9e-11, P = 4.9e-11
Identities = 75/315 (23%), Positives = 149/315 (47%)
Query: 511 YPFLFDGESVMFWRIKMDTWVATGLTAAILGLIATLAILVFIVVRISLGDVFEGNPTTSI 570
Y +F+ + WR +V L ++ + +A+LV ++V++ L V +GN + I
Sbjct: 1006 YQSVFEHITTGHWRDHPHNYVLLALITVLV--VVAIAVLVLVLVKLYLR-VVKGNQSLGI 1062
25. BLAST Script Results
Query is: BOSS_DROME Bride of sevenless protein
precursor. 896 aa
Matrix was BLOSUM62
Hit is F35H10.10
HSP Len is 315 E-value is 4.9e-11 Bit score 182
Query loc: 511 813 Sbject loc: 1006 1298
HSP Len is 28 E-value is 1.4e-09 Bit score 39
Query loc: 508 535 Sbject loc: 427 454
26. FASTA Parsing Script
use Bio::SearchIO;
my $cutoff = ’0.001;
my $file = ‘BOSS_Ce.FASTP’,
my $in = new Bio::SearchIO(-format = ‘fasta’,
-file = $file);
while( my $r = $in-next_result ) {
print Query is: , $r-query_name, ,
$r-query_description, ,$r-query_length, aa
;
print Matrix was , $r-get_parameter(’matrix’),
;
while( my $h = $r-next_hit ) {
last if $h-significance $cutoff;
print Hit is , $h-name,
;
while( my $hsp = $h-next_hsp ) {
print HSP Len is , $hsp-length(’total’), ,
E-value is , $hsp-evalue, Bit score ,
$hsp-score,
,
Query loc: ,$hsp-query-start, ,
$hsp-query-end, ,
Sbject loc: ,$hsp-hit-start, ,
$hsp-hit-end,
;
}
}
}
27. FASTA Report
FASTA searches a protein or DNA sequence data bank
version 3.4t20 Sep 26, 2002
Please cite:
W.R. Pearson D.J. Lipman PNAS (1988) 85:2444-2448
Query library BOSS_DROME.aa vs /blast/wormpep87 library
searching /blast/wormpep87 library
1BOSS_DROME Bride of sevenless protein precursor. - 896 aa
vs /blast/wormpep87 library
9238759 residues in 20881 sequences
Expectation_n fit: rho(ln(x))= 5.6098+/-0.000519; mu= 12.8177+/- 0.030
mean_var=107.8223+/-22.869, 0's: 0 Z-trim: 2 B-trim: 0 in 0/62
Lambda= 0.123515
Kolmogorov-Smirnov statistic: 0.0333 (N=29) at 48
FASTA (3.45 Mar 2002) function [optimized, BL50 matrix (15:-5)] ktup: 2
join: 38, opt: 26, gap-pen: -12/-2, width: 16
Scan time: 9.680
The best scores are: opt bits E(20881)
F35H10.10 CE24945 status:Partially_confirmed T (1404) 207 48.5 6.8e-05
T06E4.11 CE06377 locus:pqn-63 status:Predicted ( 275) 122 32.6 0.8
C33B4.3 CE01508 ankyrin and proline rich domain (1110) 124 33.6 1.6
Y48C3A.8 CE22141 status:Predicted TR:Q9NAG3 pr ( 291) 110 30.5 3.7
Y34D9A.2 CE30217 status:Partially_confirmed TR ( 326) 108 30.2 5.1
K06H7.3 CE26941 Isopentenyl-diphosphate delta i ( 618) 107 30.3 8.9
F44B9.8 CE29044 ARPA status:Partially_confirmed ( 388) 104 29.5 9.4
F35H10.10 CE24945 status:Partially_confirmed TR:Q20 (1404 aa)
initn: 94 init1: 94 opt: 207 Z-score: 197.9 bits: 48.5 E(): 6.8e-05
Smith-Waterman score: 275; 22.527% identity (27.152% ungapped) in 728 aa
overlap (207-847:640-1330)
180 190 200 210 220 230
BOSS_D RAISIDNASLAENLLIQEVQFLQQCTTYSMGIFVDWELYKQLESVIKD---LEYNIWPIP
28. FASTA Script Results
Query is: BOSS_DROME Bride of sevenless protein
precursor. 896 aa
Matrix was BL50
Hit is F35H10.10
HSP Len is 728 E-value is 6.8e-05 Bit score 197.9
Query loc: 207 847 Sbject loc: 640 1330
29. Using the Search::Result
object
use Bio::SearchIO;
use strict;
my $parser = new Bio::SearchIO(-format = ‘blast’,
-file = ‘file.bls’);
while( my $result = $parser-next_result ){
print “query name=“, $result-query_name, “ desc=”,
$result-query_description, “, len=”,$result-
query_length,“
”;
print “algorithm=“, $result-algorithm, “
”;
print “db name=”, $result-database_name, “ #lets=”,
$result-database_letters, “ #seqs=”,$result-
database_entries, “
”;
print “available params “, join(’,’,
$result-available_parameters),”
”;
print “available stats “, join(’,’,
$result-available_statistics), “
”;
print “num of hits “, $result-num_hits, “
”;
}
30. Using the Search::Hit
Object
use Bio::SearchIO;
use strict;
my $parser = new Bio::SearchIO(-format = ‘blast’,
-file = ‘file.bls’);
while( my $result = $parser-next_result ){
while( my $hit = $result-next_hit ) {
print “hit name=”,$hit-name, “ desc=”, $hit-description,
“
len=”, $hit-length, “ acc=”, $hit-accession,
”
”;
print “raw score “, $hit-raw_score, “ bits “, $hit-bits,
“ significance/evalue=“, $hit-evalue, “
”;
}
}
31. Cool Hit Methods
start(), end() - get overall alignment start
and end for all HSPs
strand() - get best overall alignment strand
matches() - get total number of matches
across entire set of HSPs (can specify only
exact ‘id’ or conservative ‘cons’)
33. Cool HSP methods
rank() - order in the alignment (which you
could have requested, by score, size)
matches - overall number of matches
seq_inds - get a list of numbers
representing residue positions which are
conserved, identical, mismatches, gaps
34. SearchIO system
BLAST (WU-BLAST, NCBI, XML, PSIBLAST,
BL2SEQ, MEGABLAST, TABULAR (-m8/m9))
FASTA (m9 and m0)
HMMER (hmmpfam, hmmsearch)
UCSC formats (WABA, AXT, PSL)
Gene based alignments
Exonerate, SIM4, {Gene,Genome}wise
35. SearchIO reformatting
Supports output of Search reports as well
Bio::SearchIO::Writer
“BLAST flavor” HTML, Text
Tabular Report Format
36. Bioperl Reformatted HTML of BLASTP Search Report
for gi|6319512|ref|NP_009594.1|
BLASTP 2.0MP-WashU [04-Feb-2003] [linux24-i686-ILP32F64 2003-02-04T19:05:09]
Copyright (C) 1996-2000 Washington University, Saint Louis, Missouri USA.
All Rights Reserved.
Reference: Gish, W. (1996-2000) http://blast.wustl.edu
Hit Overview
Score: Red= (=200), Purple 200-80, Green 80-50, Blue 50-40, Black 40
Build custom
overview
(Bio::Graphics)
Query= gi|6319512|ref|NP_009594.1| chitin synthase 2; Chs2p [Saccharomyces cerevisiae]
(963 letters)
Database: cneoA_WI.aa
9,645 sequences; 2,832,832 total letters
Hyperlink to external
Score E
resources
Sequences producing significant alignments:
(bits) value
cneo_WIH99_157.Gene2 Start=295 End=4301 Strand=1 Length=912 ExonCt=24 1650 1.6e-173
cneo_WIH99_63.Gene181 Start=154896 End=151527 Strand=-1 Length=876 ExonCt=13 1441 3.9e-149
Hyperlink to
cneo_WIH99_133.Gene1 Start=15489 End=19943 Strand=1 Length=1017 ExonCt=23 1357 3e-142
alignment part
cneo_WIH99_45.Gene2 Start=84 End=3840 Strand=1 Length=839 ExonCt=25 1311 1.5e-138
of report
cneo_WIH99_112.Gene165 Start=122440 End=118921 Strand=-1 Length=1036 ExonCt=9 198 1.2e-15
cneo_WIH99_11.Gene7 Start=39355 End=42071 Strand=1 Length=761 ExonCt=9 172 6.4e-13
cneo_WIH99_60.Gene9 Start=36153 End=32819 Strand=-1 Length=1020 ExonCt=5 166 1.2e-12
cneo_WIH99_106.Gene88 Start=242538 End=238790 Strand=-1 Length=1224 ExonCt=3 157 6.3e-09
37.
38. Turning BLAST into
HTML
use Bio::SearchIO;
use Bio::SearchIO::Writer::HTMLResultWriter;
my $in = new Bio::SearchIO(-format = 'blast',
-file = shift @ARGV);
my $writer = new
Bio::SearchIO::Writer::HTMLResultWriter();
my $out = new Bio::SearchIO(-writer = $writer
-file = “file.html”);
$out-write_result($in-next_result);
39. Turning BLAST into HTML
# to filter your output
my $MinLength = 100; # need a variable with scope outside the method
sub hsp_filter {
my $hsp = shift;
return 1 if $hsp-length('total') $MinLength;
}
sub result_filter {
my $result = shift;
return $hsp-num_hits 0;
}
my $writer = new Bio::SearchIO::Writer::HTMLResultWriter
(-filters = { 'HSP' = hsp_filter} );
my $out = new Bio::SearchIO(-writer = $writer);
$out-write_result($in-next_result);
# can also set the filter via the writer object
$writer-filter('RESULT', result_filter);
40. Trees
• Support reading of Tree data
• newick, nexus, nhx formats
• Manipulation of Trees
• Trees are connections of Nodes which have
ancestor and children pointers
40
41. Simple Tree Routines
LCA
• Lowest Common Ancestor
• Tests of monophyly, paraphyly
outgroup
• Reroot tree
• Distances between two nodes
• Find a node by name
41
42. Constructing Trees
• For Purely Bioperl Solution (only in post
Bioperl 1.4 releases)
• Bio::Align::DNAStatistics and
ProteinStatistics can build pairwise
distance matricies
• Bio::Tree::DistanceFactory has UPGMA,
Neighbor-Joining implemented
42
44. Reading/Writing a Tree
#!/usr/bin/perl -w
use Bio::TreeIO;
use strict;
my $in = Bio::TreeIO-new(-format = ‘nexus’,
-file = ‘trees.nex’);
my $out = Bio::TreeIO-new(-format = ‘newick’,
-file = ‘trees.nh’);
while( my $tree = $in-next_tree ) {
$out-write_tree($tree);
}
44
45. Fetching subset of nodes
#!/usr/bin/perl -w
use strict;
use Bio::TreeIO;
my $in = Bio::TreeIO-new(-format = ‘newick’,
-fh = *DATA);
if( my $tree = $in-next_tree ) {
my @nodes = $tree-get_nodes;
my @tips = grep { $_-is_Leaf } @nodes;
print there are ,scalar @tips, tips
;
my @internal = grep { ! $_-is_Leaf } @nodes;
print there are ,scalar @internal, internal
nodes
;
my ($A_node) = $tree-find_node(-id = 'A');
print branch length of Ancestor of ,$A_node-id,
is , $A_node-ancestor-branch_length,
;
}
__DATA__
(((A:10,B:11):2,C:5),((D:7,F:6):17,G:8));
45
46. Walking up the tree
(tips to root)
if( my $tree = $in-next_tree ) {
my @tips = grep { $_-is_Leaf } $tree-get_nodes;
for my $node ( @tips ) {
my @path;
while( defined $node) {
push @path, $node-id;
$node = $node-ancestor;
}
print join(“,“, @path), “
”;
}
}
__DATA__
(((A:10,B:11)I1:2,C:5)I2,((D:7,F:6)I3:17,G:8)I4)Root;
A
I1
C,I2,Root B
I2
A,I1,I2,Root
C
B,I1,I2,Root
Root
G,I4,Root D
I3
D,I3,I4,Root F
I4
F,I3,I4,Root 46
G
47. From Alignments to Trees
• Bio::AlignIO to parse the alignment
• Bio::Align::ProteinStatistics to compute
pairwise distances
• Bio::Tree::DistanceFactory to build a tree
based on a matrix of distances using NJ or
UPGMA
• More sophisticated tree building should be
done with tools like phyml, PAUP,
MOLPHY, PHYLIP, MrBayes, or PUZZLE
47
48. Testing phylogenetic
hypotheses
• No sophisticated ML methods are currently
built in Bioperl for testing for phylogenetic
correlations, etc
• Can export trees and use tools like
Mesquite
48
49. Bioperl Population
Genetics
• Basic data
• Marker - polymorphic region of genome
• Individual - individual sampled
• Genotype - observed allele(s) for a
marker in an individual
• Population - collection of individuals
49
50. The Players
• Bio::PopGen::Population - container for
Individuals, calculate summary stats
• Bio::PopGen::Individual - container for
Genotypes
• Bio::PopGen::Marker - summary info
about a Marker (primers, genome loc,
allele freq)
• Bio::PopGen::Genotype - pairing of
Individual Marker; allele container 50
51. Population Genetics
Data Objects
Population
Has
Map
Individual
Has
Has
Marker
Genotype
Has
51
52. Some Data Formats
• PrettyBase format (Seattle SNP data)
MARKER SAMPLEID ALLELE1 ALLELE2
ProcR2973EA 01 C T
ProcR2973EA 02 N N
ProcR2973EA 03 C C
ProcR2973EA 04 C T
ProcR2973EA 05 C C
• Comma Delimited (CSV)
SAMPLE,MARKERNAME1,MARKERNAME2,...
SAMPLE,ProcR2973EA,Marker2
sample-01,C T,G T
sample-02,C C,G G
• Phase format
3
5
P 300 1313 1500 2023 5635
MSSSM
#1
12 1 0 1 3
52
11 0 1 0 3
53. Reading in Data
use Bio::PopGen::IO;
my $in = Bio::PopGen::IO-new(-format = ‘csv’
-file = ‘dat.csv’);
my @pop;
while( my $ind = $in-next_individual ) {
push @pop, $ind;
}
# OR
my $pop = $in-next_population;
53
54. Ready Built Stuff
• Bio::PopGen::PopStats - population level
statistics (only FST currently)
• Bio::PopGen::Statistics - suite of
Population Genetics statistical tests and
summary stats.
• Bio::PopGen::Simulation::Coalescent -
primitive Coalescent simulation
• Basic tree topology and branch length
assignment. 54
55. Using the Modules
use Bio::PopGen::Statistics;
my $stats = Bio::PopGen::Statistics-new();
my $pi = $stats-pi($population);
# or use an array reference of Individuals
my $pi = $stats-pi(@individuals);
# Tajima’s D
my $TajimaD = $stats-tajima_D($population);
# Fu and Li’s D
my $FuLiD = $stats-fu_and_li_D($ingroup_pop,
$outgroup_ind);
# Fu and Li’s D*
my $FLDstar = $stats-fu_and_li_D_star($population);
# pairwise composite LD
my %LDstats = $stats-composite_LD($population);
my $LDarray = $LDstats{’marker1’}-{’marker2’};
my ($ldval,$chisq) = @$LDarray;
55
56. Getting Data from
Alignments
• use Bio::AlignIO to read in Multiple
Sequence Alignment data
• Bio::PopGen::Utilities aln_to_population
will build Population from MSA
• Will make a “Marker” for every
polymorphic site (or if asked every site)
• Eventually will have ability to only get
silent/non-silent coding sites
56
58. Coalescent Simulation
use Bio::PopGen::Simulation::Coalescent;
my $factory = new
Bio::PopGen::Simulation::Coalescent(
-sample_size = 10,
-maxcount = 100); # max number trees
my $tree = $factory-next_tree;
$factory-add_Mutations($tree,20);
my @inds = $tree-get_leaf_nodes;
use Bio::PopGen::Statistics;
my $pi = Bio::PopGen::Statistics-pi(@inds);
my $D = Bio::PopGen::Statistics- tajima_D
(@inds);
58
60. Simple pairwise Ka, Ks
rates estimation
• Within Bioperl can calculate pairwise K ,K
a s
and Ka/Ks using Nei’s method.
• Bio::Align::DNAStatistics
• calc_KaKs_pair($alnobj,$seq1id,$seq2id)
calc_all_KaKs_pairs($alnobj)
60
61. Ortholog/Homolog
finding
• Doesn’t really require much of Bioperl to
do this
• Want to find best reciprocal BLAST/
FASTA/SSEARCH hits
• Build up a matrix, where each query
sequence has a list of its best hits
61
62. my $reportfile = ‘/usr/local/bio/data/reports/fly_worm.FASTA.m9.gz’;
open(IN,”zcat $reportfile |”) || die $!;
my %data;
while(IN) {
my @line = split;
my ($query,$subject,$pid,$alnlen) = @line;
my ($evalue) = $line[10];
next if $query eq $subject;
push @{$data{$query}}, [$subject, $evalue,$pid,$alnlen];
}
my %seen; my %orthologs; # some tracking and data vars
for my $q ( keys %data ) {
next if $seen{$q}++; # skip if we’ve processed it already
my ($s, $evalue,$pid,$alnlen) = @{$data{$q}-[0]};
my ($qsp) = split(/:/,$q); my ($ssp) = split(/:/,$s);
$seen{$s}++;
next if( $qsp eq $ssp || $pid 30 ||
(defined $data{$q}-[1] 1e-10
abs($data{$q}-[1]-[1] - $evalue) ) ); # second best hit
# is too good
if( $data{$s}-[0]-[0] eq $q ) {
my $id = join(”-”, sort ($qsp,$ssp));
push @{$orthologs{$id}}, [$q,$s,$evalue,$pid,$alnlen];
}
}
62
63. # print out the ortholog pairs, do some sorting to insure that
# order of names is consistent and order by evalue
for my $id ( keys %orthologs ) {
open(OUT, “$id.orthologs”) || die $!;
for my $pair ( sort {$a-[2] = $b-[2] }
@{$orthologs{$id}} ) {
my ($q,$s,$evalue,$pid,$alnlen) = @$pair;
($q,$s) = sort ( $q, $s ); # assume prefixed
print OUT join(” ”, ($q,$s,$evalue,$pid,$alnlen)), “
”;
}
}
63
64. Gene families
• Families can be constructed with single-
linkage - merging all homologs into a single
cluster
• Or using a tool like MCL (http://micans.org/
mcl) to build gene families from a matrix of
pairwise distances (BLAST or FASTA
scores)
64
65. Automating PAML
• PAML - phylogenetic analysis with
maximum likelihood
• Estimate synonymous and non-synonymous
substitution rates
• Along branches of a tree or in a pairwise
fashion
65
66. Preparing Data
• Multiple sequence alignments of protein
coding sequence
• No stop codons!
• Must be aligned on codon boundaries
• Easiest way is to align at protein level, then
project back into CDS alignment
66
67. Doing Protein
Alignments
• Bio::Tools::Run::Alignment::Clustalw or
Bio::Tools::Run::Alignment::MUSCLE or just
prepare the sequence files and run the
alignment programs via scripts
• Bio::AlignIO to parse the alignment data
• Bio::Align::Utilities to project back into
CDS space
67
68. Build tree or assume a
tree
• If doing analysis of genomes which have a
know species tree - use that tree
• Branch lengths are not part of PAML.
Multiple topologies can be provided to test
alternative hypotheses (by comparing
maximum likelihood values)
68
69. Running PAML
#!/usr/bin/perl -w
use strict;
use Bio::Tools::Run::Phylo::PAML::Codeml;
use Bio::AlignIO;
my $factory = Bio::Tools::Run::Phylo::PAML::Codeml-new(
-params = { ‘runmode’ = -2,
‘seqtype’ = 1});
my $alnio = Bio::AlignIO-new(-format = ‘clustalw’,
-file = ‘cds.aln’);
my $aln = $alnio-next_aln; # get the alignment from file
$factory-alignment($aln); # set the alignment
my ($returncode,$parser) = $factory-run();
my $result = $parser-next_result;
my $MLmatrix = $result-get_MLMatrix;
print Ka = , $MLmatrix-[0]-[1]-{’dN’},
;
print Ks = , $MLmatrix-[0]-[1]-{’dS’},
;
print Ka/Ks = , $MLmatrix-[0]-[1]-{’omega’},
;
69
70. Parsing PAML
#!/usr/bin/perl -w
use strict;
use Bio::Tools::Phylo::PAML;
my $parser = Bio::Tools::Phylo::PAML-new
(-file = ‘results/mlc’, -dir = ‘results’);
if( my $result = $parser-next_result ) {
my @otus = $result-get_seqs;
# get Nei Gojobori dN/dS matrix
my $NGmatrix = $result-get_NGmatrix;
printf “%s and %s dS=%.4f dN=%.4f Omega=%.4f
”,
$otus[0]-display_id, $otus[1]-display_id,
$NGMatrix-[0]-[1]-{dS}, $NGMatrix-[0]-[1]-{dN},
$NGMatrix-[0]-[1]-{omega};
}
70
71. Getting the Trees out
my @trees = $result-get_trees;
for my $tree ( @trees ) {
print “likelihood is “, $tree-score, “
”;
# do something else with the trees,
# for non runmode -2 results
# inspect the tree, branch specific rates
# the t (time) parameter is available via
# (omega, dN, etc.) are available via
# ($omega) = $node-get_tag_values('omega');
for my $node ( $tree-get_nodes ) {
print $node-id, “ t=”,$node-branch_length,
“ omega “, $node-get_tag_values(‘omega’);
}
}
71
72. Summary
Lots of modules to do lots of things
How to find out what exists?
Read HOWTOs, bptutorial, Browse the docs
website - http://doc.bioperl.org/
Ask on-list bioperl-l@bioperl.org
73. Fungal Genomics
• ~45 fungal genomes currently
sequenced
• 5-10 more in production at sequencing
centers
• Diverse sampling across the fungal tree
• Clusters of closely related species
• Cryptococcus, Aspergillus,
Saccharomyces, Candida
74. Genome Annotation
Pipeline
Rfam
tRNAscan
ZFF to GFF3 GFF to AA
SNAP FASTA
predicted
Proteins
proteins
Twinscan all-vs-all
GFF2 to GFF3
Bio::DB::GFF
Tools::Glimmer
Glimmer SearchIO
protein to
Genscan
Genome genome
Tools::Genscan
coordinates
homologs
HMMER MCL
SearchIO SearchIO and
BLASTZ
orthologs
BLASTN
Tools::GFF GLEAN
GFF2 to GFF3
exonerate (combiner)
Proteins Gene
protein2genome
families
exonerate
ESTs
est2genome
BOSC 2005
81. Consensus Splice Site
Exon Exon
AG
GT G
H.sapiens exon|intron H.sapiens intron|exon
2 2
C
bits
bits
1 1
A
A T
T
G
A G
G T
C A
C G
T C
T
T T
0 0
C
A A
T
G T
C C
A
C G A G
C
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
5′ 3′ 5′
GTATGT AG
weblogo.berkeley.edu weblogo.berkele
S.cerevisiae exon|intron
2 S.cerevisiae intron|exon
2
bits
1
bits
1
T
A
C
C
A
T
0
G
T A
A
A
G
C
C
A
-2
-1
0
1
2
3
4
5
T
0 G
C
5′ 3′
A
C
-5
-4
-3
-2
-1
0
1
5′
GT G AG
weblogo.berkeley.edu
weblogo.berkele
F.graminearium exon|intron F.graminearium intron|exon
2 2
bits
bits
1 1
AA T C
T
GT A
T
G T
0 0
C
C A
C
T
C
A
A
C
T
A
G
C
A G
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
5′ 3′ 5′
AG
GT G
weblogo.berkeley.edu weblogo.berkele
C.neoformans exon|intron C.neoformans intron|exon
2 2
bits
bits
1 1
T T
AA
C
G C
G C T
0 0
TT
T A A
A
C
A
G
G C
C
A
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
5′ 3′ 5′
weblogo.berkeley.edu weblogo.berkele
83. Intron loss
• If loss predominates, how are introns lost?
• Fink (1987) model proposes mRNA
integration into genome.
• Boeke et al (1985) showed loss intron
through RNA intermediate which is
integrated into genome.
• Large scale comparisons are too far awy
to determine recent loss events.
84. Searching for Intron loss in
C. neoformans
• Average of 5 introns per
gene (Loftus et al, 2005)
• A
Average Ks between var.
grubii and var. gattii 0.22 D
(roughly mouse-rat B/C
divergence), C. gattii vs C.
neoformans Ks is 0.35
• 5133 4-way orthologs
85. Loss/gain
events observed
• 25 out of the 5133 loci had loss or gain
events.
• 2 had evidence 4 or more intron loss/gain
events.
• CNI01550, putative RNA helicase missing
10 introns in var grubii (serotype A).
• ~5-7 singleton introns in one of the
Cryptococcus spp indicative of intron gain.
86. 10 introns lost in
C. neoformans var grubii
2 3 5 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22
1 2 3 5 6 78 9-19 20 21 22
C.gattii var grubii
var neoformans
}
}
87. Happened at the locus and
was a perfect deletion
1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k
CNI01560
CNI01540 CNI01550
C.neoformans var neoformans
89% nt identity 92% nt identity 86% nt identity
C.neoformans var grubii
85% nt identity 88% nt identity 85% nt identity
C.gattii
88. Other Research
Projects
• Multigene phylogenies of the fungi
• Gene family evolution
• Sugar transporter family expansion in
C.neoformans
• Gene family evolution mammals
(with M. Hahn)