Sequence Matrix: Gene concatenation made easy

Sequence Matrix
Gene concatenation made easy
Gaurav Vaidya1, David Lohman2, Rudolf Meier2

1: NeatCo Asia, Singapore.
2: Department of Biological Sciences,
National University of Singapore, Singapore.

Our goals

✤ Many powerful tools exist for concatenating sequences.

✤ Adding new sequences to an existing dataset is tedious and time consuming.

✤ Our initial goal: simple, user-friendly program for concatenating sequences.

✤ We also added a few tools to help you look for lab contamination in your dataset.

Sequence Matrix

✤ Written in Java.

✤ Graphical user interface libraries.

✤ Works on different operating systems.

✤ Easy to install: download and run the batch ﬁle.

Importing sequences

✤ You can use the sequence names as
entered in the input ﬁle.

✤ Or you can ask Sequence Matrix to try
to identify the species names.

Importing sequences

✤ Sequences mode: ✤ Species name
✤ gi|237510679|gb|AY556753.2|Daubentonia ✤ Daubentonia madagascariensis
madagascariensis voucher WE94001 5.8S
ribosomal RNA gene, partial sequence; internal
transcribed spacer 2, complete sequence; and
28S ribosomal RNA gene, partial sequence

✤ gi|237510678|gb|AY556735.2|Macaca ✤ Macaca sylvanus
sylvanus voucher OK96022 5.8S ribosomal
RNA gene, partial sequence; internal
transcribed spacer 2, complete sequence; and
28S ribosomal RNA gene, partial sequence

Importing sequences

✤ A common source of error is forgetting
to recode leading and trailing gaps as
missing information.

✤ Sequence Matrix can automatically
replace such gaps with question marks.

Importing sequences: Naming

✤ Sequences from one dataset are matched up to another dataset by sequence name.

✤ Errors in sequence naming need to be ﬁxed.

✤ We recommend naming your ﬁles by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Export: Taxonsets

✤ By default, we generate taxonsets on the
basis of:

✤ Combined length.

✤ Number of character sets

✤ Information for a particular gene.

Gene trees

✤ Two ways to do them:

✤ Use the taxonset of taxa having information for a particular gene to exclude other
taxa.

✤ Export the entire dataset with one ﬁle per column.

Export features

✤ You can also export the Sequence Matrix table as an Excel-readable text ﬁle.

✤ Supervisory mode.

✤ Keep track of a project as it grows.

Character sets

✤ We can read character sets deﬁned in
Nexus CHARSET and TNT xgroup
commands.

✤ These can be “split” into individual
columns, or imported as a single
column representing the entire ﬁle.

Excision

✤ Individual sequences can be excised
from the dataset.

✤ Excised sequences will not be exported.

✤ Sequence Matrix will warn you about
that.

Contamination

✤ You thought you were sequencing Gorilla gorilla

✤ but you were really sequencing Homo sapiens.

✤ We have two tools you can use:

✤ If Homo sapiens is in your dataset.

✤ If Homo sapiens is not in your dataset (experimental!).

H. sapiens in dataset

✤ Looks for pairs of sequences whose
pairwise distance is very low.

✤ Expected difference depends on gene:

✤ 28S doesn’t change very much, but

✤ COI changes very quickly.

✤ Some interpretation is required.

H. sapiens not present

✤ Use “Pairwise Distance Mode” to look
for unusual pairwise distances.

✤ Ignore one charset, then sort taxa based
on their pairwise distance to a
“reference taxon”.

✤ Colour sequences by their individual
pairwise distances to the reference
taxon.

H. sapiens not present

✤ Colour pairwise distances on the gene
in question by their pairwise distance to
the reference taxon.

✤ Look for colour variation which is
unusual or out of place.

✤ We would expect sequences from
different species to be correlated
together.

Pairwise distance
mode

✤ You need to vary:

✤ The gene you are studying.

✤ The reference taxon being compared
against.

✤ Possibly helpful as an alert mechanism.

Summary

✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤ Taxonsets allow you to analyse subsets of your data in downstream programs.

✤ Excising sequences gives you greater control over which sequences to analyse.

✤ You can look for contamination in two ways:

✤ Looking for very low pairwise distances across your entire dataset.

✤ Looking for unusual pairwise distances in Pairwise Distance Mode.

Acknowledgements

✤ Rudolf Meier

✤ Zhang Guanyang

✤ Farhan Ali

✤ David Lohman

✤ Everybody at the NUS DBS
Evolutionary Biology lab.

Sequence Matrix: Gene concatenation made easy

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Sequence Matrix: Gene concatenation made easy

Similaire à Sequence Matrix: Gene concatenation made easy (20)

Dernier

Dernier (20)

Sequence Matrix: Gene concatenation made easy