Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier.
For more information, see http://code.google.com/p/sequencematrix/ or http://www3.interscience.wiley.com/journal/123577052/abstract
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Sequence Matrix: Gene concatenation made easy
1. Sequence Matrix
Gene concatenation made easy
Gaurav Vaidya1, David Lohman2, Rudolf Meier2
1: NeatCo Asia, Singapore.
2: Department of Biological Sciences,
National University of Singapore, Singapore.
2. Our goals
✤ Many powerful tools exist for concatenating sequences.
✤ Adding new sequences to an existing dataset is tedious and time consuming.
✤ Our initial goal: simple, user-friendly program for concatenating sequences.
✤ We also added a few tools to help you look for lab contamination in your dataset.
3. Sequence Matrix
✤ Written in Java.
✤ Graphical user interface libraries.
✤ Works on different operating systems.
✤ Easy to install: download and run the batch file.
4. Importing sequences
✤ You can use the sequence names as
entered in the input file.
✤ Or you can ask Sequence Matrix to try
to identify the species names.
6. Importing sequences
✤ A common source of error is forgetting
to recode leading and trailing gaps as
missing information.
✤ Sequence Matrix can automatically
replace such gaps with question marks.
7. Importing sequences: Naming
✤ Sequences from one dataset are matched up to another dataset by sequence name.
✤ Errors in sequence naming need to be fixed.
✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
8. Export: Taxonsets
✤ By default, we generate taxonsets on the
basis of:
✤ Combined length.
✤ Number of character sets
✤ Information for a particular gene.
9. Gene trees
✤ Two ways to do them:
✤ Use the taxonset of taxa having information for a particular gene to exclude other
taxa.
✤ Export the entire dataset with one file per column.
10. Export features
✤ You can also export the Sequence Matrix table as an Excel-readable text file.
✤ Supervisory mode.
✤ Keep track of a project as it grows.
11. Character sets
✤ We can read character sets defined in
Nexus CHARSET and TNT xgroup
commands.
✤ These can be “split” into individual
columns, or imported as a single
column representing the entire file.
12. Excision
✤ Individual sequences can be excised
from the dataset.
✤ Excised sequences will not be exported.
✤ Sequence Matrix will warn you about
that.
13. Contamination
✤ You thought you were sequencing Gorilla gorilla
✤ but you were really sequencing Homo sapiens.
✤ We have two tools you can use:
✤ If Homo sapiens is in your dataset.
✤ If Homo sapiens is not in your dataset (experimental!).
14. H. sapiens in dataset
✤ Looks for pairs of sequences whose
pairwise distance is very low.
✤ Expected difference depends on gene:
✤ 28S doesn’t change very much, but
✤ COI changes very quickly.
✤ Some interpretation is required.
15. H. sapiens not present
✤ Use “Pairwise Distance Mode” to look
for unusual pairwise distances.
✤ Ignore one charset, then sort taxa based
on their pairwise distance to a
“reference taxon”.
✤ Colour sequences by their individual
pairwise distances to the reference
taxon.
16. H. sapiens not present
✤ Colour pairwise distances on the gene
in question by their pairwise distance to
the reference taxon.
✤ Look for colour variation which is
unusual or out of place.
✤ We would expect sequences from
different species to be correlated
together.
17. Pairwise distance
mode
✤ You need to vary:
✤ The gene you are studying.
✤ The reference taxon being compared
against.
✤ Possibly helpful as an alert mechanism.
18. Summary
✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.
✤ Taxonsets allow you to analyse subsets of your data in downstream programs.
✤ Excising sequences gives you greater control over which sequences to analyse.
✤ You can look for contamination in two ways:
✤ Looking for very low pairwise distances across your entire dataset.
✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
19. Acknowledgements
✤ Rudolf Meier
✤ Zhang Guanyang
✤ Farhan Ali
✤ David Lohman
✤ Everybody at the NUS DBS
Evolutionary Biology lab.