SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Sequence Matrix
 Gene concatenation made easy
  Gaurav Vaidya1, David Lohman2, Rudolf Meier2

                           1: NeatCo Asia, Singapore.
                           2: Department of Biological Sciences,
                              National University of Singapore, Singapore.
Our goals


 ✤   Many powerful tools exist for concatenating sequences.

 ✤   Adding new sequences to an existing dataset is tedious and time consuming.

 ✤   Our initial goal: simple, user-friendly program for concatenating sequences.

 ✤   We also added a few tools to help you look for lab contamination in your dataset.
Sequence Matrix


✤   Written in Java.

    ✤   Graphical user interface libraries.

    ✤   Works on different operating systems.

    ✤   Easy to install: download and run the batch file.
Importing sequences



✤   You can use the sequence names as
    entered in the input file.

✤   Or you can ask Sequence Matrix to try
    to identify the species names.
Importing sequences

✤   Sequences mode:                                      ✤   Species name
    ✤   gi|237510679|gb|AY556753.2|Daubentonia               ✤   Daubentonia madagascariensis
        madagascariensis voucher WE94001 5.8S
        ribosomal RNA gene, partial sequence; internal
        transcribed spacer 2, complete sequence; and
        28S ribosomal RNA gene, partial sequence

    ✤   gi|237510678|gb|AY556735.2|Macaca                    ✤   Macaca sylvanus
        sylvanus voucher OK96022 5.8S ribosomal
        RNA gene, partial sequence; internal
        transcribed spacer 2, complete sequence; and
        28S ribosomal RNA gene, partial sequence
Importing sequences



✤   A common source of error is forgetting
    to recode leading and trailing gaps as
    missing information.

✤   Sequence Matrix can automatically
    replace such gaps with question marks.
Importing sequences: Naming



✤   Sequences from one dataset are matched up to another dataset by sequence name.

    ✤   Errors in sequence naming need to be fixed.

✤   We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
Export: Taxonsets


✤   By default, we generate taxonsets on the
    basis of:

    ✤   Combined length.

    ✤   Number of character sets

    ✤   Information for a particular gene.
Gene trees



✤   Two ways to do them:

    ✤   Use the taxonset of taxa having information for a particular gene to exclude other
        taxa.

    ✤   Export the entire dataset with one file per column.
Export features



✤   You can also export the Sequence Matrix table as an Excel-readable text file.

    ✤   Supervisory mode.

    ✤   Keep track of a project as it grows.
Character sets


✤   We can read character sets defined in
    Nexus CHARSET and TNT xgroup
    commands.

✤   These can be “split” into individual
    columns, or imported as a single
    column representing the entire file.
Excision


✤   Individual sequences can be excised
    from the dataset.

✤   Excised sequences will not be exported.

    ✤   Sequence Matrix will warn you about
        that.
Contamination


✤   You thought you were sequencing Gorilla gorilla

    ✤   but you were really sequencing Homo sapiens.

✤   We have two tools you can use:

    ✤   If Homo sapiens is in your dataset.

    ✤   If Homo sapiens is not in your dataset (experimental!).
H. sapiens in dataset

✤   Looks for pairs of sequences whose
    pairwise distance is very low.

✤   Expected difference depends on gene:

    ✤   28S doesn’t change very much, but

    ✤   COI changes very quickly.

✤   Some interpretation is required.
H. sapiens not present

✤   Use “Pairwise Distance Mode” to look
    for unusual pairwise distances.

✤   Ignore one charset, then sort taxa based
    on their pairwise distance to a
    “reference taxon”.

    ✤   Colour sequences by their individual
        pairwise distances to the reference
        taxon.
H. sapiens not present

✤   Colour pairwise distances on the gene
    in question by their pairwise distance to
    the reference taxon.

✤   Look for colour variation which is
    unusual or out of place.

✤   We would expect sequences from
    different species to be correlated
    together.
Pairwise distance
mode

✤   You need to vary:

    ✤   The gene you are studying.

    ✤   The reference taxon being compared
        against.

✤   Possibly helpful as an alert mechanism.
Summary

✤   Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤   Taxonsets allow you to analyse subsets of your data in downstream programs.

✤   Excising sequences gives you greater control over which sequences to analyse.

✤   You can look for contamination in two ways:

    ✤   Looking for very low pairwise distances across your entire dataset.

    ✤   Looking for unusual pairwise distances in Pairwise Distance Mode.
Acknowledgements

✤   Rudolf Meier

✤   Zhang Guanyang

✤   Farhan Ali

✤   David Lohman

✤   Everybody at the NUS DBS
    Evolutionary Biology lab.
Question time!

Contenu connexe

Tendances

Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
Ehtisham Ali
 
Further8 data transformation
Further8  data transformationFurther8  data transformation
Further8 data transformation
kmcmullen
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
kongara
 
T test, independant sample, paired sample and anova
T test, independant sample, paired sample and anovaT test, independant sample, paired sample and anova
T test, independant sample, paired sample and anova
Qasim Raza
 
Skewness & Kurtosis
Skewness & KurtosisSkewness & Kurtosis
Skewness & Kurtosis
Navin Bafna
 
Correlation
CorrelationCorrelation
Correlation
Tech_MX
 

Tendances (20)

Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
 
Further8 data transformation
Further8  data transformationFurther8  data transformation
Further8 data transformation
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
 
Measure of Central Tendency (Mean, Median, Mode and Quantiles)
Measure of Central Tendency (Mean, Median, Mode and Quantiles)Measure of Central Tendency (Mean, Median, Mode and Quantiles)
Measure of Central Tendency (Mean, Median, Mode and Quantiles)
 
box plot or whisker plot
box plot or whisker plotbox plot or whisker plot
box plot or whisker plot
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Central tendency
Central tendencyCentral tendency
Central tendency
 
Factorial ANOVA
Factorial ANOVAFactorial ANOVA
Factorial ANOVA
 
Data Transformation.ppt
Data Transformation.pptData Transformation.ppt
Data Transformation.ppt
 
Applied statistics lecture 1
Applied statistics lecture 1Applied statistics lecture 1
Applied statistics lecture 1
 
T test, independant sample, paired sample and anova
T test, independant sample, paired sample and anovaT test, independant sample, paired sample and anova
T test, independant sample, paired sample and anova
 
Arithmetic and Geometric Progressions
Arithmetic and Geometric Progressions Arithmetic and Geometric Progressions
Arithmetic and Geometric Progressions
 
Skewness & Kurtosis
Skewness & KurtosisSkewness & Kurtosis
Skewness & Kurtosis
 
Non parametric tests by meenu
Non parametric tests by meenuNon parametric tests by meenu
Non parametric tests by meenu
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
Partial correlation
Partial correlationPartial correlation
Partial correlation
 
Variability
VariabilityVariability
Variability
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Lecture 4 asymptotic notations
Lecture 4   asymptotic notationsLecture 4   asymptotic notations
Lecture 4 asymptotic notations
 
Correlation
CorrelationCorrelation
Correlation
 

Similaire à Sequence Matrix: Gene concatenation made easy

презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
Valeriya Simeonova
 

Similaire à Sequence Matrix: Gene concatenation made easy (20)

31931 31941
31931 3194131931 31941
31931 31941
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Seq 301116
Seq 301116Seq 301116
Seq 301116
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
EST Clustering.ppt
EST Clustering.pptEST Clustering.ppt
EST Clustering.ppt
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 

Dernier

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Sequence Matrix: Gene concatenation made easy

  • 1. Sequence Matrix Gene concatenation made easy Gaurav Vaidya1, David Lohman2, Rudolf Meier2 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.
  • 2. Our goals ✤ Many powerful tools exist for concatenating sequences. ✤ Adding new sequences to an existing dataset is tedious and time consuming. ✤ Our initial goal: simple, user-friendly program for concatenating sequences. ✤ We also added a few tools to help you look for lab contamination in your dataset.
  • 3. Sequence Matrix ✤ Written in Java. ✤ Graphical user interface libraries. ✤ Works on different operating systems. ✤ Easy to install: download and run the batch file.
  • 4. Importing sequences ✤ You can use the sequence names as entered in the input file. ✤ Or you can ask Sequence Matrix to try to identify the species names.
  • 5. Importing sequences ✤ Sequences mode: ✤ Species name ✤ gi|237510679|gb|AY556753.2|Daubentonia ✤ Daubentonia madagascariensis madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence ✤ gi|237510678|gb|AY556735.2|Macaca ✤ Macaca sylvanus sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
  • 6. Importing sequences ✤ A common source of error is forgetting to recode leading and trailing gaps as missing information. ✤ Sequence Matrix can automatically replace such gaps with question marks.
  • 7. Importing sequences: Naming ✤ Sequences from one dataset are matched up to another dataset by sequence name. ✤ Errors in sequence naming need to be fixed. ✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.
  • 8. Export: Taxonsets ✤ By default, we generate taxonsets on the basis of: ✤ Combined length. ✤ Number of character sets ✤ Information for a particular gene.
  • 9. Gene trees ✤ Two ways to do them: ✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa. ✤ Export the entire dataset with one file per column.
  • 10. Export features ✤ You can also export the Sequence Matrix table as an Excel-readable text file. ✤ Supervisory mode. ✤ Keep track of a project as it grows.
  • 11. Character sets ✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands. ✤ These can be “split” into individual columns, or imported as a single column representing the entire file.
  • 12. Excision ✤ Individual sequences can be excised from the dataset. ✤ Excised sequences will not be exported. ✤ Sequence Matrix will warn you about that.
  • 13. Contamination ✤ You thought you were sequencing Gorilla gorilla ✤ but you were really sequencing Homo sapiens. ✤ We have two tools you can use: ✤ If Homo sapiens is in your dataset. ✤ If Homo sapiens is not in your dataset (experimental!).
  • 14. H. sapiens in dataset ✤ Looks for pairs of sequences whose pairwise distance is very low. ✤ Expected difference depends on gene: ✤ 28S doesn’t change very much, but ✤ COI changes very quickly. ✤ Some interpretation is required.
  • 15. H. sapiens not present ✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances. ✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”. ✤ Colour sequences by their individual pairwise distances to the reference taxon.
  • 16. H. sapiens not present ✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon. ✤ Look for colour variation which is unusual or out of place. ✤ We would expect sequences from different species to be correlated together.
  • 17. Pairwise distance mode ✤ You need to vary: ✤ The gene you are studying. ✤ The reference taxon being compared against. ✤ Possibly helpful as an alert mechanism.
  • 18. Summary ✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets. ✤ Taxonsets allow you to analyse subsets of your data in downstream programs. ✤ Excising sequences gives you greater control over which sequences to analyse. ✤ You can look for contamination in two ways: ✤ Looking for very low pairwise distances across your entire dataset. ✤ Looking for unusual pairwise distances in Pairwise Distance Mode.
  • 19. Acknowledgements ✤ Rudolf Meier ✤ Zhang Guanyang ✤ Farhan Ali ✤ David Lohman ✤ Everybody at the NUS DBS Evolutionary Biology lab.