DNA sequence screening software that implements the best match method recommended by the federal government.
Publication: Adam L et al, Strengths and limitations of the federal guidance on synthetic DNA, Nature Biotechnology (2011) 29, 208–210 doi:10.1038/nbt.1802
US Department of Health and Human Services voluntary guidelines “Screening Framework Guidance for Synthetic Double-Stranded DNA Providers” November 2009.
Software: http://sourceforge.net/projects/genothreat/
8. Industry Response to Dual Use
• 5 members (all based in
Germany)
• Undersigned by:
► 6 German or
German/American
► 2 Chinese
• “Code of Conduct for Best
Practices in Gene Synthesis”
• 5 companies (American)
• 80% of worldwide synthesis
capacity
• “Harmonized Screening
Protocol”
7/10/2014 8GenoTHREAT
10. Our Primary Objectives
1. Interpret the (draft) guidance as an algorithm
2. Implement as a software: GenoTHREAT
3. Characterize screening efficacy
7/10/2014 10GenoTHREAT
11. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the
guidance
III. GenoTHREAT: implementation and
characterization
IV. Conclusions
7/10/2014 11GenoTHREAT
12. [Guidance] : Purpose
“[…] to minimize the risk that unauthorized individuals
or individuals with malicious intent will obtain “toxins
and agents of concern” through the use of nucleic
acid synthesis technologies, and to simultaneously
minimize any negative impacts on the conduct of
research and business operations.”
7/10/2014 12GenoTHREAT
13. [Guidance] : Goals of sequence screening
• Agent of concern?
• Select Agents and Toxins
• Sequences of concern?
• “dsDNA sequences derived from or encoding Select Agents and Toxins”
• Sequence unique to select agent
• No house-keeping genes
• Both DNA strands and the six-frames translation
• Detect any “sequence of concern”
• Embedded : as small as 200bps
Use Best match approach (at least)
7/10/2014 13GenoTHREAT
16. [Guidance] : Major Points
1. Perform Six Frame Translation
2. Divide the query sequences into
subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
7/10/2014 16GenoTHREAT
17. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 17GenoTHREAT
20. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 20GenoTHREAT
22. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 22GenoTHREAT
23. [Algorithm] : What should we do with
subsequences?
7/10/2014 GenoTHREAT 23
24. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 24GenoTHREAT
26. Basic Local Alignment Search Tool (BLAST)
• Developed at the U.S. National Center for
Biotechnology Information
• One of the most widely used bioinformatics tools
• Aligns query sequences against sequences in the
GenBank sequence database
• Algorithm emphasizes speed over sensitivity
7/10/2014 26GenoTHREAT
28. BLAST Output
Percent Identity
► The percentage of identical nucleotides (or amino acid) in
the sequence aligned
Query Coverage
► The length of sequence aligned
7/10/2014 28GenoTHREAT
29. [Algorithm] : What should we do with all
those results of BLAST?
7/10/2014 29GenoTHREAT
30. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 30GenoTHREAT
31. [Guidance] : The Best match approach
• Use local sequence alignment tool
• suggest Blast
• Best matches = greatest percent identity over
the entire fragment
• 66AA or 200bps fragments
7/10/2014 31GenoTHREAT
34. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 34GenoTHREAT
36. [Algorithm] : How can we know if a Best
Match is to a Select Agent or Toxin?
Problem: no suggestion in guidance
Solution:
keyword
and anti-keyword list
7/10/2014 36GenoTHREAT
37. BLAST
[Example] : Is this subsequence a
hit?
7/10/2014 GenoTHREAT 37
BLAST results PI QC (%)
Bacillus anthracis 100 100
Bacillus anthracis str. Sterne 100 100
Danio rerio 97 100
Danio rerio 43 80
Best matches
Bacillus anthracis
Bacillus anthracis str. Sterne
38. [Example] : Keyword vs. Anti-keyword
If a GenBank entry contains a keyword, then the
sequence is flagged
SA
7/10/2014 38GenoTHREAT
39. [Example] : Keyword vs. Anti-keyword
If a GenBank entry contains both a keyword and anti-
keyword, the order is not flagged
NSA
7/10/2014 39GenoTHREAT
45. [Algorithm] : Points of the Guidance left to
interpretation
How do you identify sequences of concern of 200bp or
greater which partially span two adjacent
subsequences?
Problem: no suggestion in guidance
Solution: extension method
7/10/2014 45GenoTHREAT
58. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
1. Perform Six Frame Translation
2. Divide the query sequences into subsequences of 200bp or 66aa
3. For each subsequence
i. BLAST
ii. Best Matches
iii. Flag if SAT
4. Automatic decision
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 58GenoTHREAT
60. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
IV. Conclusions
7/10/2014 60GenoTHREAT
61. Using BLAST
Online BLAST
Performs BLAST via NCBI
website interface
► Faster per BLAST
► Computationally less
expensive
► Only sequential, due to NCBI
restrictions
► Lack of privacy
Local BLAST
Performs BLAST in parallel on
local machine
► User privacy
► Faster per sequence due to
parallelization
► Computational expensive
(Memory + CPU intensive )
7/10/2014 GenoTHREAT 61
62. Screening time & hardware
7/10/2014 GenoTHREAT 62
Online Desktop Business Class Server
Sequence length (bp) Screening time (min)*
2,000 2
10,000 12.5
*Screening performed using business class server
63. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
i. Database of test sequences
ii. Keyword list variation
iii. Detection of Potentially dangerous sequences
iv. BLAST parameters
v. Real world gene orders simulation
IV. Conclusions
7/10/2014 63GenoTHREAT
64. Database of Test Sequences
• Implementations must be compared to assess quality
• Standardized set of test sequences is needed
• Test Set contains 184 sequences:
• Select Agents
o Genes associated with toxins or pathogenicity
o Genes associated with normal function
• Model Organisms
64
7/10/2014 64GenoTHREAT
65. Database of Test Sequences
Contribute to the development of a standard test set
of sequences
65
7/10/2014 65GenoTHREAT
66. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
i. Database of test sequences
ii. Keyword list variation
iii. Detection of Potentially dangerous sequences
iv. BLAST parameters
v. Real world gene orders simulation
IV. Conclusions
7/10/2014 66GenoTHREAT
67. Keyword and Anti-Keyword list
• Test with the unmodified sequences (184 sequences)
• Two lists of keywords
• Limited
• extensive
• Plus
• anti-keyword list
• or not
7/10/2014 67GenoTHREAT
68. Keyword List Content Variation
7/10/2014 GenoTHREAT 68
0
20
40
60
80
100
120
Limited keywords Extensivekeywords
Correct SAT Correct NSAT
Keyword list method not mentioned in guidance
Limited keyword
list:
uniquely composed
of words in SAT List
Extensive keyword
list:
extension of limited
keyword list
containing words
uniquely related to
SAT.
70. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
i. Database of test sequences
ii. Keyword list variation
iii. Detection of potentially dangerous sequences
iv. BLAST parameters
v. Real world gene orders simulation
IV. Conclusions
7/10/2014 70GenoTHREAT
71. Modified Test Sequences
Modification performed on the initial unmodified
sequences
► Intervening sequences
► Degenerate sequences
► Mutated sequences (BLAST parameters)
7/10/2014 71GenoTHREAT
72. Degenerate Sequences
Potential Danger: Codon optimized nucleotide sequences
7/10/2014 GenoTHREAT 72
GATTTGGACACTCATTTCACC
DLDTHFT
Unmodified Nucleotide
Degenerate
NucleotideGATACGTCAACCTTTTAA
GC
Amino Acid
Sequence
Result: all codon optimized sequences detected due to screening of amino acid
sequences
73. Intervening sequences
Potential Danger: SAT sequences hidden within larger, benign sequences
300bps
NSAT
200bps
SAT
300bps
NSAT
300bps
NSAT
300bps
NSAT
250bps
SAT
7/10/2014 73GenoTHREAT
Result: All hidden sequences were detected
74. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
i. Database of test sequences
ii. Keyword list variation
iii. Detection of Potentially dangerous sequences
iv. BLAST parameters
v. Real world gene orders simulation
IV. Conclusions
7/10/2014 74GenoTHREAT
80. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
1. Software implementation
2. Software Characterization
i. Database of test sequences
ii. Keyword list variation
iii. Detection of Potentially dangerous sequences
iv. BLAST parameters
v. Real world gene orders simulation
IV. Conclusions
7/10/2014 80GenoTHREAT
81. Real world gene orders simulation
Gene Synthesis company: low number of false hit
needed
1. iGEM registry
• Registry completed by iGEM teams each year
• Contains 10,000 sequences
2. GenoCAD database
• 1,258 sequences longer than 200 bp
7/10/2014 81GenoTHREAT
82. iGEM Registry
First step: screen registry sequences 1-->1724
Hit rate: 6.5%
Major causes of hits:
• 100% query coverage for Best Match too restrictive
• Some results have 100% query coverage but very low
Percent Identity
• Keyword list issues
7/10/2014 82GenoTHREAT
95%
60%
solved
2.9%
85. Real world gene orders simulation
Hits left are due to:
• Very often: 1 subsequence of 1 Protein frame leads
to a correct hit
Is it worth flagging the entire sequence?
• Sometimes: many subsequences leads to correct hits
Probably worth flagging
7/10/2014 85GenoTHREAT
86. Road Map
I. Current regulations
II. Sequence screening algorithm: interpreting the guidance
III. GenoTHREAT: implementation and characterization
IV. Conclusions
7/10/2014 86GenoTHREAT
87. GenoTHREAT
• “Best Match”
• Hardware and software parameters
• Keyword list
• BLAST parameters
• Certain types of sequence modifications
• High-resolution screen
7/10/2014 87GenoTHREAT
88. Guidance conclusion
Government Guidance potentially usable by
companies:
• Reasonable time
• Good detection of sequences of concern
• Number of false hits potentially low (manual review)
7/10/2014 88GenoTHREAT
94. 7/10/2014 GenoTHREAT 94
A T A A C T C C C T G G G T C G T T A A A C C G G
C G G C T G C G G C A G T C T T A G C A T A A T A
A T C G G A T A G C A C T T T A T G A C C T G T C
G T C G G G G C A C T A A A T G A A C T A G T G G
C A G T A A C T G T C A G G C A G C A T A T A C A
A C G T T C A A A T A A C T G C A T A G A A C C C
A G A A T A A C T A C C A C C A C C G A A T C T T
T A T C C A G A C G A C T G C A T G A C T C G C T
T C T A C G A C G G T G A A T G A C G T T G G G T
T G C G T C G C A T G G T A C C T A C T T A A C T
T C G G T C G C T C A A T G A T C T G C A A A A G
A A T C G G C T A T T G G A C T C C T A G G C G C
G T C T T A T A T A T G C G G C G C T T T T A C G
A T C C G G A C A T A A T C T A A G G T A T C G T
A C G C G C G G G A A C A C G A G G T T G T A A C
A C C G T A G C T A T C T C A T G C A T T C C G A
C C A G C G G T T A T A T A A T A C T C G T T T T
T T C C G C G T G C C A T C A T A C G A C G C T G
G C C G C C G C G T T A G T G T C G T G T G T A C
A C A C C G A G T T A C C C T C C T T C G T T C G
C A C C A G C G T T A C T G C G T G T A G A G G A
A A T T G G C T T G A G A G C T T T G C C C C A C
C G C A C G A G G T A A C T A T T G A G A T C A G
T C T A C A G A G T G C A A T A C A C C A A C G C
http://sourceforge.net/projects/genothreat/
95. Acknowledgeme
nts
Dr. Jean Peccoud
Mandy L. Wilson
The VT-ENSIMAG iGEM team (2010):
Michael Kozar
Gaelle Letort
Olivier Mirat
Arunima Srivastava
Tyler Stewart
My PhD committee:
Dr. Bevan
Dr. Garner
Dr. Peccoud
Dr. Ramakrishnan
Dr. Setubal
7/10/2014 GenoTHREAT 95