Gene transcripts are the lens through which we understand variants that are identified by genome sequencing, reported in scientific literature, and communicated on clinical reports. An accurate, shared representation of transcripts is essential to communicating variants reliably. This talk presents observations of significant discrepancies between sources of transcripts that will lead to discrepancies in the clinical interpretation of variants, and tools that we have released to contend with these complexities.
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
The Clinical Significance of Transcript Alignment Discrepancies
1. Ⓒ 2014 Invitae
Reece Hart, Ph.D.Reece Hart, Ph.D.
reece@invitae.comreece@invitae.com
Human Variome Project Meeting 2014, ParisHuman Variome Project Meeting 2014, Paris
The Clinical Significance of TranscriptThe Clinical Significance of Transcript
Alignment DiscrepanciesAlignment Discrepancies
…… and tools to help you deal with them.and tools to help you deal with them.
2. 2 / 24 Ⓒ 2014 Invitae
The fidelity of transcript-genome mapping matters.The fidelity of transcript-genome mapping matters.
Variants are identified
and computed on in
genome coordinates
Variants are analyzed and
communicated using
transcript coordinates
genome to
transcript
(g. to c.)
transcript
to genome
(c. to g.)
3. 3 / 24 Ⓒ 2014 Invitae
Motivation 1: Discordant exon coordinatesMotivation 1: Discordant exon coordinates
NCBI and UCSC report different coordinates for NM_052813.3, exon 12NCBI and UCSC report different coordinates for NM_052813.3, exon 12
UCSC
(BLAT)
NCBI
(Splign)
Consequences:
1. An assay that targets the wrong genomic region will generate
uninformative sequence data.
2. A genomic variant will be interpreted as exonic when it is
intronic, or vice versa.
exon 12
displaced 322 nt
12. 12 / 24 Ⓒ 2014 Invitae
““RefAgree” Statistics by Protein Coding TranscriptRefAgree” Statistics by Protein Coding Transcript
Sequence concordance between RefSeq and GRCh37 primary assemblySequence concordance between RefSeq and GRCh37 primary assembly
c.f. Garla V, et al. Bioinformatics 27(3): 416–8 (2010).
34531 NM transcripts (Jan 2014)
760 0.2% with length discrepancies
3481 10% with substitutions
321 0.9% with deletions
255 0.7% with insertions
➊➋
13. 13 / 24 Ⓒ 2014 Invitae
NCBI (Splign) v. UCSC (BLAT) Alignment StatisticsNCBI (Splign) v. UCSC (BLAT) Alignment Statistics
Splign and BLAT provide significantly different exon structures for 886 transcriptsSplign and BLAT provide significantly different exon structures for 886 transcripts
Are Splign
and BLAT
similar ?
31472 (97.3%)
transcriptsY
N
32358
transcripts
w/exon structures
➌
886 (2.7%)
transcripts
“similar” means either
1) identical exon coordinates, or
2) coordinates that differ only by
short 3' terminal artifacts
14. 14 / 24 Ⓒ 2014 Invitae
Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies
Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.
Splign
BLAT
T F
T 14 18
F 545 311
886 transcripts with
significant discrepancies
15. 15 / 24 Ⓒ 2014 Invitae
Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies
Reference agreement (blue) and alignment “simplicity” (green)Reference agreement (blue) and alignment “simplicity” (green)
Splign
BLAT
T F
T 14 18
F 545 311
Splign
BLAT
T F
T 200
(0)
4
(97)
F 90
(82)
16
(84)
Splign
BLAT
T F
T 6
(41)
12
(180)
F
Splign
BLAT
T F
T 434
(7)
F 110
(652)
Splign
BLAT
T F
T 14
(11)
F
886 transcripts with
significant discrepancies
16. 16 / 24 Ⓒ 2014 Invitae
Summary of Splign-BLAT gene-wise coordinate deltas.Summary of Splign-BLAT gene-wise coordinate deltas.
delta # genes # ACMG must
report
=0 15206 44
>=1 183 8
>=10 116 0
>=25 6 0
>=50 5 0
>=250 13 0
>=1000 94 2
ND 3
delta ≝ minimum per gene of maximum per transcript of
difference of exon coordinates between NCBI and UCSC.
MYH7, TNNI3
(all trivial diffs)
LDLR, MYL2,
PRKAG2, SDHB,
SDHC, TGFBR1,
TGFBR2, WT1
APOV,
MYHBPC3, NTRK
17. 17 / 24 Ⓒ 2014 Invitae
HGVS Python PackageHGVS Python Package
http://bitbucket.org/invitae/hgvs/http://bitbucket.org/invitae/hgvs/
➢ Parser
● HGVS Python object→
● Based on a Parsing Expression
Grammar
➢ Formatter
● Python object HGVS→
➢ Validator
● intrinsic & extrinsic validation
➢ Mapping tools indel-aware!
● g. c. p. (m,n,r also supported)↔ →
● transcript-to-transcript liftover
● uses on UTA data
18. 18 / 24 Ⓒ 2014 Invitae
Example: Variant liftover between transcriptsExample: Variant liftover between transcripts
Map
from NM_182763.2:c.688+403C>T➀
to NC_000001.10:g.150550916G>A➁
to ➂ NM_001197320.1:281C>T
with Splign alignments
NM_001197320.1
NP_001184249.1
NM_182763.2
NP_877495.1
➀
➂
➁
NC_000001.10
19. 19 / 24 Ⓒ 2014 Invitae
Developer InfoDeveloper Info
Testing
➢ 91% code coverage
➢ 25665 tests variants
● ~200 hand curated, rest from
dbSNP
● 23436 sub, 1254 del, 908 ins, 45
delins, 22 dup
● 44 distinct transcripts, many
selected for difficulty
Upcoming issues
(all issues are publicly readable)
➢ multi-variant alleles
➢ release LRG
➢ GRCh38
➢ API changes
20. 20 / 24 Ⓒ 2014 Invitae
AcknowledgementsAcknowledgements
➢ Vince Fusaro
➢ John Garcia
➢ Emily Hare
➢ Kevin Jacobs
➢ Geoff Nilsen
➢ Rudy Rico
➢ Jody Westbrook
http://bitbucket.com/invitae/
➢ Code (Python)
➢ Documentation & Examples
➢ Issues
➢ BED files
➢ Code testing is public
Or just:
pip install hgvs
22. 22 / 24 Ⓒ 2014 Invitae
T
RefSeq
NM_01234.4
UTA solves four issues with transcript management.UTA solves four issues with transcript management.
RefSeq
NM_01234.5
InDel
UCSC
NM_01234.5
➌
Exon coordinate differences between sources for same accession➍
Historical transcripts alignments no longer available
➊ SNV
A
➋
Transcript =≠ Genome Reference
23.
24. 24 / 24 Ⓒ 2014 Invitae
ENSTs equivalent with NMsENSTs equivalent with NMs
=> select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_ac
from uta_20140210.tx_exon_set_summary_mv N
join uta_20140210.tx_exon_set_summary_mv E
on N.es_fingerprint=E.es_fingerprint
and N.tx_ac ~ '^NM_' and E.tx_ac ~ '^ENST'
and N.alt_aln_method='transcript'
and E.alt_aln_method='transcript';
┌─────────┬──────────────────────────────────┬────────────────┬─────────────────┐
│ hgnc es_fingerprint tx_ac tx_ac │ │ │ │
├─────────┼──────────────────────────────────┼────────────────┼─────────────────┤
│ AFF2 db0e20be1a2bb687c33227d2e6bf9d53 NM_002025.3 ENST00000370460 │ │ │ │
│ UBE3A d1eace7da295c45378fa5f898f2f03f6 NM_130838.1 ENST00000438097 │ │ │ │
│ ANXA8L1 1f6fd4f3fe9854aa468489ec7f507512 NM_001098845.1 ENST00000359178 │ │ │ │
│ APOL5 939a9e9e4a46ef9aef862cf9b369afe6 NM_030642.1 ENST00000249044 │ │ │ │
│ ARID4B 524fc954d10b08a4014e86aee81d0358 NM_016374.5 ENST00000264183 │ │ │ │