SlideShare une entreprise Scribd logo
1  sur  23
Taking Advantage of GRCh38
Valerie Schneider
12 February 2014
Introducing GRCh38

GRCh38: Dec. 24, 2013
Time for change

GRCh37.p13
GRCh38
• 178 Regions: 3.15% of chromosome sequence
• 178 regions with alt loci: 2% of chromosome
• 131 FIX patches: add 6.8 Mb novel sequence
sequence (61.9 Mb)
• 73 NOVEL patches: add >800kb novel sequence
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
GRCh38: Assembly Stats

http://genomereference.org
GRCh38: Annotation Stats
GRCh38 Sequence Updates
MAF<5%
Mismatch
in
pseudo/pr
txpt
n=1413

Annotator
and clinical
requests
n= ~260

SNV MAF = 0
n=15,244

MAF=0
Insertions
n=834

MAF=0
Deletions
n=1541
GRCh38 Sequence Updates
Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components

n=10489

79% of these bases are heterozygous in RP11 WGS
GRCh38: Sequence Updates

Coding Consequences
GRCh38 Model Centromeres
Until now, centromeres have been defined as multi-megabase gaps in the assembly
GRCh38 Model Centromeres
Karen Miga (Kent Lab, UCSC)
GRCh38 Model Centromeres

http://genomereference.org
GRCh38 Sequence Addition

1q32

1q21 1p21

Dennis et al., 2012
GRCh38 Path Updates
HYDIN: chr16 (16q22.2)

Doggett et al., 2006

HYDIN2: chr1 (1q21.1)
Missing in NCBI35/NCBI36

Unlocalized in GRCh37

Placed in GRCh38

Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID

Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID

Alignment of HYDIN CHM1_1.0, >99.9% ID
GRCh38: Novel Sequence
GRCh38 Alt Loci
Sequences from haplotype 1
Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes
GRCh38: Alt Loci
GRCh38: Alt Loci
Part of chr22 assembly
Alternate locus for chr22
Kidd et al., PLoS Genet. (2007) PMID: 17447845

Black: deletion configuration
GRCh38: Alt Loci
reads

On-target alignment

alt/patch
Off-target alignments
chromosome

(n=122,922)
GRCh38: Alt Loci
GRCh38: Alt Loci
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly

Mask1: mask chr for fix patches, scaffold for novel/alts.

Mask2: mask only on scaffolds
http://www.ncbi.nlm.nih.gov/genome/tools/remap

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/
vertebrates_mammals/Homo_sapiens/GRCh38
GRCh38 Credits

Collaborators

GRC SAB

•
•
•
•
•
•
•
•
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•

NCBI RefSeq and gpipe annotation team
Havana annotators
Karen Miga
David Schwartz
Steve Goldstein
Mario Caceres
Giulio Genovese
Jeff Kidd
Peter Lansdorp
Mark Hills
David Page
Jim Knight
Stephan Schuster
1000 Genomes

Rick Myers
Granger Sutton
Evan Eichler
Jim Kent
Roderic Guigo
Carol Bult
Derek Stemple
Matthew Hurles
Richard Gibbs

Contenu connexe

Tendances

Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...kmsteinberg
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 

Tendances (20)

Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
 
Grc workshop agbt2015_tg
Grc workshop agbt2015_tgGrc workshop agbt2015_tg
Grc workshop agbt2015_tg
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
GRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slidesGRCWorkshop_geval_1KG_slides
GRCWorkshop_geval_1KG_slides
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Variant Calling II
Variant Calling IIVariant Calling II
Variant Calling II
 
Ashg2015 grc-pruitt
Ashg2015 grc-pruittAshg2015 grc-pruitt
Ashg2015 grc-pruitt
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 

Similaire à Schneider_AGBT2014

Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Deanna Church
 
New methods draft v4alpha small variant benchmark
New methods   draft v4alpha small variant benchmarkNew methods   draft v4alpha small variant benchmark
New methods draft v4alpha small variant benchmarkGenomeInABottle
 
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 Using the GRCh38 reference assembly for clinical interpretation in VSClinical Using the GRCh38 reference assembly for clinical interpretation in VSClinical
Using the GRCh38 reference assembly for clinical interpretation in VSClinicalGolden Helix
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingIntegrated DNA Technologies
 
Giab agbt small_var_2019
Giab agbt small_var_2019Giab agbt small_var_2019
Giab agbt small_var_2019GenomeInABottle
 
The importance of high quality reference genome assemblies to personal and me...
The importance of high quality reference genome assemblies to personal and me...The importance of high quality reference genome assemblies to personal and me...
The importance of high quality reference genome assemblies to personal and me...kmsteinberg
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesGenome Reference Consortium
 
Chi2007val ch ip-qpcr
Chi2007val ch ip-qpcrChi2007val ch ip-qpcr
Chi2007val ch ip-qpcrElsa von Licy
 
Microarray validation
Microarray validationMicroarray validation
Microarray validationElsa von Licy
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyAnax Fotopoulos
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionGenomeInABottle
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learningjeykottalam
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...Thermo Fisher Scientific
 

Similaire à Schneider_AGBT2014 (20)

Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
New methods draft v4alpha small variant benchmark
New methods   draft v4alpha small variant benchmarkNew methods   draft v4alpha small variant benchmark
New methods draft v4alpha small variant benchmark
 
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 Using the GRCh38 reference assembly for clinical interpretation in VSClinical Using the GRCh38 reference assembly for clinical interpretation in VSClinical
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editing
 
Giab agbt small_var_2019
Giab agbt small_var_2019Giab agbt small_var_2019
Giab agbt small_var_2019
 
The importance of high quality reference genome assemblies to personal and me...
The importance of high quality reference genome assemblies to personal and me...The importance of high quality reference genome assemblies to personal and me...
The importance of high quality reference genome assemblies to personal and me...
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
Chi2007val ch ip-qpcr
Chi2007val ch ip-qpcrChi2007val ch ip-qpcr
Chi2007val ch ip-qpcr
 
Microarray validation
Microarray validationMicroarray validation
Microarray validation
 
TIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracyTIS prediction in human cDNAs with high accuracy
TIS prediction in human cDNAs with high accuracy
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detection
 
20161021_master_lesson_no_feedback
20161021_master_lesson_no_feedback20161021_master_lesson_no_feedback
20161021_master_lesson_no_feedback
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Concurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine LearningConcurrency Control for Parallel Machine Learning
Concurrency Control for Parallel Machine Learning
 
Realtime
RealtimeRealtime
Realtime
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...
Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Schneider_AGBT2014

Notes de l'éditeur

  1. The Genome Reference Consortium released the latest human reference assembly, GRCh38, on Dec. 24. While this updated assembly has many improvements, and some groups have been eagerly awaiting its release, the GRC is well aware that many users may feel the same way about GRCh38 as we all feel about the gift of new socks.Today I’m going to tell you about some of the new features in the assembly and how these updates make GRCh38 a better substrate for analyses. In the end, I’d like to convince you that whether GRCh38 was on your wish list or not, like a new pair of socks, it’s in better shape than what’s sitting in your wardrobe and ultimately, you’ll be able to put it to good use.
  2. GRCh37 was released in 2009, and used a new assembly model in which alternate loci scaffolds were included to provide additional sequence representations for variant genomic regions. GRCh37 had 3 such regions, and 9 alternate loci scaffolds.Since then, the GRC has continued to update the assembly. Many of these updates were released as non-coordinate-changing patch scaffolds. The patches came in two flavors:FIX patches corrected problems in existing assembly sequenceNOVEL patches added new alternate sequence representationsAs shown in the box, nearly 200 regions of GRCh37 were associated with a patch, and these updates added almost 8 Mb of novel sequence to the reference assembly. Furthermore, not every assembly update was released as a patch. As this pie chart shows, the GRC resolved just over 1000 issues for GRCh38. As a result, the GRC and members of its SAB, agreed that it was time for a major assembly release.So today we have GRCh38, which now has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci.
  3. I’d now like to introduce GRCh38 with some basic assembly statistics. These and additional stats for GRCh38 assembly are available on the GRC website.One measure of an assembly’s continuity is scaffold N50. You can see here that scaffold N50s increased for almost every chromosome in GRCh38, indicating the reference assembly is more contiguous than ever.
  4. We can also compare GRCh38 to GRCh37, using a common annotation input set.There was a 5% increase in the number of aligned genes and a 3% increase in the number of aligned protein coding transcripts. There was alsoa decrease in both the numbers of annotated partial CDS and split genes (genes that span gaps).An example of one such improvement is shown here. In blue, can see the tiling path in GRCh37, where there is a gap. The TWIST2 gene spans this gap. In GRCh38, the gap has been closed by the addition of new sequence and there is complete representation for the gene.In this example, the added sequence was RP11 WGS, provided by Jim Knight, who has been working with Stephan Schuster and others on an RP11 WGS assembly (poster). The GRC used WGS sequence from this and several other WGS assemblies, including HuRef, CHM1_1.1 and the NA12878 ALLPATHS, to extend into or span gaps when clone-based sequence could not be found.
  5. One of the updates made in the assembly was the correction of erroneous bases. The human genome is approximately 2.85 billion bases and the finished human reference assembly is accurate to an error rate of 1 per 100,000 bases. While this represents the highest quality mammalian genome assembly in existence today, it still means that an approximate 28,000 bases are incorrect. The GRC made the correction of erroneous bases a priority for GRCh38.This slide shows the bases whose updates were considered by the GRC:The largest set were ~15K SNV with MAF=0 in the 1000G phase 1 analysis.1000G also identified ~2.5K indels with MAF=0These two sets represented bases that were asserted to be incorrect in the reference assembly, as they were never seen in 1000G.An additional 1413 bases with MAF&lt;5% (but &gt;0%) that overlap pseudogenes, processed transcripts or polymorphic pseudogenes were also consideredAs were ~200 base update requests from annotators and clinical labs
  6. Before attempting any of the updates, the GRC did some analysis to determine whether the bases with MAF=0 were sequencing errors or unrecognized variants. To do this, we performed a read pile-up analysis for a subset of these bases for which we had WGS data from the same genome as the reference assembly sequence. These were bases in RP11 BAC clones, which make up 70% of the reference assembly. The RP11 WGS sequence used in this analysis was generated at WashU. First graph shows the results of the pile up analysis for the SNVs: (X axis is chromosomes)Purple: Proportion of “never seen” bases that are heterozygous in RP11 (hetalt: not errors)Red: Proportion of “never seen” bases that are not seen in RP11 (hmalt: genuine errors)Across all chromosomes: 79% “never seen” SNVs are heterozygous in RP11 WGS, indicative of unrecognized variation, rather than sequencing error.The GRC did not update the heterozygous RP11 bases.
  7. Ultimately, the GRC attempted to update 9359 bases.Of these, we succeeded in updating 8128 sites (86.8%) with mini-contigs we built from WGS reads from 1000G samples or the RP11 genome. The reads were assembled into the mini-contigs with cortex_con and differ from reference only at selected base. These were all submitted to GenBank.The Ensembl VEP found 8188 variants associated with the sites updated by mini-contigs. Most updates are not in coding sequence. Among those variants with coding consequences, most are missense or synonymous, consistent with most of the updates being SNVs. Consequences of note include:15 genes that had an internal stop codon in GRCh37 are now coding78 genes had a frameshift relative to GRCh37 that restored gene function2 genes that were coding in GRCh37 are now non-coding, but do represent the more common allele (CASP12/PRM3)
  8. The first new feature of GRCh38 I want to mention are the centromeres. Until now, centromeres have been represented in the reference assembly by very large gaps. This is unfortunate, because centromeres play important roles in biology. Contrary to popular belief, centromeres aren’t difficult to sequence. In fact, there are large datasets of centromere sequence out there that are just waiting for a reference so that they can be analyzed.The challenge has been their assembly, which is complicated by their highly repetitive nature. As illustrated here, centromeres are comprised largely of tandemly repeated alpha-satellite sequences, that exhibit a wide range of variation. These short repeats are organized into longer higher order arrays that are highly identical. Because the centromeres are so long, they are difficult to assemble with even the longest read technology.
  9. Centromeric sequence assembly is further complicated by the fact that these higher order arrays can vary between individuals and vary between homologous chromosomes in the same individual.
  10. The GRC was fortunate to be contacted by Karen Miga, a postdoc in Jim Kent’s lab, who was developing an approach for generating modeled centromere sequences. All of the work I’m going to talk about was done by Karen and will soon be published in Genome Research.In short, Karen created a database of centromeric WGS reads from the HuRef genome. She determined the chromosome-specific higher order array structures and then build statistical linear models that could be used in the reference assembly, where they will serve as targets for read mapping.This next slide just shows a schematized version of graph-based representations for each of the chromosome-specific higher order arrays.
  11. In these graphs, the nodes represent identical monomers and the edges are the likelihood of their adjacency in the array. Karen used a hidden Markov-based tool called LinearSat to build statistically based linear models from these graphs.It’s important to understand that each model represents the variants and monomer ordering in a proportional manner to that observed in the initial read database, but the long-range ordering of the repeats represents only an inferred sequence.Karen further used mate pair mapping to identify euchromatic WGS sequences from the HuRef assembly that are associated with the arrays. Like the repeats, the long range ordering of these euchromatic contigs in the models is also an inference.Users can find the coordinates of the centromere sequences in a table on the GRC website.
  12. In addition to adding centromere sequence, the GRC has focused on adding human-specific sequences to the reference assembly.An example of this is the SRGAP gene family, which is involved in cortical development. The ancestral 1q32 gene has been duplicated in humans to 1p21 and 1q21. Work from EvanEichler’s lab found that not only were the 3 SRGAP2 human paralogs incompletely sequenced in GRCh37, but that allelic and paralogous sequences had been mixed in the assembly. 1q21 was the worst of these misassemblies, containing multiple haplotypes due to the highly duplicated nature of the region. Only by use of a single haplotype hydatidiform mole resource was it possible to disambiguate the correct paths at each locus. These updated paths were originally released as fix patches to GRCh37 and are now incorporated in the GRCh38 chromosomes. This panel shows the GRCh37.p13-GRCh38 assembly-assembly alignments in the 1Q21 region.The alignment of the GRCh37 chromosome sequence is highly fragmented, indicative of the large changes that were made.Also aligning to this region of GRCh38 is a GRCh37 chr. 1 unlocalized scaffold. This scaffold contained the HYDIN2 gene.
  13. HYDIN2 represents another human specific gene duplication, also involved in neuronal phenotypes. The human genome contains two HYDIN loci: HYDIN on chr. 16, and HYDIN2 on chr. 1. The HYDIN2 locus was absent from previous assembly versions, unlocalized scaffold in GRCh37 and placed in GRCh38.This slide shows the alignment of the HYDIN2 and HYDIN genes from the CHM1 genome assembly (TINA POSTER) to the chr.16 HYDIN locus in the GRCh37 assembly. The HYDIN2 alignment reflects paralogous sequence differences, while the HYDIN alignment reflects allelic differences. The alignments show that the 2 loci are highly similar, explaining why it was so difficult to disambiguate the two genes. In fact, the sequences are so similar, in NCBI34, sequences from the two genes were mixed at the same locus.The high degree of similarity has complicated variation analysis of these two paralogous genes. The absence of the chr. 1 paralog in previous assembly versions has likely led to likely erroneous variant calling at the chr. 16 locus. Zooming in, we see a paralogous sequence variant in HYDIN2 that occurs at the position of an annotated SNP in HYDIN. Now that HYDIN2 is present in GRCh38, we can begin to address issues such as this.
  14. Another set of sequences that the GRC was interested in capturing for GRCh38 was the 1000G decoy sequence. This was a 35 Mb collection of sequences that were not represented in the GRCh37 primary assembly. They were included in the 1000G phase 2 alignment target set as a read trap, as analyses showed they improved variation calling. The decoy sequences had an average repeat content of ~80%.In order to assess decoy capture in GRCh38, we looked at reads from two 1000G samples that previously aligned only to the decoy. Depending on the sample, we find that 70-75% of such reads now align to the GRCh38 primary assembly. An additional 1% percent of reads are captured when the full assembly is used as a substrate and the alt loci are present. Thus, while not fully representing the decoy, GRCh38 does include a significant portion of this important sequence and is therefore a better alignment target than GRCh37. We continue to pursue the capture of the remaining decoy, much of which is highly repetitive, in a meaningful way in the reference assembly.
  15. This brings me to the alternate loci, which are now present in greater number and locations than ever.In the original reference assembly model, there was no good way to handle variant genomic regions. Frequently, sequences from multiple haplotypes were inserted and confounded assembly, leading to artificial gaps. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes : these are the alternate loci. They allow the reference assembly to contain alternate representations for regions where a single sequence path is considered insufficient, while retaining the linear chromosome models that most users are comfortable with. The corollary of this statement is that the reference assembly may represent &gt;1 allele at a locus.
  16. So, why is it important to use the alternate loci? One simplereason is gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci.An example is shown in this slide. This image shows an alternate locus scaffold from chromosome 22. Grey bar is assembly component, green bars are genes, and the alignment is below. You can see several genes annotated in the region of the alt that has no alignment to the chromosome.Thus, if you’re not using the entire assembly in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
  17. Alternate loci also have implications for genome interpretation:In this example, we’re looking at structural variation in the APOBEC locus on chr. 22. There is a deletion variant that results in the fusion of the APOBEC3A and 3B genes.Deletion allele is prevalent in Asians and South America. GRCh38 contains the deletion allele on an alt loci scaffold. This is a common polymorphism for which the alt contains the predominant allele for certain populations.This image shows reads from two Asian 1000G samples that align in the APOBEC intergenic region in GRCh37, displayed in the NCBI 1000G browser. B/c the samples are heterozygous, but are aligned to the primary assembly, which has only the insertion variant, it complicates the alignments. Can see that different methods give different results. Use of the full assembly, an alignment substrate that includes both variants, would likely improve the interpretation of the data.
  18. We’ve been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll talk today about a study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to GRCh37.p9 patches or alternate loci. We asked what happened to them when aligned to GRCh37 primary assembly+MT, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism).As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters have an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses.This analysis demonstrates the value of including alternate loci in alignment target sets.
  19. That being said, most commonly used short read aligners can’t currently handle the allelic duplication introduced into the assembly by non-unique sequences in alt loci. Mapping scores for reads aligning to both the alt and the corresponding chromosome region are depressed and excluded from analysis.As a result, new alternate aware tools that understand the relationship of the alt to the chromosome and don’t depress scores are needed in order for users to take advantage of the full reference assembly. Some aligners, such as iBWA and srprism, can now do this, but other aspects of variant calling tool chains still need to be updated to address this issue of allelic duplication.In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this NOVEL patch, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
  20. We have looked at the effect of masking on BWA alignments and compared results to those obtained with use of the alternate aware aligner, srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments.As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings.We’ll be following up this analysis on GRCh38, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
  21. On that note, I’d like to wrap things up. I’d like to think I’ve convinced you that:It was time for an updateThe reference has improvedUpdates and new features will make the reference a better substrate for analysisFor those of you ready to make the switch, I’d like to plug the NCBI remapping service, which uses assembly-assembly alignments to remap features from one assembly to another. This tool can be used for mapping between GRCh37 and GRCh38. It is available as a web interface, as well as a perl script API.While you may not be excited by the new assembly as these folks are with their socks, it’s a far cry from a lump of coal.