SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
De novo assembly and
genotyping of variants using
colored de Bruijn graphs
Iqbal et al. 2012
Kolmogorov Mikhail 2013
Challenges
• Detecting genetic variants that are highly
divergent from a reference
• Detecting genetic variants between samples
when there is no available reference sequence
2 / 22
Mapping-based approaches limitations
● Sample may contain sequence absent or
divergent from the reference
● Reference sequences, particularly of higher
eukaryotes are incomplete, notably in telomeric
and pericentromeric regions
● Some samples may have no available reference
sequence
3 / 22
Cortex!
● De novo assmebler capable of assembling
multiple eukaryote genomes simultaneously
● Focus on detecting and characterising genetic
variation
● Alignment free
● Accuracy of variant calling may be improved by
adding reference sequence
4 / 22
Colored graph
● De Brujin graph with colored nodes and edges
● Different colors may reflect HTS data from
multiple samples, experiments, reference
sequences, known variant sequences
5 / 22
Variant calling aproaches
●Bubble calling
●Path divergence
●Multiple-sample analysis
●Genotyping
6 / 22
First approach: Bubble calling
●
Bubbles can be induced by variants, repeats or
sequencing errors
● Variants can be separated from repeats by
including reference sequence
● When multiple samples available, it`s possible to
estimate likelihood of bubble type
7 / 22
BC implementation
get_super_node(n, G):
path1 = traverse forward, until reach node with in/out degree != 1
path2 = traverse forward, until reach node with in/out degree != 1
return union(path1, path2)
bubble_caller(G):
for each node n in G:
if outdegree(n) == 2 and not_visited(n):
mark_as_visited(n)
(<n1,e1>,<n2,e2>) = get_outedges(n, G)
path1 = get_super_node(n1, G)
path2 = get_super_node(n2, G)
mark_as_visited(path1[0] .. path1[-1])
mark_as_visited(path2[0] .. path2[-1])
if path1[-1] == path2[-1] and
orientation(path1[-1] == orientation(path2[-1]):
bubble_found = true
8 / 22
Second approach: Path divergence
● Complex variants are unlikely to generate clean
bubbles
● In some cases (particularly deletions), path
complexity is restricted to reference allele
● Such cases may be identified by following reference
sequence and track places, where it diverges from
sample graph
9 / 22
PD implementation
path_devergence(L, M, ref):
for i in [0 .. len(ref)]:
s = get_supernode_in_sample_color(n) #may be null
b = get_first_position_differ(ref,s)
if s != null and b > 0 and
s[b–L–1+j] == n[i+j] for j = [0 .. L-1]:
for j = [i+L+1 .. i+M]:
if s[len(s) – L+k] == ref[j+k] for k = [0 .. L-1]:
variant_found = true
break
10 / 22
Third approach: Multiple-sample analysis
●The joint analysis of HTS data from multiple samples
can improve the accuracy and false discovery rate of
variant detection substantially
●Maintaining separate colors for each sample gives an
additional information about whether a bubble is likely to
be induced by repeats or errors
●Approach can be used when there is no suitable
reference
11 / 22
Classification of graph structures
● Given colored de Brujin graph, with each color
representing single diploid individual from single
population
● Need to classify pair of paths (for example,
bubble branches) as either alleles of variant,
repeats or errors
12 / 22
Information sources
● Total coverage of two sides
● Average allele-balance (the distribution of
coverages between the two branches)
● Variance of allele-balance across samples
13 / 22
Structure differences
●Repeat: normal coverage, intermediate balance,
no variance between samples.
●Variance: normal coverage, but each individual
has a 0, 1 or 2 copy of allele 1 (and 2, 1 or 0 of
allele 2).
●Error: low coverage on error branch
14 / 22
Models
●Coverage modelles as over-dispersed Poisson
distribution
●Variant: Binominal distribution for allele-balance
●Repeat / Error: Beta distribution for allele-balance
●A site is classified as Variant / Repeat / Error if the log-
likelihood of the data under that model is at least 10
greater than the other models (log10
Bayess Factor at
least 10)
15 / 22
Fourth approach: Genotyping
●Colored de Bruijn graphs can be used to genotype samples
at known loci even when coverage is insufficient to enable
variant assembly
●Graph is constructed of the reference sequence, known
allelic variants and data from the sample
●The likelihood of possible genotype is calculated accounting
for the graph structure
16 / 22
Genotyping algorithm
●We have a graph with one color for each known allele,
one color for reference genome (with site in question
named X) and one color for sample
●For any pair of alleles γ1
, γ2
we define a likelihood
function
17 / 22
Likelihood function
●Ignore all segments of alleles, which are present in reference
minus X
●Decompose both alleles into sections, that are shared (si
) and
uniqe (ui
) and count nuber of reads in each
●Count number of nodes n in both sample graph and joint graph of
all known alleles except γ1
and γ2
— these are sequencing errors
●For each section of length li probability of ri
reads arriwing within
the section is given by a Poisson distribution with rate θi
= λli
on
shared segments and θi
/ 2 on unique
18 / 22
Colored graph implementation
● Graph encoded impicitly, within a hash table
● Kmers and their reverse complements are stored in
a same node
● Hash table value is array of integers, representing
coverage for each color
● For each kmer, one binary flag is used for each
nucleotide to track, if corresponding edge is present
● Memory usage per node: 8(k / 32) + 5c + 1 bytes
19 / 22
Use cases
● Variant calling in a high coverage human genome
● Detection of novel sequence from population
graphs of low-coverage samples
● Using population information to classify bubble
structures
● Genotyping simple and complex variants
20 / 22
Limitations
● Not using paired reads information
● Greater need for error correction, as kmer size
increases
● Potential of a graph explosion, as more
individuals are included in the graph
21 / 22
Thanks for attention
22 / 22

Contenu connexe

Plus de BioinformaticsInstitute

Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
BioinformaticsInstitute
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
BioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
BioinformaticsInstitute
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
BioinformaticsInstitute
 

Plus de BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Slides5

  • 1. De novo assembly and genotyping of variants using colored de Bruijn graphs Iqbal et al. 2012 Kolmogorov Mikhail 2013
  • 2. Challenges • Detecting genetic variants that are highly divergent from a reference • Detecting genetic variants between samples when there is no available reference sequence 2 / 22
  • 3. Mapping-based approaches limitations ● Sample may contain sequence absent or divergent from the reference ● Reference sequences, particularly of higher eukaryotes are incomplete, notably in telomeric and pericentromeric regions ● Some samples may have no available reference sequence 3 / 22
  • 4. Cortex! ● De novo assmebler capable of assembling multiple eukaryote genomes simultaneously ● Focus on detecting and characterising genetic variation ● Alignment free ● Accuracy of variant calling may be improved by adding reference sequence 4 / 22
  • 5. Colored graph ● De Brujin graph with colored nodes and edges ● Different colors may reflect HTS data from multiple samples, experiments, reference sequences, known variant sequences 5 / 22
  • 6. Variant calling aproaches ●Bubble calling ●Path divergence ●Multiple-sample analysis ●Genotyping 6 / 22
  • 7. First approach: Bubble calling ● Bubbles can be induced by variants, repeats or sequencing errors ● Variants can be separated from repeats by including reference sequence ● When multiple samples available, it`s possible to estimate likelihood of bubble type 7 / 22
  • 8. BC implementation get_super_node(n, G): path1 = traverse forward, until reach node with in/out degree != 1 path2 = traverse forward, until reach node with in/out degree != 1 return union(path1, path2) bubble_caller(G): for each node n in G: if outdegree(n) == 2 and not_visited(n): mark_as_visited(n) (<n1,e1>,<n2,e2>) = get_outedges(n, G) path1 = get_super_node(n1, G) path2 = get_super_node(n2, G) mark_as_visited(path1[0] .. path1[-1]) mark_as_visited(path2[0] .. path2[-1]) if path1[-1] == path2[-1] and orientation(path1[-1] == orientation(path2[-1]): bubble_found = true 8 / 22
  • 9. Second approach: Path divergence ● Complex variants are unlikely to generate clean bubbles ● In some cases (particularly deletions), path complexity is restricted to reference allele ● Such cases may be identified by following reference sequence and track places, where it diverges from sample graph 9 / 22
  • 10. PD implementation path_devergence(L, M, ref): for i in [0 .. len(ref)]: s = get_supernode_in_sample_color(n) #may be null b = get_first_position_differ(ref,s) if s != null and b > 0 and s[b–L–1+j] == n[i+j] for j = [0 .. L-1]: for j = [i+L+1 .. i+M]: if s[len(s) – L+k] == ref[j+k] for k = [0 .. L-1]: variant_found = true break 10 / 22
  • 11. Third approach: Multiple-sample analysis ●The joint analysis of HTS data from multiple samples can improve the accuracy and false discovery rate of variant detection substantially ●Maintaining separate colors for each sample gives an additional information about whether a bubble is likely to be induced by repeats or errors ●Approach can be used when there is no suitable reference 11 / 22
  • 12. Classification of graph structures ● Given colored de Brujin graph, with each color representing single diploid individual from single population ● Need to classify pair of paths (for example, bubble branches) as either alleles of variant, repeats or errors 12 / 22
  • 13. Information sources ● Total coverage of two sides ● Average allele-balance (the distribution of coverages between the two branches) ● Variance of allele-balance across samples 13 / 22
  • 14. Structure differences ●Repeat: normal coverage, intermediate balance, no variance between samples. ●Variance: normal coverage, but each individual has a 0, 1 or 2 copy of allele 1 (and 2, 1 or 0 of allele 2). ●Error: low coverage on error branch 14 / 22
  • 15. Models ●Coverage modelles as over-dispersed Poisson distribution ●Variant: Binominal distribution for allele-balance ●Repeat / Error: Beta distribution for allele-balance ●A site is classified as Variant / Repeat / Error if the log- likelihood of the data under that model is at least 10 greater than the other models (log10 Bayess Factor at least 10) 15 / 22
  • 16. Fourth approach: Genotyping ●Colored de Bruijn graphs can be used to genotype samples at known loci even when coverage is insufficient to enable variant assembly ●Graph is constructed of the reference sequence, known allelic variants and data from the sample ●The likelihood of possible genotype is calculated accounting for the graph structure 16 / 22
  • 17. Genotyping algorithm ●We have a graph with one color for each known allele, one color for reference genome (with site in question named X) and one color for sample ●For any pair of alleles γ1 , γ2 we define a likelihood function 17 / 22
  • 18. Likelihood function ●Ignore all segments of alleles, which are present in reference minus X ●Decompose both alleles into sections, that are shared (si ) and uniqe (ui ) and count nuber of reads in each ●Count number of nodes n in both sample graph and joint graph of all known alleles except γ1 and γ2 — these are sequencing errors ●For each section of length li probability of ri reads arriwing within the section is given by a Poisson distribution with rate θi = λli on shared segments and θi / 2 on unique 18 / 22
  • 19. Colored graph implementation ● Graph encoded impicitly, within a hash table ● Kmers and their reverse complements are stored in a same node ● Hash table value is array of integers, representing coverage for each color ● For each kmer, one binary flag is used for each nucleotide to track, if corresponding edge is present ● Memory usage per node: 8(k / 32) + 5c + 1 bytes 19 / 22
  • 20. Use cases ● Variant calling in a high coverage human genome ● Detection of novel sequence from population graphs of low-coverage samples ● Using population information to classify bubble structures ● Genotyping simple and complex variants 20 / 22
  • 21. Limitations ● Not using paired reads information ● Greater need for error correction, as kmer size increases ● Potential of a graph explosion, as more individuals are included in the graph 21 / 22