SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
A probabilistic approach to k-mer counting

                                                   Qingpeng Zhang

                                 Department of Computer Science and Engineering
                                           Michigan State University
                                         East Lansing, Michigan, USA
                                                    qingpeng@msu.edu

                                                     July 13, 2012




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   1 / 12
What is k-mer counting?




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   2 / 12
What is our k-mer counting approach?

                                                                                  The Bloom counting hash
                                                                                  consists of one or more
                                                                                  hash tables of different
                                                                                  size
                                                                                  Each entry in the hash
                                                                                  tables is a counter
                                                                                  representing the number
                                                                                  of k-mers that hash to
                                                                                  that location
                                                                                        Bloom filter(0/1) or
                                                                                        Count-min
                                                                                        Sketch(counting)
                                                                                  The hash function is to
                                                                                  take the modulus of a
                                                                                  number representing the
                                                                                  k-mer with the table size.

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting              July 2012   3 / 12
What is our k-mer counting approach?




           With certain counting false positive rate1 as tradeoff because of collision
           Probabilistic properties well suited to next generation sequencing datasets
           Highly scalable: Counting accuracy is related to memory usage. However
           our approach will never break an imposed memory bound.




      1
        counting false positive rate: the possibility that the number of counts will
   be incorrect (off by 1 or more)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   4 / 12
How does our k-mer counting approach perform?
   How many k-mers have incorrect count? - counting error rate


                                                                      N: number of unique kmers; Z:
                                                                      number of hash tables; H: size
                                                                      of hash tables
                                                                      The probability that no collisions
                                                                      happened in a specific entry in
                                                                      one hash table is
                                                                      (1 − 1/H)N ,which is e −N/H .
                                                                      The individual collision rate in
                                                                      one hash table is 1 − e −N/H .
                  Example: N=915898,
                  Z=4, H=400000,                                      The counting error rate f , which
                                  −N/H Z                              is the probability that collision
                  f = (1 − e              ) =
                                                                      happened in all the locations
                  0.6523
                                                                      where a k-mer is hashed to in all
                  observed counting                                   Z hash tables, will be
                  error rate f : 0.6566                               (1 − e −N/H )Z

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting           July 2012   5 / 12
How does our k-mer counting approach perform?
   Ok, some counts are incorrect. However, how ”incorrect”?




                                                                             factors to influence miscount:
                                                                                        number of total k-mers
                                                                                        hash table size




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting                July 2012   6 / 12
How does our k-mer counting approach perform?
   Time Usage




                            Figure: Time usage of khmer counting approach

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   7 / 12
How does our k-mer counting approach perform?
   Memory Usage




                       Figure: Memory usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   8 / 12
How does our k-mer counting approach perform?
   Disk Storage Usage




                    Figure: disk storage usage of different k-mer counting tools

Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   9 / 12
What is the application of our approach?
   Filtering out reads with low-abundance k-mers for de novo assembly




                     Figure: Percentage of ”bad” reads in the remaining reads

   Iterating filtering out low-abundance reads(”bad” reads) that contain even a
   single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a
   human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   10 / 12
Summary


           a simple probabilistic approach for fast and memory efficient counting of
           k-mers
                arbitrary-length k-mers
                arbitrary-size sequence data set
                with a tradeoff of counting error
           other possible applications
                digital normalization
                repeat detection
                diversity analysis of metagenomic sample.
                ...
           The khmer software package is written in C++ and Python, available at
           https://github.com/ged-lab/khmer




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   11 / 12
Acknowledgement




           Jason Pell, Rose Canino-Koning, Adina Chuang Howe
           Dr. C. Titus Brown
           GED lab members@ Michigan State University
           Funding from USDA, DOE, MSU, BEACON, iCER
           Thanks!




Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting   July 2012   12 / 12

Contenu connexe

Plus de Jan Aerts

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoJan Aerts
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy UpdateJan Aerts
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpaceJan Aerts
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...Jan Aerts
 

Plus de Jan Aerts (20)

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy Update
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpace
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
 

Dernier

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Dernier (20)

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Zhang Q - A probabilistic approach to k-mer counting

  • 1. A probabilistic approach to k-mer counting Qingpeng Zhang Department of Computer Science and Engineering Michigan State University East Lansing, Michigan, USA qingpeng@msu.edu July 13, 2012 Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
  • 2. What is k-mer counting? Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
  • 3. What is our k-mer counting approach? The Bloom counting hash consists of one or more hash tables of different size Each entry in the hash tables is a counter representing the number of k-mers that hash to that location Bloom filter(0/1) or Count-min Sketch(counting) The hash function is to take the modulus of a number representing the k-mer with the table size. Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
  • 4. What is our k-mer counting approach? With certain counting false positive rate1 as tradeoff because of collision Probabilistic properties well suited to next generation sequencing datasets Highly scalable: Counting accuracy is related to memory usage. However our approach will never break an imposed memory bound. 1 counting false positive rate: the possibility that the number of counts will be incorrect (off by 1 or more) Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
  • 5. How does our k-mer counting approach perform? How many k-mers have incorrect count? - counting error rate N: number of unique kmers; Z: number of hash tables; H: size of hash tables The probability that no collisions happened in a specific entry in one hash table is (1 − 1/H)N ,which is e −N/H . The individual collision rate in one hash table is 1 − e −N/H . Example: N=915898, Z=4, H=400000, The counting error rate f , which −N/H Z is the probability that collision f = (1 − e ) = happened in all the locations 0.6523 where a k-mer is hashed to in all observed counting Z hash tables, will be error rate f : 0.6566 (1 − e −N/H )Z Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
  • 6. How does our k-mer counting approach perform? Ok, some counts are incorrect. However, how ”incorrect”? factors to influence miscount: number of total k-mers hash table size Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
  • 7. How does our k-mer counting approach perform? Time Usage Figure: Time usage of khmer counting approach Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
  • 8. How does our k-mer counting approach perform? Memory Usage Figure: Memory usage of different k-mer counting tools Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
  • 9. How does our k-mer counting approach perform? Disk Storage Usage Figure: disk storage usage of different k-mer counting tools Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
  • 10. What is the application of our approach? Filtering out reads with low-abundance k-mers for de novo assembly Figure: Percentage of ”bad” reads in the remaining reads Iterating filtering out low-abundance reads(”bad” reads) that contain even a single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads) Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
  • 11. Summary a simple probabilistic approach for fast and memory efficient counting of k-mers arbitrary-length k-mers arbitrary-size sequence data set with a tradeoff of counting error other possible applications digital normalization repeat detection diversity analysis of metagenomic sample. ... The khmer software package is written in C++ and Python, available at https://github.com/ged-lab/khmer Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
  • 12. Acknowledgement Jason Pell, Rose Canino-Koning, Adina Chuang Howe Dr. C. Titus Brown GED lab members@ Michigan State University Funding from USDA, DOE, MSU, BEACON, iCER Thanks! Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12