Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Zhang Q - A probabilistic approach to k-mer counting
1. A probabilistic approach to k-mer counting
Qingpeng Zhang
Department of Computer Science and Engineering
Michigan State University
East Lansing, Michigan, USA
qingpeng@msu.edu
July 13, 2012
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 1 / 12
2. What is k-mer counting?
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 2 / 12
3. What is our k-mer counting approach?
The Bloom counting hash
consists of one or more
hash tables of different
size
Each entry in the hash
tables is a counter
representing the number
of k-mers that hash to
that location
Bloom filter(0/1) or
Count-min
Sketch(counting)
The hash function is to
take the modulus of a
number representing the
k-mer with the table size.
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 3 / 12
4. What is our k-mer counting approach?
With certain counting false positive rate1 as tradeoff because of collision
Probabilistic properties well suited to next generation sequencing datasets
Highly scalable: Counting accuracy is related to memory usage. However
our approach will never break an imposed memory bound.
1
counting false positive rate: the possibility that the number of counts will
be incorrect (off by 1 or more)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 4 / 12
5. How does our k-mer counting approach perform?
How many k-mers have incorrect count? - counting error rate
N: number of unique kmers; Z:
number of hash tables; H: size
of hash tables
The probability that no collisions
happened in a specific entry in
one hash table is
(1 − 1/H)N ,which is e −N/H .
The individual collision rate in
one hash table is 1 − e −N/H .
Example: N=915898,
Z=4, H=400000, The counting error rate f , which
−N/H Z is the probability that collision
f = (1 − e ) =
happened in all the locations
0.6523
where a k-mer is hashed to in all
observed counting Z hash tables, will be
error rate f : 0.6566 (1 − e −N/H )Z
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 5 / 12
6. How does our k-mer counting approach perform?
Ok, some counts are incorrect. However, how ”incorrect”?
factors to influence miscount:
number of total k-mers
hash table size
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 6 / 12
7. How does our k-mer counting approach perform?
Time Usage
Figure: Time usage of khmer counting approach
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 7 / 12
8. How does our k-mer counting approach perform?
Memory Usage
Figure: Memory usage of different k-mer counting tools
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 8 / 12
9. How does our k-mer counting approach perform?
Disk Storage Usage
Figure: disk storage usage of different k-mer counting tools
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 9 / 12
10. What is the application of our approach?
Filtering out reads with low-abundance k-mers for de novo assembly
Figure: Percentage of ”bad” reads in the remaining reads
Iterating filtering out low-abundance reads(”bad” reads) that contain even a
single unique k-mer with hash tables with different sizes(1e8 and 1e9) for a
human gut microbiome metagenomic dataset(MH0001, 42,458,402 reads)
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 10 / 12
11. Summary
a simple probabilistic approach for fast and memory efficient counting of
k-mers
arbitrary-length k-mers
arbitrary-size sequence data set
with a tradeoff of counting error
other possible applications
digital normalization
repeat detection
diversity analysis of metagenomic sample.
...
The khmer software package is written in C++ and Python, available at
https://github.com/ged-lab/khmer
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 11 / 12
12. Acknowledgement
Jason Pell, Rose Canino-Koning, Adina Chuang Howe
Dr. C. Titus Brown
GED lab members@ Michigan State University
Funding from USDA, DOE, MSU, BEACON, iCER
Thanks!
Qingpeng Zhang (MIchigan State University) A probabilistic approach to k-mer counting July 2012 12 / 12