SlideShare une entreprise Scribd logo
1  sur  28
Handling ridiculous amounts of data with probabilistic data structures C. Titus Brown Michigan State University Computer Science / Microbiology
Resources http://www.slideshare.net/c.titus.brown/ Webinar: http://oreillynet.com/pub/e/1784 Source: 						github.com/ctb/ N-grams (this talk): 			khmer-ngram DNA (the real cheese):		khmer khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general development.  (But man, does C++ suck…)
Lincoln Stein Sequencing capacity is outscaling Moore’s Law.
Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.   (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)  …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
The challenge? Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!!  (See: Moore’s Law slide)
Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the  isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn fragments. Not subdivisible; not easy to distribute; memory intensive.
Define a hash function (word => num) def hash(word):     assert len(word) <= MAX_K     value = 0     for n, ch in enumerate(word):         value += ord(ch) * 128**n     return value
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2])		# …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2])		# …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
Storing text in a Bloom filter class BloomFilter(object):   … 	def insert_text(self, text):     for i in range(len(text)-self.k+1): self.add(text[i:i+self.k])
def next_words(bf, word):	# try all 1-ch extensions     prefix = word[1:]     for ch in bf.allchars:         word = prefix + ch         if word in bf:             yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start):     word = start[-bf.k:] n = -1     for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch)         for sentence in ss:             yield sentence     if n < 0:         yield start
def next_words(bf, word):	# try all 1-ch extensions     prefix = word[1:]     for ch in bf.allchars:         word = prefix + ch         if word in bf:             yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start):     word = start[-bf.k:] n = -1     for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch)         for sentence in ss:             yield sentence     if n < 0:         yield start
Storing and retrieving text >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('foo bar bazbif zap!') >>> x.insert_text('the quick brown fox jumped over the lazy dog') >>> print retrieve_first_sentence(x, 'foo bar ') foo bar bazbif zap! >>> print retrieve_first_sentence(x, 'the quic') the quick brown fox jumped over the lazy dog
Sequence assembly >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('the quick brown fox jumped ') >>> x.insert_text('jumped over the lazy dog') >>> retrieve_first_sentence(x, 'the quic') the quick brown fox jumpedover the lazy dog (This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
Repetitive strings are the devil >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('nanana, batman!') >>> x.insert_text('my chemical romance: nanana') >>> retrieve_first_sentence(x, "my chemical") 'my chemical romance: nanana, batman!'
Note, it’s a probabilistic data structure Retrieval errors: >>> x = BloomFilter([1001, 1003])		# small Bloom filter… >>> x.insert_text('the quick brown fox jumped over the lazy dog’) >>> retrieve_first_sentence(x, 'the quic'), ('the quick brY',)
Assembling DNA sequence Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties) But we can use the data structure to grok graph properties and eliminate/break up data: Eliminate small graphs (no false negatives!) Disconnected partitions (parts -> map reduce) Local graph complexity reduction & error/artifact trimming …and then feed into other programs. This is a data reducing prefilter
Right, but does it work?? Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500). …compare with not at allon a 512 GB RAM machine. Error/repeat trimming on a tricky worm genome: reduction from 170 GB resident / 60 hrs 54 GB resident / 13 hrs
How good is this graph representation? V. low false positive rates at ~2 bytes/k-mer; Nearly exact human genome graph in ~5 GB. Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome) Good failure mode: it’s all connected, Jim!  (No loss of connections => good prefilter) Did I mention it’s constant memory?  And independent of word size? …only works for de Bruijn graphs 
Thoughts for the future Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure. Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.
Groxel view of knot-like region / ArendHintze
Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Contenu connexe

Tendances

Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cythonAnderson Dantas
 
Slides Δικτυακών Υπολογισμών με την Python
Slides Δικτυακών Υπολογισμών με την PythonSlides Δικτυακών Υπολογισμών με την Python
Slides Δικτυακών Υπολογισμών με την PythonMoses Boudourides
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.keyYannick Wurm
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Learn python - for beginners - part-2
Learn python - for beginners - part-2Learn python - for beginners - part-2
Learn python - for beginners - part-2RajKumar Rampelli
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesNiranjan Sarade
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsSarah Allen
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Ganesh Samarthyam
 
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Moses Boudourides
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Ruby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と HashRuby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と Hashhigaki
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101Faisal Abid
 
The secrets of inverse brogramming
The secrets of inverse brogrammingThe secrets of inverse brogramming
The secrets of inverse brogrammingRichie Cotton
 
Odessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonOdessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonMax Klymyshyn
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing GroovyEvgeny Goldin
 
Stuart Mitchell - Pulp Optimisation
Stuart Mitchell - Pulp OptimisationStuart Mitchell - Pulp Optimisation
Stuart Mitchell - Pulp Optimisationdanny.adair
 

Tendances (20)

Clustering com numpy e cython
Clustering com numpy e cythonClustering com numpy e cython
Clustering com numpy e cython
 
Slides Δικτυακών Υπολογισμών με την Python
Slides Δικτυακών Υπολογισμών με την PythonSlides Δικτυακών Υπολογισμών με την Python
Slides Δικτυακών Υπολογισμών με την Python
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.key
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Learn python - for beginners - part-2
Learn python - for beginners - part-2Learn python - for beginners - part-2
Learn python - for beginners - part-2
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examples
 
Begin with Python
Begin with PythonBegin with Python
Begin with Python
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and Iterators
 
Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)
 
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
 
Haskell
HaskellHaskell
Haskell
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Ruby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と HashRuby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と Hash
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101
 
The secrets of inverse brogramming
The secrets of inverse brogrammingThe secrets of inverse brogramming
The secrets of inverse brogramming
 
Android Guava
Android GuavaAndroid Guava
Android Guava
 
Odessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonOdessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and Python
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing Groovy
 
Stuart Mitchell - Pulp Optimisation
Stuart Mitchell - Pulp OptimisationStuart Mitchell - Pulp Optimisation
Stuart Mitchell - Pulp Optimisation
 

En vedette

E learningt3 4puketapapahomework2015-3
E learningt3 4puketapapahomework2015-3E learningt3 4puketapapahomework2015-3
E learningt3 4puketapapahomework2015-3Takahe One
 
Little Red Cap[1]
Little Red Cap[1]Little Red Cap[1]
Little Red Cap[1]Denisse
 
إنسان النهضة
إنسان النهضةإنسان النهضة
إنسان النهضةAhmad Darwish
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEWavlainich
 
Perspectives on Poverty and Class
Perspectives on Poverty and ClassPerspectives on Poverty and Class
Perspectives on Poverty and ClassSarah Halstead
 
Mercado Digital no Brasil
Mercado Digital no BrasilMercado Digital no Brasil
Mercado Digital no Brasilflaviohorta
 
News and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilNews and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilSarah Halstead
 
Tinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugtTinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugtPiet van Vugt
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Spain
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online VideoEyeblaster Spain
 
Improving line management capability the Grimsby Institute story by Peter J B...
Improving line management capability the Grimsby Institute story by Peter J B...Improving line management capability the Grimsby Institute story by Peter J B...
Improving line management capability the Grimsby Institute story by Peter J B...Acas Comms
 
2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquestTakahe One
 
How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?jessecadelina
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Edmund_Wheeler
 
Laurence
LaurenceLaurence
LaurenceJURY
 

En vedette (20)

E learningt3 4puketapapahomework2015-3
E learningt3 4puketapapahomework2015-3E learningt3 4puketapapahomework2015-3
E learningt3 4puketapapahomework2015-3
 
Little Red Cap[1]
Little Red Cap[1]Little Red Cap[1]
Little Red Cap[1]
 
Sdarticle3
Sdarticle3Sdarticle3
Sdarticle3
 
إنسان النهضة
إنسان النهضةإنسان النهضة
إنسان النهضة
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEW
 
Netiquette
NetiquetteNetiquette
Netiquette
 
Perspectives on Poverty and Class
Perspectives on Poverty and ClassPerspectives on Poverty and Class
Perspectives on Poverty and Class
 
Mercado Digital no Brasil
Mercado Digital no BrasilMercado Digital no Brasil
Mercado Digital no Brasil
 
News and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy CouncilNews and Views of the Portage County Literacy Council
News and Views of the Portage County Literacy Council
 
Tinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugtTinn Capital 2010 piet van vugt
Tinn Capital 2010 piet van vugt
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009
 
Kindle vs Sony
Kindle vs SonyKindle vs Sony
Kindle vs Sony
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online Video
 
Improving line management capability the Grimsby Institute story by Peter J B...
Improving line management capability the Grimsby Institute story by Peter J B...Improving line management capability the Grimsby Institute story by Peter J B...
Improving line management capability the Grimsby Institute story by Peter J B...
 
2011 rwc webquest
2011 rwc webquest2011 rwc webquest
2011 rwc webquest
 
How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?How to download Microsoft Security Essentials?
How to download Microsoft Security Essentials?
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
Nobel Visie
Nobel VisieNobel Visie
Nobel Visie
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09
 
Laurence
LaurenceLaurence
Laurence
 

Similaire à Pycon 2011 talk (may not be final, note)

Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Nate Murray
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fuclimatewarrior
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Languagevsssuresh
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertextfrankieroberto
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
Making the most of 2.2
Making the most of 2.2Making the most of 2.2
Making the most of 2.2markstory
 
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview QuestionsGradeup
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonPython Ireland
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesHans-Jürgen Schönig
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Bryan O'Sullivan
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Modelguest2a5acfb
 

Similaire à Pycon 2011 talk (may not be final, note) (20)

Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
Pythonic Math
Pythonic MathPythonic Math
Pythonic Math
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Making the most of 2.2
Making the most of 2.2Making the most of 2.2
Making the most of 2.2
 
SQL -PHP Tutorial
SQL -PHP TutorialSQL -PHP Tutorial
SQL -PHP Tutorial
 
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview Questions
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in Python
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tables
 
Python Puzzlers
Python PuzzlersPython Puzzlers
Python Puzzlers
 
Scala 2 + 2 > 4
Scala 2 + 2 > 4Scala 2 + 2 > 4
Scala 2 + 2 > 4
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Rclass
RclassRclass
Rclass
 
Scala en
Scala enScala en
Scala en
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
 

Plus de c.titus.brown

Plus de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 

Pycon 2011 talk (may not be final, note)

  • 1.
  • 2. Handling ridiculous amounts of data with probabilistic data structures C. Titus Brown Michigan State University Computer Science / Microbiology
  • 3. Resources http://www.slideshare.net/c.titus.brown/ Webinar: http://oreillynet.com/pub/e/1784 Source: github.com/ctb/ N-grams (this talk): khmer-ngram DNA (the real cheese): khmer khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general development. (But man, does C++ suck…)
  • 4. Lincoln Stein Sequencing capacity is outscaling Moore’s Law.
  • 5. Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
  • 6. Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.) …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
  • 7. The challenge? Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
  • 8. Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
  • 9. A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn fragments. Not subdivisible; not easy to distribute; memory intensive.
  • 10. Define a hash function (word => num) def hash(word): assert len(word) <= MAX_K value = 0 for n, ch in enumerate(word): value += ord(ch) * 128**n return value
  • 11. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 12. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 13. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 14. Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2]) # …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
  • 15. Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2]) # …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
  • 16. Storing text in a Bloom filter class BloomFilter(object): … def insert_text(self, text): for i in range(len(text)-self.k+1): self.add(text[i:i+self.k])
  • 17. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start): word = start[-bf.k:] n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
  • 18. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start): word = start[-bf.k:] n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
  • 19. Storing and retrieving text >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('foo bar bazbif zap!') >>> x.insert_text('the quick brown fox jumped over the lazy dog') >>> print retrieve_first_sentence(x, 'foo bar ') foo bar bazbif zap! >>> print retrieve_first_sentence(x, 'the quic') the quick brown fox jumped over the lazy dog
  • 20. Sequence assembly >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('the quick brown fox jumped ') >>> x.insert_text('jumped over the lazy dog') >>> retrieve_first_sentence(x, 'the quic') the quick brown fox jumpedover the lazy dog (This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
  • 21. Repetitive strings are the devil >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('nanana, batman!') >>> x.insert_text('my chemical romance: nanana') >>> retrieve_first_sentence(x, "my chemical") 'my chemical romance: nanana, batman!'
  • 22. Note, it’s a probabilistic data structure Retrieval errors: >>> x = BloomFilter([1001, 1003]) # small Bloom filter… >>> x.insert_text('the quick brown fox jumped over the lazy dog’) >>> retrieve_first_sentence(x, 'the quic'), ('the quick brY',)
  • 23. Assembling DNA sequence Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties) But we can use the data structure to grok graph properties and eliminate/break up data: Eliminate small graphs (no false negatives!) Disconnected partitions (parts -> map reduce) Local graph complexity reduction & error/artifact trimming …and then feed into other programs. This is a data reducing prefilter
  • 24. Right, but does it work?? Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500). …compare with not at allon a 512 GB RAM machine. Error/repeat trimming on a tricky worm genome: reduction from 170 GB resident / 60 hrs 54 GB resident / 13 hrs
  • 25. How good is this graph representation? V. low false positive rates at ~2 bytes/k-mer; Nearly exact human genome graph in ~5 GB. Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome) Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter) Did I mention it’s constant memory? And independent of word size? …only works for de Bruijn graphs 
  • 26. Thoughts for the future Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure. Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.
  • 27. Groxel view of knot-like region / ArendHintze
  • 28. Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Notes de l'éditeur

  1. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.