BEACON 101: Sequencing tech

BEACON 101: Making use of new sequencing technologies C. Titus Brown ctb@msu.edu Comp Sci & Micro Michigan State University

Outline “Next”-generation sequencing. Dealing with the data – our research. What are they teaching kids these days, anyway?

But first, some background… The kinds of technology I’ll be talking about are being used by many BEACON groups, and will probably be used by many more within the next few years. Sequencing advances are (IMO) one of the most stunning technological breakthroughs in biology in the last 20 years. As a mid-level BEACON bureaucrat (TG leader! Course instructor!) I’m interested in: Enabling interesting science. Finding fun new problems to tackle. Developing a training & education plan so that we produce tech-savvy students and junior faculty.

In particular… At the last BEACON Congress, we had a “bioinformatics sandbox” session. Only MSU folk could attend (short notice!) About 8 labs, all using next-gen sequencing… …and 2 labs, working on methods for analyzing data. (Hi!) I know there are more people out there, on both sides of the equation. Who are you??

OK, Back to…I. Sequencing! Sequencing of DNA and RNA. Single genomes Transcriptomes Natural populations (tags) Environmental samples/microbial populations (metagenomics) Cheap and massively scalable sequencing of DNA and RNA.

Sequencing technology Major, dramatic changes in our ability to sequence DNA and RNA quickly and cheaply. Majority of deployed techniques depend on (variations of) a single trick: “polony” sequencing. No cloning. Single-molecule sequencing coming along fast, but not yet ready for prime time.

Two specific concepts: First, sequencing everything at random is very much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.) Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.

Some numbers For under $1,000 per sample, the Illumina HiSeq machine will generate: 100,000,000 reads Each of length ~100 In under a week. x 16 samples/run. That’s 160 Gb of sequence, or just over 50x human genome…

How do you choose a sequencing approach? Choose one: Long reads (low sampling, but easier to work with) Deep random sampling (quantitative sequencing, quite sensitive) The answer will depend on what exactly you want to do. Generally I prefer the shorter reads. Find someone who pays obsessive attention to this stuff. (Hi!)

Data analysis! In general, it now takes longer to analyze the data than it does to generate the data. That is, suppose you already know exactly what to do and simply want to run your analysis. By and large, you can generate a large enough amount of data in one week that you cannot run the analysis of it in the following week. …this is steadily shifting towards the “more data” side, too. (This is really a paradigm shift for many areas of biology.)

Your basic data file. >895:5:1:1276:16683/1 GTCGCTTTGCGATGTTTGTCGGGTGCATCTTTTGGGAACAGCAAGTTTTGGAATGATCCCTGCACTTTCATCGGAACACC >895:5:1:1558:16140/2 CCGTTCCAGAGATATGACCCGTTTTAATGAACGCTGCCAGTTGACAAATTATTTTCCAAAATTAGCAATTGCGTGGGTTCTTTTCCATCTAAACAGCTTCTGGGCTTTATGCTG >895:5:1:1581:10052/1 TTACAGACGTCGTTCTAACTAATTTGTGACGAAAATTGCCCACAATTATGACTATATGTGGAATTTTG >895:5:1:1824:4518/2 CCAAATTAGTTAGAATGACGTTTGTAACCGTATTCCGGTGCAACTTTGTGAATAATTTCTAACTGTAAAAATTTTTGGCAAAACCAAGTTTGCCGGCCGCAACCGCAAC >895:5:1:1945:14960/1 CTGATTTTGCAATGTTACTGACATGGGTATGCCAGTTGTGATTATTGGCGACTGCAACTCCCAACAATGATACTGTTTACTTTTGTGTGAATGAACATTTATTCATCCTTGGGT …

Mapping U. Colorado http://genomics-course.jasondk.org/?p=395 Many fast & efficient computational solutions exist. You have to figure out how to choose parameters to maximize sensitivity/specificity, and when to validate.

Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)

Data analysis challenges Choosing a software suite/pipeline/analysis approach. Scaling chosen approach to volume of data (2-200x what they designed it for) Efficiently running software. Integrating analysis results and extracting desired information. Understanding what you’ve done in sufficient detail to design & perform requisite computational controls.

Data analysis challenges, cont’d The rate of change is itself accelerating: New tools, approaches every month. More data, data types, chemistries every month. Increasing commercialization (so getting an honest answer from the companies is basically impossible) But… opportunities are great! Jump on in!

What does the future hold? “Prediction is very difficult, especially about the future.” -- Niels Bohr More, cheaper sequencing: plan for a world where you can sequence anything you sample, to any depth you want, for arbitrarily small amounts of money. Seriously. Solutions to the majority of the scaling issues in data analysis (but not the scientific issues…)

II. Our research “Making sense of sequence” “Surfing the data tsunami” There are a number of fascinating challenges at the intersection of genomics and the rest of biology; they require appropriate (ab)use of computational techniques, applied to data sets from interesting critters and/or experimental setups. (Evolution turns out to be especially interesting in this regard.)

Frontiers in sequencing new stuff… There are many, many interesting critters for which we have essentially no genomic or transcriptomic information. Next-gen sequencing has now made these organisms accessible to investigation. But dealing with organisms for which there is no reference genome is … challenging.

A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.

Assemble based on word overlaps: the quick brown fox jumped jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!

Project I: metagenomics Wild microbes! ,[object Object]

These microbes mediate important geobiological processes (e.g. nitrogen reduction)

Ecology & evolution of these habitats??,[object Object]

Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores (pyrotag sequencing) 10 M

Great Prairie sequencing summary 200x human genome…! > 10x more challenging (total diversity)

Subdividing reads by connection “Partitioning” => assembly on multiple computers

Project II: transcriptomics Developmental change in non-model ascidians, the Molgula

Molgula questions What happened to the downstream tail gene network in the tailless ascidian? What are the genomic adaptations that made the Molgulidae particularly susceptible to tail loss? (e.g. Manx/bobcat) How does tail loss actually work, functionally? Heterochrony of metamorphosis?

Preliminary round of sequencing (Illumina 76 bp x 2, ~250 bp insert size)

Molgula/emerging story Looks like notochord/tail cells are being specified, but cell movement isn’t happening. May be failure in convergence/extension? Computational leads => experimental validation.

Research goals “Better science through superior computation” Enable interesting biology downstream of sequence analysis. Also, provide tools to others.

III. (Graduate) education! Biology is fast becoming data-intensive. This requires expertise that is not traditionally part of many biologists’ training. More generally, “computational science” in biology is really at least three different things: Data analysis (data => hypothesis discovery/validation) Modeling/simulation (ecology models, protein structure, etc.) Instantiation of biological system (e.g. evolution). I’m avoiding theory and (non-digital) experiment, which are yet separate skills…

…and worse… Increasingly, biological understanding relies on computational analysis and inference. Computational intuition and informed skepticism (a.k.a. “scientific method”…) isn’t taught to biologists.

…and worst. All of this rests on a “bedrock” foundation of Badly written or inflexible software that’s difficult to run or install. Scripts written quickly and without reflection or testing. Ineffective computer use. …and a general lack of regard for reproducibility and replication.

Cultural problems? Physics, in particular, has a history of computation, and a robust computational culture… but not bio so much. “Many undergrads got into biology because they were interested in science, but didn’t like the math required for physics and chemistry. I have bad news for them…” -- me Bad news? Computation is increasingly important in bio. Good news? Computation != math. Better news: BEACON is enriched for grads, postdocs, and faculty that live at this interface. It’s a good crowd.

So what do we do? BEACON course: “Computational Science for (Evolutionary) Biologists”, v2.0 (alpha) 1. Teach programming for computational scientists. 2. Teach computational science strategies/thinking. 3. Touch on reproducibility, RCR, and data management. 4. Keep it interesting enough that people don’t “check out” 5. …try to figure out remote interaction: currently teaching across MSU (15), UT Austin (3), UW Seattle (2), and U Idaho (3).

What is class like? Tuesdays: programming HW due; discussion of computational stuff. Thursdays: reading HW due; group presentation; discussion. Groups split between MSU & (other); in-group teleconf (iPads and FaceTime), whiteboard (Jot!) (Yes, we bought 16 iPads for the course. BEACON now owns 16 iPads.)

The course is still a work in progress You can ask your local students, too, but – In-class interaction is possible, but still hard. Group dynamics! Now across 1000s of miles! Not everyone is great at technology multitasking (although kids these days…) That whole “mixed background” thing is extra challenging. BEACON students are so diverse that you can’t rely on all of them really knowing anything specific. But they all know so much individually that you risk boring them. Sigh.

Educational future Increasingly, BEACON graduate students cannot be placed into easy categories (Michelle Vogel, Tasneem Pierce). Can we really split these people into “bio” and “compu” folk? No… nor should we want to, necessarily; whole point of BEACON! Can we make the courses more distributed to take advantage of remote faculty expertise? Last year: no way in heck. This year, the tech is working better. Plus, iPads! Your opinions welcome, especially if it involves less work for me. Note: options for faculty & postdocs, too: summer course w/Dworkin.

BEACON 101: Sequencing tech

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (20)

Similar to BEACON 101: Sequencing tech

Similar to BEACON 101: Sequencing tech (20)

More from c.titus.brown

More from c.titus.brown (20)

Recently uploaded

Recently uploaded (20)

BEACON 101: Sequencing tech