This document summarizes lessons learned from managing large genomic datasets at Monsanto. It discusses how Monsanto uses big data technologies like Hadoop, HBase, and Solr to store and query genomic data at scale. Key lessons include using HBase more like a hashmap than a relational database, denormalizing HBase schemas, and using distributed search technologies like SolrCloud rather than rebuilding Solr indexes. The document provides examples of genomic data formats and architectures used to store, index, and retrieve genomic feature data from petabytes of sequence data.
12. Tall Narrow
Ind:chr:pos Id Ref SNP
rs1243 A true
rs321 C true
rs1243 A true
1a2b3c:chr1:1001 ->
1a2b3c:chr1:1000 ->
ColumnFamily:V
456def:chr1:1000 ->
13. Matrix Report Use Case
Pos VCF1 VCF2 VCF3 VCF4
chr1:100 A A . .
chr1:101 . C C C
chr1:102 . . . A
chr1:103 T T . T
chr1:104 . . . .
14. First Try
Mapper
Individual 1
Region 1
Mapper
Individual n
Region m
…
Reduce
chr1:1000
Chrom:Pos
Reduce
chr1:1001
Reduce
chrN:M
…
intermediate
result
intermediate
result
intermediate
result
Mapper
Individual 1
Region 1
16. Check For Gaps
Mapper
Individual 1
chr 1
Mapper
Individual 2
chr 1
Mapper
Individual n
chr m
…
Reduce
Abc123:chr1
Reduce
abc124:chr1
Reduce
n:m
…
intermediate
result
intermediate
result
intermediate
result
intermediate
result
intermediate
result
intermediate
result
Check Gaps against
coverage table
17. This Takes Some Time
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1 2 3 4 5 6 7 8 9 10 11
Matrix Report Running Time
time in minutes
# of VCFs
Minutes
26. Lessons Learned
• Use HBase like a HashMap, not like a relational
database
• Denormalize HBase schemas, no foreign keys
• To scale solr indexes past one server, use
SolrCloud/Cloudera Search, don’t rebuild it on
your own
Good Afternoon
Rob Long
Today, talk about big data at monsanto
And some of the lessons we’ve learned.
Started > year ago
Little knowledge
Something with plants
But then, so cool
Been doing big data since 2009
It wasn’t so big then, but…
Big data not size, but handling
Mutliple machines
Denormalized nosql
Mapreduce
Also variety,
unstructured and semistructured
integration
Monsanto not prev. assoc. w/ big data
Sell seeds
Recipes for soil air water into food
We need to understand:
Genomics, agronomy, breeding, chemistry,
Data feedback for breeding
Year over year improvement
US average corn yields 1863 - 2002
Axes: x- year, y-bushels per acre ( 0 – 160 )
Up to 1930’s, yield was 30 bushels
Increasing yields, due to breeding, treatments, automation, weather prediction, etc
Now > 140 bshls/acre
5x increase
Goal of 300 Bshls/acre by 2030
How?
Knowledge of the molecular machinery, answers “how?”
Highschool biology, applied
Squiggles are important – called organelles
Smaller scale
ACGT’s
Genes as recipes
How many people familiar w/ hum gen proj?
Some done on the east campus
Actual books on shelves
Map for a genome
Shred a book
Need a guide
A reference, just like the library.
Many references, one per species
Store individual genomes
Warehouse / data mart
Archipelago of sources
Bureaucratic friction
Solution: GenRe
<point out the parts>
App servers are web logic,
Using Cloudera, CDH4.6
hadoop cluster has 30 data nodes, 3 edge nodes
Edge nodes host services
Compute farm
- blast to find sequences
Graph db for lineage
Header – standard/custom data
1 variant per line
Variant ind. != ref
Address, chrom/pos
Chrom = file, pos = offset
Count from 1
1 line per pos
Maize 2.3B bases
1/1000 variants
Keep variants only
1/1000th storage requirement
Pay a price, explain later
Three main access patterns.
Dump ind.
Matrix, sites passing filters
Regions or whole genome
Aligned to same ref
Flanking regions
Not real-time
How many familiar w/ HBase?
watch: “Introduction to NoSQL” by Martin Fowler on youtube
HBase, big table
Distributed persistent hashmap
Keyspace partitioned into regions
Region servers host the regions
CAP, CP
Rows, cf
rowkeys
Column families, similar datas
Sparse columns, 1 c1 vs 2 c1
Our approach
Tall/narrow schema
Fixed # cols
Many rows
1 row / variant / VCF
Id:chr:pos, groups indiv. data
Good for indiv. And flanking (explain)
But…
Review matrix use case
Filter: Pos w/ => 1 variant
More joins in tall/narrow
M * N scanners, worst case
TableMapper, region/individual
Join this on chrom:pos, get individuals at a pos
Now we can filter
Emit if needed
Intermediate result, map files
Blue: ref
Green: data
vert.boxes, 8 variant, 12 ref, 24 no data
24 complicates, no row
Either ref or no data?
Solved: store gaps
Group by individual/chrom
Gaps addressed this way
Intersect gaps
Then
w/gaps, back to chr:pos orientation
Dump results to n files
X: num individuals
Y: minutes
Exponential growth
Too many joins
Swap individual with genome build
One row per pos per genome
One get
HBase used correctly
Single pass for M individ.
Variants have a context
Location, in a gene, known effects
Search on any field
Jbrowse, open source visualization
An established pattern
Features dumped from hbase
Embedded solr in reducers
Zips in hdfs
Solr server runs cron
Pulls zipped indexes
Merges
Incremental updates skip MR phase
Problems:
Brittle, lots of code to maintain
Not distributed, a single solr server
Must move data off HDFS
Ramping up to billions of features
The old solr server was choking.
Enter, solrcloud.
Indexes in HDFS,
Coordinates through zookeeper
Collection concept
Same REST interface
Cloudera search
lily hbase indexer service (real time)
MR indexer
Morphline file defines mappings between inputs and solr docs
Foreign key problem
Add a step
Mapping from avro to solr
Can have multi-level records
Hbase like hashmap
Denormalize
Scale with solrcloud, don’t rebuild