8. MP/MPI solution: k-mer counting
1
2
3
4
Raw Data Data slices
Each node/core
has data and table slices
Count table
9. MP/MPI performance
MPI version
412 Gb, 4.5B reads
2.7 hours on 128x24 cores
NESRC HopperII
MP Threaded version
268 Gb, 3B reads
5 days on 32 cores
High-Mem Cluster
• Experienced software engineers
• Six months of development time
• One nodes fails, all fail
Problems:
Fast, scalable
10. Hadoop/Map Reduce framework
• Google MapReduce
– Data Parallel programming model to process petabyte data
– Generally has a map and a reduce step
• Apache Hadoop
– Distributed file system (HDFS) and job handling for
scalability and robustness
– Data locality to bring compute to data, avoiding network
transfer bottleneck
12. BioPig: design goals
• Flexible
– every dataset is unique, data analysts have domain knowledge that is essential
to optimize the analysis,
– pluggable modules that analysts can use to build custom analytic pipelines,
• High-Level
– domain-specific language enable data analysts to create custom pipelines,
– hide details of parallelism (too complex for most people),
• Scalability
– leverage data parallelism to speed up analytics,
– integrate external tools and applications where necessary,
– scale from 1 to hundreds of compute nodes with minimal effort and linear
scalability.
• Robustness
– Data and computation are replicated across nodes
to combat failures
BioPIG
13. Runs on any hardware supporting Hadoop
• JGI Titanium (commodity hadoop cluster)
– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet
• NERSC Magellan Cloud Testbed
– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem
processors, 10Gbit InfiniBand, GPFS
• Amazon AWS
– Elastic MapReduce with cluster compute nodes (23 GB of
memory, 2 x Intel quad-core “Nehalem” architecture 1690
GB of instance storage, 10G Ethernet
17. Rumen metagenome gene discovery pipeline
Read
preprocess
(remove artifacts)
pigBlast
(blast reads
against known
cellulases)
pigAssembler
(Assemble reads
into contigs)
pigExtender
(Extend contigs
into full-length
enzymes)
18. Cloud solution to large data
BioPig-
Blaster
BioPig-
Assembler
BioPig-
Extender
BioPIG
BioPig: 61 lines of code
MPI-extender: ~12,000 lines
(vs 31 in BioPig)
Flexibility
Programmability
Scalability
x
x
20. Challenges in application
• IO optimization, e.g., reduce data copying
• Some problems do not easily fit into
map/reduce framework, e.g., graph-based
algorithms
• Integration into exiting framework, Galaxy
21. Acknowledgement
• Karan Bhatia
• Henrik Nordberg
• Kai Wang
• Rob Egan
• Alex Sczyrba
• Jeremy Brand @JGI/NERSC
• Shane Cannon @NERSC
BioPIG