SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Enabling next-generation sequencing
applications

With IBM Storwize V7000 Unified and
SONAS Gateway solutions

Dr. Tzy-Hwa Tzeng, Dr. Ruzhu Chen,
Justin Morosi, Prashant Avashia
IBM Systems and Technology Group ISV Enablement
October 2012

© Copyright IBM Corporation, 2012
Table of Contents
Abstract..................................................................................................................................... 1
Introduction: DNA and RNA sequencing applications........................................................... 2
DNA, RNA, and next-generation sequencing (NGS) technologies ......................................................... 3
Analysis tools ........................................................................................................................................... 3

Introduction: IBM Storwize V7000 Unified and SONAS systems .......................................... 5
IBM Storwize V7000 Unified system overview ........................................................................................ 5
IBM SONAS Gateway system overview .................................................................................................. 6
Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems ................................ 7

Architectural assumptions ...................................................................................................... 9
IBM Storwize V7000 Unified: Configurations, tests, and results ......................................... 10
IBM SONAS Gateway: NAS configurations, tests, and results ........................................... 13
File systems layout: Best practice recommendations ......................................................... 17
Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system ................... 18
Summary ................................................................................................................................. 19
Acknowledgments .................................................................................................................. 19
Appendices ............................................................................................................................. 20
Appendix A: Typical server and storage configuration sizing recommendations ................................. 20
Appendix B: Resources ......................................................................................................................... 21

About the authors................................................................................................................... 22
Trademarks and special notices ........................................................................................... 23

Enabling next-generation sequencing applications
Abstract
Next generation genomic sequencing technologies have been instrumental in significantly
accelerating biological research and discovery of genomes for humans, mice, snakes, plants,
bacteria, virus, cancer cells, and so on. Now, researchers process immense data sets, build
analytical deoxyribonucleic acid (DNA) models for large genomes, use reference-based analytic
methods, and further their understanding of genomic models for drug discovery, personalized
medicine, toxicology, forensics, agriculture, nanotechnology, and other emerging use cases.
IBM has now partnered with CLC bio Inc. to bring validated, and integrated smart computing
solutions that combine intelligent Assembly Cell software and optimized open systems software
from public domains, together with IBM Smarter Storage. This joint solution incorporates IBM
industry expertise, best practices, and IBM Technologies to help research institutions and
pharmaceutical companies to manage, query, analyze, and better understand integrated
genotypic and phenotypic data for medical research and patient treatment.
This paper validates that IBM Storwize V7000 Unified and Scale Out Network Attached Storage
(SONAS) Gateway based Smarter Storage solutions offer good application performance, and
availability for de novo Assembly and reference-based mapping algorithms, under the following
circumstances:
•
•
•

Access to genomic data from DNA and ribonucleic acid (RNA) sequences is configured on
IBM Storwize V7000 Unified or SONAS Gateway solutions.
The CLC Assembly Cell or open systems software applications are configured on Red Hat
Enterprise Linux (RHEL) servers.
The Network File System (NFS), v3 services are configured and delivered over the
Internet Protocol (IP) network.

This paper offers easy recommendations guidance to facilitate easy configuration and
installation of the solution to ensure an efficient installation with good performance.

Enabling next-generation sequencing applications
1
Introduction: DNA and RNA sequencing applications
Genetic concepts and interesting facts
All humans, animals, plants, and living organisms are comprised of cells. Inside any, and each cell,
resides a nucleus. The nucleus is a self-contained unit that acts as a central entity, managing the functions
and activity inside, and outside the cell. The nucleus contains most of the cell's genetic information,
organized as multiple long linear DNA molecules that are co-existent with a large variety of proteins, to
form chromosomes. The genes within these chromosomes make up the cell's genome. The function of the
nucleus is to maintain the integrity of these genes and control the cell activities. The nucleus is, therefore,
the control center of the cell.

Genes, DNA, and RNA
Genes are made up of various lengths of DNA, which contains four chemicals: adenine (A), guanine (G),
cytosine (C), and thymine (T). These chemicals line up similar to beads on a necklace to form strands of
code. They also pair up with each other to form the double strands that are characteristic of DNA. The
sequence of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical
bonds that bond those atoms.
DNA is a nucleic acid containing the genetic instructions used in the development and functioning of all
known living organisms (with the exception of RNA viruses). The DNA segments carrying this genetic
information are called genes. Likewise, other DNA sequences have structural purposes, or are involved in
regulating the use of this genetic information. Along with RNA and proteins, DNA is one of the three major
macromolecules that are essential for all known forms of life.
RNA is also a nucleic acid, and is one of the four major macromolecules (along with lipids, carbohydrates,
and proteins) essential for all known forms of life. Similar to DNA, RNA is made up of a long chain of
components called nucleotides. Each nucleotide consists of a nucleobase, a ribose sugar, and a
phosphate group. The sequence of nucleotides allows RNA to encode genetic information. In addition,
many viruses use RNA instead of DNA as their genetic material.
The chemical structure of RNA is very similar to that of DNA, with two differences: (a) RNA contains the
sugar ribose, while DNA contains the slightly different sugar deoxyribose (a type of ribose that lacks one
oxygen atom), and (b) RNA has the nucleobase uracil while DNA contains thymine. The other three
nucleobases namely, adenine (A), guanine (G), and cytosine (C) are the same in both DNA and RNA.
Unlike DNA, most RNA molecules are single-stranded and can adopt very complex three-dimensional
structures.
Human genome
The human genome includes a complete set of human genetic information stored as separate DNA
sequences in 23 chromosome pairs of the human cell nucleus and a small amount of mitochondrial DNA,
which are used as a source of chemical energy required for the cell to survive. The human genome is
estimated to be about 3.2 billion base pairs long and it contains about 20,000 to 25,000 distinct genes.
There are 23 chromosome pairs in each cell. The twenty third chromosome pair is a sex determining
chromosome. If it is a pair of X chromosomes, then in many animal species, it determines a female. If it is
Enabling next-generation sequencing applications
2
a combination of X and Y chromosomes, it determines a male. The X chromosome has more than 153
million base pairs and represents about 2000 of the 20,000 to 25,000 genes in the human genome (or
about 10% of the total gene population). The Y chromosome has about 58 million base pairs and
represents about 200 to 500 of the 20,000 to 25,000 genes in the human genome. The largest human
chromosome is chromosome 1, and is approximately 220 million base pairs long. The smallest
chromosome is the mitochondrial DNA, and is approximately 16,000 base pairs long.

DNA, RNA, and next-generation sequencing (NGS) technologies
In genetics, the sequencing processes determine the primary structure of an unbranched biopolymer. The
sequencing process results in a symbolic linear depiction (known as a sequence), which clearly
summarizes much of the atomic-level structure of the sequenced molecule.
DNA sequencing is the process of reading the nucleotide bases in a DNA molecule. It includes any
method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine,
and thymine, (A,G,C,T)—in a strand of DNA.
RNA sequencing is the process of reading the nucleotide bases in a RNA molecule. It includes any
method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine,
and uracil, (A,G,C,U)—in a strand of RNA.
Next-generation sequencing technologies parallelize the sequencing process, producing thousands or
millions of sequences at a time. These technologies are intended to lower the cost of sequencing beyond
what is possible with standard dye-terminator methods. High-throughput sequencing technologies
generate millions of short reads from a library of nucleotide sequences; whether they come from DNA,
RNA, or a mixture, the sequencing mechanism of each platform does not vary.

Analysis tools
The next-generation sequencing technologies read the biological specimen or the tissue sample, and
create hundreds of thousands (or even millions) of base pairs for analysis. A typical sequencing run can
range from a single day (24 hours) to a single week (162 hours) in the year 2012, and can generate data
between the ranges of 100 MB to 3 GB. In the next few years, this effort will only improve to generate
significantly more precise results even sooner, than available, with current processes, methods, and
technologies.
There are several open source, high performance next-generation sequencing tools, such as BurrowsWheeler Aligner (BWA) and Trinity, that can analyze the genomic DNA and RNA data from the
sequencers. On a commercial license, CLC bio offers the most-comprehensive, high-performance
computing solution for the Life Sciences industry. This section explains all the three applications.

Enabling next-generation sequencing applications
3
CLC Assembly Cell
CLC Assembly Cell is available on a commercial license. It is a high-performance computing solution for
read mapping and de novo assembling of next-generation sequencing data. It includes native color-space
support.
The command-line interface (CLI) of CLC Assembly Cell enables the functionalities to be easily included in
scripts and other next-generation sequencing workflows.
CLC Assembly Cell uses single-instruction, multiple-data (SIMD) compute instructions to parallelize and
accelerate the assembly algorithms, making the program the fastest next-generation sequencing
assembler in the market. For reference, visit the following URL:
http://www.clcbio.com/wp-content/uploads/2012/09/CLCAssemblyCell12.pdf
Burrows-Wheeler Aligner
Burrows-Wheeler Aligner (BWA) is an open-source, high-performance tool, and is available freely, with no
software licensing restrictions. It is an efficient program that aligns relatively short nucleotide sequences
against a long reference sequence, such as the human genome. It implements two algorithms, BWASHORT and BWA-SW. The former works for query sequences shorter than 200 base-pairsand the latter
for longer sequences up to around 100,000 base-pairsp. Both algorithms do gapped alignment. They are
usually more accurate and faster on queries with low error rates.
Trinity
Trinity, developed at the Broad Institute (a collaboration of MIT and Harvard Universities), is also a widely
used open-source, high-performance tool. It represents a novel method for the efficient and robust
de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent
software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of
RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each
representing the transcriptional complexity at a given gene or locus, and then processes each graph
independently to extract full-length splicing isoforms and to tease apart transcripts derived from
paralogous genes.
NGS solution benefits at a glance:
The NGS tools are enabled, tested, validated, and certified. They are then included in optimized solutions
by IBM®. IBM has used technology, industry expertise, best practices, and leading analytical partner
applications into a tightly integrated solution. With this solution, research institutions and pharmaceutical
companies can easily manage, query, analyze, and better understand integrated genotypic and
phenotypic data for medical research and patient treatment. They can:
•

•
•

Organize, integrate, and manage different kinds of data to enable focused clinical
research, including: diagnostic, clinical, demographic, genomic, phenotypic, imaging,
environmental, and more.
Enable secure, cross-department collection and sharing of clinical and research data.
Ensure flexibility and growth with open and industry-standards based architecture.

Enabling next-generation sequencing applications
4
Introduction: IBM Storwize V7000 Unified and SONAS systems
This section provides introductory details and highlights of IBM Storwize® V7000 Unified and IBM SONAS
Gateway systems.

IBM Storwize V7000 Unified system overview
Many users have deployed storage area network (SAN) attached storage for their applications requiring
the highest levels of performance while separately deploying network-attached storage (NAS) for its ease
of use and lower cost networking. This divided approach adds complexity by introducing multiple
management points and also creates islands of storage that reduce efficiency.
The Storwize V7000 Unified system provides the ability to combine both block and file storage into a single
system. By consolidating storage systems, multiple management points can be eliminated and storage
capacity can be shared across both types of access, helping to improve overall storage utilization. The
Storwize V7000 Unified system also presents a single, easy-to-use management interface that supports
both block and file storage, helping to simplify administration further.
The Storwize V7000 Unified system builds on the functions and high-performance design of the Storwize
V7000 system and integrates proven IBM software capabilities to deliver new levels of efficiency.
The Storwize V7000 Unified system provides identical software capabilities as the IBM SONAS system, as
follows:
•

•

•

•
•

•

Massive scalability:
− Supports billions of files (up to 21 petabytes of storage) in a single file system
− Supports upto 256 file systems per single SONAS platform
Flexibility:
− Allows access to data in a single global namespace, allowing all users a single,
logical view of files through a single drive letter such as a Z drive
− Provides efficient distribution of files, images, and application updates and fixes
to multiple locations quickly and cost effectively
− Provides multiple storage tiers for flexible, efficient management of petabytes of
files.
− Supports industry-standard protocols: Common Internet File System (CIFS),
Network File System (NFS), File Transfer Protocol (FTP), Hypertext Transfer
Protocol Secure (HTTPS), and Secure Copy Protocol (SCP)
Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering
high bandwidth and additional connectivity in each SONAS interface node to manage
multiple data streams and functions (for example, backup, replication, antivirus).
Data protection: File system and fileset-level snapshots (up to 256 per file system)
provide a way to partition the namespace into smaller, more manageable units.
Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative
GUI provide icon-based navigation, informative graphics, and SONAS visualizations that
streamline storage tasks and display real-time capacity, performance, and system health.
Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data
from malware and use the most commonly deployed ISV antivirus applications.
Enabling next-generation sequencing applications
5
•

•

(Clarification, for purposes of this particular paper: In Life Sciences, there is a separate
definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent
that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat.
The authors do not refer to the Life Sciences definition, in this paper.
Cloud features: Self-managing, autonomic system enables capacity, provisioning, and
other IT service management decisions to be made dynamically, without human
intervention or increased administrative costs. IBM Active Cloud Engine™ enables
ubiquitous access to files from across the globe quickly and cost effectively.
Operational savings and total cost of ownership (TCO):
− Consolidates multiple individual filers and their management, thereby avoiding
problems associated with administering an array of disparate NAS systems
− Automates file placement by transparently moving files to another internal or
external storage pool, optimizes your storage resources, and offers tremendous
time and cost savings in administering petabytes of files
− Helps conserve floor space (up to a petabyte of data in less than a square
meter), is highly scalable and can help reduce capital expenditure and enhance
operational efficiency; its advanced architecture virtualizes and consolidates file
space into a single, enterprise-wide file system, which can translate into reduced
TCO

IBM SONAS Gateway system overview
The IBM SONAS Gateway system is designed to manage vast repositories of information in enterprise
environments requiring very large capacities, high levels of performance, and high availability.
SONAS Gateway uses a mature technology from the IBM high-performance computing (HPC) experience.
It is based upon the IBM General Parallel File System (IBM GPFS™), a highly scalable clustered file
system. SONAS Gateway is an easy-to-install, turnkey, modular, scale out NAS solution. It provides the
performance, clustered scalability, high availability, and functionality that are essential for meeting
strategic multi-petabyte age and cloud storage requirements.
SONAS Gateway currently offers the following features and capabilities:
•

•

Massive scalability:
− Supports billions of files (up to 21 petabytes of storage) in a single file system
− Supports upto 256 file systems per single SONAS platform
Flexibility:
− Allows access to data in a single global namespace, allowing all users a single,
logical view of files through a single drive letter such as a Z drive
− Provides efficient distribution of files, images, and application updates and fixes
to multiple locations quickly and cost effectively
− Provides multiple storage tiers for flexible, efficient management of petabytes of
files
− Supports industry-standard protocols: CIFS, NFS, FTP, HTTPS, and SCP

Enabling next-generation sequencing applications
6
•

•
•

•

•

•

Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering
high bandwidth and additional connectivity in each SONAS interface node to manage
multiple data streams and functions (for example, backup, replication, antivirus).
Data protection: File system and fileset-level snapshots (up to 256 per file system)
provide a way to partition the namespace into smaller, more manageable units.
Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative
GUI provide icon-based navigation, informative graphics and SONAS visualizations that
streamline storage tasks and display real-time capacity, performance, and system health.
Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data
from malware and uses the most commonly deployed ISV antivirus applications.
(Clarification, for purposes of this particular paper: In Life Sciences, there is a separate
definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent
that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat.
The authors do not refer to the Life Sciences definition, in this paper.
Cloud features: Self-managing, autonomic system enables capacity, provisioning and
other IT service management decisions to be made dynamically, without human
intervention or increased administrative costs. IBM Active Cloud Engine enables
ubiquitous access to files from across the globe quickly and cost effectively.
Operational savings and TCO:
− Consolidates multiple individual filers and their management, thereby avoiding
problems associated with administering an array of disparate NAS systems.
− Automates file placement by transparently moving files to another internal or
external storage pool, optimizes your storage resources, and offers tremendous
time and cost savings in administering petabytes of files
− Helps conserve floor space (up to a petabyte of data in less than a square
meter), is highly scalable and can help reduce capital expenditure and enhance
operational efficiency; its advanced architecture virtualizes and consolidates file
space into a single, enterprise-wide file system, which can translate into reduced
TCO

Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems
The difference between the IBM Storwize V7000 Unified and SONAS Gateway systems lies in the
workloads that each system can support. The Storwize V7000 Unified system can support smaller and
medium-size workloads, while the SONAS Gateway system has the scalability to deliver high performance
for extremely large application workloads and capacities, typically for the entire enterprise.

Enabling next-generation sequencing applications
7
Table 1 offers the comparative product positioning between the Storwize V7000 Unified and SONAS
systems:

No.

Attribute

Storwize V7000 Unified

SONAS

1

Maximum number of
interface nodes

2

30

2

Maximum number of
storage nodes

N/A

60

3

Maximum raw
capacity of file storage

360 TB (3 TB drives x 12
drives per expansion unit x
10 expansion units)

21.6 PB (3TB drives x 240
drives x 30 controllers).

4

Maximum size of
single shared file
system (GPFS)

8 PB

8 PB

5

Maximum number of
file systems within a
single system

64

256

6

Maximum size of a
single file

8 PB

8 PB

7

Maximum number of
files per storage
system

4 Billion

4 Billion

8

Maximum number of
dependent file sets
per file system

256

3000

9

Maximum number of
independent file sets

256

1000

10

Maximum number of
independent file sets

256

1000

Table 1: Comparative product positioning of Storwize V7000 Unified and SONAS Gateway systems

Enabling next-generation sequencing applications
8
Architectural assumptions
Make a note of the following architectural assumptions and caveats in regard to the technical content of
this paper.
This paper does:
• Offer information and recommendations for tuning adjustments to achieve good
performance in normal NAS production environments.
• Allow a non-technical customer or user to quickly tune their NAS environment by using
recommendations, observations, tips, and best practices, as documented.
• Provide information from a non-technical user point of view for fast implementations.
This paper does not:
• Explain the various technologies and solutions to establish or publish any benchmarks.
• Guarantee a specific performance of any technical element.
• Provide or offer any information to overcome previously established benchmarks.
• Explain or explore newer technologies, standards, and concepts such as 40 GbE
connections, NFS V4, cloud multi-tenancy and so on.
• Offer any guidance on how to determine hardware sizing or capacity planning for your
installation.
Caveats:
•
•
•

•

Use cognizance in making your decisions.
Do not take any published numbers literally.
For this paper, the tests were run on different IBM equipments located at different IBM
data centers. Note that the performance results might vary, depending on unique server /
client conditions, architectural configurations, network behaviors, application
dependencies, and operational environments. Your performance and mileage might vary
from the test results.
Recommended best practices sometimes differ from the test configurations. The test
configurations were set up to observe certain behavior in specific test situations. The best
practices are recommended to run operations in production environments.

Enabling next-generation sequencing applications
9
IBM Storwize V7000 Unified: Configurations, tests, and results
Configuration and tests
An IBM Storwize V7000 Unified system was tested with the three NGS applications: CLC Assembly Cell,
BWA, and Trinity. The connectivity between the Storwize V7000 Unified system and the single application
server was configured as NAS-attached configuration. This configuration was a typical use case for a
small research facility, with minimal compute resources, as shown in Figure 1.

Figure 1: NAS-attached Storwize V7000 Unified configuration with NGS applications for a small research facility

Test results with CLC Assembly Cell
The following tables summarize the results of successful testing of de novo assembly and reference
assembly with the CLC Assembly Cell software, BWA application software, and Trinity application
software using identical server and storage configurations, as demonstrated in Figure 1.

Enabling next-generation sequencing applications
10
Input

gz-fastq

Cores
(threads)

XFS-local
(minutes)

GPFS
(minutes)

SSD
(minutes)

Storwize
V7000
Unified
(minutes)*

571

575

544

589

32 (32)

386

384

376

385

32 (64)

323

309

320

310

16 (16)

534

520

525

560

32 (32)

351

345

337

363

32 (64)

fasta

16 (16)

288

286

267

276

Table 2: CLC Assembly Cell performance results with de novo assembly using non paired-end option

Input

GPFS
(minutes)

SSD
(minutes)

16 (16)

582

570

571

610

396

374

380

387

32 (64)

333

329

323

317

16 (16)

569

566

535

605

32 (32)

380

365

369

365

32 (64)

fasta

XFS-local
(minutes)

32 (32)

gz-fastq

Cores
(threads)

Storwize
V7000
Unified
(minutes)*

313

315

310

298

Table 3: CLC Assembly Cell performance results with de novo assembly using paired-end information

CLC
Assembly
Cell
Version 4

Input

32 (64)

Cores
(threads)
78.5

XFS-local
(minutes)
78

GPFS
(minutes)
72

SSD
(minutes)
80

Table 4: CLC Assembly Cell performance results with paired-end reference mapping information

*Mount Options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0

Enabling next-generation sequencing applications
11
Test results with the BWA application
When the BWA application was run on the same server with the same Storwize V7000 Unified system as
the storage back-end, the following test results were obtained, as shown in Table 5.
Input

Threads

Storwize
V7000
1
Unified
(No cache)

Storwize
2
V7000 Unified
(with cache)

Local

Comparing BWA
reads_100m.fq
with
humangenome.fa

8

44 min 46 s

44 min 57.483 s

44 min 59 s

16

26 min 29 s

25 min 38.118 s

26 min 38 s

24

20 min 50 s

20 min 9.843 s

21 min 34 s

32

18 min 40 s

19 min 0.600 s

18 min 40 s

64

24 min 58 s

26 min 47.676 s

26 min 20 s

Table 5: BWA performance results with various file system options

1
2

Mount options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0
Mount options: rw, noatime, nodiratime, rsize=1048576, wsize=1048576, proto=tcp, vers=3, timeo=600, addr=9.11.82.103
Enabling next-generation sequencing applications
12
Test results with the Trinity application
When the Trinity application was run on the same server with the same Storwize V7000 Unified system as
the storage back-end, the following test results were obtained, as shown in Table 6.

Mount options

Duration

fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,addr=9.11.82.103)

869 min 4.618 s

fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60
0,addr=9.11.82.103)

866 min 47.335 s

rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0
00

More than 4 days

/dev/sdb on /xfs type xfs (rw,nobarrier)

787 min 54.544 s

9.11.83.71:/ibm/gpfs1/Life_sciences_bak on /NGS type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60
0,addr=9.11.83.71)

875 min 36.453 s

rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0
00

More than 4 days

Table 6: Trinity application performance results with various file system mount point options

Note: Run times are large as the Trinity application creates millions of files, ranging from 0 MB to 10 MB in
size. This is a typical behavior of Trinity applications.

IBM SONAS Gateway: NAS configurations, tests, and results
Configurations and tests
An IBM SONAS Gateway system was tested with the three NGS applications: CLC Assembly Cell, BWA,
and Trinity. The connectivity between the SONAS Gateway system and 14 IBM BladeCenter® blade
servers was configured as a NAS-attached configuration. The blade servers represented application
services. This configuration was a typical use case for a medium- to large-research facility, with adequate
compute and performance resources, as shown in Figure 2.

Enabling next-generation sequencing applications
13
Figure 2: NAS-attached SONAS Gateway configuration with NGS applications for a medium- to large-research facility

Enabling next-generation sequencing applications
14
Test results with CLC Assembly Cell
The following tables summarize the results of successful testing of de novo assembly and reference
assembly with the CLC Assembly Cell software, BWA application software, and Trinity application
software using identical server and storage configurations, as demonstrated in Figure 2.
Input

Cores
(threads)

SONAS
Gateway
(minutes)*

gz-fastq

16 (16)

573

16 (32)

439

16 (16)

547

16 (32)

406

fasta

Table 7: CLC Assembly Cell performance results with de novo assembly using the non paired-end option

Input

Cores
(threads)

SONAS
Gateway
(minutes)*

gz-fastq

16 (16)

591

16 (32)

449

16 (16)

588

16 (32)

437

fasta

Table 8: CLC Assembly Cell performance results with de novo assembly using paired-end information

CLC
Assembly
Cell

Cores
(threads)

SONAS
Gateway
(minutes)*

Version 4

16 (32)

148

Table 9: CLC Assembly Cell performance results with paired-end reference mapping information

*Mount options:rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0

Enabling next-generation sequencing applications
15
Test results with BWA application
Table 10 summarizes the results of successful testing with the BWA application software.

Input

Threads

SONAS
Gateway
Hx9203**
No cache

SONAS
Gateway
Hx9201**
No cache

SONAS
Gateway
Hx9201***
cache

BWA
reads_10
0m.fq
vs
humange
nome.fa

8

50 min 31 s

43 min 58.889 s

44 min 6.455 s

16

30 min 7 s

25 min 30.709 s

26 min 30.983 s

24

26 min 45 s

22 min 49.733 s

22 min 17.789 s

32

25 min 46 s

22 min 38.095 s

23 min 14.785 s

Hx9202
Hx9205
Hx9206
Hx9207
Hx9208
Hx9210
Hx9211
Hx9212

26 min 34 s
30 min 14 s
30 min 50 s
30 min 1 s
31 min 12 s
30 min 32 s
32 min 13 s
30 min 54 s

Table 10: Results of successful testing of BWA applications on 14 servers attached to the SONAS Gateway system

The following mount options were documented, with the results as listed in Table 10.

** rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0
*** rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=600,addr=9.11.82.103

Enabling next-generation sequencing applications
16
Test results with Trinity application
When the Trinity application was run on the same server, with the same Storwize V7000 Unified system as
a storage back-end, the following test results were obtained, as in Table 11, below:
Mount options

Duration

172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,addr=172.26.39.180)

804 min 2.373 s

172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time
o=600,addr=172.26.39.180)

812 min 7.432 s

172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs
(rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time
o=600,addr=172.26.39.180)

760 min 50.722 s
774 min 20.714 s

Table 11: Results of successful testing of Trinity Application on 14 servers attached to SONAS Gateway system

File systems layout: Best practice recommendations
To ensure good application performance, with optimal runtimes, when server(s) are connected over the 10
GbE Ethernet to IBM Storwize V7000 Unified or SONAS Gateway systems the following considerations
should be noted:
•

•

•

Proper sizing and stability of application nodes and servers is extremely important to drive
the required levels of workloads for various different types of algorithms, such as de novo,
or reference-based mapping.
Proper selection of valid mount options affects the performance and runtime
characteristics of NGS applications. Incorrect selection of mount options results in long
running jobs, as these jobs will create millions of files ranging from 0 MB to 10 MB in size.
It was observed that all these applications did not saturate the network, the IBM Storwize
V7000 Unified system, or the IBM SONAS Gateway system.

For improved performance in a normal and a typical production environment, lay out the file systems for
NGS applications as per the following guidelines and best practice recommendations:
•
•

•

•

Different NGS applications require different types of mount options for increased
performance and optimal response time.
Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using the
cluster method of creating the block allocation maps to achieve a uniform disk
performance across all storage capacities.
Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using
logfileplacement value = striped to stripe the log file of the file system, across all
metadata disks.
Recommend using the block size as 256 K for both, short-term, and long-term storage.

Enabling next-generation sequencing applications
17
•
•
•

As a best practice, run all RHEL 6.2 servers with dual 10 GbE bonded network channel
connections, with MTU=9000.
To support various NGS application workloads, two interface nodes are recommended on
the Storwize V7000 Unified system for increased availability.
To support various NGS application workloads, at least two interface nodes are
recommended on the SONAS Gateway system for increased availability.

Solution benefits: IBM Storwize V7000 Unified and SONAS
Gateway system
Both, SONAS and Storwize V7000 Unified systems offer the following significant benefits, for clients
running NGS Applications for efficient analysis of genomic data from DNA and RNA sequences:
•

Easily examine a large group of potential gene candidates by using typical applications
such as blast, linkage analysis, mascot etc., that can quickly search and rapidly screen
targets in genomic databases, genomes and assays.

•

Efficiently create targeted drug treatments. Easily enable the scale-up development of
new drug molecules developed through Drug Discovery (Research, Synthesis), PreClinical Development (Preparation, Formulation, Pre-dosage design), Pre-FDA (new drug
formulation, standards).

•

Delivers tight integration between ERP and Pharma Supply Chains - SONAS easily
supports pharmaceutical processes to scale-up of API’s (active pharmaceutical
ingredients) from Milligram to Kilogram quantities for commercial manufacturing and
distribution of drugs, with improved visibility into process optimization and consistent yield
variability across batches.

•

Lowers TCO by efficiently reducing drug discovery costs through use/reuse of databases,
common analytical data, processes and standards throughout the pharmaceutical
operational chain.

•

Deliver on-demand cloud computing models to rapidly address changing levels of
analytical computational capacities and facilitate self service of analytical tools, pooling of
analytic, research development, pharmaceutical manufacturing resources, and common
and scalable transactional processes and standards.

Enabling next-generation sequencing applications
18
Summary
This paper validates that IBM Storwize V7000 Unified and SONAS Gateway based solutions offer good
application performance with excellent virtualization and availability under the following circumstances:
•
•
•

Access to genomic data from DNA and RNA sequences is configured on the IBM Storwize
V7000 Unified or SONAS Gateway system.
The CLC Assembly Cell or open systems software applications are configured on RHEL
servers.
The NFS v3 services are configured and delivered over the IP network.

This paper offers recommendations and guidance to facilitate easy configuration and installation of the
solution to ensure an efficient installation with good performance.

Acknowledgments
Special thanks to the teams from CLC bio in Denmark for loaning the software licenses of the CLC
Assembly Cell software, which enabled the IBM test team to create a representative operational test
environment in IBM data centers and run tests to document real-life results.
Many Thanks to the IBM client executives, IBM Systems and Technology Group members, and other team
members who contributed with their recommendations during the test run and review process, and
enabled successful completion and validation so that CLC bio software applications can run successfully
over various environments facilitated by IBM Storwize V7000 Unified and SONAS Gateway systems.
The IBM team also acknowledges with special thanks to Connie Borton, Michael Nelson, Cathy Drews,
Daniel Drinnon, and Larry Garibay for their invaluable help and assistance, without which the software
validation of three independently different software applications would not have been successful.

Enabling next-generation sequencing applications
19
Appendices
Appendix A: Typical server and storage configuration sizing recommendations
This section includes a typical recommendation and a guideline for server and storage configuration sizing
for small, medium, and large research facilities. While this information is typical, the authors do understand
that there will be differences in various organizations in terms of the following criteria:
•
•
•
•
•
•
•

Different number and types of sequencers in the facility
Different types of genomes being worked on in the laboratory / organization
Different processes being pursued within the organization – be it reference mapping,
assembly or transcriptions, or downstream analytics
The amount of data that is required to be kept active
The amount of data that is required to be kept archived
The response time required in terms of the number of genomes per day, per week, or per
month
And many other factors

Tier 1: 1 to 2 human size genomes per week, for both de novo and reference-based mapping
Single server and internal storage configuration
•
•
•
•

IBM system x3750 with 2.4 GHz E5 4640, ½ TB RAM, 16 TB internal disks
4 sockets, 32 cores, 2.4 GHz Intel® Xeon® processor E5 4640
32 x 16 GB 1600MHz DDR3 DIMMs,
16 x 2.5-inch 1 TB SAS drive

Tier 2: 3 to 10 human size genomes per week or need more than 15 TB online for both de novo and
reference-based mapping
Multiple server and external storage configuration
•

•
•
•
•

•

IBM BladeCenter HS23 frame enclosed with14 blade servers. Each blade server
configured with 2.6 GHz Intel Xeon processor E5 2670, 128 GB RAM and dual10 GbE
connection ports
2 sockets, 16 cores, 2.6 GHz Intel Xeon processor E5 2670
16 x 8 GB 1333MHz DDR3 DIMMs
2 x 2.5-inch 300 GB SAS drive
96 x 2.5-inch 600 GB 10 K rpm drives in four enclosures of IBM Storwize V7000 Unified.
The IBM Storwize V7000 Unified system can host up to 10 enclosures, and therefore, if
more capacity is needed in the future, more disks can be added to the remaining six
enclosures.
1 external switch (or customer supplied switch) to support 10 GbE connectivity

Tier 3: More than 10 human size genomes per week or need more than 100 TB online or need for
downstream analysis.

Enabling next-generation sequencing applications
20
This is a custom configuration. You can contact IBM.

Appendix B: Resources
The following websites provide useful references to supplement the information contained in this paper:
•

Introduction to Genetics
en.wikipedia.org/wiki/Introduction_to_genetics

•

DNA
en.wikipedia.org/wiki/DNA

•

DNA Sequencing
en.wikipedia.org/wiki/DNA_sequencing

•

Cell Nucleus
en.wikipedia.org/wiki/Cell_nucleus

•

Human Genome
en.wikipedia.org/wiki/Human_genome_map

•

RNA
en.wikipedia.org/wiki/RNA

•

RNA-Seq
en.wikipedia.org/wiki/RNA-Seq

•

X Chromosome
en.wikipedia.org/wiki/X_chromosome

•

Y Chromosome
en.wikipedia.org/wiki/Y_chromosome

•

Trinity
www.broadinstitute.org/scientific-community/software/trinity

•

Burroughs Wheeler Aligner
bio-bwa.sourceforge.net/

•

CLC bio Applications
www.clcbio.com/

•

IBM Redbooks®
ibm.com/redbooks

Enabling next-generation sequencing applications
21
•

IBM Publications Center
www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US

•

IBM Scale Out Network Attached Storage Architecture, Planning and Implementation
Basics [SG24-7875-00]
ibm.com/redbooks/abstracts/sg247875.html?Open

•

IBM Scale Out Network Attached Storage Concepts [SG24-7874-00]
ibm.com/redbooks/abstracts/sg247874.html?Open

•

IBM Storwize V7000 Introduction and Implementation Guide [SG247938]
ibm.com/redbooks/redpieces/abstracts/sg247938.html?Open

About the authors
Dr. Tzy-Hwa K. (Kathy) Tzeng, is a Senior Technical Staff Member (STSM) for IBM Systems and
Technology Group ISV Strategy and Enablement Organization. She received her Ph.D. in Genetics and
Plant Pathology from Iowa State University. Prior to IBM, she led drug discovery projects in bioinformatics,
proteomics, and genomics. At IBM, she is responsible for the strategy and content of IBM Life Sciences
application plans, portfolio, and product positioning. You can reach Dr. Kathy Tzeng at tzy@us.ibm.com
Dr. Ruzhu Chen is an IBM Certified Expert IT Specialist for IBM Systems and Technology Group,
focusing on computational chemistry and NGS applications. Over the last ten years, he has successfully
tuned, benchmarked, and optimized solutions for IBM worldwide partners, and customers. Ruzhu earned
his Masters degree in Biochemistry from University of Sciences and Technology of China, a second
Masers degree in Computer Science and a Ph.D. in Molecular Biology, both from the University of
Oklahoma. You can reach Dr. Ruzhu Chen at ruzhuchen@us.ibm.com.
Justin Morosi is a Consulting IT Specialist working for IBM Systems and Technology Group as a
Worldwide Technical Architect focusing on HPC solutions. He has worked for IBM for over 14 years and
has more than 20 years of consulting and solution design experience. He holds numerous industryrecognized certifications from Cisco, Microsoft®, VMware, Red Hat, and IBM. His areas of expertise
include high-performance computing/storage, high availability, disaster recovery, and virtualization. You
can reach Justin Morosi at jmorosi@us.ibm.com.
Prashant Avashia is a software engineer in IBM Systems and Technology Group ISV Strategy and
Enablement Organization. With more than 15 years of experience, he has successfully architected,
engineered, and implemented enterprise infrastructure solutions for key global clients in healthcare,
financial, and software industries. He earned his master's degree in Industrial Engineering from Kansas
State University, and a bachelor's degree in Mechanical Engineering from Osmania University, India. You
can reach Prashant Avashia at pavashia@us.ibm.com.

Enabling next-generation sequencing applications
22
Trademarks and special notices
© Copyright IBM Corporation 2012.
References in this document to IBM products or services do not imply that IBM intends to make them
available in every country.
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked
terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these
symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information
was published. Such trademarks may also be registered or common law trademarks in other countries. A
current list of IBM trademarks is available on the Web at "Copyright and trademark information" at
www.ibm.com/legal/copytrade.shtml.
Java and all Java-based trademarks and logos are trademarks of registered trademarks of Oracle and/or
its affiliates.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the
United States, other countries, or both.
Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States,
other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
Information is provided "AS IS" without warranty of any kind.
All customer examples described are presented as illustrations of how those customers have used IBM
products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published
announcement material, or other publicly available sources and does not constitute an endorsement of
such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly
available information, including vendor announcements and vendor worldwide homepages. IBM has not
tested these products and cannot confirm the accuracy of performance, capability, or any other claims
related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the
supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice,
and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the
full text of the specific Statement of Direction.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive
statement of a commitment to specific levels of performance, function or delivery schedules with respect to
any future products. Such commitments are only made in IBM product announcements. The information is
Enabling next-generation sequencing applications
23
presented here to communicate IBM's current investment and development activities as a good faith effort
to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be given that an
individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Photographs shown are of engineering prototypes. Changes may be incorporated in production models.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.

Enabling next-generation sequencing applications
24

Contenu connexe

Tendances

Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterIJMER
 
Human genome project
Human genome projectHuman genome project
Human genome projectShital Pal
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part Ihhalhaddad
 
encode project
encode project encode project
encode project Priti Pal
 
New insights into the human genome by encode 14.12.12
New insights into the human genome by encode 14.12.12New insights into the human genome by encode 14.12.12
New insights into the human genome by encode 14.12.12Ranjani Reddy
 
20140613 Analysis of High Throughput DNA Methylation Profiling
20140613 Analysis of High Throughput DNA Methylation Profiling20140613 Analysis of High Throughput DNA Methylation Profiling
20140613 Analysis of High Throughput DNA Methylation ProfilingYi-Feng Chang
 
Genomics 101 jun 15 2012
Genomics 101 jun 15 2012Genomics 101 jun 15 2012
Genomics 101 jun 15 2012Genome Alberta
 
How telomeres protect the ends of our chromosomes - Jack Griffith
How telomeres protect the ends of our chromosomes - Jack Griffith How telomeres protect the ends of our chromosomes - Jack Griffith
How telomeres protect the ends of our chromosomes - Jack Griffith Lake Como School of Advanced Studies
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
An overview of mitochondrial biology
An overview of mitochondrial biologyAn overview of mitochondrial biology
An overview of mitochondrial biologyUpasana Ganguly
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Human genome project the mitre corporation - jason program office
Human genome project   the mitre corporation - jason program officeHuman genome project   the mitre corporation - jason program office
Human genome project the mitre corporation - jason program officePublicLeaker
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformaticsjaumebp
 

Tendances (19)

Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
 
Human genome project
Human genome projectHuman genome project
Human genome project
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
 
encode project
encode project encode project
encode project
 
New insights into the human genome by encode 14.12.12
New insights into the human genome by encode 14.12.12New insights into the human genome by encode 14.12.12
New insights into the human genome by encode 14.12.12
 
Genetics and heredity
Genetics and heredity Genetics and heredity
Genetics and heredity
 
Organelle genome
Organelle genomeOrganelle genome
Organelle genome
 
20140613 Analysis of High Throughput DNA Methylation Profiling
20140613 Analysis of High Throughput DNA Methylation Profiling20140613 Analysis of High Throughput DNA Methylation Profiling
20140613 Analysis of High Throughput DNA Methylation Profiling
 
Genomics 101 jun 15 2012
Genomics 101 jun 15 2012Genomics 101 jun 15 2012
Genomics 101 jun 15 2012
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
 
Bioengineering Information for Genomic Medicine
Bioengineering Information for Genomic MedicineBioengineering Information for Genomic Medicine
Bioengineering Information for Genomic Medicine
 
How telomeres protect the ends of our chromosomes - Jack Griffith
How telomeres protect the ends of our chromosomes - Jack Griffith How telomeres protect the ends of our chromosomes - Jack Griffith
How telomeres protect the ends of our chromosomes - Jack Griffith
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
GENOMICS
GENOMICSGENOMICS
GENOMICS
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
An overview of mitochondrial biology
An overview of mitochondrial biologyAn overview of mitochondrial biology
An overview of mitochondrial biology
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Human genome project the mitre corporation - jason program office
Human genome project   the mitre corporation - jason program officeHuman genome project   the mitre corporation - jason program office
Human genome project the mitre corporation - jason program office
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 

En vedette

Архивирование. Концепция C-Bura
Архивирование. Концепция C-BuraАрхивирование. Концепция C-Bura
Архивирование. Концепция C-BuraКРОК
 
Частные облака на уровне инфраструктуры. Аппаратные решения
Частные облака на уровне инфраструктуры. Аппаратные решенияЧастные облака на уровне инфраструктуры. Аппаратные решения
Частные облака на уровне инфраструктуры. Аппаратные решенияКРОК
 
Расскажем об облаках все чАстное слово
Расскажем об облаках все чАстное словоРасскажем об облаках все чАстное слово
Расскажем об облаках все чАстное словоКРОК
 
Booklet семейство продуктов ibm storwize
Booklet семейство продуктов ibm storwizeBooklet семейство продуктов ibm storwize
Booklet семейство продуктов ibm storwizeЮлия Трунова
 
Test IBM Storwize V7000 with Easy Tier Rus
Test IBM Storwize V7000 with Easy Tier RusTest IBM Storwize V7000 with Easy Tier Rus
Test IBM Storwize V7000 with Easy Tier RusOleg Korol
 
Никитин Вячеслав - резюме
Никитин Вячеслав - резюмеНикитин Вячеслав - резюме
Никитин Вячеслав - резюмеhTS
 
Облачный ITSM
Облачный ITSMОблачный ITSM
Облачный ITSMКРОК
 
лабораторная работа 1
лабораторная работа 1лабораторная работа 1
лабораторная работа 1student_kai
 

En vedette (8)

Архивирование. Концепция C-Bura
Архивирование. Концепция C-BuraАрхивирование. Концепция C-Bura
Архивирование. Концепция C-Bura
 
Частные облака на уровне инфраструктуры. Аппаратные решения
Частные облака на уровне инфраструктуры. Аппаратные решенияЧастные облака на уровне инфраструктуры. Аппаратные решения
Частные облака на уровне инфраструктуры. Аппаратные решения
 
Расскажем об облаках все чАстное слово
Расскажем об облаках все чАстное словоРасскажем об облаках все чАстное слово
Расскажем об облаках все чАстное слово
 
Booklet семейство продуктов ibm storwize
Booklet семейство продуктов ibm storwizeBooklet семейство продуктов ibm storwize
Booklet семейство продуктов ibm storwize
 
Test IBM Storwize V7000 with Easy Tier Rus
Test IBM Storwize V7000 with Easy Tier RusTest IBM Storwize V7000 with Easy Tier Rus
Test IBM Storwize V7000 with Easy Tier Rus
 
Никитин Вячеслав - резюме
Никитин Вячеслав - резюмеНикитин Вячеслав - резюме
Никитин Вячеслав - резюме
 
Облачный ITSM
Облачный ITSMОблачный ITSM
Облачный ITSM
 
лабораторная работа 1
лабораторная работа 1лабораторная работа 1
лабораторная работа 1
 

Similaire à Enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions

Sk microfluidics and lab on-a-chip-ch3
Sk microfluidics and lab on-a-chip-ch3Sk microfluidics and lab on-a-chip-ch3
Sk microfluidics and lab on-a-chip-ch3stanislas547
 
Genome Sequencing - Ahmadrezarafati 1395-01-30
Genome Sequencing - Ahmadrezarafati 1395-01-30Genome Sequencing - Ahmadrezarafati 1395-01-30
Genome Sequencing - Ahmadrezarafati 1395-01-30Ahmadreza Rafati Roudsari
 
IntroductionBio.ppt
IntroductionBio.pptIntroductionBio.ppt
IntroductionBio.pptssuserb86ba7
 
DNA vs. RNA 5 Key Differences and Comparison.pdf
DNA vs. RNA  5 Key Differences and Comparison.pdfDNA vs. RNA  5 Key Differences and Comparison.pdf
DNA vs. RNA 5 Key Differences and Comparison.pdfMichelleRojas57
 
CELL REPLICATION.pptx
CELL REPLICATION.pptxCELL REPLICATION.pptx
CELL REPLICATION.pptxRizaCatli2
 
Pre-Messenger Rna Synthesis
Pre-Messenger Rna SynthesisPre-Messenger Rna Synthesis
Pre-Messenger Rna SynthesisJennifer Cruz
 
Macromolecules continued
Macromolecules continuedMacromolecules continued
Macromolecules continuedPaula Mills
 
DNA vs RNA and Comparison.pptx
DNA vs RNA  and Comparison.pptxDNA vs RNA  and Comparison.pptx
DNA vs RNA and Comparison.pptxYousefMElshrek
 
A Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square CipherA Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square CipherIJERA Editor
 
Epigenetic Analysis Sequencing
Epigenetic Analysis SequencingEpigenetic Analysis Sequencing
Epigenetic Analysis SequencingLisa Martinez
 

Similaire à Enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions (20)

MoM2010: Bioinformatics
MoM2010: BioinformaticsMoM2010: Bioinformatics
MoM2010: Bioinformatics
 
Sk microfluidics and lab on-a-chip-ch3
Sk microfluidics and lab on-a-chip-ch3Sk microfluidics and lab on-a-chip-ch3
Sk microfluidics and lab on-a-chip-ch3
 
Genome Sequencing - Ahmadrezarafati 1395-01-30
Genome Sequencing - Ahmadrezarafati 1395-01-30Genome Sequencing - Ahmadrezarafati 1395-01-30
Genome Sequencing - Ahmadrezarafati 1395-01-30
 
IntroductionBio.ppt
IntroductionBio.pptIntroductionBio.ppt
IntroductionBio.ppt
 
DNA vs. RNA 5 Key Differences and Comparison.pdf
DNA vs. RNA  5 Key Differences and Comparison.pdfDNA vs. RNA  5 Key Differences and Comparison.pdf
DNA vs. RNA 5 Key Differences and Comparison.pdf
 
CELL REPLICATION.pptx
CELL REPLICATION.pptxCELL REPLICATION.pptx
CELL REPLICATION.pptx
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
Basic Biocomputing
Basic BiocomputingBasic Biocomputing
Basic Biocomputing
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
 
Nucleic acids
Nucleic   acidsNucleic   acids
Nucleic acids
 
Pre-Messenger Rna Synthesis
Pre-Messenger Rna SynthesisPre-Messenger Rna Synthesis
Pre-Messenger Rna Synthesis
 
Macromolecules continued
Macromolecules continuedMacromolecules continued
Macromolecules continued
 
Introduction
IntroductionIntroduction
Introduction
 
DNA vs RNA and Comparison.pptx
DNA vs RNA  and Comparison.pptxDNA vs RNA  and Comparison.pptx
DNA vs RNA and Comparison.pptx
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
A Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square CipherA Dna And Amino-Acids Based Implementation Of Four-Square Cipher
A Dna And Amino-Acids Based Implementation Of Four-Square Cipher
 
Epigenetic Analysis Sequencing
Epigenetic Analysis SequencingEpigenetic Analysis Sequencing
Epigenetic Analysis Sequencing
 
Lesson 14.3
Lesson 14.3Lesson 14.3
Lesson 14.3
 

Plus de IBM India Smarter Computing

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments IBM India Smarter Computing
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...IBM India Smarter Computing
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceIBM India Smarter Computing
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM India Smarter Computing
 

Plus de IBM India Smarter Computing (20)

Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments Using the IBM XIV Storage System in OpenStack Cloud Environments
Using the IBM XIV Storage System in OpenStack Cloud Environments
 
All-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage EfficiencyAll-flash Needs End to End Storage Efficiency
All-flash Needs End to End Storage Efficiency
 
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
TSL03104USEN Exploring VMware vSphere Storage API for Array Integration on th...
 
IBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product GuideIBM FlashSystem 840 Product Guide
IBM FlashSystem 840 Product Guide
 
IBM System x3250 M5
IBM System x3250 M5IBM System x3250 M5
IBM System x3250 M5
 
IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4IBM NeXtScale nx360 M4
IBM NeXtScale nx360 M4
 
IBM System x3650 M4 HD
IBM System x3650 M4 HDIBM System x3650 M4 HD
IBM System x3650 M4 HD
 
IBM System x3300 M4
IBM System x3300 M4IBM System x3300 M4
IBM System x3300 M4
 
IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4IBM System x iDataPlex dx360 M4
IBM System x iDataPlex dx360 M4
 
IBM System x3500 M4
IBM System x3500 M4IBM System x3500 M4
IBM System x3500 M4
 
IBM System x3550 M4
IBM System x3550 M4IBM System x3550 M4
IBM System x3550 M4
 
IBM System x3650 M4
IBM System x3650 M4IBM System x3650 M4
IBM System x3650 M4
 
IBM System x3500 M3
IBM System x3500 M3IBM System x3500 M3
IBM System x3500 M3
 
IBM System x3400 M3
IBM System x3400 M3IBM System x3400 M3
IBM System x3400 M3
 
IBM System x3250 M3
IBM System x3250 M3IBM System x3250 M3
IBM System x3250 M3
 
IBM System x3200 M3
IBM System x3200 M3IBM System x3200 M3
IBM System x3200 M3
 
IBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and ConfigurationIBM PowerVC Introduction and Configuration
IBM PowerVC Introduction and Configuration
 
A Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization PerformanceA Comparison of PowerVM and Vmware Virtualization Performance
A Comparison of PowerVM and Vmware Virtualization Performance
 
IBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architectureIBM pureflex system and vmware vcloud enterprise suite reference architecture
IBM pureflex system and vmware vcloud enterprise suite reference architecture
 
X6: The sixth generation of EXA Technology
X6: The sixth generation of EXA TechnologyX6: The sixth generation of EXA Technology
X6: The sixth generation of EXA Technology
 

Dernier

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Dernier (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions

  • 1. Enabling next-generation sequencing applications With IBM Storwize V7000 Unified and SONAS Gateway solutions Dr. Tzy-Hwa Tzeng, Dr. Ruzhu Chen, Justin Morosi, Prashant Avashia IBM Systems and Technology Group ISV Enablement October 2012 © Copyright IBM Corporation, 2012
  • 2. Table of Contents Abstract..................................................................................................................................... 1 Introduction: DNA and RNA sequencing applications........................................................... 2 DNA, RNA, and next-generation sequencing (NGS) technologies ......................................................... 3 Analysis tools ........................................................................................................................................... 3 Introduction: IBM Storwize V7000 Unified and SONAS systems .......................................... 5 IBM Storwize V7000 Unified system overview ........................................................................................ 5 IBM SONAS Gateway system overview .................................................................................................. 6 Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems ................................ 7 Architectural assumptions ...................................................................................................... 9 IBM Storwize V7000 Unified: Configurations, tests, and results ......................................... 10 IBM SONAS Gateway: NAS configurations, tests, and results ........................................... 13 File systems layout: Best practice recommendations ......................................................... 17 Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system ................... 18 Summary ................................................................................................................................. 19 Acknowledgments .................................................................................................................. 19 Appendices ............................................................................................................................. 20 Appendix A: Typical server and storage configuration sizing recommendations ................................. 20 Appendix B: Resources ......................................................................................................................... 21 About the authors................................................................................................................... 22 Trademarks and special notices ........................................................................................... 23 Enabling next-generation sequencing applications
  • 3. Abstract Next generation genomic sequencing technologies have been instrumental in significantly accelerating biological research and discovery of genomes for humans, mice, snakes, plants, bacteria, virus, cancer cells, and so on. Now, researchers process immense data sets, build analytical deoxyribonucleic acid (DNA) models for large genomes, use reference-based analytic methods, and further their understanding of genomic models for drug discovery, personalized medicine, toxicology, forensics, agriculture, nanotechnology, and other emerging use cases. IBM has now partnered with CLC bio Inc. to bring validated, and integrated smart computing solutions that combine intelligent Assembly Cell software and optimized open systems software from public domains, together with IBM Smarter Storage. This joint solution incorporates IBM industry expertise, best practices, and IBM Technologies to help research institutions and pharmaceutical companies to manage, query, analyze, and better understand integrated genotypic and phenotypic data for medical research and patient treatment. This paper validates that IBM Storwize V7000 Unified and Scale Out Network Attached Storage (SONAS) Gateway based Smarter Storage solutions offer good application performance, and availability for de novo Assembly and reference-based mapping algorithms, under the following circumstances: • • • Access to genomic data from DNA and ribonucleic acid (RNA) sequences is configured on IBM Storwize V7000 Unified or SONAS Gateway solutions. The CLC Assembly Cell or open systems software applications are configured on Red Hat Enterprise Linux (RHEL) servers. The Network File System (NFS), v3 services are configured and delivered over the Internet Protocol (IP) network. This paper offers easy recommendations guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. Enabling next-generation sequencing applications 1
  • 4. Introduction: DNA and RNA sequencing applications Genetic concepts and interesting facts All humans, animals, plants, and living organisms are comprised of cells. Inside any, and each cell, resides a nucleus. The nucleus is a self-contained unit that acts as a central entity, managing the functions and activity inside, and outside the cell. The nucleus contains most of the cell's genetic information, organized as multiple long linear DNA molecules that are co-existent with a large variety of proteins, to form chromosomes. The genes within these chromosomes make up the cell's genome. The function of the nucleus is to maintain the integrity of these genes and control the cell activities. The nucleus is, therefore, the control center of the cell. Genes, DNA, and RNA Genes are made up of various lengths of DNA, which contains four chemicals: adenine (A), guanine (G), cytosine (C), and thymine (T). These chemicals line up similar to beads on a necklace to form strands of code. They also pair up with each other to form the double strands that are characteristic of DNA. The sequence of a nucleic acid is the composition of atoms that make up the nucleic acid and the chemical bonds that bond those atoms. DNA is a nucleic acid containing the genetic instructions used in the development and functioning of all known living organisms (with the exception of RNA viruses). The DNA segments carrying this genetic information are called genes. Likewise, other DNA sequences have structural purposes, or are involved in regulating the use of this genetic information. Along with RNA and proteins, DNA is one of the three major macromolecules that are essential for all known forms of life. RNA is also a nucleic acid, and is one of the four major macromolecules (along with lipids, carbohydrates, and proteins) essential for all known forms of life. Similar to DNA, RNA is made up of a long chain of components called nucleotides. Each nucleotide consists of a nucleobase, a ribose sugar, and a phosphate group. The sequence of nucleotides allows RNA to encode genetic information. In addition, many viruses use RNA instead of DNA as their genetic material. The chemical structure of RNA is very similar to that of DNA, with two differences: (a) RNA contains the sugar ribose, while DNA contains the slightly different sugar deoxyribose (a type of ribose that lacks one oxygen atom), and (b) RNA has the nucleobase uracil while DNA contains thymine. The other three nucleobases namely, adenine (A), guanine (G), and cytosine (C) are the same in both DNA and RNA. Unlike DNA, most RNA molecules are single-stranded and can adopt very complex three-dimensional structures. Human genome The human genome includes a complete set of human genetic information stored as separate DNA sequences in 23 chromosome pairs of the human cell nucleus and a small amount of mitochondrial DNA, which are used as a source of chemical energy required for the cell to survive. The human genome is estimated to be about 3.2 billion base pairs long and it contains about 20,000 to 25,000 distinct genes. There are 23 chromosome pairs in each cell. The twenty third chromosome pair is a sex determining chromosome. If it is a pair of X chromosomes, then in many animal species, it determines a female. If it is Enabling next-generation sequencing applications 2
  • 5. a combination of X and Y chromosomes, it determines a male. The X chromosome has more than 153 million base pairs and represents about 2000 of the 20,000 to 25,000 genes in the human genome (or about 10% of the total gene population). The Y chromosome has about 58 million base pairs and represents about 200 to 500 of the 20,000 to 25,000 genes in the human genome. The largest human chromosome is chromosome 1, and is approximately 220 million base pairs long. The smallest chromosome is the mitochondrial DNA, and is approximately 16,000 base pairs long. DNA, RNA, and next-generation sequencing (NGS) technologies In genetics, the sequencing processes determine the primary structure of an unbranched biopolymer. The sequencing process results in a symbolic linear depiction (known as a sequence), which clearly summarizes much of the atomic-level structure of the sequenced molecule. DNA sequencing is the process of reading the nucleotide bases in a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine, (A,G,C,T)—in a strand of DNA. RNA sequencing is the process of reading the nucleotide bases in a RNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and uracil, (A,G,C,U)—in a strand of RNA. Next-generation sequencing technologies parallelize the sequencing process, producing thousands or millions of sequences at a time. These technologies are intended to lower the cost of sequencing beyond what is possible with standard dye-terminator methods. High-throughput sequencing technologies generate millions of short reads from a library of nucleotide sequences; whether they come from DNA, RNA, or a mixture, the sequencing mechanism of each platform does not vary. Analysis tools The next-generation sequencing technologies read the biological specimen or the tissue sample, and create hundreds of thousands (or even millions) of base pairs for analysis. A typical sequencing run can range from a single day (24 hours) to a single week (162 hours) in the year 2012, and can generate data between the ranges of 100 MB to 3 GB. In the next few years, this effort will only improve to generate significantly more precise results even sooner, than available, with current processes, methods, and technologies. There are several open source, high performance next-generation sequencing tools, such as BurrowsWheeler Aligner (BWA) and Trinity, that can analyze the genomic DNA and RNA data from the sequencers. On a commercial license, CLC bio offers the most-comprehensive, high-performance computing solution for the Life Sciences industry. This section explains all the three applications. Enabling next-generation sequencing applications 3
  • 6. CLC Assembly Cell CLC Assembly Cell is available on a commercial license. It is a high-performance computing solution for read mapping and de novo assembling of next-generation sequencing data. It includes native color-space support. The command-line interface (CLI) of CLC Assembly Cell enables the functionalities to be easily included in scripts and other next-generation sequencing workflows. CLC Assembly Cell uses single-instruction, multiple-data (SIMD) compute instructions to parallelize and accelerate the assembly algorithms, making the program the fastest next-generation sequencing assembler in the market. For reference, visit the following URL: http://www.clcbio.com/wp-content/uploads/2012/09/CLCAssemblyCell12.pdf Burrows-Wheeler Aligner Burrows-Wheeler Aligner (BWA) is an open-source, high-performance tool, and is available freely, with no software licensing restrictions. It is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence, such as the human genome. It implements two algorithms, BWASHORT and BWA-SW. The former works for query sequences shorter than 200 base-pairsand the latter for longer sequences up to around 100,000 base-pairsp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates. Trinity Trinity, developed at the Broad Institute (a collaboration of MIT and Harvard Universities), is also a widely used open-source, high-performance tool. It represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. NGS solution benefits at a glance: The NGS tools are enabled, tested, validated, and certified. They are then included in optimized solutions by IBM®. IBM has used technology, industry expertise, best practices, and leading analytical partner applications into a tightly integrated solution. With this solution, research institutions and pharmaceutical companies can easily manage, query, analyze, and better understand integrated genotypic and phenotypic data for medical research and patient treatment. They can: • • • Organize, integrate, and manage different kinds of data to enable focused clinical research, including: diagnostic, clinical, demographic, genomic, phenotypic, imaging, environmental, and more. Enable secure, cross-department collection and sharing of clinical and research data. Ensure flexibility and growth with open and industry-standards based architecture. Enabling next-generation sequencing applications 4
  • 7. Introduction: IBM Storwize V7000 Unified and SONAS systems This section provides introductory details and highlights of IBM Storwize® V7000 Unified and IBM SONAS Gateway systems. IBM Storwize V7000 Unified system overview Many users have deployed storage area network (SAN) attached storage for their applications requiring the highest levels of performance while separately deploying network-attached storage (NAS) for its ease of use and lower cost networking. This divided approach adds complexity by introducing multiple management points and also creates islands of storage that reduce efficiency. The Storwize V7000 Unified system provides the ability to combine both block and file storage into a single system. By consolidating storage systems, multiple management points can be eliminated and storage capacity can be shared across both types of access, helping to improve overall storage utilization. The Storwize V7000 Unified system also presents a single, easy-to-use management interface that supports both block and file storage, helping to simplify administration further. The Storwize V7000 Unified system builds on the functions and high-performance design of the Storwize V7000 system and integrates proven IBM software capabilities to deliver new levels of efficiency. The Storwize V7000 Unified system provides identical software capabilities as the IBM SONAS system, as follows: • • • • • • Massive scalability: − Supports billions of files (up to 21 petabytes of storage) in a single file system − Supports upto 256 file systems per single SONAS platform Flexibility: − Allows access to data in a single global namespace, allowing all users a single, logical view of files through a single drive letter such as a Z drive − Provides efficient distribution of files, images, and application updates and fixes to multiple locations quickly and cost effectively − Provides multiple storage tiers for flexible, efficient management of petabytes of files. − Supports industry-standard protocols: Common Internet File System (CIFS), Network File System (NFS), File Transfer Protocol (FTP), Hypertext Transfer Protocol Secure (HTTPS), and Secure Copy Protocol (SCP) Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering high bandwidth and additional connectivity in each SONAS interface node to manage multiple data streams and functions (for example, backup, replication, antivirus). Data protection: File system and fileset-level snapshots (up to 256 per file system) provide a way to partition the namespace into smaller, more manageable units. Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative GUI provide icon-based navigation, informative graphics, and SONAS visualizations that streamline storage tasks and display real-time capacity, performance, and system health. Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data from malware and use the most commonly deployed ISV antivirus applications. Enabling next-generation sequencing applications 5
  • 8. • • (Clarification, for purposes of this particular paper: In Life Sciences, there is a separate definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat. The authors do not refer to the Life Sciences definition, in this paper. Cloud features: Self-managing, autonomic system enables capacity, provisioning, and other IT service management decisions to be made dynamically, without human intervention or increased administrative costs. IBM Active Cloud Engine™ enables ubiquitous access to files from across the globe quickly and cost effectively. Operational savings and total cost of ownership (TCO): − Consolidates multiple individual filers and their management, thereby avoiding problems associated with administering an array of disparate NAS systems − Automates file placement by transparently moving files to another internal or external storage pool, optimizes your storage resources, and offers tremendous time and cost savings in administering petabytes of files − Helps conserve floor space (up to a petabyte of data in less than a square meter), is highly scalable and can help reduce capital expenditure and enhance operational efficiency; its advanced architecture virtualizes and consolidates file space into a single, enterprise-wide file system, which can translate into reduced TCO IBM SONAS Gateway system overview The IBM SONAS Gateway system is designed to manage vast repositories of information in enterprise environments requiring very large capacities, high levels of performance, and high availability. SONAS Gateway uses a mature technology from the IBM high-performance computing (HPC) experience. It is based upon the IBM General Parallel File System (IBM GPFS™), a highly scalable clustered file system. SONAS Gateway is an easy-to-install, turnkey, modular, scale out NAS solution. It provides the performance, clustered scalability, high availability, and functionality that are essential for meeting strategic multi-petabyte age and cloud storage requirements. SONAS Gateway currently offers the following features and capabilities: • • Massive scalability: − Supports billions of files (up to 21 petabytes of storage) in a single file system − Supports upto 256 file systems per single SONAS platform Flexibility: − Allows access to data in a single global namespace, allowing all users a single, logical view of files through a single drive letter such as a Z drive − Provides efficient distribution of files, images, and application updates and fixes to multiple locations quickly and cost effectively − Provides multiple storage tiers for flexible, efficient management of petabytes of files − Supports industry-standard protocols: CIFS, NFS, FTP, HTTPS, and SCP Enabling next-generation sequencing applications 6
  • 9. • • • • • • Performance: Leverages two dual-port (all ports active) 10 GbE interface cards offering high bandwidth and additional connectivity in each SONAS interface node to manage multiple data streams and functions (for example, backup, replication, antivirus). Data protection: File system and fileset-level snapshots (up to 256 per file system) provide a way to partition the namespace into smaller, more manageable units. Management: CLI and browser-based, simple, intuitive, and state-of-the-art administrative GUI provide icon-based navigation, informative graphics and SONAS visualizations that streamline storage tasks and display real-time capacity, performance, and system health. Antivirus: Integrates with McAfee and Symantec Antivirus, enabling users to secure data from malware and uses the most commonly deployed ISV antivirus applications. (Clarification, for purposes of this particular paper: In Life Sciences, there is a separate definition for antivirus – An ultramicroscopic (20 to 200 nm in diameter), infectious agent that replicates within host cells. It is composed of a DNA, RNA core, and a protein coat. The authors do not refer to the Life Sciences definition, in this paper. Cloud features: Self-managing, autonomic system enables capacity, provisioning and other IT service management decisions to be made dynamically, without human intervention or increased administrative costs. IBM Active Cloud Engine enables ubiquitous access to files from across the globe quickly and cost effectively. Operational savings and TCO: − Consolidates multiple individual filers and their management, thereby avoiding problems associated with administering an array of disparate NAS systems. − Automates file placement by transparently moving files to another internal or external storage pool, optimizes your storage resources, and offers tremendous time and cost savings in administering petabytes of files − Helps conserve floor space (up to a petabyte of data in less than a square meter), is highly scalable and can help reduce capital expenditure and enhance operational efficiency; its advanced architecture virtualizes and consolidates file space into a single, enterprise-wide file system, which can translate into reduced TCO Differences: IBM Storwize V7000 Unified and SONAS Gateway as NAS systems The difference between the IBM Storwize V7000 Unified and SONAS Gateway systems lies in the workloads that each system can support. The Storwize V7000 Unified system can support smaller and medium-size workloads, while the SONAS Gateway system has the scalability to deliver high performance for extremely large application workloads and capacities, typically for the entire enterprise. Enabling next-generation sequencing applications 7
  • 10. Table 1 offers the comparative product positioning between the Storwize V7000 Unified and SONAS systems: No. Attribute Storwize V7000 Unified SONAS 1 Maximum number of interface nodes 2 30 2 Maximum number of storage nodes N/A 60 3 Maximum raw capacity of file storage 360 TB (3 TB drives x 12 drives per expansion unit x 10 expansion units) 21.6 PB (3TB drives x 240 drives x 30 controllers). 4 Maximum size of single shared file system (GPFS) 8 PB 8 PB 5 Maximum number of file systems within a single system 64 256 6 Maximum size of a single file 8 PB 8 PB 7 Maximum number of files per storage system 4 Billion 4 Billion 8 Maximum number of dependent file sets per file system 256 3000 9 Maximum number of independent file sets 256 1000 10 Maximum number of independent file sets 256 1000 Table 1: Comparative product positioning of Storwize V7000 Unified and SONAS Gateway systems Enabling next-generation sequencing applications 8
  • 11. Architectural assumptions Make a note of the following architectural assumptions and caveats in regard to the technical content of this paper. This paper does: • Offer information and recommendations for tuning adjustments to achieve good performance in normal NAS production environments. • Allow a non-technical customer or user to quickly tune their NAS environment by using recommendations, observations, tips, and best practices, as documented. • Provide information from a non-technical user point of view for fast implementations. This paper does not: • Explain the various technologies and solutions to establish or publish any benchmarks. • Guarantee a specific performance of any technical element. • Provide or offer any information to overcome previously established benchmarks. • Explain or explore newer technologies, standards, and concepts such as 40 GbE connections, NFS V4, cloud multi-tenancy and so on. • Offer any guidance on how to determine hardware sizing or capacity planning for your installation. Caveats: • • • • Use cognizance in making your decisions. Do not take any published numbers literally. For this paper, the tests were run on different IBM equipments located at different IBM data centers. Note that the performance results might vary, depending on unique server / client conditions, architectural configurations, network behaviors, application dependencies, and operational environments. Your performance and mileage might vary from the test results. Recommended best practices sometimes differ from the test configurations. The test configurations were set up to observe certain behavior in specific test situations. The best practices are recommended to run operations in production environments. Enabling next-generation sequencing applications 9
  • 12. IBM Storwize V7000 Unified: Configurations, tests, and results Configuration and tests An IBM Storwize V7000 Unified system was tested with the three NGS applications: CLC Assembly Cell, BWA, and Trinity. The connectivity between the Storwize V7000 Unified system and the single application server was configured as NAS-attached configuration. This configuration was a typical use case for a small research facility, with minimal compute resources, as shown in Figure 1. Figure 1: NAS-attached Storwize V7000 Unified configuration with NGS applications for a small research facility Test results with CLC Assembly Cell The following tables summarize the results of successful testing of de novo assembly and reference assembly with the CLC Assembly Cell software, BWA application software, and Trinity application software using identical server and storage configurations, as demonstrated in Figure 1. Enabling next-generation sequencing applications 10
  • 13. Input gz-fastq Cores (threads) XFS-local (minutes) GPFS (minutes) SSD (minutes) Storwize V7000 Unified (minutes)* 571 575 544 589 32 (32) 386 384 376 385 32 (64) 323 309 320 310 16 (16) 534 520 525 560 32 (32) 351 345 337 363 32 (64) fasta 16 (16) 288 286 267 276 Table 2: CLC Assembly Cell performance results with de novo assembly using non paired-end option Input GPFS (minutes) SSD (minutes) 16 (16) 582 570 571 610 396 374 380 387 32 (64) 333 329 323 317 16 (16) 569 566 535 605 32 (32) 380 365 369 365 32 (64) fasta XFS-local (minutes) 32 (32) gz-fastq Cores (threads) Storwize V7000 Unified (minutes)* 313 315 310 298 Table 3: CLC Assembly Cell performance results with de novo assembly using paired-end information CLC Assembly Cell Version 4 Input 32 (64) Cores (threads) 78.5 XFS-local (minutes) 78 GPFS (minutes) 72 SSD (minutes) 80 Table 4: CLC Assembly Cell performance results with paired-end reference mapping information *Mount Options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Enabling next-generation sequencing applications 11
  • 14. Test results with the BWA application When the BWA application was run on the same server with the same Storwize V7000 Unified system as the storage back-end, the following test results were obtained, as shown in Table 5. Input Threads Storwize V7000 1 Unified (No cache) Storwize 2 V7000 Unified (with cache) Local Comparing BWA reads_100m.fq with humangenome.fa 8 44 min 46 s 44 min 57.483 s 44 min 59 s 16 26 min 29 s 25 min 38.118 s 26 min 38 s 24 20 min 50 s 20 min 9.843 s 21 min 34 s 32 18 min 40 s 19 min 0.600 s 18 min 40 s 64 24 min 58 s 26 min 47.676 s 26 min 20 s Table 5: BWA performance results with various file system options 1 2 Mount options: rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Mount options: rw, noatime, nodiratime, rsize=1048576, wsize=1048576, proto=tcp, vers=3, timeo=600, addr=9.11.82.103 Enabling next-generation sequencing applications 12
  • 15. Test results with the Trinity application When the Trinity application was run on the same server with the same Storwize V7000 Unified system as the storage back-end, the following test results were obtained, as shown in Table 6. Mount options Duration fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,addr=9.11.82.103) 869 min 4.618 s fm1p1:/ibm/gpfs_15k/ngsfs on /gpfs0 type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60 0,addr=9.11.82.103) 866 min 47.335 s rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 00 More than 4 days /dev/sdb on /xfs type xfs (rw,nobarrier) 787 min 54.544 s 9.11.83.71:/ibm/gpfs1/Life_sciences_bak on /NGS type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=60 0,addr=9.11.83.71) 875 min 36.453 s rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 00 More than 4 days Table 6: Trinity application performance results with various file system mount point options Note: Run times are large as the Trinity application creates millions of files, ranging from 0 MB to 10 MB in size. This is a typical behavior of Trinity applications. IBM SONAS Gateway: NAS configurations, tests, and results Configurations and tests An IBM SONAS Gateway system was tested with the three NGS applications: CLC Assembly Cell, BWA, and Trinity. The connectivity between the SONAS Gateway system and 14 IBM BladeCenter® blade servers was configured as a NAS-attached configuration. The blade servers represented application services. This configuration was a typical use case for a medium- to large-research facility, with adequate compute and performance resources, as shown in Figure 2. Enabling next-generation sequencing applications 13
  • 16. Figure 2: NAS-attached SONAS Gateway configuration with NGS applications for a medium- to large-research facility Enabling next-generation sequencing applications 14
  • 17. Test results with CLC Assembly Cell The following tables summarize the results of successful testing of de novo assembly and reference assembly with the CLC Assembly Cell software, BWA application software, and Trinity application software using identical server and storage configurations, as demonstrated in Figure 2. Input Cores (threads) SONAS Gateway (minutes)* gz-fastq 16 (16) 573 16 (32) 439 16 (16) 547 16 (32) 406 fasta Table 7: CLC Assembly Cell performance results with de novo assembly using the non paired-end option Input Cores (threads) SONAS Gateway (minutes)* gz-fastq 16 (16) 591 16 (32) 449 16 (16) 588 16 (32) 437 fasta Table 8: CLC Assembly Cell performance results with de novo assembly using paired-end information CLC Assembly Cell Cores (threads) SONAS Gateway (minutes)* Version 4 16 (32) 148 Table 9: CLC Assembly Cell performance results with paired-end reference mapping information *Mount options:rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 Enabling next-generation sequencing applications 15
  • 18. Test results with BWA application Table 10 summarizes the results of successful testing with the BWA application software. Input Threads SONAS Gateway Hx9203** No cache SONAS Gateway Hx9201** No cache SONAS Gateway Hx9201*** cache BWA reads_10 0m.fq vs humange nome.fa 8 50 min 31 s 43 min 58.889 s 44 min 6.455 s 16 30 min 7 s 25 min 30.709 s 26 min 30.983 s 24 26 min 45 s 22 min 49.733 s 22 min 17.789 s 32 25 min 46 s 22 min 38.095 s 23 min 14.785 s Hx9202 Hx9205 Hx9206 Hx9207 Hx9208 Hx9210 Hx9211 Hx9212 26 min 34 s 30 min 14 s 30 min 50 s 30 min 1 s 31 min 12 s 30 min 32 s 32 min 13 s 30 min 54 s Table 10: Results of successful testing of BWA applications on 14 servers attached to the SONAS Gateway system The following mount options were documented, with the results as listed in Table 10. ** rw,bg,hard,rsize=1048576,wsize=1048576,proto=tcp,vers=3,noac,nocto,actimeo=0 0 0 *** rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,timeo=600,addr=9.11.82.103 Enabling next-generation sequencing applications 16
  • 19. Test results with Trinity application When the Trinity application was run on the same server, with the same Storwize V7000 Unified system as a storage back-end, the following test results were obtained, as in Table 11, below: Mount options Duration 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,addr=172.26.39.180) 804 min 2.373 s 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time o=600,addr=172.26.39.180) 812 min 7.432 s 172.26.39.180:/ibm/gpfs136gb_15k on /lifesci type nfs (rw,noatime,nodiratime,rsize=1048576,wsize=1048576,proto=tcp,vers=3,time o=600,addr=172.26.39.180) 760 min 50.722 s 774 min 20.714 s Table 11: Results of successful testing of Trinity Application on 14 servers attached to SONAS Gateway system File systems layout: Best practice recommendations To ensure good application performance, with optimal runtimes, when server(s) are connected over the 10 GbE Ethernet to IBM Storwize V7000 Unified or SONAS Gateway systems the following considerations should be noted: • • • Proper sizing and stability of application nodes and servers is extremely important to drive the required levels of workloads for various different types of algorithms, such as de novo, or reference-based mapping. Proper selection of valid mount options affects the performance and runtime characteristics of NGS applications. Incorrect selection of mount options results in long running jobs, as these jobs will create millions of files ranging from 0 MB to 10 MB in size. It was observed that all these applications did not saturate the network, the IBM Storwize V7000 Unified system, or the IBM SONAS Gateway system. For improved performance in a normal and a typical production environment, lay out the file systems for NGS applications as per the following guidelines and best practice recommendations: • • • • Different NGS applications require different types of mount options for increased performance and optimal response time. Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using the cluster method of creating the block allocation maps to achieve a uniform disk performance across all storage capacities. Create the GPFS on the SONAS Gateway or Storwize V7000 Unified system by using logfileplacement value = striped to stripe the log file of the file system, across all metadata disks. Recommend using the block size as 256 K for both, short-term, and long-term storage. Enabling next-generation sequencing applications 17
  • 20. • • • As a best practice, run all RHEL 6.2 servers with dual 10 GbE bonded network channel connections, with MTU=9000. To support various NGS application workloads, two interface nodes are recommended on the Storwize V7000 Unified system for increased availability. To support various NGS application workloads, at least two interface nodes are recommended on the SONAS Gateway system for increased availability. Solution benefits: IBM Storwize V7000 Unified and SONAS Gateway system Both, SONAS and Storwize V7000 Unified systems offer the following significant benefits, for clients running NGS Applications for efficient analysis of genomic data from DNA and RNA sequences: • Easily examine a large group of potential gene candidates by using typical applications such as blast, linkage analysis, mascot etc., that can quickly search and rapidly screen targets in genomic databases, genomes and assays. • Efficiently create targeted drug treatments. Easily enable the scale-up development of new drug molecules developed through Drug Discovery (Research, Synthesis), PreClinical Development (Preparation, Formulation, Pre-dosage design), Pre-FDA (new drug formulation, standards). • Delivers tight integration between ERP and Pharma Supply Chains - SONAS easily supports pharmaceutical processes to scale-up of API’s (active pharmaceutical ingredients) from Milligram to Kilogram quantities for commercial manufacturing and distribution of drugs, with improved visibility into process optimization and consistent yield variability across batches. • Lowers TCO by efficiently reducing drug discovery costs through use/reuse of databases, common analytical data, processes and standards throughout the pharmaceutical operational chain. • Deliver on-demand cloud computing models to rapidly address changing levels of analytical computational capacities and facilitate self service of analytical tools, pooling of analytic, research development, pharmaceutical manufacturing resources, and common and scalable transactional processes and standards. Enabling next-generation sequencing applications 18
  • 21. Summary This paper validates that IBM Storwize V7000 Unified and SONAS Gateway based solutions offer good application performance with excellent virtualization and availability under the following circumstances: • • • Access to genomic data from DNA and RNA sequences is configured on the IBM Storwize V7000 Unified or SONAS Gateway system. The CLC Assembly Cell or open systems software applications are configured on RHEL servers. The NFS v3 services are configured and delivered over the IP network. This paper offers recommendations and guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. Acknowledgments Special thanks to the teams from CLC bio in Denmark for loaning the software licenses of the CLC Assembly Cell software, which enabled the IBM test team to create a representative operational test environment in IBM data centers and run tests to document real-life results. Many Thanks to the IBM client executives, IBM Systems and Technology Group members, and other team members who contributed with their recommendations during the test run and review process, and enabled successful completion and validation so that CLC bio software applications can run successfully over various environments facilitated by IBM Storwize V7000 Unified and SONAS Gateway systems. The IBM team also acknowledges with special thanks to Connie Borton, Michael Nelson, Cathy Drews, Daniel Drinnon, and Larry Garibay for their invaluable help and assistance, without which the software validation of three independently different software applications would not have been successful. Enabling next-generation sequencing applications 19
  • 22. Appendices Appendix A: Typical server and storage configuration sizing recommendations This section includes a typical recommendation and a guideline for server and storage configuration sizing for small, medium, and large research facilities. While this information is typical, the authors do understand that there will be differences in various organizations in terms of the following criteria: • • • • • • • Different number and types of sequencers in the facility Different types of genomes being worked on in the laboratory / organization Different processes being pursued within the organization – be it reference mapping, assembly or transcriptions, or downstream analytics The amount of data that is required to be kept active The amount of data that is required to be kept archived The response time required in terms of the number of genomes per day, per week, or per month And many other factors Tier 1: 1 to 2 human size genomes per week, for both de novo and reference-based mapping Single server and internal storage configuration • • • • IBM system x3750 with 2.4 GHz E5 4640, ½ TB RAM, 16 TB internal disks 4 sockets, 32 cores, 2.4 GHz Intel® Xeon® processor E5 4640 32 x 16 GB 1600MHz DDR3 DIMMs, 16 x 2.5-inch 1 TB SAS drive Tier 2: 3 to 10 human size genomes per week or need more than 15 TB online for both de novo and reference-based mapping Multiple server and external storage configuration • • • • • • IBM BladeCenter HS23 frame enclosed with14 blade servers. Each blade server configured with 2.6 GHz Intel Xeon processor E5 2670, 128 GB RAM and dual10 GbE connection ports 2 sockets, 16 cores, 2.6 GHz Intel Xeon processor E5 2670 16 x 8 GB 1333MHz DDR3 DIMMs 2 x 2.5-inch 300 GB SAS drive 96 x 2.5-inch 600 GB 10 K rpm drives in four enclosures of IBM Storwize V7000 Unified. The IBM Storwize V7000 Unified system can host up to 10 enclosures, and therefore, if more capacity is needed in the future, more disks can be added to the remaining six enclosures. 1 external switch (or customer supplied switch) to support 10 GbE connectivity Tier 3: More than 10 human size genomes per week or need more than 100 TB online or need for downstream analysis. Enabling next-generation sequencing applications 20
  • 23. This is a custom configuration. You can contact IBM. Appendix B: Resources The following websites provide useful references to supplement the information contained in this paper: • Introduction to Genetics en.wikipedia.org/wiki/Introduction_to_genetics • DNA en.wikipedia.org/wiki/DNA • DNA Sequencing en.wikipedia.org/wiki/DNA_sequencing • Cell Nucleus en.wikipedia.org/wiki/Cell_nucleus • Human Genome en.wikipedia.org/wiki/Human_genome_map • RNA en.wikipedia.org/wiki/RNA • RNA-Seq en.wikipedia.org/wiki/RNA-Seq • X Chromosome en.wikipedia.org/wiki/X_chromosome • Y Chromosome en.wikipedia.org/wiki/Y_chromosome • Trinity www.broadinstitute.org/scientific-community/software/trinity • Burroughs Wheeler Aligner bio-bwa.sourceforge.net/ • CLC bio Applications www.clcbio.com/ • IBM Redbooks® ibm.com/redbooks Enabling next-generation sequencing applications 21
  • 24. • IBM Publications Center www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US • IBM Scale Out Network Attached Storage Architecture, Planning and Implementation Basics [SG24-7875-00] ibm.com/redbooks/abstracts/sg247875.html?Open • IBM Scale Out Network Attached Storage Concepts [SG24-7874-00] ibm.com/redbooks/abstracts/sg247874.html?Open • IBM Storwize V7000 Introduction and Implementation Guide [SG247938] ibm.com/redbooks/redpieces/abstracts/sg247938.html?Open About the authors Dr. Tzy-Hwa K. (Kathy) Tzeng, is a Senior Technical Staff Member (STSM) for IBM Systems and Technology Group ISV Strategy and Enablement Organization. She received her Ph.D. in Genetics and Plant Pathology from Iowa State University. Prior to IBM, she led drug discovery projects in bioinformatics, proteomics, and genomics. At IBM, she is responsible for the strategy and content of IBM Life Sciences application plans, portfolio, and product positioning. You can reach Dr. Kathy Tzeng at tzy@us.ibm.com Dr. Ruzhu Chen is an IBM Certified Expert IT Specialist for IBM Systems and Technology Group, focusing on computational chemistry and NGS applications. Over the last ten years, he has successfully tuned, benchmarked, and optimized solutions for IBM worldwide partners, and customers. Ruzhu earned his Masters degree in Biochemistry from University of Sciences and Technology of China, a second Masers degree in Computer Science and a Ph.D. in Molecular Biology, both from the University of Oklahoma. You can reach Dr. Ruzhu Chen at ruzhuchen@us.ibm.com. Justin Morosi is a Consulting IT Specialist working for IBM Systems and Technology Group as a Worldwide Technical Architect focusing on HPC solutions. He has worked for IBM for over 14 years and has more than 20 years of consulting and solution design experience. He holds numerous industryrecognized certifications from Cisco, Microsoft®, VMware, Red Hat, and IBM. His areas of expertise include high-performance computing/storage, high availability, disaster recovery, and virtualization. You can reach Justin Morosi at jmorosi@us.ibm.com. Prashant Avashia is a software engineer in IBM Systems and Technology Group ISV Strategy and Enablement Organization. With more than 15 years of experience, he has successfully architected, engineered, and implemented enterprise infrastructure solutions for key global clients in healthcare, financial, and software industries. He earned his master's degree in Industrial Engineering from Kansas State University, and a bachelor's degree in Mechanical Engineering from Osmania University, India. You can reach Prashant Avashia at pavashia@us.ibm.com. Enabling next-generation sequencing applications 22
  • 25. Trademarks and special notices © Copyright IBM Corporation 2012. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml. Java and all Java-based trademarks and logos are trademarks of registered trademarks of Oracle and/or its affiliates. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is Enabling next-generation sequencing applications 23
  • 26. presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. Enabling next-generation sequencing applications 24