1. Biomanycores, a repository of interoperable
open-source code for many-cores bioinformatics
Jean-St´phane Varr´, St´phane Janot, Mathieu Giraud
e e e
contact@biomanycores.org
Sequoia Bioinformatics
LIFL – UMR CNRS 8022 – Universit´ Lille 1, France
e
INRIA Lille Nord-Europe, France
June 2009
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 1 / 20
2. Outline
High-performance computing
Graphical Processing Units and bioinformatics
biomanycores.org
aim of the project
what has been done ?
future developments
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 2 / 20
3. High Performance Bioinformatics – Manycores
1970 – 2002:
Moore’s law =
increasing frequencies
problems:
power consumption,
heat dissipation here
from now on: Moore’s law continues with multiple cores
from multicores: dual-cores, quad-cores, octo-cores...
to manycores:
Graphic processing units (GPUs)
Nvidia GTX 285 ⇒ 30 × 8 cores, 1.2 GHz, 40 (×8) GFlops
convergence CPU-GPU: Intel Larrabee
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 3 / 20
4. High Performance Bioinformatics – Manycores
GPGPU = General-Purpose computation on GPU
until 2007: tweaking graphics primitives
2007: Nvidia CUDA
2009: OpenCL (Khronos Group)
dec 08: 1.0 specification
may 09: beta release of a Nvidia compiler
AMD/ATI compiler coming soon
⇒ portable manycores applications ?
With GPGPU...
10× / 100× peak speed-up, low costs ($50–$500)
even with loss due to parallelism, 10× speed-up is possible
(relatively) easy with CUDA / OpenCL, requires some learning
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 4 / 20
5. GPU + Bioinformatics
Methods
“Graphical” GPGPU (2005/06):
speed-up
RAxML up to 2× Charalambous et al. 2005
ClustalW up to 7× Liu et al. 2006
CUDA (since 2007):
speed-up
mummerGPU up to 10× Schatz et al. 2007
Smith-Waterman up to 15× Manavski and Valle 2008
Neighbor-Joining up to 26× Liu et al. 2009
RNAfold up to 17× Risk and Lavenier 2009
∼ 10 papers between 2007 and 2009
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 5 / 20
6. GPU + Bioinformatics
Specific Bioinformatics HPC Events
HiComb (IEEE Workshop on High Performance Computational Biology)
since 2002
in conjunction with IPDPS [may 09, Roma]
PBC (Parallel Bio-Computing Workshop)
since 2005, every two years
in conjunction with PPAM [sept 09, Wroclaw]
HiBi (Workshop on High Performance Computational Systems Biology)
[oct 09, Trento]
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 6 / 20
7. Sequoia Bioinformatics
LIFL, INRIA, Universit´ Lille 1, France
e
H. Touzet’s group, 14 people (including 5 PhD students)
Large-scale sequence analysis
Sequence comparisons, seed-based heuristics
RNA, transcription factors, NRPS
High-Performance Bioinformatics
SIMD flexible read mapper (L. No´, M. Gˆ
e ırdea)
GPU PWM scan / P-value (22× – 77× on a GTX 280)
GPU ADP (6.1× – 22.8× on a GTX 280, with U. Bielefeld)
GPU & bit-parallelism pattern matching (ongoing)
Supported by NVIDIA (Professor Partnership, 2009)
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 7 / 20
8. GPU + Position-Weight Matrices (PWM)
Parallel Position Weight Matrices Algorithms. M. Giraud and J.-S. Varr´. ISPDC’09
e
PWMs are used for modeling transcription
factor binding sites, transcription start sites, 2.0
TGT GGT
protein domains, . . .
bits
1.0
score threshold or P-value computation: A T T
0.0
TC A
C A CT C A
C
C
A
requires to enumerate words
A G
5
WebLogo 3.0
occurrences: requires to scan quickly a very
long sequence
25x
100x
CPU (one thread)
GeForce 8800
GTX 280 20x
GTX 280 (+ atomic)
10x 15x
Speedup
Speedup
10x CPU (one thread)
GeForce 8800
GTX 280
1x 5x
35 40 45 50 55 60 65 70 0 10 20 30 40 50 60 70 80 90
Matrix length Matrix length
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 8 / 20
9. HPC Bioinformatics for human beings ?
Research in High-Performance Computing
nice ideas, nice papers
but not always exploited
A few HPC bioinformatics frameworks projects...
⇒ far from everyday usage of bioinformaticians and biologists
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 9 / 20
10. www.biomanycores.org
1. Share OpenCL code
= public repository, open-source
2. Make it easy
= Bio∗ integration
3. Benchmark
algorithms, implementations, hardware
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 10 / 20
11. www.biomanycores.org
1. Share OpenCL code (currently CUDA)
= public repository, open-source
2. Make it easy
= Bio∗ integration
3. Benchmark
algorithms, implementations, hardware
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 10 / 20
12. Already included projects
SWcuda – Smith-Waterman protein alignment
CRIBI Genomics, University of Padova, Italy
S. A. Manavski, G. Valle, CUDA compatible GPU cards as efficient hardware
accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics 2008,
9(S2):S10
pknotsRG – pseudonots of an RNA sequence
Universit¨t Bielefeld, Germany
a
J. Reeder, P. Steffen, R. Giegerich, pknotsRG: RNA pseudoknot folding including
near-optimal structures and sliding windows, Nucl. Acids. Res., 2007
cudaPWM – scan a PWM against a DNA sequence
Sequoia, LIFL, INRIA, Universit´ Lille 1
e
M. Giraud, J.-S. Varr´, Parallel Position Weight Matrices Algorithms, ISPDC’09
e
Interfaces to BioJava 1.6, BioPerl 1.52, and Biopython 1.50b
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 11 / 20
13. J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 12 / 20
14. Biopython + CRIBI SW
from Bio i m p o r t SeqIO
from Biomanycores i m p o r t PadovaSW
bank = SeqIO . parse ( open ( ” u n i p r o t −s t a r t . f a ” ) , ” f a s t a ” )
f o r query i n SeqIO . parse ( open ( ” p r o t 6 4 . f a ” ) , ” f a s t a ” ) :
handle = PadovaSW . run ( query , bank )
result = PadovaSW . SWParser ( ) . parse ( )
p r i n t result
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 13 / 20
15. Biopython + CRIBI SW
Tests on a GeForce 8800
biopython$ time python sw-demo.py cuda
** cd ../bin/ ; ./swcuda config.gpu ../tmp/swcuda.fa ../tmp/swcuda.bank
** 1.846s
12098 results...
[(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0,
real 2.81 user 1.79 sys 0.27
biopython$ time python sw-demo.py cpu
** cd ../bin/ ; ./swcuda config.cpu ../tmp/swcuda.fa ../tmp/swcuda.bank
** 16.604s
12098 results...
[(84.0, 0, 0, ’sp|P30350|ADH1_ANAPL’), (81.0, 0, 0, ’sp|P23991|ADH1_CHICK’), (81.0,
real 17.57 user 16.42 sys 0.14
10× – 15× paper speedup (BMC Bioinformatics 2008, 9S2)
8.7× application speedup
6.2× final speedup (including Biopython/Biomanycores)
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 14 / 20
16. BioPerl + CRIBI SW
BioPerl tutorial
u s e Bio : : Tools : : pSW ;
$factory = new Bio : : Tools : : pSW ( ’−m a t r i x ’=> ’ b l o s u m 6 2 . b l a ’ , ’−gap ’ ←
=>12, ’−e x t ’ =>2) ;
$factory−>alig n_and_sh ow ( $seq1 , $seq2 , STDOUT ) ;
$aln = $factory−>p a i r w i s e _ a l i g n m e n t ( $seq1 , $seq2 ) ;
With biomanycores
u s e Bio : : SeqIO ;
u s e Biomanycores : : PadovaSW ;
$factory = PadovaSW−>new ( ) ;
$factory−>swcuda ( $inputseq , $bank ) ;
@r = $factory−>parse_result ( ) ;
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 15 / 20
17. BioJava + PWM
i m p o r t org . biojavax . bio . seq . RichSequence ;
i m p o r t org . biojava . bio . dp . S i m p l e W e i g h t M a t r i x ;
...
i m p o r t org . biomanycores . bio . pwm . ∗ ;
...
{
LillePWMScan scanner = new LillePWMScan ( launcher ) ;
// r e a d t h e s e q u e n c e
R i c h S e q u e n c e I t e r a t o r it = n u l l ;
Buffe redRead er in1 = new Buff eredRead er ( new FileReader ( args [ 1 ] ) ) ;
it = RichSequence . IOTools . readFastaDNA ( in1 , n u l l ) ;
RichSequence query = it . n e x t R i c h S e q u e nc e ( ) ;
// r e a d a w e i g h t m a t r i x
S i m p l e W e i g h t M a t r i x pwm = PFMParser . PARSER . get ( args [ 2 ] , alph , ”ACGT” ) ;
// s c a n t h e s e q u e n c e
List<PWMHit> al = scanner . scan ( query , pwm , threshold ) ;
}
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 16 / 20
18. Challenges
Differents APIs, different philosophies
BioJava : no external program execution ?
Object representation (alignments)
Object existence (PWM)
Minimal modifications to the source code of applications
CribiSW : command-line arguments
Real-world pipelines ?
Bio∗ are not HPC frameworks
Succession of several programs
Usage: requires CUDA / OpenCL SDK
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 17 / 20
19. Licenses
Projects must have an open-source licence
Bio∗ interfaces : same license than mother API
BioJava: LGPL 2.1
BioPerl: Perl artistic license
Biopython: Biopython license
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 18 / 20
20. www.biomanycores.org
1. Share OpenCL code (currently CUDA)
= public repository, open-source
⇒ bring new projects
2. Make it easy
= Bio∗ integration
⇒ integrate new projects
⇒ improve current interfaces
3. Benchmark
algorithms, implementations, hardware
⇒ think !
J.-S. Varr´, S. Janot, M. Giraud (LIFL)
e Biomanycores June 2009 19 / 20