Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Prins Bio Lib Bosc 2009
1. BioLib Development Report (BOSC
2009)
C and C++ libraries for BioPerl, BioJAVA,
BioPython, BioRuby. . .
Pjotr Prins (pjotr.prins at wur.nl)
Wageningen University, Dept. of Nematology; Groningen Bioinformatics Center
BioLib Development Report (BOSC 2009) – p.
2. The stated problem
Many high-level languages used in Biology
(Perl, R, Java. . . )
Duplication of effort in all Bio* efforts -
BioPerl, BioConductor, BioJAVA. . .
in particular for data IO/parsing/interpretation
(Alan’s keynote)
BioLib Development Report (BOSC 2009) – p.
3. What if?
What if you need some functionality (e.g. linear
regression) in Perl, you can
Roll your own in Perl (performance?)
Bind against existing clib using Perl-XS (ugh)
Bind using SWIG (better, but one-off like
Perl::GSL)
Bind using SWIG with Biolib (all languages)
In fact, it may already be there (GSL or Rlib)
BioLib Development Report (BOSC 2009) – p.
4. DRY-DRO
Do not repeat yourself (DRY)
Do not repeat ourselves (DRO)
Bio*: BioPerl, BioPython, BioRuby, BioJAVA,
BioConductor, BioHaskell, BioCPP, . . .
Limited pool of programmers in bioinformatics
Usually 2 or 3 competing implementations
Use existing implementations
BioLib Development Report (BOSC 2009) – p.
9. BioLib project
Objectives:
Utilize existing C/C++ libraries
Create mappings to all Bio* languages
Focus on correctness and
performance
A central place (plumbing)
An OBF affiliated project
BioLib Development Report (BOSC 2009) – p.
10. Power Trio
Plumbing power trio:
Git - modular version control
Cmake - make file generator
SWIG - simplified wrapper and interface
generator
BioLib Development Report (BOSC 2009) – p. 1
11. Power trio (1)
GIT
Version control on steroids
What source control should be
Easy branching of development
Submodules
BioLib Development Report (BOSC 2009) – p. 1
12. Power trio (2)
CMake
Generator for make files
Very modular approach
Resolves complex dependencies
Looks like a simple
programming language
Easy on the eyes and mind
BioLib Development Report (BOSC 2009) – p. 1
13. Power trio (3)
SWIG
Code generator for mappings done right:
Rules for generating code
Macros (DRY)
Pattern matching
Flexible
Supports many languages
BioLib Development Report (BOSC 2009) – p. 1
16. Adding a C lib
Unpack C/C++ library in
./src/clibs/modulename
Add CMake file - compiles into .so shared
library
Create Perl mapping in
./src/mapping/swig/perl/module
Add SWIG .i file
Add CMake file - compiles into .pm and .so
shared library
BioLib Development Report (BOSC 2009) – p. 1
17. CMake goodies
# Defining a C library build in Biolib:
SET (M_NAME staden_io_lib)
SET (M_VERSION 1.11.6)
FIND_PACKAGE(ZLIB REQUIRED)
BUILD_CLIB()
ADD_LIBRARY(${LIBNAME} SHARED
array.c
compress.c
compression.c
ctfCompress.c
(...)
INSTALL_CLIB()
BioLib Development Report (BOSC 2009) – p. 1
18. CMake for Perl
# Defining a C library mapping for Perl
SET (USE_ZLIB TRUE)
SET (USE_INCLUDEPATH io_lib)
FIND_PACKAGE(MapPerl)
POST_BUILD_PERL_BINDINGS()
TEST_PERL_BINDINGS()
INSTALL_PERL_BINDINGS()
BioLib Development Report (BOSC 2009) – p. 1
19. SWIG Map
%include <Read.h>
#define TT_ANY 0
#define TT_ZTR 7
typedef struct
{
int format;
char *trace_name;
int NPoints;
int NBases;
(...)
} Read;
Read *read_reading(char *fn, int format);
BioLib Development Report (BOSC 2009) – p. 1
20. Perl
use biolib::staden_io_lib;
$result = staden_io_lib::read_reading($fn,
$staden_io_lib::TT_ANY);
print("format=",staden_io_libc::Read_format_get($result));
print("NBases=",$result->{NBases});
print("base=",staden_io_libc::Read_base_get($result));
Outputs:
format=7
NBases=766
base=NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT
CGGTCCCAACTTAATTGTACA...
BioLib Development Report (BOSC 2009) – p. 2
21. Python
import biolib.staden_io_lib as io_lib
result = io_lib.read_reading(procsrffn,
io_lib.TT_ANY)
print result.format
print result.NBases
print result.base
7
766
NCTTGGGAAAGCATAAACCATGTATTATCGAATTCGAGCT
CGGTCCCAACTTAATTGTACA...
BioLib Development Report (BOSC 2009) – p. 2
22. For the Perl coder
Adding functionality in language of choice
Easier deployment - ’install biolib-perl’
Shared correctness testing
Generated API documentation
BioLib Development Report (BOSC 2009) – p. 2
23. For the authors
Independent source trees
Increased exposure (Ruby, Perl. . . )
Added unit/integration testing environment
Deployment, multi-platform support (Linux,
OSX, Windows)
No autoconf pain (./configure and friends)
Implicit access to other libraries (GSL, Rlib)
Online generated API documentation
BioLib Development Report (BOSC 2009) – p. 2
24. Future work
Automated API documentation (with doctests)
More libraries (Emboss, NCBI, . . . )
New code (HPC)
More languages (JAVA, R, OCaml, . . . )
Bio* integration (CPAN, Ruby gems, Python
eggs)
Debian/Fedora/OSX/Windows packages
More platforms (Windows without Cygwin)
BioLib Development Report (BOSC 2009) – p. 2
25. Credits
Ben Bolstad (Affyio), James Bonfield (Staden), Karl Broman (R/qtl)
Jonathan Leto (GSL SWIG)
Xin Shuai (Google SoC libsequence)
Adam Smith (Google SoC Bio++)
Oswaldo Trelles, José Manuel Mateos-Duran and Andrés Rodríguez (UMA)
Chris Fields (BioPerl), Mark Jensen (BioPerl), Hilmar Lap (Nescent, OBF)
Jaap Bakker (WU), Geert Smant (WU), Ritsert Jansen (GBIC)
BioLib Development Report (BOSC 2009) – p. 2
26. BoF
BioLib: Birds of a Feather Session (BoF) at 16:50 hours
BioLib Development Report (BOSC 2009) – p. 2