SlideShare une entreprise Scribd logo
1  sur  105
Big Data Analysis
Christoph Bernau and Ferdinand Jamitzky
jamitzky@lrz.de
http://goo.gl/kS31X
Big Data Analysis
Christoph Bernau and Ferdinand Jamitzky
jamitzky@lrz.de
http://goo.gl/kS31X
Big Data Analysis
Christoph Bernau and Ferdinand Jamitzky
jamitzky@lrz.de
http://goo.gl/kS31X
Contents
1. A short introduction to big data
2. Parallel programming is hard
3. Hardware @LRZ
4. Functional Programming
5. Available packages for R
6. Parallel Programming Tools
7. SMP Programming
8. Cluster Programming
9. Job Scheduler
10.Calling external binary code
big data
a short introduction
What is Big Data?
In information technology, big data is a loosely-
defined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools
(from wikipedia)
● Buzz Word
● High dimensional data
● Memory intensive data and/or algorithms
Who does Big Data?
● Bioinformatics
● Genomics and other "Omics"
● Astronomy
● Meteorology
● Environmental Research
● Multiscale physics simulations
● Economic and financial simulations
● Social Networks
● Text Mining
● Large Hadron Collider
Hardware for Big Data
● Large Arrays of Harddisks
● Solid State Disks as temp storage
● Large RAM
● Manycore
● Multicore
● Accelerators
● Tape Archives
Software Middleware for Big Data
● MapReduce
● Distributed File Systems
● Parallel File Systems
● Distributed Databases
● Task Queues
● Memory Attached Files
Supercomputer for Big Data
(Flash) Gordon: Data-Intensive Supercomputing at
the San Diego Supercomputing Centre
● 1,024 dual-socket Intel Sandy Bridge nodes,
each with 64 GB DDR3 1333 memory
● Over 300 TB of high performance Intel flash
memory SSDs via 64 dual-socket Intel
Westmere I/O nodes
● Large memory supernodes capable of
presenting over 2 TB of cache coherent
memory
● Dual rail QDR InfiniBand network
http://www.sdsc.edu/supercomputing/gordon/
SuperMUC as Big Data System
SuperMUC
● 9,216 dual-socket Intel Sandy Bridge nodes,
each with 32 GB DDR3 1333 memory
● Parallel File System GPFS
● FDR10 InfiniBand network
● Bandwith to GPFS 200 GByte/s
● No Flash :-(
parallel programming is hard
Why parallel programming?
End of the free lunch
Moore's law means
no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)
The future is parallel
●Moore's law is still valid
●Number of transistors doubles every 2 years
●Clock speed saturates at 3 to 4 GHz
●multi-core processors vs many-core processors
●grid/cloud computing
●clusters
●GPGPUs
(intel 2000)
The future is massively parallel
Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores
(AND, OR, NOT)
Rmax: 20 GFLOP/s
The future is massively parallel
JUGENE
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores
(PowerPC 450)
Rmax: 222 TFLOP/s
now: 1 PFLOP/s
294912 cores
Supercomputer: SMP
SMP Machine:
shared memory
typically 10s of cores
threaded programs
bus interconnect
in R:
library(multicore)
and inlined code
Example: gvs1
128 GB RAM
16 cores
Example: uv3.cos.lrz.de
2000 GB RAM
1120 cores
Supercomputer: MPI
Cluster of machines:
distributed memory
typically 100s of cores
message passing interface
infiniband interconnect
in R:
library(Rmpi)
and inlined code
Example: coolMUC
4700 GB RAM
2030 cores
Example: superMUC
320.000 GB RAM
160.000 cores
Levels of Parallelism
●Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
●Socket Level
each socket contains 8 cores
●Core Level
each core has 16 vector registers
●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
●Pipeline Level (how many simultaneous pipelines)
hyperthreading
●Instruction Level (instructions per cycle)
out of order execution, branch prediction
Problems: Access Times
Getting data from:
CPU register 1ns
L2 cache 10ns
memory 80 ns
network(IB) 200 ns
GPU(PCIe) 50.000 ns
harddisk 500.000 ns
Getting some food from:
fridge 10s
microwave 100s ~ 2min
pizza service 800s ~ 15min
city mall 2000s ~ 0.5h
mum sends cake 500.000 s~1 week
grown in own garden 5Ms ~ 2months
Computing MFlop/s
mflops.internal <- function(np) {
a=matrix(runif(np**2),np,np)
b=matrix(runif(np**2),np,np)
nflops=np**2*(2*np-1)
time=system.time(a %*% b)[[3]]
nflops/time/1000000}
This function computes a matrix-matrix multiplication using np x np random matrices.
The number of floating point operations is:
●np x np matrix elements
●np multiplications and (np-1) additions
resulting in
np x np x (np+np-1) = np**2*(2*np-1) FLOPS
Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Accelerator factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
saturation point!
Amdahl's Law II
Acceleration factor for
Tserial/T(1)=0.01
Amdahl's law III
> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
R on the HLRB-II
Strong scaling for
up to 120 cores
then the computing time is
too low.
Leibniz Supercomputing Centre
Hardware @ LRZ
● Computer Centre (~175 employees) for all Munich Universities with
o more than 80,000 students and
o more than 26,000 employees
o including 8,500 scientists
● Regional Computer Centre for all Bavarian Universities
o Capacity computing
o Special equipment
o Backup and Archiving Centre (10 petabyte, more than 6 billion files)
o Distributed File Systems
o Competence centre (Networks, HPC, IT Management)
● National Supercomputing Centre
o Gauss Centre for Supercomputing
o Integrated in European HPC and Grid projects
The Leibniz Supercomputing Centre is…
Hardware @ LRZ
http://www.lrz.de/services/compute/linux-cluster/overview/
The LRZ Linux Cluster:
Heterogeneous Cluster of Intel-compatible systems
●lx64ia, lx64ia2, lx64ia3 (login nodes)
●gvs1, gvs2, gvs3, gvs4 (remote visualisation nodes 8 GPUs)
●uv2, uv3 (SMP nodes 1.040 cores)
●ice1-login (cluster)
●lxa1 (coolMUC, MPP cluster)
The SuperMUC
●superMIG (migration system and fat island, 8.200 cores)
●superMUC (cluster of thin islands, 147.456 cores available in Sept 2012)
SuperMUC Linux Cluster
Hardware@LRZ (new Sept 2012)
SuperMIG
8200 cores
CoolMUC
4300 cores
SGI UV
2080 cores
gvs1...4
64 cores
SGI ICE
512 cores
ia64 x86_64 GPU
lx64ia2
8 cores
lx64ia3
8 cores
supzero
80 cores
login
login
SuperMUC
147456 cores
supermuc
16 cores
File space @ LRZ
http://www.lrz.de/services/compute/backup/
$HOME
25 GB per group, with backup and snapshots
cd $HOME/.snapshot
$OPT_TMP
temporary scratch space (beware!)
High Watermark Deletion
When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the
oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.
$PROJECT
project space (max 1TB), no automatic backup, use dsmc
module system@LRZ
http://www.lrz.de/services/software/utilities/modules/
module avail
module list
module load <name>
e.g. module load matlab
module unload <name>
module show <name>
insert module system into qsub job:
. /etc/profile
or
. /etc/profile.d/modules.sh
What our user do: Usage 2010 by Research Area
Performance per core by Research area
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
simple slurm script:
#!/bin/bash
#SBATCH -J myjob
#SBATCH --mail-
user=me@my_domain
#SBATCH --time=00:05:00
. /etc/profile
cd mydir
./myprog.exe
echo $JOB_ID
ls -al
pwd
this is ignored by SGE, but could be used if
executed normally
(Placeholder) name of job
(Placeholder) e-Mail address (don't forget!)
maximum run time; this may be increased up to
the queue limit
load the standard environment (see below)
change to working directory
start executable
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
sbatch jobfile.sh submit job to SLURM
squeue -u <userid> get status of my job
scancel <jobid> delete my job
Start interactive shell:
srun --ntasks=32 --partition=uv2_batch xterm
R makes life easier
functional programming matters
How are High-Performance Codes constructed?
●“Traditional” Construction of High-Performance Codes:
oC/C++/Fortran
oLibraries
●“Alternative” Construction of High-Performance Codes:
oScripting for ‘brains’
oGPUs/multicore for ‘inner loops’
●Play to the strengths of each programming environment.
●Hybrid programming:
o use cluster and task parallelism at the same time
o cluster parallelism: separated memory
o task parallelism: shared memory
Why scripting?
A scripting language. . .
●is discoverable and interactive.
●has comprehensive built-in functionality.
●manages resources automatically.
●is dynamically typed.
●works well for “glueing” lower-level blocks together.
●examples: tcl/tk, perl, python, ruby, R, MATLAB
Why functional matters...
●for parallel programming:
ono side effects
ocode as data
●for structured programming:
olate binding
orecursion
olazy evaluation
overy high abstraction
R functions
●R can define named and anonymous functions
●Define a (named or anonymous) function:
todB <- function(X) {10*log10(X)}
●Functions can even return (anonymous) functions
●The last value evaluated is the return value
●Variables from the calling namespace are visible
●All other variables are local unless specified
●Variable number of inputs:
myfunc <- function(...) list(...)
●Variable names and predefined values
myfunc <- function(a,b=1,c=a*b) c+1
Available packages for R
How to use multiple cores with R
●R provides modularization
●R provides high level abstractions
●R provides mixing of programming paradigms
●R provides dynamic libraries
●R provides vector expressions
Use It!
You can write multi-machine, multi-core, GPGPU accelerated, client-
server based, web-enabled applications using R
Parallel R Packages
●foreach
●pnmath/MKL
●multicore
●snow
●Rmpi
●rgpu, gputools
●R webservices
●sqldf
●rredis
●mapReduce
parallel abstraction
parallel intrinsic functions
SMP programming
Simple Network of Workers
Message Passing Interface
GPGPU programming
client/server webservices
SQL server for R
noSQL server for R
large scale parallelization
Parallel programming with R
●Parallel APIs:
oSMP - multicore
oMPP/MPI - mpi
ossh/sockets - snow
●Abstraction:
oforeach package
 doMC
 doMPI
 doSNOW
 doREDIS
Example:
library(doMC)
registerDoMC(cores=5)
foreach(i=1:10) %dopar%
sqrt(i)
roots -> foreach(i=1:10)
%dopar% sqrt(i)
SMP programming
library(multicore)
● send tasks into the background with parallel
● wait for completion and gather results with collect
library(multicore)
# spawn two tasks
p1 <- parallel(sum(runif(10000000)))
p2 <- parallel(sum(runif(10000000)))
# gather results blocking
collect(list(p1,p2))
# gather results non-blocking
collect(list(p1,p2),wait=F)
library(multicore)
● Extension of the apply function family in R
● function-function or functional
● utilizes SMP:
library(multicore)
doit <- function(x,np)sum(sort(runif(np)))
# single call
system.time( doit(0,10000000) )
# serial loop
system.time( lapply(1:16,doit,10000000))
# parallel loop
system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))
doMC
# R
> library(foreach)
> library(doMC)
> registerDoMC(cores=4)
> foreach(i=1:10) %do% sum(runif(10000000))
user system elapsed
9.352 2.652 12.002
> foreach(i=1:10) %dopar% sum(runif(10000000))
user system elapsed
7.228 7.216 3.296
multithreading with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N) %dopar%
{
mmult.f()
}
# thread execution
Cluster Programming
doSNOW
# R
> library(doSNOW)
> registerDoSNOW(makeSOCKcluster(4))
> foreach(i=1:10) %do% sum(runif(10000000))
user system elapsed
15.377 0.928 16.303
> foreach(i=1:10) %dopar% sum(runif(10000000))
user system elapsed
4.864 0.000 4.865
SNOW with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N) %dopar%
{
mmult.f()
}
# cluster execution
Job Scheduler
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
Clients are available for C, C++, C#, Objective-C, Clojure, Common
Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet
> startRedisWorkers(100)
doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
doMC
# R
> library(doMC)
> registerDoMC(cores=4)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
9.352 2.652 12.002
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
7.228 7.216 3.296
doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
redis and R: rredis, doREDIS
redisConnect()
redisSet('x',runif(5))
redisGet('x')
redisClose()
redisAuth(pwd)
redisConnect()
redisLPush('x',1)
redisLPush('x',2)
redisLPush('x',3)
redisLRange('x',0,2)
# connect to redis store
# store a value
# retrieve value from store
# close connection
# simple authentication
# push numbers into list
# retrieve list
Calling external binary code
One R to rule them all
●C/C++/objectiveC
●Fortran
●java
●Mpi
●Threads
●opengl
●ssh
●web server/client
●linux mac mswin
●R shell
●R gui
●math notebook
●automatic latex/pdf
●vtk
One R to bind them
●C/C++/objectiveC
●Fortran
●java
●R objects
●R objects
●.C("funcname", args...)
●.Fortran("test", args...)
●.jcall("class", args...)
●.Call
●.External
Use R as scripting language
R can dynamically load shared objects:
dyn.load("lib.so")
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)
C integration
●shared object libraries can be
used in R out of the box
●R arrays are mapped to C
pointers
R
C
integer int*
numeric double*
character char*
Example:
R CMD SHLIB -o test.so test.c
use in R:
> dyn.load("test.so")
> .C("test", args)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
Fortran Compiler
use Intel fortran compiler
$ ifort -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: -fast, -O3)
R subroutine
subroutine mysub(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
return
end subroutine
Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np) a(i,j) = a(i,j) +
b(i,k)*c(k,j)
end do
return
end subroutine
Call FORTRAN from R
# compile f90 to shared object library
system("ifort -shared -fPIC -o mmult.so mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,np=as.integer(dim(a)[1]
))
Call FORTRAN binary
np=100
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Exercise: make a plot system-time vs matrix-dimension
Disk
Big Memory
R R
MEM MEM
Logical Setup of Node
without shared memory
R R
MEM
Logical Setup of Node
with shared memory
DiskDisk
R R
MEM
Logical Setup of Node
with file-backed memory
R R
MEM
Logical Setup of Node
with network attached file-
backed memory
Network Network Network
library(bigmemory)
● shared memory regions for several
processes in SMP
● file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])
Part II
Applications
Potential Problems on Big Data Sets
1. many small tasks have to be performed for
each of many thousands of variables (long
run time)
2. analysis/ processing needs more main
memory than available
3. several R processes on a node need to
process the same big data set and each
process creates its own big R-object
4. data set cannot be loaded into R because
the R-object representing it would be too big
for the main memory available (worst case)
Approaches for Big Data Problems
1. C-function (shared library)
2. Accelerators (gpgpu, MICs)
3. SMP parallelisation
4. Cluster parallelisation
5. distributed data
6. in memory data files (arrays as big as
available memory)
7. parallel file systems (file backed arrays, no
size limit)
8. hierarchical and heterogeneous file systems
Problem 1: Example (Microarray Data)
● gene expressions for approximately 20000 genes
● influence of each variable on a Survival response shall be tested
Compute a Cox-Survival-Model for each variable
S(t|x) = S (t)
● In R: function coxph() in package Surv (already part of package base)
● even more challenging problem: test all second order interactions
(all pairs, 20000 choose 2)
exp(bx)
0
Problem 1: Example (Microarray Data)
First approach: for-loop in R using function coxph() [which actually calls a C-function using dyn.load to
compute the Cox-Model ]:
library(survHD)
data(beer.survival)
data(beer.exprs)
set.seed(123)
X<-t(as.matrix(beer.exprs))
y<-Surv(beer.survival[,2],beer.survival[,1])
coefs<-c()
system.time(
for(j in 1:ncol(X)){
fit <- coxph( y ~ X[,j])
coefs<-rbind(coefs,summary(fit)$coefficients[ 1 , c(1, 3, 5) ])})
Second Approach: using apply
system.time(output <- apply(t(X),1,function(xrow){
fit <- coxph( y ~ xrow )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]
}))
User System elapsed
34.635 0.002 34.686
User System elapsed
26.531 0.020 26.676
Problem 1: Example (Microarray Data)
2nd Approach:
● Passing a matrix to C and perform the for-loop inside C
● only coefficients and cooresponding p-values are returned for each variable
● function rowCoxTests in R-package survHD
time <- y[,1]
status <- y[,2]
sorted <- order(time)
time <- time[sorted]
status <- status[sorted]
X <- X[sorted,]
##compute columnwise coxmodels
#dynload not necessary, because 'coxmat.so' is integrated into survHD
system.time(out<-
.C('coxmat',regmat=as.double(X),ncolmat=as.integer(ncol(X)),nrowmat=as.integer(n
row(X)),reg=as.double(X[,1]),zscores=as.double(numeric(ncol(X))),coefs=as.double
(numeric(ncol(X))),maxiter=as.integer(20),...))
● performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up
● principally difficult to program and quite error prone
● C-functions for single variables are usually available and wrappers are usually easy to program
User System elapsed
0.229 0.000 0.229
max(abs(out$coefs-coefs[,1]))
[1] 1.004459e-07
Comparison to parallel programming:
Parallelization of for-loop using snow:
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#broadcast X
Z<-X
clusterExport(cl=cl,list=list('Z'))
#function to be applied in parallel
parcoxph<-function(ind,y){
require(survHD)
zcol<-Z[,ind]
fit<-coxph( y ~ zcol )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]}
#run function on 10 cores
system.time(result <- parLapply(cl=cl,x=1:ncol(Z),fun=parcoxph,y=y))
● parallelization of very small and short tasks usually not efficient
● possible improvement: rewrite code such that bunches of tests are performed
User System elapsed
0.031 0.003 3.474
Combining both approaches:
For really big data sets (>100000 variables) one can combine both approaches?
X2<-X
for(i in 30){
X2<-cbind(X2,X)}
colnames(X2) <- 1:ncol(X2)
system.time(tt<-rowCoxTests(t(X2),y,option='fast'))
system.time(rowCoxTests(t(X),y,option='fast'))
##using snow
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#function to be applied in parallel
parfun<-function(ind,Z,y){
require(survHD)
rowCoxTests(X=t(Z),y=y,option='fast')}
#run function on 10 cores
system.time(result<-parLapply(cl=cl,x=1:30,fun=parfun,Z=X,y=y))
X2<-cbind(X,X,X)
system.time(result<-parLapply(cl=cl,x=1:10,fun=parfun,Z=X,y=y))
User System elapsed
0.593 0.010 0.606
User System elapsed
0.303 0.000 0.303
User System elapsed
1.825 0.291 7.215
User System elapsed
2.255 0.206 3.436
Combining both approaches: Exercise
In the current example, however, parallel computing is less effective anyway
Exercise:
1. Create a large data set by concatenating the gene-expression matrix 20 times
(use cbind)
2. apply the function rowCoxTests() and measure the runtime.
3. use snow in order to sent the expression matrix to 20 cores and let each core
perform rowCoxTests() on its own matrix.
4. Measure the runtime.
Problem 2: Example
Normalization of Gene-Expression-Microarrays:
● approximately 500k measurements per array
● background correction has to be performed
● ca. 50 measurements have to be summarized to a single value representing one gene expression
(summarization step)
● R functions: rma() or vsn() in Bioconductor package affy
● high memory requirements as soon as number of observations exceeds 100 arrays (>10GB RAM)
Distributed Data Approach (Bioconductor Package affyPara)
Problem 2: Example
source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation
Distributed Data Approach for backgound correction
AffyPara: Code Example
#load packages and initialize snow-cluster (for affyPara)
library(snow) #parallelization
library(affyPara) #parallel preprocessing
library(affy) #for reading in affy batches
ncpusaffy<-7 #number of cpus
cl<-makeSOCKcluster(ncpusaffy) #create cluster
#reading AffyBatch from cel-files
setwd('~/dataCEL/wang05/cel') #directory containing cel files
aboall<-ReadAffy() #reading
#create subcluster of length ncores
ncores<-7
cll<-cl[1:ncores]
#perform preprocessing using subcluster cll
res<-system.time(arrs.out<-
preproPara(aboall,bgcorrect=T,bgcorrect.method='rma',normalize=T,normalize.method='quantiles
',pmcorrect.method='pmonly',summary.method='avgdiff',cluster=cll))
###stop cluster/ finalize MPI
stopCluster(cl)
single core RAM > 6GB 7 cores: ca. 1.5GB/core minor speedup
Problem 2: Exercise
Exercise for you:
1. Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package
affy)
2. use top to observe the memory consumption of the process.
3. Additionally, measure its runtime.
4. Perform the background correction as a distributed data approach using snow
(you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files)
1. Compare memory consumption and runtime to the sequential code
Problem 3/4: Data set too large for
RAM
● R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP)
● modern biological data can have several dozen GB (e.g. Next Generation Sequencing)
● If the R-object representing the data set grows larger than the available RAM, R stops throwing an
error reading "Cannot allocate vector of xx byte".
Possible solution: R package bigmemory (based on C++-libraries for big data objects)
2 areas of usage:
● if several processes operate on the same big matrix
● file-backed-matrices if data sets are larger than available main memory
and the combination of both situations
R-Package bigmemory
Essential functions:
● bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to
access the matrix)
● filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small)
● describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object
● bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their
elements can be accessed using brackets
bigmemory: code example
###write
data(golub)
library(bigmemory)
setwd('~/tmp/bigmem')
X<-as.matrix(golub[,-1])
#create filebacked.bigmatrix and write data into its elements
z<-
filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',backingfile="m
agolub.bin",descriptorfile="magolub.desc")
k<-0
for(i in 1:5000){
inds<-sample(1:nrow(X),30)
z[(1:30)+(k*30),]<-X[inds,]
k<-k+1}
#create and save descriptorfile for later usage
desc<-describe(z)
save(desc,file='desc_z.RData')
bigmemory: code example
###read
library(bigmemory)
setwd('tmp/bigmem')
#load descriptorfile
load('desc_z.RData')
#attach bigmatrix object using the descriptor file
y<-attach.big.matrix(desc)
#access elements
y[1:10,7]
#read element 7 in the 5th row
b<-y[5,7]
#compute sum of a submatrix
(sum1<-sum(y[1:10,5:20]))
bigmemory: exercise
Exercise for you:
1. create a bigmatrix object using big.matrix()
2. create a descriptor and save it
3. start another R-session on the same node
4. load the descriptor file and attach the bigmatrix
5. use the bigmatrix object for communication between both R processes
Gaining Flexibility: doRedis
● separates job administration and execution
● subtasks are stored in a redis data base
o master process sends subtasks of a computation to the server
o worker can log in and request the tasks
o all necessary R objects are stored in the redis server, too
● necessary software:
o R-packages: rredis, doRedis
o data base: redis-server (debian-package)
doRedis: essential functionality
● Master process:
o registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue
for the tasks to come
o foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base
o redisFlushAll(): clears the data base
o removeQueue(): removes a queue from the data base
● Worker process:
o registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed
o startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks
specified in jobqueue (uses multicore)
o redisWorker(jobqueue,host): useful in mpi-environments
usually users do not request or set the data base values directly
typical parallelization as known from other "Do-packages"
Worker processes can run on any R-compatible hardware and can connect at any time
redis-server
master:
doRedis
sends
jobs +objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
distributes
jobs and objects
eventually returns results
● robust
● flexible
● dynamic
doRedis: code example
Master (sending subtasks to redis-server and wait for results):
#redis-server ~/redis/redis-2.2.14/redis.conf (in linux shell, starts the
redis-server)
#cross-validation of classification on microarray data
library(CMA)
X <- as.matrix(golub[,-1])
y <- golub[,1]
ls <- GenerateLearningsets(y=y,method='CV',
fold=10,niter=10000)
#function to be applied on each node
cl2 <- function(j){
require(CMA)
ttt<-system.time(cl<-svmCMA(y=y,X=X,learnind=ls@learnmatrix[j,],cost=10))
list(cl,ttt,Sys.info())}
#connect to redis-server, sent subtasks and wait for results
library(doRedis)
redisFlushAll()
registerDoRedis('jobscmanew')
numtodo<-nrow(ls@learnmatrix)
lll3<-foreach(j=1:numtodo) %dopar% {cl2(j)}
doRedis: code example
Worker processes (connect to server, receive subtasks and objects, return results):
###using multicore (just two lines)
#register jobqueue from redis-server
registerDoRedis('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
#start 10 local workers
startLocalWorkers(n=10, queue='jobscmanew')
###using MPI
#function to be run by each mpi-process
startdr<-function(ll){
library(doRedis)
redisWorker('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
}
#start rmpi
library(Rmpi)
numworker<-mpi.universe.size()
mpi.spawn.Rslaves()
#let each mpi-process connect to redis-server and perform subtasks
mpi.apply(1:numworker,startdr)
doRedis: exercise
1. connect to the redis server in R
2. submit a job queue
3. start workers to perform the subtasks
4. set a value for variable xnewinteger (use)
5. request the value of variable xnewinteger (use)
● redis and doRedis provide high flexibility for performing independent subtasks
o worker processes can connect at any time
o errors in individual processes do not stop the entire computation (robustness)
o worker processes can run on totally different architectures
o worker processes can run all around the world
● disadvantage: database can become a bottleneck if large R objects have to be stored/sent
solution: separation of large data objects (bigmemory) and job tasks (redis)
Combining doRedis and bigmemory
Separate task and data channel:
Combining doRedis and bigmemory
doredis/bigmemory: Code Example
worker process:
redisbigreadwrite<-function(procind){
require(CMA)
require(bigmemory)
j<-procind
setwd('~/tmp/bigmemlrz')
load('desc_z.RData') #big data object containing many gene expression sets
load('desc_out.RData') #big data file for misclassification rates
z<-attach.big.matrix(desc)
out<-attach.big.matrix(descout)
load('descresmat.RData')
resmat<-attach.big.matrix(descresmat) #big data object for simulating large
writing operation
for(iter in 1:10){
start<-(j-1)*30*10*10+(iter-1)*30*10+1
X<-z[start:(start+299),] #read gene expression matrix
cl<-svmCMA(y=sample(c(1,2),nrow(X),replace=T),X=X,learnind=1:25,cost=10))
#construct classifier
out[(j-1)*10+iter]<-mean(abs(cl@y-cl@yhat)) #compute misclassification rate
resmat[start:(start+299)]<-X #write X
}
#flush
flush(resmat);flush(out)}
doredis/bigmemory: Code Example
master process:
###create bigmatrix (gene expressions)
library(bigmemory)
setwd('~/tmp/bigmemlrz')
X<-as.matrix(golub[,-1])
z<-
filebacked.big.matrix(nrow=30*1500,ncol=ncol(X),type='double',backingfile="magolu
b.bin",
descriptorfile="magolub.desc")
for(i in 1:1500){
inds<-sample(1:nrow(X),30)
z[(1:30)+(i*30),]<-X[inds,]}
#create descrptor file and save it for other processes
desc<-describe(z)
save(desc,file='desc_z.RData')
###doredis part
library(doRedis)
registerDoRedis('rwbigmem')
lll3<-foreach(j=1:1500) %dopar% redisbigreadwrite{(j)}
results are returned in a file-backed object so master could quit
doredis/bigmemory: code example
main difference: underlying network and network file system
IBE (NFS)LRZ (NAS)
comparison to standard MPIIO-
approach
Difference: MPI less flexible
● not robust
● collective open/close calls
Fortran90 - MPIIO - Implementation R - bigmemory - implementation
Exercise:
1. run the previous example using only two doredis-workers which perform only a single task
2. rewrite the previous example such that the proportion of class 1 predictions is returned
3. try to rewrite the previous example such that each worker process reads 10 subdatasets at a time
and then constructs a classifier for each of the ten read in subdatasets
4. create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension
200x10000 ) using random numbers and run the previous example using that input 'bigmatrix'
doRedis/bigmemory: Exercise
Thanks for your attention.
Further questions?
The End
Worker processes can run on any R-compatible hardware and can connect at any time
redis-server
master:
doRedis
sends
jobs +objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
distributes
jobs and objects
eventually returns results
● robust
● flexible
● dynamic

Contenu connexe

Tendances

Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014Hitoshi Sato
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)Jyh-Miin Lin
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 

Tendances (18)

Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
CUDA
CUDACUDA
CUDA
 

En vedette

En vedette (8)

Lrz kurse: r visualisation
Lrz kurse: r visualisationLrz kurse: r visualisation
Lrz kurse: r visualisation
 
Lrz kurse: r as superglue
Lrz kurse: r as superglueLrz kurse: r as superglue
Lrz kurse: r as superglue
 
Big Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on GraphsBig Data Analysis with Signal Processing on Graphs
Big Data Analysis with Signal Processing on Graphs
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similaire à Lrz kurs: big data analysis

Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Advanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupAdvanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupMongoDB
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIDataWorks Summit
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 

Similaire à Lrz kurs: big data analysis (20)

GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Advanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupAdvanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and Backup
 
uCluster
uClusteruCluster
uCluster
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 

Dernier

Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 

Dernier (20)

Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 

Lrz kurs: big data analysis

  • 1. Big Data Analysis Christoph Bernau and Ferdinand Jamitzky jamitzky@lrz.de http://goo.gl/kS31X
  • 2. Big Data Analysis Christoph Bernau and Ferdinand Jamitzky jamitzky@lrz.de http://goo.gl/kS31X
  • 3. Big Data Analysis Christoph Bernau and Ferdinand Jamitzky jamitzky@lrz.de http://goo.gl/kS31X
  • 4. Contents 1. A short introduction to big data 2. Parallel programming is hard 3. Hardware @LRZ 4. Functional Programming 5. Available packages for R 6. Parallel Programming Tools 7. SMP Programming 8. Cluster Programming 9. Job Scheduler 10.Calling external binary code
  • 5. big data a short introduction
  • 6. What is Big Data? In information technology, big data is a loosely- defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools (from wikipedia) ● Buzz Word ● High dimensional data ● Memory intensive data and/or algorithms
  • 7. Who does Big Data? ● Bioinformatics ● Genomics and other "Omics" ● Astronomy ● Meteorology ● Environmental Research ● Multiscale physics simulations ● Economic and financial simulations ● Social Networks ● Text Mining ● Large Hadron Collider
  • 8. Hardware for Big Data ● Large Arrays of Harddisks ● Solid State Disks as temp storage ● Large RAM ● Manycore ● Multicore ● Accelerators ● Tape Archives
  • 9. Software Middleware for Big Data ● MapReduce ● Distributed File Systems ● Parallel File Systems ● Distributed Databases ● Task Queues ● Memory Attached Files
  • 10. Supercomputer for Big Data (Flash) Gordon: Data-Intensive Supercomputing at the San Diego Supercomputing Centre ● 1,024 dual-socket Intel Sandy Bridge nodes, each with 64 GB DDR3 1333 memory ● Over 300 TB of high performance Intel flash memory SSDs via 64 dual-socket Intel Westmere I/O nodes ● Large memory supernodes capable of presenting over 2 TB of cache coherent memory ● Dual rail QDR InfiniBand network http://www.sdsc.edu/supercomputing/gordon/
  • 11. SuperMUC as Big Data System SuperMUC ● 9,216 dual-socket Intel Sandy Bridge nodes, each with 32 GB DDR3 1333 memory ● Parallel File System GPFS ● FDR10 InfiniBand network ● Bandwith to GPFS 200 GByte/s ● No Flash :-(
  • 13. Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more of them. But beware! 2 x 3 GHz < 6 GHz (cache consistency, multi-threading, etc)
  • 14. The future is parallel ●Moore's law is still valid ●Number of transistors doubles every 2 years ●Clock speed saturates at 3 to 4 GHz ●multi-core processors vs many-core processors ●grid/cloud computing ●clusters ●GPGPUs (intel 2000)
  • 15. The future is massively parallel Connection Machine CM-1 (1983) 12-D Hypercube 65536 1-bit cores (AND, OR, NOT) Rmax: 20 GFLOP/s
  • 16. The future is massively parallel JUGENE Blue Gene/P (2007) 3-D Torus or Tree 65536 64-bit cores (PowerPC 450) Rmax: 222 TFLOP/s now: 1 PFLOP/s 294912 cores
  • 17. Supercomputer: SMP SMP Machine: shared memory typically 10s of cores threaded programs bus interconnect in R: library(multicore) and inlined code Example: gvs1 128 GB RAM 16 cores Example: uv3.cos.lrz.de 2000 GB RAM 1120 cores
  • 18. Supercomputer: MPI Cluster of machines: distributed memory typically 100s of cores message passing interface infiniband interconnect in R: library(Rmpi) and inlined code Example: coolMUC 4700 GB RAM 2030 cores Example: superMUC 320.000 GB RAM 160.000 cores
  • 19. Levels of Parallelism ●Node Level (e.g. SuperMUC has approx. 10000 nodes) each node has 2 sockets ●Socket Level each socket contains 8 cores ●Core Level each core has 16 vector registers ●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers) ●Pipeline Level (how many simultaneous pipelines) hyperthreading ●Instruction Level (instructions per cycle) out of order execution, branch prediction
  • 20. Problems: Access Times Getting data from: CPU register 1ns L2 cache 10ns memory 80 ns network(IB) 200 ns GPU(PCIe) 50.000 ns harddisk 500.000 ns Getting some food from: fridge 10s microwave 100s ~ 2min pizza service 800s ~ 15min city mall 2000s ~ 0.5h mum sends cake 500.000 s~1 week grown in own garden 5Ms ~ 2months
  • 21. Computing MFlop/s mflops.internal <- function(np) { a=matrix(runif(np**2),np,np) b=matrix(runif(np**2),np,np) nflops=np**2*(2*np-1) time=system.time(a %*% b)[[3]] nflops/time/1000000} This function computes a matrix-matrix multiplication using np x np random matrices. The number of floating point operations is: ●np x np matrix elements ●np multiplications and (np-1) additions resulting in np x np x (np+np-1) = np**2*(2*np-1) FLOPS
  • 22. Amdahl's law Computing time for N processors T(N) = T(1)/N + Tserial + Tcomm * N Accelerator factor: T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2) small N: T(1)/T(N) ~ N large N: T(1)/T(N) ~ 1/N saturation point!
  • 23. Amdahl's Law II Acceleration factor for Tserial/T(1)=0.01
  • 24. Amdahl's law III > plot(N,type="l") > lines(N/(1+0.01*N),col="red") > lines(N/(1+0.01*N+0.001*N**2),col="green")
  • 25. R on the HLRB-II Strong scaling for up to 120 cores then the computing time is too low.
  • 27. ● Computer Centre (~175 employees) for all Munich Universities with o more than 80,000 students and o more than 26,000 employees o including 8,500 scientists ● Regional Computer Centre for all Bavarian Universities o Capacity computing o Special equipment o Backup and Archiving Centre (10 petabyte, more than 6 billion files) o Distributed File Systems o Competence centre (Networks, HPC, IT Management) ● National Supercomputing Centre o Gauss Centre for Supercomputing o Integrated in European HPC and Grid projects The Leibniz Supercomputing Centre is…
  • 28. Hardware @ LRZ http://www.lrz.de/services/compute/linux-cluster/overview/ The LRZ Linux Cluster: Heterogeneous Cluster of Intel-compatible systems ●lx64ia, lx64ia2, lx64ia3 (login nodes) ●gvs1, gvs2, gvs3, gvs4 (remote visualisation nodes 8 GPUs) ●uv2, uv3 (SMP nodes 1.040 cores) ●ice1-login (cluster) ●lxa1 (coolMUC, MPP cluster) The SuperMUC ●superMIG (migration system and fat island, 8.200 cores) ●superMUC (cluster of thin islands, 147.456 cores available in Sept 2012)
  • 29. SuperMUC Linux Cluster Hardware@LRZ (new Sept 2012) SuperMIG 8200 cores CoolMUC 4300 cores SGI UV 2080 cores gvs1...4 64 cores SGI ICE 512 cores ia64 x86_64 GPU lx64ia2 8 cores lx64ia3 8 cores supzero 80 cores login login SuperMUC 147456 cores supermuc 16 cores
  • 30. File space @ LRZ http://www.lrz.de/services/compute/backup/ $HOME 25 GB per group, with backup and snapshots cd $HOME/.snapshot $OPT_TMP temporary scratch space (beware!) High Watermark Deletion When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary. $PROJECT project space (max 1TB), no automatic backup, use dsmc
  • 31. module system@LRZ http://www.lrz.de/services/software/utilities/modules/ module avail module list module load <name> e.g. module load matlab module unload <name> module show <name> insert module system into qsub job: . /etc/profile or . /etc/profile.d/modules.sh
  • 32. What our user do: Usage 2010 by Research Area
  • 33. Performance per core by Research area
  • 34. batch system@LRZ http://www.lrz.de/services/compute/linux-cluster/batch-parallel simple slurm script: #!/bin/bash #SBATCH -J myjob #SBATCH --mail- user=me@my_domain #SBATCH --time=00:05:00 . /etc/profile cd mydir ./myprog.exe echo $JOB_ID ls -al pwd this is ignored by SGE, but could be used if executed normally (Placeholder) name of job (Placeholder) e-Mail address (don't forget!) maximum run time; this may be increased up to the queue limit load the standard environment (see below) change to working directory start executable
  • 35. batch system@LRZ http://www.lrz.de/services/compute/linux-cluster/batch-parallel sbatch jobfile.sh submit job to SLURM squeue -u <userid> get status of my job scancel <jobid> delete my job Start interactive shell: srun --ntasks=32 --partition=uv2_batch xterm
  • 36. R makes life easier functional programming matters
  • 37. How are High-Performance Codes constructed? ●“Traditional” Construction of High-Performance Codes: oC/C++/Fortran oLibraries ●“Alternative” Construction of High-Performance Codes: oScripting for ‘brains’ oGPUs/multicore for ‘inner loops’ ●Play to the strengths of each programming environment. ●Hybrid programming: o use cluster and task parallelism at the same time o cluster parallelism: separated memory o task parallelism: shared memory
  • 38. Why scripting? A scripting language. . . ●is discoverable and interactive. ●has comprehensive built-in functionality. ●manages resources automatically. ●is dynamically typed. ●works well for “glueing” lower-level blocks together. ●examples: tcl/tk, perl, python, ruby, R, MATLAB
  • 39. Why functional matters... ●for parallel programming: ono side effects ocode as data ●for structured programming: olate binding orecursion olazy evaluation overy high abstraction
  • 40. R functions ●R can define named and anonymous functions ●Define a (named or anonymous) function: todB <- function(X) {10*log10(X)} ●Functions can even return (anonymous) functions ●The last value evaluated is the return value ●Variables from the calling namespace are visible ●All other variables are local unless specified ●Variable number of inputs: myfunc <- function(...) list(...) ●Variable names and predefined values myfunc <- function(a,b=1,c=a*b) c+1
  • 42. How to use multiple cores with R ●R provides modularization ●R provides high level abstractions ●R provides mixing of programming paradigms ●R provides dynamic libraries ●R provides vector expressions Use It! You can write multi-machine, multi-core, GPGPU accelerated, client- server based, web-enabled applications using R
  • 43. Parallel R Packages ●foreach ●pnmath/MKL ●multicore ●snow ●Rmpi ●rgpu, gputools ●R webservices ●sqldf ●rredis ●mapReduce parallel abstraction parallel intrinsic functions SMP programming Simple Network of Workers Message Passing Interface GPGPU programming client/server webservices SQL server for R noSQL server for R large scale parallelization
  • 44. Parallel programming with R ●Parallel APIs: oSMP - multicore oMPP/MPI - mpi ossh/sockets - snow ●Abstraction: oforeach package  doMC  doMPI  doSNOW  doREDIS Example: library(doMC) registerDoMC(cores=5) foreach(i=1:10) %dopar% sqrt(i) roots -> foreach(i=1:10) %dopar% sqrt(i)
  • 46. library(multicore) ● send tasks into the background with parallel ● wait for completion and gather results with collect library(multicore) # spawn two tasks p1 <- parallel(sum(runif(10000000))) p2 <- parallel(sum(runif(10000000))) # gather results blocking collect(list(p1,p2)) # gather results non-blocking collect(list(p1,p2),wait=F)
  • 47. library(multicore) ● Extension of the apply function family in R ● function-function or functional ● utilizes SMP: library(multicore) doit <- function(x,np)sum(sort(runif(np))) # single call system.time( doit(0,10000000) ) # serial loop system.time( lapply(1:16,doit,10000000)) # parallel loop system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))
  • 48. doMC # R > library(foreach) > library(doMC) > registerDoMC(cores=4) > foreach(i=1:10) %do% sum(runif(10000000)) user system elapsed 9.352 2.652 12.002 > foreach(i=1:10) %dopar% sum(runif(10000000)) user system elapsed 7.228 7.216 3.296
  • 49. multithreading with R library(foreach) foreach(i=1:N) %do% { mmult.f() } # serial execution library(foreach) library(doMC) registerDoMC() foreach(i=1:N) %dopar% { mmult.f() } # thread execution
  • 51. doSNOW # R > library(doSNOW) > registerDoSNOW(makeSOCKcluster(4)) > foreach(i=1:10) %do% sum(runif(10000000)) user system elapsed 15.377 0.928 16.303 > foreach(i=1:10) %dopar% sum(runif(10000000)) user system elapsed 4.864 0.000 4.865
  • 52. SNOW with R library(foreach) foreach(i=1:N) %do% { mmult.f() } # serial execution library(foreach) library(doSNOW) registerDoSNOW() foreach(i=1:N) %dopar% { mmult.f() } # cluster execution
  • 54. noSQL databases Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. http://www.redis.io Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl
  • 55. doRedis / workers start redis worker: > echo "require('doRedis');redisWorker('jobs')" | R The workers can be distributed over the internet > startRedisWorkers(100)
  • 56. doRedis # R > library(doRedis) > registerDoRedis("jobs") > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 15.377 0.928 16.303 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 4.864 0.000 4.865
  • 57. doMC # R > library(doMC) > registerDoMC(cores=4) > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 9.352 2.652 12.002 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 7.228 7.216 3.296
  • 58. doSNOW # R > library(doSNOW) > cl <- makeSOCKcluster(4) > registerDoSNOW(cl) > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 15.377 0.928 16.303 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 4.864 0.000 4.865
  • 59. redis and R: rredis, doREDIS redisConnect() redisSet('x',runif(5)) redisGet('x') redisClose() redisAuth(pwd) redisConnect() redisLPush('x',1) redisLPush('x',2) redisLPush('x',3) redisLRange('x',0,2) # connect to redis store # store a value # retrieve value from store # close connection # simple authentication # push numbers into list # retrieve list
  • 61. One R to rule them all ●C/C++/objectiveC ●Fortran ●java ●Mpi ●Threads ●opengl ●ssh ●web server/client ●linux mac mswin ●R shell ●R gui ●math notebook ●automatic latex/pdf ●vtk
  • 62. One R to bind them ●C/C++/objectiveC ●Fortran ●java ●R objects ●R objects ●.C("funcname", args...) ●.Fortran("test", args...) ●.jcall("class", args...) ●.Call ●.External
  • 63. Use R as scripting language R can dynamically load shared objects: dyn.load("lib.so") these functions can then be called via .C("fname", args) .Fortran("fname", args)
  • 64. C integration ●shared object libraries can be used in R out of the box ●R arrays are mapped to C pointers R C integer int* numeric double* character char* Example: R CMD SHLIB -o test.so test.c use in R: > dyn.load("test.so") > .C("test", args)
  • 65. Fortran 90 Example program myprog ! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2) end program
  • 66. Fortran Compiler use Intel fortran compiler $ ifort -o myprog.exe myprog.f90 $ time ./myprog.exe exercise for you: ●compute MFlop/s (Floating Point Operations: 4 * np * nstep) ●optimize (hint: -fast, -O3)
  • 67. R subroutine subroutine mysub(x,v,nstep) ! simulate harmonic oscillator integer, parameter :: np=1000000 real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do return end subroutine
  • 68. Matrix Multipl. in FORTRAN subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j) end do return end subroutine
  • 69. Call FORTRAN from R # compile f90 to shared object library system("ifort -shared -fPIC -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,c=c,np=as.integer(dim(a)[1] ))
  • 70. Call FORTRAN binary np=100 system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ) ) Exercise: make a plot system-time vs matrix-dimension
  • 71. Disk Big Memory R R MEM MEM Logical Setup of Node without shared memory R R MEM Logical Setup of Node with shared memory DiskDisk R R MEM Logical Setup of Node with file-backed memory R R MEM Logical Setup of Node with network attached file- backed memory Network Network Network
  • 72. library(bigmemory) ● shared memory regions for several processes in SMP ● file backed arrays for several node over network file systems library(bigmemory) x <- as.big.matrix(matrix(runif(1000000), 1000, 1000))) sum(x[1,1:1000])
  • 74. Potential Problems on Big Data Sets 1. many small tasks have to be performed for each of many thousands of variables (long run time) 2. analysis/ processing needs more main memory than available 3. several R processes on a node need to process the same big data set and each process creates its own big R-object 4. data set cannot be loaded into R because the R-object representing it would be too big for the main memory available (worst case)
  • 75. Approaches for Big Data Problems 1. C-function (shared library) 2. Accelerators (gpgpu, MICs) 3. SMP parallelisation 4. Cluster parallelisation 5. distributed data 6. in memory data files (arrays as big as available memory) 7. parallel file systems (file backed arrays, no size limit) 8. hierarchical and heterogeneous file systems
  • 76. Problem 1: Example (Microarray Data) ● gene expressions for approximately 20000 genes ● influence of each variable on a Survival response shall be tested Compute a Cox-Survival-Model for each variable S(t|x) = S (t) ● In R: function coxph() in package Surv (already part of package base) ● even more challenging problem: test all second order interactions (all pairs, 20000 choose 2) exp(bx) 0
  • 77. Problem 1: Example (Microarray Data) First approach: for-loop in R using function coxph() [which actually calls a C-function using dyn.load to compute the Cox-Model ]: library(survHD) data(beer.survival) data(beer.exprs) set.seed(123) X<-t(as.matrix(beer.exprs)) y<-Surv(beer.survival[,2],beer.survival[,1]) coefs<-c() system.time( for(j in 1:ncol(X)){ fit <- coxph( y ~ X[,j]) coefs<-rbind(coefs,summary(fit)$coefficients[ 1 , c(1, 3, 5) ])}) Second Approach: using apply system.time(output <- apply(t(X),1,function(xrow){ fit <- coxph( y ~ xrow ) summary(fit)$coefficients[ 1 , c(1, 3, 5) ] })) User System elapsed 34.635 0.002 34.686 User System elapsed 26.531 0.020 26.676
  • 78. Problem 1: Example (Microarray Data) 2nd Approach: ● Passing a matrix to C and perform the for-loop inside C ● only coefficients and cooresponding p-values are returned for each variable ● function rowCoxTests in R-package survHD time <- y[,1] status <- y[,2] sorted <- order(time) time <- time[sorted] status <- status[sorted] X <- X[sorted,] ##compute columnwise coxmodels #dynload not necessary, because 'coxmat.so' is integrated into survHD system.time(out<- .C('coxmat',regmat=as.double(X),ncolmat=as.integer(ncol(X)),nrowmat=as.integer(n row(X)),reg=as.double(X[,1]),zscores=as.double(numeric(ncol(X))),coefs=as.double (numeric(ncol(X))),maxiter=as.integer(20),...)) ● performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up ● principally difficult to program and quite error prone ● C-functions for single variables are usually available and wrappers are usually easy to program User System elapsed 0.229 0.000 0.229 max(abs(out$coefs-coefs[,1])) [1] 1.004459e-07
  • 79. Comparison to parallel programming: Parallelization of for-loop using snow: #create cluster library(snow) cl<-makeSOCKcluster(10) #broadcast X Z<-X clusterExport(cl=cl,list=list('Z')) #function to be applied in parallel parcoxph<-function(ind,y){ require(survHD) zcol<-Z[,ind] fit<-coxph( y ~ zcol ) summary(fit)$coefficients[ 1 , c(1, 3, 5) ]} #run function on 10 cores system.time(result <- parLapply(cl=cl,x=1:ncol(Z),fun=parcoxph,y=y)) ● parallelization of very small and short tasks usually not efficient ● possible improvement: rewrite code such that bunches of tests are performed User System elapsed 0.031 0.003 3.474
  • 80. Combining both approaches: For really big data sets (>100000 variables) one can combine both approaches? X2<-X for(i in 30){ X2<-cbind(X2,X)} colnames(X2) <- 1:ncol(X2) system.time(tt<-rowCoxTests(t(X2),y,option='fast')) system.time(rowCoxTests(t(X),y,option='fast')) ##using snow #create cluster library(snow) cl<-makeSOCKcluster(10) #function to be applied in parallel parfun<-function(ind,Z,y){ require(survHD) rowCoxTests(X=t(Z),y=y,option='fast')} #run function on 10 cores system.time(result<-parLapply(cl=cl,x=1:30,fun=parfun,Z=X,y=y)) X2<-cbind(X,X,X) system.time(result<-parLapply(cl=cl,x=1:10,fun=parfun,Z=X,y=y)) User System elapsed 0.593 0.010 0.606 User System elapsed 0.303 0.000 0.303 User System elapsed 1.825 0.291 7.215 User System elapsed 2.255 0.206 3.436
  • 81. Combining both approaches: Exercise In the current example, however, parallel computing is less effective anyway Exercise: 1. Create a large data set by concatenating the gene-expression matrix 20 times (use cbind) 2. apply the function rowCoxTests() and measure the runtime. 3. use snow in order to sent the expression matrix to 20 cores and let each core perform rowCoxTests() on its own matrix. 4. Measure the runtime.
  • 82. Problem 2: Example Normalization of Gene-Expression-Microarrays: ● approximately 500k measurements per array ● background correction has to be performed ● ca. 50 measurements have to be summarized to a single value representing one gene expression (summarization step) ● R functions: rma() or vsn() in Bioconductor package affy ● high memory requirements as soon as number of observations exceeds 100 arrays (>10GB RAM) Distributed Data Approach (Bioconductor Package affyPara)
  • 83. Problem 2: Example source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation Distributed Data Approach for backgound correction
  • 84. AffyPara: Code Example #load packages and initialize snow-cluster (for affyPara) library(snow) #parallelization library(affyPara) #parallel preprocessing library(affy) #for reading in affy batches ncpusaffy<-7 #number of cpus cl<-makeSOCKcluster(ncpusaffy) #create cluster #reading AffyBatch from cel-files setwd('~/dataCEL/wang05/cel') #directory containing cel files aboall<-ReadAffy() #reading #create subcluster of length ncores ncores<-7 cll<-cl[1:ncores] #perform preprocessing using subcluster cll res<-system.time(arrs.out<- preproPara(aboall,bgcorrect=T,bgcorrect.method='rma',normalize=T,normalize.method='quantiles ',pmcorrect.method='pmonly',summary.method='avgdiff',cluster=cll)) ###stop cluster/ finalize MPI stopCluster(cl) single core RAM > 6GB 7 cores: ca. 1.5GB/core minor speedup
  • 85. Problem 2: Exercise Exercise for you: 1. Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package affy) 2. use top to observe the memory consumption of the process. 3. Additionally, measure its runtime. 4. Perform the background correction as a distributed data approach using snow (you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files) 1. Compare memory consumption and runtime to the sequential code
  • 86. Problem 3/4: Data set too large for RAM ● R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP) ● modern biological data can have several dozen GB (e.g. Next Generation Sequencing) ● If the R-object representing the data set grows larger than the available RAM, R stops throwing an error reading "Cannot allocate vector of xx byte". Possible solution: R package bigmemory (based on C++-libraries for big data objects) 2 areas of usage: ● if several processes operate on the same big matrix ● file-backed-matrices if data sets are larger than available main memory and the combination of both situations
  • 87. R-Package bigmemory Essential functions: ● bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to access the matrix) ● filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small) ● describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object ● bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their elements can be accessed using brackets
  • 88. bigmemory: code example ###write data(golub) library(bigmemory) setwd('~/tmp/bigmem') X<-as.matrix(golub[,-1]) #create filebacked.bigmatrix and write data into its elements z<- filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',backingfile="m agolub.bin",descriptorfile="magolub.desc") k<-0 for(i in 1:5000){ inds<-sample(1:nrow(X),30) z[(1:30)+(k*30),]<-X[inds,] k<-k+1} #create and save descriptorfile for later usage desc<-describe(z) save(desc,file='desc_z.RData')
  • 89. bigmemory: code example ###read library(bigmemory) setwd('tmp/bigmem') #load descriptorfile load('desc_z.RData') #attach bigmatrix object using the descriptor file y<-attach.big.matrix(desc) #access elements y[1:10,7] #read element 7 in the 5th row b<-y[5,7] #compute sum of a submatrix (sum1<-sum(y[1:10,5:20]))
  • 90. bigmemory: exercise Exercise for you: 1. create a bigmatrix object using big.matrix() 2. create a descriptor and save it 3. start another R-session on the same node 4. load the descriptor file and attach the bigmatrix 5. use the bigmatrix object for communication between both R processes
  • 91. Gaining Flexibility: doRedis ● separates job administration and execution ● subtasks are stored in a redis data base o master process sends subtasks of a computation to the server o worker can log in and request the tasks o all necessary R objects are stored in the redis server, too ● necessary software: o R-packages: rredis, doRedis o data base: redis-server (debian-package)
  • 92. doRedis: essential functionality ● Master process: o registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue for the tasks to come o foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base o redisFlushAll(): clears the data base o removeQueue(): removes a queue from the data base ● Worker process: o registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed o startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks specified in jobqueue (uses multicore) o redisWorker(jobqueue,host): useful in mpi-environments usually users do not request or set the data base values directly typical parallelization as known from other "Do-packages"
  • 93. Worker processes can run on any R-compatible hardware and can connect at any time redis-server master: doRedis sends jobs +objects NODE 1 worker 1a ... worker 1z NODE 2 worker 2a ... worker 2z NODE 3 worker 3a ... worker 3z NODE 4 worker 4a ... worker 4z distributes jobs and objects eventually returns results ● robust ● flexible ● dynamic
  • 94. doRedis: code example Master (sending subtasks to redis-server and wait for results): #redis-server ~/redis/redis-2.2.14/redis.conf (in linux shell, starts the redis-server) #cross-validation of classification on microarray data library(CMA) X <- as.matrix(golub[,-1]) y <- golub[,1] ls <- GenerateLearningsets(y=y,method='CV', fold=10,niter=10000) #function to be applied on each node cl2 <- function(j){ require(CMA) ttt<-system.time(cl<-svmCMA(y=y,X=X,learnind=ls@learnmatrix[j,],cost=10)) list(cl,ttt,Sys.info())} #connect to redis-server, sent subtasks and wait for results library(doRedis) redisFlushAll() registerDoRedis('jobscmanew') numtodo<-nrow(ls@learnmatrix) lll3<-foreach(j=1:numtodo) %dopar% {cl2(j)}
  • 95. doRedis: code example Worker processes (connect to server, receive subtasks and objects, return results): ###using multicore (just two lines) #register jobqueue from redis-server registerDoRedis('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de') #start 10 local workers startLocalWorkers(n=10, queue='jobscmanew') ###using MPI #function to be run by each mpi-process startdr<-function(ll){ library(doRedis) redisWorker('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de') } #start rmpi library(Rmpi) numworker<-mpi.universe.size() mpi.spawn.Rslaves() #let each mpi-process connect to redis-server and perform subtasks mpi.apply(1:numworker,startdr)
  • 96. doRedis: exercise 1. connect to the redis server in R 2. submit a job queue 3. start workers to perform the subtasks 4. set a value for variable xnewinteger (use) 5. request the value of variable xnewinteger (use)
  • 97. ● redis and doRedis provide high flexibility for performing independent subtasks o worker processes can connect at any time o errors in individual processes do not stop the entire computation (robustness) o worker processes can run on totally different architectures o worker processes can run all around the world ● disadvantage: database can become a bottleneck if large R objects have to be stored/sent solution: separation of large data objects (bigmemory) and job tasks (redis) Combining doRedis and bigmemory
  • 98. Separate task and data channel: Combining doRedis and bigmemory
  • 99. doredis/bigmemory: Code Example worker process: redisbigreadwrite<-function(procind){ require(CMA) require(bigmemory) j<-procind setwd('~/tmp/bigmemlrz') load('desc_z.RData') #big data object containing many gene expression sets load('desc_out.RData') #big data file for misclassification rates z<-attach.big.matrix(desc) out<-attach.big.matrix(descout) load('descresmat.RData') resmat<-attach.big.matrix(descresmat) #big data object for simulating large writing operation for(iter in 1:10){ start<-(j-1)*30*10*10+(iter-1)*30*10+1 X<-z[start:(start+299),] #read gene expression matrix cl<-svmCMA(y=sample(c(1,2),nrow(X),replace=T),X=X,learnind=1:25,cost=10)) #construct classifier out[(j-1)*10+iter]<-mean(abs(cl@y-cl@yhat)) #compute misclassification rate resmat[start:(start+299)]<-X #write X } #flush flush(resmat);flush(out)}
  • 100. doredis/bigmemory: Code Example master process: ###create bigmatrix (gene expressions) library(bigmemory) setwd('~/tmp/bigmemlrz') X<-as.matrix(golub[,-1]) z<- filebacked.big.matrix(nrow=30*1500,ncol=ncol(X),type='double',backingfile="magolu b.bin", descriptorfile="magolub.desc") for(i in 1:1500){ inds<-sample(1:nrow(X),30) z[(1:30)+(i*30),]<-X[inds,]} #create descrptor file and save it for other processes desc<-describe(z) save(desc,file='desc_z.RData') ###doredis part library(doRedis) registerDoRedis('rwbigmem') lll3<-foreach(j=1:1500) %dopar% redisbigreadwrite{(j)} results are returned in a file-backed object so master could quit
  • 101. doredis/bigmemory: code example main difference: underlying network and network file system IBE (NFS)LRZ (NAS)
  • 102. comparison to standard MPIIO- approach Difference: MPI less flexible ● not robust ● collective open/close calls Fortran90 - MPIIO - Implementation R - bigmemory - implementation
  • 103. Exercise: 1. run the previous example using only two doredis-workers which perform only a single task 2. rewrite the previous example such that the proportion of class 1 predictions is returned 3. try to rewrite the previous example such that each worker process reads 10 subdatasets at a time and then constructs a classifier for each of the ten read in subdatasets 4. create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension 200x10000 ) using random numbers and run the previous example using that input 'bigmatrix' doRedis/bigmemory: Exercise
  • 104. Thanks for your attention. Further questions? The End
  • 105. Worker processes can run on any R-compatible hardware and can connect at any time redis-server master: doRedis sends jobs +objects NODE 1 worker 1a ... worker 1z NODE 2 worker 2a ... worker 2z NODE 3 worker 3a ... worker 3z NODE 4 worker 4a ... worker 4z distributes jobs and objects eventually returns results ● robust ● flexible ● dynamic