Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Slides meyer116

The era of genomic evaluation has brought the need to
perform computations involving large, dense matrices. Particular tasks are the computation and inversion of the genomic relationship matrix. This paper investigates the suitability of Graphics Processing Units together with highly optimised software libraries for these computations, using blocked algorithms. It is shown that calculations
are readily sped up by parallel processing, using freely available library routines, and that reductions in time by factors of 4 to 5 are achievable even for `consumer' grade graphics cards.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Slides meyer116

  1. 1. Utility of Graphics Processing Units for dense matrix calculations in computing and inverting genomic relationship matrices Karin Meyer and Bruce Tier Animal Genetics and Breeding Unit, University of New England, Armidale, Australia AAABG 2013
  2. 2. GPUs for GRMs | Introduction Computing in the genomics age Requirements for mixed model analyses have changed! Pre-genomics: mixed model equations sparse (few non-zero elements) – inverse of pedigree based relationship matrix (NRM) concentrated on exploiting sparsity in computations Now: Genomic relationship matrix (GRM) dense (large proportion of elements non-zero) require manipulation of large, dense matrices to – compute the GRM – invert the GRM – predict breeding values in single-step approach computationally demanding exploit new hardware and software libraries – computing in parallel K. M. | 2 / 16
  3. 3. GPUs for GRMs | Introduction Objectives A first look to examine scope for speeding up computations to calculate & invert GRM using parallel computations exploiting a Graphics Processing Unit (GPU) “An excursion into the land of computer gaming, modern compilers and software libraries” K. M. | 3 / 16
  4. 4. GPUs for GRMs | Introduction What is a GPU? Graphical Processing Unit initially: co-processor for computer gaming – graphics card – fast recalculation of pixel values – remember: i387 math processor for i386 now: GP-GPU adapted to General Purpose computing – hundreds to thousands of cores – capable of double precision calculations – turns desktop PC into personal super-computer – but: limited memory (currently 6GB max.) specialised interface for application programming – CUDA (proprietary of NVIDIA) −→ Fortran interface – OPENCL (C++ like) K. M. | 4 / 16
  5. 5. GPUs for GRMs | Introduction Tesla K20X 2688 cores 6 GB RAM Peak: 3.95/1.31 Tflops (single/double precision) K. M. | 5 / 16
  6. 6. GPUs for GRMs | Calculating the GRM Memory needed: G = ZZ Gigabytes – single precision matrices n Gn×n Zn×s s = 100K 500K 1000K 5 000 0.1 1.9 9.3 18.6 10 000 0.4 3.7 18.6 37.3 20 000 1.5 7.5 37.3 74.5 30 000 3.4 11.2 55.9 111.8 50 000 9.3 18.6 93.1 186.3 −→ break calculations into blocks −→ use out-of-core storage → parallel computing literature K. M. | 6 / 16
  7. 7. GPUs for GRMs | Calculating the GRM Blockwise multiplication All-in-one =Z Z G G = ZZ Column-wise division = + · · · +Z.1 · · · · · · Z.4 Z.1 · · · · · · Z.4 Z.1Z.1 Z.4Z.4 G = k Z.kZ.k K. M. | 7 / 16 Z: n animals × s SNPs G: symmetric sn(n + 1)/2 flops
  8. 8. GPUs for GRMs | Calculating the GRM Blockwise multiplication - cont’ Row-wise division =Zi. Zj. Gij = Zi.Zj. Row- and columnwise division = Gij = k ZikZjk K. M. | 8 / 16
  9. 9. GPUs for GRMs | Inverting the GRM Blockwise inversion G =   GPP GPC GPT GCP GCC GCT GTP GTC GTT   1. Subdivide matrix into blocks C current −→ choosen size P previous T trailing 2. Factor & invert chol(GCC) 3. Adjust GPP & GPC 4. ‘Absorb’ into GTT & GCT 5. Calculate inverse 6. Redefine C, P & T; repeat Gauss-Jordan Elimination GCC := chol(GCC) GCC := G−1 CC GPC := GPCGCC GPP := GPP + GPCGPC GCT := GCCGCT GTT := GTT − GCTGCT GPT := GPT − GPCGCT GCT := −(GCCGCT) GPC := GPCGCC GCC := GCCGCC −→ break matrix products into blocks as needed K. M. | 9 / 16
  10. 10. GPUs for GRMs | Computing environment Hardware ‘Standard’ multi-core CPU – Intel I7-3930K – 3.2 GHz, 12M cache – 64GB RAM – 6 cores GPU – Nvidia GTX 580 – 512 cores – 3GB RAM – capable of double precision calculations → high end gaming card, not GP-GPU K. M. | 10 / 16
  11. 11. GPUs for GRMs | Computing environment Software Linux (CentOS 6) gfortran, gcc 4.4.7; Intel composer XE 13.0.1 Intel MKL library 11.0 – BLAS: SSYRK, SGEMM, STRTRI, STRMM – LAPACK: SPOTRF, SPOTRI CUDA 5.0 – CUBLAS: CUBLAS_SSYRK, CUBLAS_SGEMM, CUBLAS_STRTRI, CUBLAS_STRMM CULA dense R16a – CULA_SPOTRF, CULA_SPOTRI, CULA_STRTRI MAGMA 1.3 – MAGMAF_SLAUUM K. M. | 11 / 16
  12. 12. GPUs for GRMs | Results Time to ‘make’ GRM∗ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● GPU CPU6 CPU1 1 hour 5 hours 10−1 100 101 102 103 0 20000 40000 60000 80000 No. animals Systemtime(min) ∗ Single precision calculations, s = 500 000 K. M. | 12 / 16
  13. 13. GPUs for GRMs | Results Time to ‘make’ GRM∗ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● GPU CPU6 CPU1 1 hour 5 hours 10−1 100 101 102 103 0 20000 40000 60000 80000 No. animals Systemtime(min) ∗ Single precision calculations, s = 500 000 K. M. | 12 / 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● GPU / CPU1 CPU6 / CPU1 5 10 15 20
  14. 14. GPUs for GRMs | Results Time to invert GRM† ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 hour GPU CPU6 CPU1 10−4 10−3 10−2 10−1 100 101 102 0 20000 40000 60000 80000 No. animals Systemtime(min) † Single precision calculations K. M. | 13 / 16
  15. 15. GPUs for GRMs | Results Time to invert GRM† ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 hour GPU CPU6 CPU1 10−4 10−3 10−2 10−1 100 101 102 0 20000 40000 60000 80000 No. animals Systemtime(min) † Single precision calculations K. M. | 13 / 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ●●●● ● ● ●●●●●●●● ●●●●●● ● ●●●●●●●● ● ●●●● ●● ● ●●●● ●● ●●●●●●●●● ● GPU / CPU1 CPU6 / CPU1 5 10 15
  16. 16. GPUs for GRMs | Results Time to invert GRM‡ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 1 hour GPU CPU6 CPU1 10−3 10−2 10−1 100 101 102 0 20000 40000 60000 80000 No. animals Systemtime(min) ‡ Double precision calculations K. M. | 14 / 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● GPU / CPU1 CPU6 / CPU1 4 6 8
  17. 17. GPUs for GRMs | Results | Conclusions Conclusions Build & invert GRM −→ dense matrix multiplications −→ highly parallelisable K. M. | 15 / 16
  18. 18. GPUs for GRMs | Results | Conclusions Conclusions Build & invert GRM −→ dense matrix multiplications −→ highly parallelisable BLAS and LAPACK library routines awesome – highly optimised – exploit memory architecture of modern processors – multi-threaded versions K. M. | 15 / 16
  19. 19. GPUs for GRMs | Results | Conclusions Conclusions Build & invert GRM −→ dense matrix multiplications −→ highly parallelisable BLAS and LAPACK library routines awesome – highly optimised – exploit memory architecture of modern processors – multi-threaded versions GPU: powerful new hardware for parallel computing – thousands of processors – can use multiple GPUs simultaneously – BLAS/LAPACK etc. libraries available – but: limited memory −→ blocked algorithms – but: specialised interface, not much Fortran support yet K. M. | 15 / 16
  20. 20. GPUs for GRMs | Results | Conclusions K. M. | 16 / 16

    Soyez le premier à commenter

    Identifiez-vous pour voir les commentaires

The era of genomic evaluation has brought the need to perform computations involving large, dense matrices. Particular tasks are the computation and inversion of the genomic relationship matrix. This paper investigates the suitability of Graphics Processing Units together with highly optimised software libraries for these computations, using blocked algorithms. It is shown that calculations are readily sped up by parallel processing, using freely available library routines, and that reductions in time by factors of 4 to 5 are achievable even for `consumer' grade graphics cards.

Vues

Nombre de vues

285

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

2

Actions

Téléchargements

2

Partages

0

Commentaires

0

Mentions J'aime

0

×