SlideShare a Scribd company logo
1 of 59
Download to read offline
.
    A fast implementation of matrix-matrix product in
     double-double precision on NVIDIA C2050 and
        application to semidefinite programming
.


                              Nakata Maho∗†
                      (maho@riken.jp∗ ),
      Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro†

           RIKEN, Advanced Center for Computing and Communication† ,
                                 JFE Tech††

    International Conference on Networking and Computing 2012/12/5 @ Okinawa
                                    14:45-15:15




                            Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Overview



Introduction of this research in a slide.
Importance of high precision arithmetic.
The double-double precision: a cheap and easy solution for
quadruple precision and its details.
Matrix-matrix multiplication (Rgemm) in MPACK (high
precision version of BLAS and LAPACK).
Implementation of a fast Rgemm on C2050 GPU : 150 times
faster than CPU.
Application: acceleration of semidefinite programming solver
“SDPA-DD” : 10 times faster than CPU.
Summary.



                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Introduction of this research in a slide.

Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           GPU=CPUx150, Peak performance: 26GFLOPS
                                                25


                                                20




                                       GFLOPS
                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000     3000      4000     5000   6000

  §                                         Dimension
                                                      ¤
   + Application : Semidefinite Programming GPU=CPUx10
  ¦                                                   ¥
                         Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
Introduction of this research in a slide.

Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           GPU=CPUx150, Peak performance: 26GFLOPS
                                                25


                                                20




                                       GFLOPS
                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000     3000      4000     5000   6000

  §                                         Dimension
                                                      ¤
   + Application : Semidefinite Programming GPU=CPUx10
  ¦                                                   ¥
                         Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  Scientific computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  Scientific computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  Scientific computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Iterative methods in double precision calculation sometimes
  do not even converge. [Hasegawa 2007]




                     Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Iterative methods in double precision calculation sometimes
  do not even converge. [Hasegawa 2007]




                     Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Semidefinite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Semidefinite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Semidefinite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Semidefinite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∼ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


   “754-2008 IEEE Standard for Floating-Point Arithmetic”
   The binary64 (aka double precision) format has 16 decimal
   significant digits




   Widely used and very fast. Core i7 920: ∼40GFLOPS;
   RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over
   10PFLOPS)
   §                                                        ¤
   Rounding error may occur for every arithmetic operation.
   ¦                                                        ¥


                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


The double-double precision number a is expressed by two double
precision numbers a hi , a lo.




                         a = (a hi , a lo).


                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


                    §                                        ¤
                    Knuth’s Theorem ¥
                    ¦


Error-free transformation of two floating point numbers a, b,


                   a + b = (a ⊕ b) + e
 where ⊕ is addition including rounding errors, + is addition, e is
floating point number
       §                                                      ¤
        We can evaluate rounding error exactly for addition!
       ¦                                                      ¥



                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


                   §                                           ¤
                   Dekker’s Theorem ¥
                   ¦


Error-free transformation of two floating point numbers a, b,


                  a × b = (a ⊗ b) + e
⊗ is multiplication operator with rounding errors, × is multiplication
operator, e is floating point number.
    §                                                          ¤
    We can evaluate rounding error exactly for multiplication!
    ¦                                                          ¥



                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)”
where a, b are floating point numbers, and ⊕, are operators
including rounding errors. and when and when |a| ≥ |b|, we can
calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in three
operations.
                1                                    (
                 Quick-Two-Sum (a, b):
                    1. s← a⊕b
               . e ← b (s a)
                 2

              3. return(s, e)
              0                               )
              §                             ¤
              (s, e) = Quick-Two-Sum (a, b) ¥
              ¦



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)”
where a, b are floating point numbers, and ⊕, are operators
including rounding errors. and we can calculate exactly
s = (a ⊕ b), e = a + b − (a ⊕ b) in six operations.
                9                                   6
                   Two-Sum (a, b):
                   1. s← a⊕b
               . v←s a
                 2

              3. e ← (a (s v)) ⊕ (b v)
              4. return(s, e)
              8                                                  7
                 §                       ¤
                 (s, e) = Two-Sum (a, b) ¥
                 ¦



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision

Basics:Dekker’s Theorem
There exists an algorithm which calculate s = (a ⊗ b) and
e = a × b − (a ⊗ b), where ⊗ is multiplication operator with
rounding errors, using following “Split(a)” in four operations and
“Two-Prod(a,b)” in 17 operations.
                               9                                                                   6
9                             6       Two-prod (a, b):
      Split (a):                   . p← a⊗b
                                   1

   1. t ← (227 + 1) ⊗ a            . (a , a ) ← Split(a)
                                   2
                                                hi      lo
 . a hi ← t (t a)
   2                                   . (b hi , b lo) ← Split(b)
                                       3

3. a lo ← a a hi                       . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗
                                       4

4. return(a hi , a lo)                      b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo
8                             7
                                 . return( p, e)
                                 5
                              8                                                                    7
                      §                      ¤
                      (s, e) =Two-Prod(a, b) ¥
                      ¦
                         Nakata Maho       A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Addition in double-double operation can be done in 20FLOPS by
following “QuadAdd-IEEE”
9                                                      6
       QuadAdd-IEEE (a, b):
    1. (s hi , e hi ) = Two-Sum(a hi , b hi )
   2. (s lo, e lo) = Two-Sum(a lo, b lo)
   3. e hi = e hi ⊕ s lo
   4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi )
   5. e hi = e hi ⊕ s lo
 . (s hi , e lo) = Quick-Two-Sum(s hi , e hi )
   6

7. return(c)
8                                                                                7



                           Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




Multiplication in double-double operation can be done in 24FLOPS
by following “QuadMul”.
9                                                       6
      QuadMul (a, b):
    1. ( phi , plo) = Two-Prod(a hi , b hi )
   2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi )
   3. (c hi , c lo) = Quick-Two-Sum(phi , plo)
4. return(c)
8                                                                             7




                          Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




The FMA (fused multiply-add) operation calculates

                             a×b+c

in one command. Doing a × b + c exactly, then round to
double-precision.




                       Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




Faster: using FMA instruction Two-Prod becomes 3 operations (17
op. w/o FMA), and QuadMul(-FMA) can be done in only 10
operations (24 ops w/o FMA)
1                                                   (
     Two-prod-FMA (a, b):
   1. p← a⊗b
 . e ← FMA(a × b − p)
   2

3. return(p, e)
0                                                                         )




                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision



Faster: lower accuracy operations 9                                                                     6
9                                 6 QuadMul-Sloppy (a, b):
     QuadAdd-Cray (a, b):         1. p = (a hi ⊗ b lo)
   1. (c hi , c lo) =
                                  2. q = (a lo ⊗ b hi )
      Two-Sum(a hi , b hi )        . t = p⊕ q
                                  3
   2. c lo = c lo ⊕ (a lo ⊕ b lo)
                                  4. c hi = FMA(a hi × b hi + t)
   3. (c hi , c lo) =
                                  5. e = FMA(a hi × b hi − c hi )
      Quick-Two-Sum(c hi , c lo)
                                  6. c lo = e ⊕ t
   4. return(c)
8                                 7
                                  7. return(c)
                                  8                                                                     7




                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision

Summary: Operations count in each double-double arithmetic
                 Algorithm               # of operations
               Quick-Two-Sum                    3
                 Two-Sum                        6
                    Split                       4
                 Two-Prod                      17
               Two-Prod-FMA                     3∗
              QuadAdd-IEEE                          20
               QuadAdd-Cray                         11
                 QuadMul                            24
               QuadMul-FMA                          10∗
            QuadMul-FMA-Sloppy                      8∗
∗ 2FLOPScount for FMA.
We used QuadAdd-IEEE and QuadMul-FMA when not explicitly
stated
                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


QD library
Features: Class of C++.The double-double precision: “dd real”.
Free software. Author: Yozo Hida, Xiaoye S. Li, David H. Bailey
Download:

http://crd.lbl.gov/˜dhbailey/mpdist/

Paper:

http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf

Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic:
Algorithms, Implementation, and Application”, Technical Report
LBNL-46996, Lawrence Berkeley National Laboratory, 2000.


                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and performance evaluation

We accelerated matrix-matrix multiplication routine called
“Rgemm”. Prototype definition of Rgemm
'                                                                                                   $
void Rgemm(const char *transa, const char *transb,
mpackint m, mpackint n, mpackint k, dd_real alpha,
dd_real * A, mpackint lda, dd_real * B, mpackint ldb,
dd_real beta, dd_real * C, mpackint ldc)
&                                                     %

                              “MPACK”by M. Nakata, Multiple pre-
                              cision version of BLAS, LAPACK(de
                              facto standard linear algebra pack-
                              age).

                              http://mplapack.sourceforge.net/

                              “Rgemm” corresponds to “dgemm”
                              and “sgemm” of BLAS)


                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and performance evaluation


Related study
    D. Mukunoki and D. Takahashi : Implementation of
    double-double matrix matrix multiplication on GPU, HPCS, p.
    148-156, (2011). → Matrix size should be multiple of 64 and
    slower than our implementation
   Nakasato, N.:, “A Fast GEMM Implementation On a Cypress
   GPU, Performance Modeling, Benchmark and Simulation of
   High Performance Computing Systems”, Louisiana, USA,
   2010. → Matrix size should be multiple of 64 and faster than
   our implementation
 §                                     ¤
 Both implementations are not practical → we implemented for
 ¦                                     ¥
                          general use.



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

 NVIDIA C2050                               Architecture




           Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Block algorithm. We divide matrices to small blocks like b K, b M,
b N. We used b M = b K = 16 and b N = 64.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Basic algorithm:
 1. Transfer A,B,C matrices on CPU memory to GPU Global
    memory.
 2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efficient.
 3. Apply 16 × 16 = 256 thread blocks to each elements Each
    (i, j)-th thread in thread block calculated i-th row of Ab and
     j, j + 16, j + 32, j + 48-th column (four columns at the same
    time) of Bb.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation
Operation of each thread in detail:
  1.   Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of
       Ab and j, j + 16, j + 32, j + 48-th row of Bb.
  2.   Read the first block Ab and Bb from global memory to shared memory.
       Each thread of blocks read its elements.
  3.   Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16
       , bi+32 , bi+48 as p0 , p1 , p2 , p3
  4.   Update c0, c1, c2, c3 like c0 ← c0 + α p0.
  5.   Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are
       available.
  6.   Update C-matrix by c0, c1, c2, c3.
  7.   Finally transfer C-matrix from GPU Global memory to CPU.




                               Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-double
precision. Square matrix (m = n = k), we varied m. Max. kernel
performance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transfer
included.
           16
           14
           12
  GFLOPS




           10
            8
            6
            4
            2                         NN−Kernel
                                       NN−Total
            0
                0   1000 2000 3000 4000 5000 6000
                            Dimension
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-double
precision with matrix transposes. Square matrix (m = n = k), we
varied m. No performance loss with matrix transposes are
observed.
           16
           14
           12
                                      NN−Kernel
  GFLOPS




           10
                                       NN−Total
            8                         NT−Kernel
            6                          NT−Total
                                      TN−Kernel
            4                          TN−Total
            2                         TT−Kernel
                                       TT−Total
            0
                0   1000 2000 3000 4000 5000 6000
                            Dimension
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

We observed no performance loss with matrix transposes, the
reason was we make use of texture memory instead.
    Global memory and Texture memory are essentially the same.
    However, performance loss was small without coalescing
    memory access using texture memory.




    Also, relatively easy to hide the latency of memory transfer in
    double-double precision since operation intensive (cf.
    QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10
    FLOPS).
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


“Pointer Redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
    Large performance loss (∼ 35%) are observed for matrix size
    out of multiple of 64.




                       Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

“Pointer redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
     Simple algorithm: if pointer is out of the block, then return the
     value of the nearest edge.




     Very simple program.
     Small amount of performance loss.
                        §               ¤
                         Breakthrough!!
                        ¦               ¥
                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Performance loss was reduced from 35% to 6% !!

           16.4
                                         Kernel
           16.2                           Total
             16
           15.8
  GFLOPS




           15.6
           15.4
           15.2
             15
           14.8
           14.6
                  2050   2100      2150            2200             2250
                                Dimension

                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Performance blurred only 0.1% by repeated calculations.


                  15.5575
                   15.557
                  15.5565
  GFLOPS(Total)




                   15.556
                  15.5555
                   15.555
                  15.5545
                   15.554
                  15.5535
                            10 20 30 40 50 60 70 80 90 100
                                    −th measure
                             Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Using less accurate operations, we attained 26.4GFLOPS.



          25


          20
 GFLOPS




          15


          10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                              QuadMul−Sloppy, QuadAdd−Cray Total
                               QuadMul−FMA, QuadAdd−Cray Kernel
                                QuadMul−FMA, QuadAdd−Cray Total
          5                  QuadMul−Sloppy, QuadAdd−IEEE Kernel
                              QuadMul−Sloppy, QuadAdd−IEEE Total
                              QuadMul−FMA, QuadAdd−IEEE Kernel
                                QuadMul−FMA, QuadAdd−IEEE Total
          0
               0      1000       2000          3000        4000            5000           6000
                                        Dimension
                                 Nakata Maho      A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


Using less accurate operations, we attained 26.4GFLOPS. “CPU”
denotes measured on Xeon 3470 + DDR3-1066.
                Algorithm                    Performance
  QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS
   QuadAdd-Cray, QuadMul-Sloppy total       25.7GFLOPS
     QuadAdd-Cray, QuadMul kernel           23.0GFLOPS
       QuadAdd-Cray, QuadMul total          22.4GFLOPS
 QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS
  QuadAdd-IEEE, QuadMul-Sloppy total        17.8GFLOPS
     QuadAdd-IEEE, QuadMul kernel           16.4GFLOPS
      QuadAdd-IEEE, QuadMul total           16.1GFLOPS
     QuadAdd-IEEE, QuadMul CPU               100MFLOPS
 QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


16.1GFLOPS = ??2.4% (or 46.2%) of peak performance
(QuadAdd-IEEE, QuadMul-FMA)
Average flop per sec:QuadAdd-IEEE 20op. QuadMul-FMA
10op., in Rgemm, same # of mul and add op appear.

                   (20 + 10 − 1)/2 = 14.5

Approx theoretical peak should be...

             515GFLOPS/14.5 = 35.5GFLOPS

However, on C2050, peak performance is calculated full use
of FMA and our calculation is not this case, thus...

            515GFLOPS/14.5/2 = 17.8GFLOPS


                   Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for Semidefinite programming
                    solver“SDPA-DD”.




                         Application




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for Semidefinite programming
                    solver“SDPA-DD”.

 Semidefinite programming:

       Primal   min:             A0 • X
                s.t.:          Ai • X = bi                (i = 1, 2, · · · , m)
                                  X 0
                                 ∑m
         Dual   max:                 bi zi
                                 i=1
                         ∑
                         m
                 s.t.:         Ai zi + Y = A0
                         i=1
                                 Y     0

 Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim
                                             ∑
 vector,Y n × n symm. variable mat, X • Y := Xi j Yi j . X 0 : X
 semidefinite: eigenvalues are lager than or equal to 0.

                         Nakata Maho       A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for Semidefinite programming
                    solver“SDPA-DD”.




 Nature of optimally.
 .
 Theorem (Complementary slackness theorem)
 .
 When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they
 satisfy the conditions of SDP of primal and dual, then necessary
 and sufficient condition for optimally of (X∗ , Y ∗ , z∗ ) is:

 .                             X ∗ • Y ∗ = 0.




                           Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for Semidefinite programming
                    solver“SDPA-DD”.


 When X∗ , Y ∗ is optimal,

                                 X∗ • Y ∗ = 0.

 Then,
                        rank X∗ + rankY ∗ ≤ n                                                  (1)
 also follows.
             §                                                                 ¤
             At least one of X∗ , Y ∗ is singular ¥
             ¦
     Usually both of X∗ , Y ∗ are singular: → unstable and/or less
                      accurate at the optimal.




                             Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
How to solve SDP:Interior point primal-dual path following
                        method

World’s best implementations SDPA and SDPARA are available by
the SDPA group led by Prof. Fujisawa.
     Step 0: Setting the initial points: x0 , X0 , Y 0 , X0         0, Y 0      0. letting h = 0,
               choose parameter γ ∈ (0, 1).
     Step 1: Calculate Shur complementary matrix B ∈ S n.

                                        Bi j = ((X h )−1 Fi Y h ) • F j

     Step 2: Solving linear equation Bdx = r, and calculate dX, dY by
               solution dx, then we obtain next step (dx, dX, dY)
     Step 3: Determine step size α keeping positive-semidefiniteness of
               matrices. α = max{α ∈ [0, 1] : X h + αdX                   0, Y h + αdY           0}.
     Step 4: Update the current point.
               (x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY).
     Step 5: If (x h+1 , X h+1 , Y h+1 ) satisfies some requirements, then iteration
               ends. Otherwise, go back to the Step 1 and increment h = h + 1.


                               Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
Shur complement matrix becomes singular

  B is called “Shur complementary matrix”
  We solve linear equation Bdx = r to determine the next step.
  This linear equation becomes singular!
§                                                               ¤
Multiple precision arithmetic is needed for accurate solutions!
¦                                                               ¥
                       The 1-norm and the estimated 1-norm condition number of shur complement matrix
           1e+20
                                                                                           1-cond
                                                                                           1-norm


           1e+15



           1e+10



          100000



               1



           1e-05



           1e-10
                   0     10        20        30         40           50     60        70        80      90
                                                          # of iter.



                                  Nakata Maho                A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for Semidefinite programming
                    solver“SDPA-DD”.

 Benchmark result: lager problem from SDPLIB (problem archive)
 CPU: Xeon 3470, DDR3 -1066
          Problem   CPU(sec)         GPU(sec)            acceleration
         equalG51    6531.9           573.2                 11.4
         gpp500-1     902.0            72.2                 12.5
         gpp500-4     638.0            74.8                  8.5
          maxG32     36284.4          4373.1                 8.3
          maxG55    521575.4         53413.1                 9.8
         mcp500-4     539.1            65.2                  8.3
           qpG11     16114.7          1408.0                11.4
           qpG51     39678.9          3299.2                12.0
            ss30      310.7           138.6                  2.2
           theta5    3250.0           239.8                 13.6
           theta6    9028.2           623.6                 14.5
         thetaG51    49161.5          4870.4                10.1
                       Nakata Maho    A fast implementation of matrix-matrix product in double-double preci
Summary
                §                                 ¤
                 http://mplapack.sourceforge.net/
                ¦                                 ¥
Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           CPU x150, Peak performance: 26GFLOPS
                                                25


                                                20
                                       GFLOPS



                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000      3000     4000     5000   6000
                                                                              Dimension
                         Nakata Maho             A fast implementation of matrix-matrix product in double-double preci

More Related Content

Viewers also liked

Social media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSocial media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSander Van Lingen
 
Mocloudos - Feather-weight Cloud OS developed within
14 man-days
Mocloudos - Feather-weight Cloud OS developed within
14 man-daysMocloudos - Feather-weight Cloud OS developed within
14 man-days
Mocloudos - Feather-weight Cloud OS developed within
14 man-daysMasaki Muranaka
 
Competing For Analytics
Competing For AnalyticsCompeting For Analytics
Competing For Analyticsmdideepak
 
medioambiente consumo
medioambiente consumomedioambiente consumo
medioambiente consumoChelo Mena
 
Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Fullerton Securities
 
Who wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionWho wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionmargaserranoflo
 
نوجوانان
نوجواناننوجوانان
نوجوانانmojir
 
Tips on how to get more followers on keek
Tips on how to get more followers on keekTips on how to get more followers on keek
Tips on how to get more followers on keekrock635
 
Social Media in Senior Care
Social Media in Senior CareSocial Media in Senior Care
Social Media in Senior CareLee Aase
 
Visual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineVisual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineEmmanuel Lemoine
 
Customer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionCustomer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionNicolas Cachoux
 

Viewers also liked (14)

Social media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSocial media strategy and ROI in 4 steps
Social media strategy and ROI in 4 steps
 
Mocloudos - Feather-weight Cloud OS developed within
14 man-days
Mocloudos - Feather-weight Cloud OS developed within
14 man-daysMocloudos - Feather-weight Cloud OS developed within
14 man-days
Mocloudos - Feather-weight Cloud OS developed within
14 man-days
 
Competing For Analytics
Competing For AnalyticsCompeting For Analytics
Competing For Analytics
 
Article Samurai Q&A [WEBINAR]
Article Samurai Q&A [WEBINAR]Article Samurai Q&A [WEBINAR]
Article Samurai Q&A [WEBINAR]
 
medioambiente consumo
medioambiente consumomedioambiente consumo
medioambiente consumo
 
Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011
 
Salzburg
SalzburgSalzburg
Salzburg
 
Who wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionWho wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollution
 
نوجوانان
نوجواناننوجوانان
نوجوانان
 
Tips on how to get more followers on keek
Tips on how to get more followers on keekTips on how to get more followers on keek
Tips on how to get more followers on keek
 
Social Media in Senior Care
Social Media in Senior CareSocial Media in Senior Care
Social Media in Senior Care
 
Visual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineVisual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel Lemoine
 
Datan vankina
Datan vankinaDatan vankina
Datan vankina
 
Customer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionCustomer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas Solution
 

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Takahiro Harada
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUUsatyuk Vasiliy
 
Tesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayTesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayShanker Trivedi
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyAkira Nakagawa
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
OPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT TECHNOLOGIES
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemckeJay Kruemcke
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesMAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesAPNIC
 
State of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSState of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSAlexandre Morgaut
 

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming (20)

Example Application of GPU
Example Application of GPUExample Application of GPU
Example Application of GPU
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPU
 
Functional CNN in elixir
Functional CNN in elixirFunctional CNN in elixir
Functional CNN in elixir
 
Tesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayTesla @ NVIDIA investor day
Tesla @ NVIDIA investor day
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 Technology
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
OPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulation
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesMAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
 
State of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSState of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJS
 
Muda Proposal
Muda ProposalMuda Proposal
Muda Proposal
 

More from Maho Nakata

quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)Maho Nakata
 
Lie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解についてLie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解についてMaho Nakata
 
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編Maho Nakata
 
Q#による量子化学計算 : 水素分子の位相推定について
Q#による量子化学計算 : 水素分子の位相推定についてQ#による量子化学計算 : 水素分子の位相推定について
Q#による量子化学計算 : 水素分子の位相推定についてMaho Nakata
 
量子コンピュータの量子化学計算への応用の現状と展望
量子コンピュータの量子化学計算への応用の現状と展望量子コンピュータの量子化学計算への応用の現状と展望
量子コンピュータの量子化学計算への応用の現状と展望Maho Nakata
 
qubitによる波動関数の虚時間発展のシミュレーション: a review
qubitによる波動関数の虚時間発展のシミュレーション: a reviewqubitによる波動関数の虚時間発展のシミュレーション: a review
qubitによる波動関数の虚時間発展のシミュレーション: a reviewMaho Nakata
 
Openfermionを使った分子の計算 part I
Openfermionを使った分子の計算 part IOpenfermionを使った分子の計算 part I
Openfermionを使った分子の計算 part IMaho Nakata
 
量子コンピュータで量子化学のfullCIが超高速になる(かも
量子コンピュータで量子化学のfullCIが超高速になる(かも量子コンピュータで量子化学のfullCIが超高速になる(かも
量子コンピュータで量子化学のfullCIが超高速になる(かもMaho Nakata
 
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装Maho Nakata
 
第11回分子科学 2017/9/17 Pubchemqcプロジェクト
第11回分子科学 2017/9/17 Pubchemqcプロジェクト第11回分子科学 2017/9/17 Pubchemqcプロジェクト
第11回分子科学 2017/9/17 PubchemqcプロジェクトMaho Nakata
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
計算化学実習講座:第二回
 計算化学実習講座:第二回 計算化学実習講座:第二回
計算化学実習講座:第二回Maho Nakata
 
計算化学実習講座:第一回
計算化学実習講座:第一回計算化学実習講座:第一回
計算化学実習講座:第一回Maho Nakata
 
HOKUSAIのベンチマーク 理研シンポジウム 中田分
HOKUSAIのベンチマーク 理研シンポジウム 中田分HOKUSAIのベンチマーク 理研シンポジウム 中田分
HOKUSAIのベンチマーク 理研シンポジウム 中田分Maho Nakata
 
為替取引(FX)でのtickdataの加工とMySQLで管理
為替取引(FX)でのtickdataの加工とMySQLで管理為替取引(FX)でのtickdataの加工とMySQLで管理
為替取引(FX)でのtickdataの加工とMySQLで管理Maho Nakata
 
為替のTickdataをDukascopyからダウンロードする
為替のTickdataをDukascopyからダウンロードする為替のTickdataをDukascopyからダウンロードする
為替のTickdataをDukascopyからダウンロードするMaho Nakata
 
HPCS2015 pythonを用いた量子化学プログラムの開発と応用
HPCS2015 pythonを用いた量子化学プログラムの開発と応用HPCS2015 pythonを用いた量子化学プログラムの開発と応用
HPCS2015 pythonを用いた量子化学プログラムの開発と応用Maho Nakata
 
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)Maho Nakata
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
3Dプリンタ導入記 タンパク質の模型をプリントする
3Dプリンタ導入記 タンパク質の模型をプリントする3Dプリンタ導入記 タンパク質の模型をプリントする
3Dプリンタ導入記 タンパク質の模型をプリントするMaho Nakata
 

More from Maho Nakata (20)

quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
 
Lie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解についてLie-Trotter-Suzuki分解、特にフラクタル分解について
Lie-Trotter-Suzuki分解、特にフラクタル分解について
 
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
 
Q#による量子化学計算 : 水素分子の位相推定について
Q#による量子化学計算 : 水素分子の位相推定についてQ#による量子化学計算 : 水素分子の位相推定について
Q#による量子化学計算 : 水素分子の位相推定について
 
量子コンピュータの量子化学計算への応用の現状と展望
量子コンピュータの量子化学計算への応用の現状と展望量子コンピュータの量子化学計算への応用の現状と展望
量子コンピュータの量子化学計算への応用の現状と展望
 
qubitによる波動関数の虚時間発展のシミュレーション: a review
qubitによる波動関数の虚時間発展のシミュレーション: a reviewqubitによる波動関数の虚時間発展のシミュレーション: a review
qubitによる波動関数の虚時間発展のシミュレーション: a review
 
Openfermionを使った分子の計算 part I
Openfermionを使った分子の計算 part IOpenfermionを使った分子の計算 part I
Openfermionを使った分子の計算 part I
 
量子コンピュータで量子化学のfullCIが超高速になる(かも
量子コンピュータで量子化学のfullCIが超高速になる(かも量子コンピュータで量子化学のfullCIが超高速になる(かも
量子コンピュータで量子化学のfullCIが超高速になる(かも
 
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
 
第11回分子科学 2017/9/17 Pubchemqcプロジェクト
第11回分子科学 2017/9/17 Pubchemqcプロジェクト第11回分子科学 2017/9/17 Pubchemqcプロジェクト
第11回分子科学 2017/9/17 Pubchemqcプロジェクト
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
計算化学実習講座:第二回
 計算化学実習講座:第二回 計算化学実習講座:第二回
計算化学実習講座:第二回
 
計算化学実習講座:第一回
計算化学実習講座:第一回計算化学実習講座:第一回
計算化学実習講座:第一回
 
HOKUSAIのベンチマーク 理研シンポジウム 中田分
HOKUSAIのベンチマーク 理研シンポジウム 中田分HOKUSAIのベンチマーク 理研シンポジウム 中田分
HOKUSAIのベンチマーク 理研シンポジウム 中田分
 
為替取引(FX)でのtickdataの加工とMySQLで管理
為替取引(FX)でのtickdataの加工とMySQLで管理為替取引(FX)でのtickdataの加工とMySQLで管理
為替取引(FX)でのtickdataの加工とMySQLで管理
 
為替のTickdataをDukascopyからダウンロードする
為替のTickdataをDukascopyからダウンロードする為替のTickdataをDukascopyからダウンロードする
為替のTickdataをDukascopyからダウンロードする
 
HPCS2015 pythonを用いた量子化学プログラムの開発と応用
HPCS2015 pythonを用いた量子化学プログラムの開発と応用HPCS2015 pythonを用いた量子化学プログラムの開発と応用
HPCS2015 pythonを用いた量子化学プログラムの開発と応用
 
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
3Dプリンタ導入記 タンパク質の模型をプリントする
3Dプリンタ導入記 タンパク質の模型をプリントする3Dプリンタ導入記 タンパク質の模型をプリントする
3Dプリンタ導入記 タンパク質の模型をプリントする
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

  • 1. . A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming . Nakata Maho∗† (maho@riken.jp∗ ), Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro† RIKEN, Advanced Center for Computing and Communication† , JFE Tech†† International Conference on Networking and Computing 2012/12/5 @ Okinawa 14:45-15:15 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 2. Overview Introduction of this research in a slide. Importance of high precision arithmetic. The double-double precision: a cheap and easy solution for quadruple precision and its details. Matrix-matrix multiplication (Rgemm) in MPACK (high precision version of BLAS and LAPACK). Implementation of a fast Rgemm on C2050 GPU : 150 times faster than CPU. Application: acceleration of semidefinite programming solver “SDPA-DD” : 10 times faster than CPU. Summary. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 3. Introduction of this research in a slide. Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension ¤ + Application : Semidefinite Programming GPU=CPUx10 ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 4. Introduction of this research in a slide. Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension ¤ + Application : Semidefinite Programming GPU=CPUx10 ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 5. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientific computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 6. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientific computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 7. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientific computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 8. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 9. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 10. More accuracy is needed towards PETA and EXA scale computing Semidefinite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 11. More accuracy is needed towards PETA and EXA scale computing Semidefinite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 12. More accuracy is needed towards PETA and EXA scale computing Semidefinite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 13. More accuracy is needed towards PETA and EXA scale computing Semidefinite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 14. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 15. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 16. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 17. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 18. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 19. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 20. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 21. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 22. The double-double precision: handy and easy quadruple precision “754-2008 IEEE Standard for Floating-Point Arithmetic” The binary64 (aka double precision) format has 16 decimal significant digits Widely used and very fast. Core i7 920: ∼40GFLOPS; RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over 10PFLOPS) § ¤ Rounding error may occur for every arithmetic operation. ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 23. The double-double precision: handy and easy quadruple precision The double-double precision number a is expressed by two double precision numbers a hi , a lo. a = (a hi , a lo). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 24. The double-double precision: handy and easy quadruple precision § ¤ Knuth’s Theorem ¥ ¦ Error-free transformation of two floating point numbers a, b, a + b = (a ⊕ b) + e where ⊕ is addition including rounding errors, + is addition, e is floating point number § ¤ We can evaluate rounding error exactly for addition! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 25. The double-double precision: handy and easy quadruple precision § ¤ Dekker’s Theorem ¥ ¦ Error-free transformation of two floating point numbers a, b, a × b = (a ⊗ b) + e ⊗ is multiplication operator with rounding errors, × is multiplication operator, e is floating point number. § ¤ We can evaluate rounding error exactly for multiplication! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 26. The double-double precision: handy and easy quadruple precision Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)” where a, b are floating point numbers, and ⊕, are operators including rounding errors. and when and when |a| ≥ |b|, we can calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in three operations. 1 ( Quick-Two-Sum (a, b): 1. s← a⊕b . e ← b (s a) 2 3. return(s, e) 0 ) § ¤ (s, e) = Quick-Two-Sum (a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 27. The double-double precision: handy and easy quadruple precision Based on Knuth’s Theorem, we can define “Quick-Two-Sum (a, b)” where a, b are floating point numbers, and ⊕, are operators including rounding errors. and we can calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in six operations. 9 6 Two-Sum (a, b): 1. s← a⊕b . v←s a 2 3. e ← (a (s v)) ⊕ (b v) 4. return(s, e) 8 7 § ¤ (s, e) = Two-Sum (a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 28. The double-double precision: handy and easy quadruple precision Basics:Dekker’s Theorem There exists an algorithm which calculate s = (a ⊗ b) and e = a × b − (a ⊗ b), where ⊗ is multiplication operator with rounding errors, using following “Split(a)” in four operations and “Two-Prod(a,b)” in 17 operations. 9 6 9 6 Two-prod (a, b): Split (a): . p← a⊗b 1 1. t ← (227 + 1) ⊗ a . (a , a ) ← Split(a) 2 hi lo . a hi ← t (t a) 2 . (b hi , b lo) ← Split(b) 3 3. a lo ← a a hi . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗ 4 4. return(a hi , a lo) b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo 8 7 . return( p, e) 5 8 7 § ¤ (s, e) =Two-Prod(a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 29. The double-double precision: handy and easy quadruple precision Addition in double-double operation can be done in 20FLOPS by following “QuadAdd-IEEE” 9 6 QuadAdd-IEEE (a, b): 1. (s hi , e hi ) = Two-Sum(a hi , b hi ) 2. (s lo, e lo) = Two-Sum(a lo, b lo) 3. e hi = e hi ⊕ s lo 4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi ) 5. e hi = e hi ⊕ s lo . (s hi , e lo) = Quick-Two-Sum(s hi , e hi ) 6 7. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 30. The double-double precision: handy and easy quadruple precision Multiplication in double-double operation can be done in 24FLOPS by following “QuadMul”. 9 6 QuadMul (a, b): 1. ( phi , plo) = Two-Prod(a hi , b hi ) 2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi ) 3. (c hi , c lo) = Quick-Two-Sum(phi , plo) 4. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 31. The double-double precision: handy and easy quadruple precision The FMA (fused multiply-add) operation calculates a×b+c in one command. Doing a × b + c exactly, then round to double-precision. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 32. The double-double precision: handy and easy quadruple precision Faster: using FMA instruction Two-Prod becomes 3 operations (17 op. w/o FMA), and QuadMul(-FMA) can be done in only 10 operations (24 ops w/o FMA) 1 ( Two-prod-FMA (a, b): 1. p← a⊗b . e ← FMA(a × b − p) 2 3. return(p, e) 0 ) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 33. The double-double precision: handy and easy quadruple precision Faster: lower accuracy operations 9 6 9 6 QuadMul-Sloppy (a, b): QuadAdd-Cray (a, b): 1. p = (a hi ⊗ b lo) 1. (c hi , c lo) = 2. q = (a lo ⊗ b hi ) Two-Sum(a hi , b hi ) . t = p⊕ q 3 2. c lo = c lo ⊕ (a lo ⊕ b lo) 4. c hi = FMA(a hi × b hi + t) 3. (c hi , c lo) = 5. e = FMA(a hi × b hi − c hi ) Quick-Two-Sum(c hi , c lo) 6. c lo = e ⊕ t 4. return(c) 8 7 7. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 34. The double-double precision: handy and easy quadruple precision Summary: Operations count in each double-double arithmetic Algorithm # of operations Quick-Two-Sum 3 Two-Sum 6 Split 4 Two-Prod 17 Two-Prod-FMA 3∗ QuadAdd-IEEE 20 QuadAdd-Cray 11 QuadMul 24 QuadMul-FMA 10∗ QuadMul-FMA-Sloppy 8∗ ∗ 2FLOPScount for FMA. We used QuadAdd-IEEE and QuadMul-FMA when not explicitly stated Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 35. The double-double precision: handy and easy quadruple precision QD library Features: Class of C++.The double-double precision: “dd real”. Free software. Author: Yozo Hida, Xiaoye S. Li, David H. Bailey Download: http://crd.lbl.gov/˜dhbailey/mpdist/ Paper: http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic: Algorithms, Implementation, and Application”, Technical Report LBNL-46996, Lawrence Berkeley National Laboratory, 2000. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 36. Implementation on GPU and performance evaluation We accelerated matrix-matrix multiplication routine called “Rgemm”. Prototype definition of Rgemm ' $ void Rgemm(const char *transa, const char *transb, mpackint m, mpackint n, mpackint k, dd_real alpha, dd_real * A, mpackint lda, dd_real * B, mpackint ldb, dd_real beta, dd_real * C, mpackint ldc) & % “MPACK”by M. Nakata, Multiple pre- cision version of BLAS, LAPACK(de facto standard linear algebra pack- age). http://mplapack.sourceforge.net/ “Rgemm” corresponds to “dgemm” and “sgemm” of BLAS) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 37. Implementation on GPU and performance evaluation Related study D. Mukunoki and D. Takahashi : Implementation of double-double matrix matrix multiplication on GPU, HPCS, p. 148-156, (2011). → Matrix size should be multiple of 64 and slower than our implementation Nakasato, N.:, “A Fast GEMM Implementation On a Cypress GPU, Performance Modeling, Benchmark and Simulation of High Performance Computing Systems”, Louisiana, USA, 2010. → Matrix size should be multiple of 64 and faster than our implementation § ¤ Both implementations are not practical → we implemented for ¦ ¥ general use. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 38. Implementation on GPU and evaluation NVIDIA C2050 Architecture Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 39. Implementation on GPU and evaluation Block algorithm. We divide matrices to small blocks like b K, b M, b N. We used b M = b K = 16 and b N = 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 40. Implementation on GPU and evaluation Basic algorithm: 1. Transfer A,B,C matrices on CPU memory to GPU Global memory. 2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efficient. 3. Apply 16 × 16 = 256 thread blocks to each elements Each (i, j)-th thread in thread block calculated i-th row of Ab and j, j + 16, j + 32, j + 48-th column (four columns at the same time) of Bb. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 41. Implementation on GPU and evaluation Operation of each thread in detail: 1. Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of Ab and j, j + 16, j + 32, j + 48-th row of Bb. 2. Read the first block Ab and Bb from global memory to shared memory. Each thread of blocks read its elements. 3. Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16 , bi+32 , bi+48 as p0 , p1 , p2 , p3 4. Update c0, c1, c2, c3 like c0 ← c0 + α p0. 5. Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are available. 6. Update C-matrix by c0, c1, c2, c3. 7. Finally transfer C-matrix from GPU Global memory to CPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 42. Implementation on GPU and evaluation The performance of matrix-matrix operation in double-double precision. Square matrix (m = n = k), we varied m. Max. kernel performance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transfer included. 16 14 12 GFLOPS 10 8 6 4 2 NN−Kernel NN−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 43. Implementation on GPU and evaluation The performance of matrix-matrix operation in double-double precision with matrix transposes. Square matrix (m = n = k), we varied m. No performance loss with matrix transposes are observed. 16 14 12 NN−Kernel GFLOPS 10 NN−Total 8 NT−Kernel 6 NT−Total TN−Kernel 4 TN−Total 2 TT−Kernel TT−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 44. Implementation on GPU and evaluation We observed no performance loss with matrix transposes, the reason was we make use of texture memory instead. Global memory and Texture memory are essentially the same. However, performance loss was small without coalescing memory access using texture memory. Also, relatively easy to hide the latency of memory transfer in double-double precision since operation intensive (cf. QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10 FLOPS). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 45. Implementation on GPU and evaluation “Pointer Redirecting” from “Accelerating GPU kernels for dense linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Large performance loss (∼ 35%) are observed for matrix size out of multiple of 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 46. Implementation on GPU and evaluation “Pointer redirecting” from “Accelerating GPU kernels for dense linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Simple algorithm: if pointer is out of the block, then return the value of the nearest edge. Very simple program. Small amount of performance loss. § ¤ Breakthrough!! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 47. Implementation on GPU and evaluation Performance loss was reduced from 35% to 6% !! 16.4 Kernel 16.2 Total 16 15.8 GFLOPS 15.6 15.4 15.2 15 14.8 14.6 2050 2100 2150 2200 2250 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 48. Implementation on GPU and evaluation Performance blurred only 0.1% by repeated calculations. 15.5575 15.557 15.5565 GFLOPS(Total) 15.556 15.5555 15.555 15.5545 15.554 15.5535 10 20 30 40 50 60 70 80 90 100 −th measure Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 49. Implementation on GPU and evaluation Using less accurate operations, we attained 26.4GFLOPS. 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 50. Implementation on GPU and evaluation Using less accurate operations, we attained 26.4GFLOPS. “CPU” denotes measured on Xeon 3470 + DDR3-1066. Algorithm Performance QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS QuadAdd-Cray, QuadMul-Sloppy total 25.7GFLOPS QuadAdd-Cray, QuadMul kernel 23.0GFLOPS QuadAdd-Cray, QuadMul total 22.4GFLOPS QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS QuadAdd-IEEE, QuadMul-Sloppy total 17.8GFLOPS QuadAdd-IEEE, QuadMul kernel 16.4GFLOPS QuadAdd-IEEE, QuadMul total 16.1GFLOPS QuadAdd-IEEE, QuadMul CPU 100MFLOPS QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 51. Implementation on GPU and evaluation 16.1GFLOPS = ??2.4% (or 46.2%) of peak performance (QuadAdd-IEEE, QuadMul-FMA) Average flop per sec:QuadAdd-IEEE 20op. QuadMul-FMA 10op., in Rgemm, same # of mul and add op appear. (20 + 10 − 1)/2 = 14.5 Approx theoretical peak should be... 515GFLOPS/14.5 = 35.5GFLOPS However, on C2050, peak performance is calculated full use of FMA and our calculation is not this case, thus... 515GFLOPS/14.5/2 = 17.8GFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 52. Application:x10 acceleration for Semidefinite programming solver“SDPA-DD”. Application Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 53. Application:x10 acceleration for Semidefinite programming solver“SDPA-DD”. Semidefinite programming: Primal min: A0 • X s.t.: Ai • X = bi (i = 1, 2, · · · , m) X 0 ∑m Dual max: bi zi i=1 ∑ m s.t.: Ai zi + Y = A0 i=1 Y 0 Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim ∑ vector,Y n × n symm. variable mat, X • Y := Xi j Yi j . X 0 : X semidefinite: eigenvalues are lager than or equal to 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 54. Application:x10 acceleration for Semidefinite programming solver“SDPA-DD”. Nature of optimally. . Theorem (Complementary slackness theorem) . When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they satisfy the conditions of SDP of primal and dual, then necessary and sufficient condition for optimally of (X∗ , Y ∗ , z∗ ) is: . X ∗ • Y ∗ = 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 55. Application:x10 acceleration for Semidefinite programming solver“SDPA-DD”. When X∗ , Y ∗ is optimal, X∗ • Y ∗ = 0. Then, rank X∗ + rankY ∗ ≤ n (1) also follows. § ¤ At least one of X∗ , Y ∗ is singular ¥ ¦ Usually both of X∗ , Y ∗ are singular: → unstable and/or less accurate at the optimal. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 56. How to solve SDP:Interior point primal-dual path following method World’s best implementations SDPA and SDPARA are available by the SDPA group led by Prof. Fujisawa. Step 0: Setting the initial points: x0 , X0 , Y 0 , X0 0, Y 0 0. letting h = 0, choose parameter γ ∈ (0, 1). Step 1: Calculate Shur complementary matrix B ∈ S n. Bi j = ((X h )−1 Fi Y h ) • F j Step 2: Solving linear equation Bdx = r, and calculate dX, dY by solution dx, then we obtain next step (dx, dX, dY) Step 3: Determine step size α keeping positive-semidefiniteness of matrices. α = max{α ∈ [0, 1] : X h + αdX 0, Y h + αdY 0}. Step 4: Update the current point. (x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY). Step 5: If (x h+1 , X h+1 , Y h+1 ) satisfies some requirements, then iteration ends. Otherwise, go back to the Step 1 and increment h = h + 1. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 57. Shur complement matrix becomes singular B is called “Shur complementary matrix” We solve linear equation Bdx = r to determine the next step. This linear equation becomes singular! § ¤ Multiple precision arithmetic is needed for accurate solutions! ¦ ¥ The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 # of iter. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 58. Application:x10 acceleration for Semidefinite programming solver“SDPA-DD”. Benchmark result: lager problem from SDPLIB (problem archive) CPU: Xeon 3470, DDR3 -1066 Problem CPU(sec) GPU(sec) acceleration equalG51 6531.9 573.2 11.4 gpp500-1 902.0 72.2 12.5 gpp500-4 638.0 74.8 8.5 maxG32 36284.4 4373.1 8.3 maxG55 521575.4 53413.1 9.8 mcp500-4 539.1 65.2 8.3 qpG11 16114.7 1408.0 11.4 qpG51 39678.9 3299.2 12.0 ss30 310.7 138.6 2.2 theta5 3250.0 239.8 13.6 theta6 9028.2 623.6 14.5 thetaG51 49161.5 4870.4 10.1 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 59. Summary § ¤ http://mplapack.sourceforge.net/ ¦ ¥ Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU CPU x150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci