A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

.
A fast implementation of matrix-matrix product in
double-double precision on NVIDIA C2050 and
application to semideﬁnite programming
.

Nakata Maho∗†
(maho@riken.jp∗ ),
Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro†

RIKEN, Advanced Center for Computing and Communication† ,
JFE Tech††

International Conference on Networking and Computing 2012/12/5 @ Okinawa
14:45-15:15

Nakata Maho A fast implementation of matrix-matrix product in double-double preci

Overview

Introduction of this research in a slide.
Importance of high precision arithmetic.
The double-double precision: a cheap and easy solution for
quadruple precision and its details.
Matrix-matrix multiplication (Rgemm) in MPACK (high
precision version of BLAS and LAPACK).
Implementation of a fast Rgemm on C2050 GPU : 150 times
faster than CPU.
Application: acceleration of semideﬁnite programming solver
“SDPA-DD” : 10 times faster than CPU.
Summary.


Introduction of this research in a slide.

Matrix-matrix multiplication double-double precision

NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS
25

20

GFLOPS
15

10 QuadMul−Sloppy, QuadAdd−Cray Kernel
QuadMul−Sloppy, QuadAdd−Cray Total
QuadMul−FMA, QuadAdd−Cray Kernel
QuadMul−FMA, QuadAdd−Cray Total
5 QuadMul−Sloppy, QuadAdd−IEEE Kernel
QuadMul−Sloppy, QuadAdd−IEEE Total
QuadMul−FMA, QuadAdd−IEEE Kernel
QuadMul−FMA, QuadAdd−IEEE Total
0
0 1000 2000 3000 4000 5000 6000

§ Dimension
¤
+ Application : Semideﬁnite Programming GPU=CPUx10
¦ ¥

More accuracy is needed towards PETA and EXA scale
computing

The EXA scale computing : 1023 FLOP!!! for just one week
calculation.
Scientiﬁc computing may suffer from the accuracy.


computing

Iterative methods in double precision calculation sometimes
do not even converge. [Hasegawa 2007]


computing

Semideﬁnite programming (SDP): condition number diverges
at the optimum.
Therefore, one may be very hard to obtain an accurate
solution
[Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
The 1-norm and the estimated 1-norm condition number of shur complement matrix
1e+20
1-cond
1-norm

1e+15

1e+10

100000

1

1e-05

1e-10
0 10 20 30 40 50 60 70 80 90
Nakata Maho # of iter. implementation of
A fast matrix-matrix product in double-double preci

Acceleration high precision operation on GPU is a good idea

Double-double precision is a cheap and fast solution for high
precision
accurate enough for many purposes : almost as accurate as
quadruple precision.
fast: operations are done only by 8 ∼ 24 double precision
operations.
operation intensive: requires memory bandwidth than FLOPS.
Implementing on GPU is a good idea
fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼
200GFLOPS.
cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000.
do not require complex operations: suitable for GPU.


The double-double precision: handy and easy quadruple
precision

“754-2008 IEEE Standard for Floating-Point Arithmetic”
The binary64 (aka double precision) format has 16 decimal
signiﬁcant digits

Widely used and very fast. Core i7 920: ∼40GFLOPS;
RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over
10PFLOPS)
§ ¤
Rounding error may occur for every arithmetic operation.
¦ ¥


precision

The double-double precision number a is expressed by two double
precision numbers a hi , a lo.

a = (a hi , a lo).


precision

§ ¤
Knuth’s Theorem ¥
¦

Error-free transformation of two ﬂoating point numbers a, b,

a + b = (a ⊕ b) + e
where ⊕ is addition including rounding errors, + is addition, e is
ﬂoating point number
§ ¤
We can evaluate rounding error exactly for addition!
¦ ¥


precision

§ ¤
Dekker’s Theorem ¥
¦

Error-free transformation of two ﬂoating point numbers a, b,

a × b = (a ⊗ b) + e
⊗ is multiplication operator with rounding errors, × is multiplication
operator, e is ﬂoating point number.
§ ¤
We can evaluate rounding error exactly for multiplication!
¦ ¥


precision

Based on Knuth’s Theorem, we can deﬁne “Quick-Two-Sum (a, b)”
where a, b are ﬂoating point numbers, and ⊕, are operators
including rounding errors. and when and when |a| ≥ |b|, we can
calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in three
operations.
1 (
Quick-Two-Sum (a, b):
1. s← a⊕b
. e ← b (s a)
2

3. return(s, e)
0 )
§ ¤
(s, e) = Quick-Two-Sum (a, b) ¥
¦


precision

Based on Knuth’s Theorem, we can deﬁne “Quick-Two-Sum (a, b)”
where a, b are ﬂoating point numbers, and ⊕, are operators
including rounding errors. and we can calculate exactly
s = (a ⊕ b), e = a + b − (a ⊕ b) in six operations.
9 6
Two-Sum (a, b):
1. s← a⊕b
. v←s a
2

3. e ← (a (s v)) ⊕ (b v)
4. return(s, e)
8 7
§ ¤
(s, e) = Two-Sum (a, b) ¥
¦


precision

Basics:Dekker’s Theorem
There exists an algorithm which calculate s = (a ⊗ b) and
e = a × b − (a ⊗ b), where ⊗ is multiplication operator with
rounding errors, using following “Split(a)” in four operations and
“Two-Prod(a,b)” in 17 operations.
9 6
9 6 Two-prod (a, b):
Split (a): . p← a⊗b
1

1. t ← (227 + 1) ⊗ a . (a , a ) ← Split(a)
2
hi lo
. a hi ← t (t a)
2 . (b hi , b lo) ← Split(b)
3

3. a lo ← a a hi . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗
4

4. return(a hi , a lo) b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo
8 7
. return( p, e)
5
8 7
§ ¤
(s, e) =Two-Prod(a, b) ¥
¦

precision

Addition in double-double operation can be done in 20FLOPS by
following “QuadAdd-IEEE”
9 6
QuadAdd-IEEE (a, b):
1. (s hi , e hi ) = Two-Sum(a hi , b hi )
2. (s lo, e lo) = Two-Sum(a lo, b lo)
3. e hi = e hi ⊕ s lo
4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi )
5. e hi = e hi ⊕ s lo
. (s hi , e lo) = Quick-Two-Sum(s hi , e hi )
6

7. return(c)
8 7


precision

Multiplication in double-double operation can be done in 24FLOPS
by following “QuadMul”.
9 6
QuadMul (a, b):
1. ( phi , plo) = Two-Prod(a hi , b hi )
2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi )
3. (c hi , c lo) = Quick-Two-Sum(phi , plo)
4. return(c)
8 7


precision

The FMA (fused multiply-add) operation calculates

a×b+c

in one command. Doing a × b + c exactly, then round to
double-precision.


precision

Faster: using FMA instruction Two-Prod becomes 3 operations (17
op. w/o FMA), and QuadMul(-FMA) can be done in only 10
operations (24 ops w/o FMA)
1 (
Two-prod-FMA (a, b):
1. p← a⊗b
. e ← FMA(a × b − p)
2

3. return(p, e)
0 )


precision

Faster: lower accuracy operations 9 6
9 6 QuadMul-Sloppy (a, b):
QuadAdd-Cray (a, b): 1. p = (a hi ⊗ b lo)
1. (c hi , c lo) =
2. q = (a lo ⊗ b hi )
Two-Sum(a hi , b hi ) . t = p⊕ q
3
2. c lo = c lo ⊕ (a lo ⊕ b lo)
4. c hi = FMA(a hi × b hi + t)
3. (c hi , c lo) =
5. e = FMA(a hi × b hi − c hi )
Quick-Two-Sum(c hi , c lo)
6. c lo = e ⊕ t
4. return(c)
8 7
7. return(c)
8 7


precision

Summary: Operations count in each double-double arithmetic
Algorithm # of operations
Quick-Two-Sum 3
Two-Sum 6
Split 4
Two-Prod 17
Two-Prod-FMA 3∗
QuadAdd-IEEE 20
QuadAdd-Cray 11
QuadMul 24
QuadMul-FMA 10∗
QuadMul-FMA-Sloppy 8∗
∗ 2FLOPScount for FMA.
We used QuadAdd-IEEE and QuadMul-FMA when not explicitly
stated

precision

QD library
Features: Class of C++.The double-double precision: “dd real”.
Free software. Author: Yozo Hida, Xiaoye S. Li, David H. Bailey
Download:

http://crd.lbl.gov/˜dhbailey/mpdist/

Paper:

http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf

Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic:
Algorithms, Implementation, and Application”, Technical Report
LBNL-46996, Lawrence Berkeley National Laboratory, 2000.


Implementation on GPU and performance evaluation

We accelerated matrix-matrix multiplication routine called
“Rgemm”. Prototype deﬁnition of Rgemm
' $
void Rgemm(const char *transa, const char *transb,
mpackint m, mpackint n, mpackint k, dd_real alpha,
dd_real * A, mpackint lda, dd_real * B, mpackint ldb,
dd_real beta, dd_real * C, mpackint ldc)
& %

“MPACK”by M. Nakata, Multiple pre-
cision version of BLAS, LAPACK(de
facto standard linear algebra pack-
age).

http://mplapack.sourceforge.net/

“Rgemm” corresponds to “dgemm”
and “sgemm” of BLAS)


Implementation on GPU and performance evaluation

Related study
D. Mukunoki and D. Takahashi : Implementation of
double-double matrix matrix multiplication on GPU, HPCS, p.
148-156, (2011). → Matrix size should be multiple of 64 and
slower than our implementation
Nakasato, N.:, “A Fast GEMM Implementation On a Cypress
GPU, Performance Modeling, Benchmark and Simulation of
High Performance Computing Systems”, Louisiana, USA,
2010. → Matrix size should be multiple of 64 and faster than
our implementation
§ ¤
Both implementations are not practical → we implemented for
¦ ¥
general use.


Implementation on GPU and evaluation

NVIDIA C2050 Architecture



Block algorithm. We divide matrices to small blocks like b K, b M,
b N. We used b M = b K = 16 and b N = 64.



Basic algorithm:
1. Transfer A,B,C matrices on CPU memory to GPU Global
memory.
2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efﬁcient.
3. Apply 16 × 16 = 256 thread blocks to each elements Each
(i, j)-th thread in thread block calculated i-th row of Ab and
j, j + 16, j + 32, j + 48-th column (four columns at the same
time) of Bb.


Operation of each thread in detail:
1. Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of
Ab and j, j + 16, j + 32, j + 48-th row of Bb.
2. Read the ﬁrst block Ab and Bb from global memory to shared memory.
Each thread of blocks read its elements.
3. Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16
, bi+32 , bi+48 as p0 , p1 , p2 , p3
4. Update c0, c1, c2, c3 like c0 ← c0 + α p0.
5. Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are
available.
6. Update C-matrix by c0, c1, c2, c3.
7. Finally transfer C-matrix from GPU Global memory to CPU.



The performance of matrix-matrix operation in double-double
precision. Square matrix (m = n = k), we varied m. Max. kernel
performance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transfer
included.
16
14
12
GFLOPS

10
8
6
4
2 NN−Kernel
NN−Total
0
0 1000 2000 3000 4000 5000 6000
Dimension


The performance of matrix-matrix operation in double-double
precision with matrix transposes. Square matrix (m = n = k), we
varied m. No performance loss with matrix transposes are
observed.
16
14
12
NN−Kernel
GFLOPS

10
NN−Total
8 NT−Kernel
6 NT−Total
TN−Kernel
4 TN−Total
2 TT−Kernel
TT−Total
0
0 1000 2000 3000 4000 5000 6000
Dimension


We observed no performance loss with matrix transposes, the
reason was we make use of texture memory instead.
Global memory and Texture memory are essentially the same.
However, performance loss was small without coalescing
memory access using texture memory.

Also, relatively easy to hide the latency of memory transfer in
double-double precision since operation intensive (cf.
QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10
FLOPS).


“Pointer Redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
Large performance loss (∼ 35%) are observed for matrix size
out of multiple of 64.



“Pointer redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
Simple algorithm: if pointer is out of the block, then return the
value of the nearest edge.

Very simple program.
Small amount of performance loss.
§ ¤
Breakthrough!!
¦ ¥


Performance loss was reduced from 35% to 6% !!

16.4
Kernel
16.2 Total
16
15.8
GFLOPS

15.6
15.4
15.2
15
14.8
14.6
2050 2100 2150 2200 2250
Dimension



Performance blurred only 0.1% by repeated calculations.

15.5575
15.557
15.5565
GFLOPS(Total)

15.556
15.5555
15.555
15.5545
15.554
15.5535
10 20 30 40 50 60 70 80 90 100
−th measure


Using less accurate operations, we attained 26.4GFLOPS.

25

20
GFLOPS

15

0
0 1000 2000 3000 4000 5000 6000
Dimension


Using less accurate operations, we attained 26.4GFLOPS. “CPU”
denotes measured on Xeon 3470 + DDR3-1066.
Algorithm Performance
QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS
QuadAdd-Cray, QuadMul-Sloppy total 25.7GFLOPS
QuadAdd-Cray, QuadMul kernel 23.0GFLOPS
QuadAdd-Cray, QuadMul total 22.4GFLOPS
QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS
QuadAdd-IEEE, QuadMul-Sloppy total 17.8GFLOPS
QuadAdd-IEEE, QuadMul kernel 16.4GFLOPS
QuadAdd-IEEE, QuadMul total 16.1GFLOPS
QuadAdd-IEEE, QuadMul CPU 100MFLOPS
QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS



16.1GFLOPS = ??2.4% (or 46.2%) of peak performance
(QuadAdd-IEEE, QuadMul-FMA)
Average ﬂop per sec:QuadAdd-IEEE 20op. QuadMul-FMA
10op., in Rgemm, same # of mul and add op appear.

(20 + 10 − 1)/2 = 14.5

Approx theoretical peak should be...

515GFLOPS/14.5 = 35.5GFLOPS

However, on C2050, peak performance is calculated full use
of FMA and our calculation is not this case, thus...

515GFLOPS/14.5/2 = 17.8GFLOPS


Application:x10 acceleration for Semideﬁnite programming
solver“SDPA-DD”.

Application



Semideﬁnite programming:

Primal min: A0 • X
s.t.: Ai • X = bi (i = 1, 2, · · · , m)
X 0
∑m
Dual max: bi zi
i=1
∑
m
s.t.: Ai zi + Y = A0
i=1
Y 0

Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim
∑
vector,Y n × n symm. variable mat, X • Y := Xi j Yi j . X 0 : X
semideﬁnite: eigenvalues are lager than or equal to 0.



Nature of optimally.
.
Theorem (Complementary slackness theorem)
.
When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they
satisfy the conditions of SDP of primal and dual, then necessary
and sufﬁcient condition for optimally of (X∗ , Y ∗ , z∗ ) is:

. X ∗ • Y ∗ = 0.



When X∗ , Y ∗ is optimal,

X∗ • Y ∗ = 0.

Then,
rank X∗ + rankY ∗ ≤ n (1)
also follows.
§ ¤
At least one of X∗ , Y ∗ is singular ¥
¦
Usually both of X∗ , Y ∗ are singular: → unstable and/or less
accurate at the optimal.


How to solve SDP:Interior point primal-dual path following
method

World’s best implementations SDPA and SDPARA are available by
the SDPA group led by Prof. Fujisawa.
Step 0: Setting the initial points: x0 , X0 , Y 0 , X0 0, Y 0 0. letting h = 0,
choose parameter γ ∈ (0, 1).
Step 1: Calculate Shur complementary matrix B ∈ S n.

Bi j = ((X h )−1 Fi Y h ) • F j

Step 2: Solving linear equation Bdx = r, and calculate dX, dY by
solution dx, then we obtain next step (dx, dX, dY)
Step 3: Determine step size α keeping positive-semideﬁniteness of
matrices. α = max{α ∈ [0, 1] : X h + αdX 0, Y h + αdY 0}.
Step 4: Update the current point.
(x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY).
Step 5: If (x h+1 , X h+1 , Y h+1 ) satisﬁes some requirements, then iteration
ends. Otherwise, go back to the Step 1 and increment h = h + 1.


Shur complement matrix becomes singular

B is called “Shur complementary matrix”
We solve linear equation Bdx = r to determine the next step.
This linear equation becomes singular!
§ ¤
Multiple precision arithmetic is needed for accurate solutions!
¦ ¥
The 1-norm and the estimated 1-norm condition number of shur complement matrix
1e+20
1-cond
1-norm

1e+15

1e+10

100000

1

1e-05

1e-10
0 10 20 30 40 50 60 70 80 90
# of iter.



Benchmark result: lager problem from SDPLIB (problem archive)
CPU: Xeon 3470, DDR3 -1066
Problem CPU(sec) GPU(sec) acceleration
equalG51 6531.9 573.2 11.4
gpp500-1 902.0 72.2 12.5
gpp500-4 638.0 74.8 8.5
maxG32 36284.4 4373.1 8.3
maxG55 521575.4 53413.1 9.8
mcp500-4 539.1 65.2 8.3
qpG11 16114.7 1408.0 11.4
qpG51 39678.9 3299.2 12.0
ss30 310.7 138.6 2.2
theta5 3250.0 239.8 13.6
theta6 9028.2 623.6 14.5
thetaG51 49161.5 4870.4 10.1

Summary
§ ¤
http://mplapack.sourceforge.net/
¦ ¥
Matrix-matrix multiplication double-double precision

NVIDIA C2050, GPU CPU x150, Peak performance: 26GFLOPS
25

20
GFLOPS

15

0
0 1000 2000 3000 4000 5000 6000
Dimension

A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming (20)

More from Maho Nakata

More from Maho Nakata (20)

Recently uploaded

Recently uploaded (20)

A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming