SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Noname manuscript No.
(will be inserted by the editor)




Efficient Evaluation Methods of Elementary Functions
Suitable for SIMD Computation
Naoki Shibata




Received: date / Accepted: date



Abstract Data-parallel architectures like SIMD (Sin-         SIMT (Single Instruction Multiple Thread), rather than
gle Instruction Multiple Data) or SIMT (Single Instruc-      extracting more instruction-level parallelism. Intel’s x86
tion Multiple Thread) have been adopted in many re-          architecture employs 128-bit-wide SIMD instructions,
cent CPU and GPU architectures. Although some SIMD           but it will be extended to 256 bit in the Sandy Bridge
and SIMT instruction sets include double-precision arith-    processors and 512 bit in the Larrabee processors[1].
metic and bitwise operations, there are no instructions      Many supercomputers utilize AMD Opteron Proces-
dedicated to evaluating elementary functions like trigono-   sors as computing cores. IBM Roadrunner[2] also uti-
metric functions in double precision. Thus, these func-      lizes Cell Broadband Engines (Cell B.E.)[3], and both of
tions have to be evaluated one by one using an FPU           these architectures have powerful SIMD engines. nVidia’s
or using a software library. However, traditional algo-      CUDA-enabled GPGPUs use SIMT architecture, and
rithms for evaluating these elementary functions in-         can achieve 600 double-precision GFLOPS with a Tesla
volve heavy use of conditional branches and/or table         C2050 processor[4].
look-ups, which are not suitable for SIMD computa-               Utilizing SIMD or SIMT processing is becoming more
tion. In this paper, efficient methods are proposed for        and more important for many kinds of applications, and
evaluating the sine, cosine, arc tangent, exponential and    double-precision calculation capabilities are especially
logarithmic functions in double precision without table      important for many scientific applications. However,
look-ups, scattering from, or gathering into SIMD reg-       those instruction sets only include basic arithmetic and
isters, or conditional branches. We implemented these        bitwise operations. Instructions for evaluating elemen-
methods using the Intel SSE2 instruction set to evaluate     tary functions like trigonometric functions in double-
their accuracy and speed. The results showed that the        precision are not provided, and we have to evaluate
average error was less than 0.67 ulp, and the maximum        these functions with a conventional FPU or a software
error was 6 ulps. The computation speed was faster           library. An FPU calculation is fast, but FPU instruc-
than the FPUs on Intel Core 2 and Core i7 processors.        tions for evaluating the elementary functions are only
Keywords SIMD, elementary functions                          available in a few architectures.
                                                                 We need to take special care for efficient computa-
                                                             tion utilizing SIMD instructions. Gathering and scat-
1 Introduction                                               tering data from and into SIMD registers are expensive
                                                             operations. Looking up different table entries according
Recently, computer architects have been trying to achieve    to the value of each element in a SIMD register can be
high performance by adopting data-parallel architec-         especially expensive, since each table look-up may cause
tures like SIMD (Single Instruction Multiple Data) or        a cache miss. Also, the SSE instructions[5] in the x86
                                                             architecture have restrictions on data alignment, and
Naoki Shibata
Department of Information Processing & Management,
                                                             an access to unaligned data is always expensive. Ran-
Shiga University                                             dom conditional branches are expensive for processors
1-1-1 Bamba, Hikone 522-8522, Japan                          with a long instruction pipeline. On the other hand,
E-mail: shibata@biwako.shiga-u.ac.jp                         many SIMD architectures provide instructions for di-
2


vision and taking square root, and these calculations         ditional irregular algorithms are sometimes difficult to
are now inexpensive. Thus, sometimes traditional algo-        parallelize on massively multi-threaded hardware, and
rithms are not suitable, and we need new algorithms           many researchers are now working on new algorithms
for them.                                                     for such hardware.
    In this paper, we propose new methods for evaluat-
ing the sine, cosine, arc tangent, exponential and loga-
rithmic functions in double precision which are suitable      3 Proposed Method
for processors with a long instruction pipeline and an
SIMD instruction set. Evaluations of these functions are      In this section, the details of the proposed methods are
carried out by a series of double-precision arithmetic        explained. The working C codes are provided which can
and bitwise operations without table look-ups, scatter-       be easily converted into a series of SIMD instructions.
ing/gathering operations, or conditional branches. The        Having converted the codes, a target function can be
total code size is very small, and thus they are also suit-   evaluated with multiple inputs in a SIMD register si-
able for Cell B.E. which has only 256K bytes of directly      multaneously.
accessible scratch pad memory in each SPE. The pro-
posed methods were implemented using the Intel SSE2
instruction set for evaluating accuracy and speed. The        3.1 Trigonometric Functions
results showed that the proposed methods were faster
                                                              The proposed method evaluates the functions of sine
than FPU calculations on Intel Core 2 and Core i7 pro-
                                                              and cosine on the argument of d at the same time. The
cessors, and the average and maximum errors were less
                                                              proposed method consists of two steps. First, the ar-
than 0.67 ulp (unit in the last place) and 6 ulps, respec-
                                                              gument d is reduced to within 0 and π/4 utilizing the
tively. The total code size was less than 1400 bytes.
                                                              symmetry and periodicity of the sine and cosine func-
    The remainder of this paper is organized as fol-
                                                              tions, shown as (1) and (2). Second, an evaluation of
lows. Section 2 introduces some related works. Section
                                                              the sine and cosine function on the reduced argument s
3 presents the proposed methods with working C codes.
                                                              is performed assuming that the argument is within the
Section 4 shows the results of the evaluation of accu-
                                                              range.
racy, speed and code size. Section 5 presents our con-
clusions.
                                                                          π               π
                                                              cos x = sin(  − x) = sin(x + )                       (1)
                                                                          2               2
2 Related Works                                                           π                  π
                                                              sin x = cos( − x) = − cos(x + )                      (2)
                                                                          2                  2
Some software libraries have capabilities for evaluating
                                                                  For the first step, we first find s and an integer q in
elementary functions. A few libraries with capabilities
                                                              (3) so that 0 ≤ s < π/4.
for evaluating elementary functions using SIMD com-
putation in single precision are available[6,7,8]. How-
                                                                       π
ever, adding capabilities of double-precision evaluations     d = s+     q                                         (3)
to these libraries is not a straightforward task. Software             4
modules like the x87 FPU emulator code in Linux oper-         q can be found by dividing d by π/4 and applying the
ating system kernel[9] or the libm in GNU C Library[10]       floor function. Then s can be found by multiplying q by
have capabilities for evaluating elementary functions in      π/4 and subtracting it from d. We need to take special
double precision. These modules do not utilize SIMD           care here, because a cancellation error can make the
computation, and they involve heavy use of conditional        derived value of s inaccurate when d is a large num-
branches. Some multiple-precision floating-point com-          ber close to a multiple of π/4. Some implementations
putation libraries have capabilities for evaluating ele-      like FPUs on x86 architectures exhibit this problem. In
mentary functions in multiple-precision[11, 12]. The pro-     this paper, a way for efficiently preventing this prob-
gram structures of these libraries are very different from     lem is proposed. The problem is in the last multipli-
the proposed method.                                          cation and subtraction, and the proposed method pre-
    There are many researches on evaluating elementary        vents the problem by calculating this part with extra
functions on hardware[13,14,15, 16]. The restrictions of      precision utilizing properties of the IEEE 754 standard
calculation are very different from the SIMD computa-          for floating-point arithmetic[19]. Here, q is assumed to
tions, and many of these methods use table look-ups.          be expressed in 26 bits, which is a half of the length of
    There are many researches on accelerating applica-        the mantissa in an IEEE 754 double-precision FP num-
tions using SIMD or SIMT computation[17,18]. Tra-             ber. The value of π/4 is split into three parts so that
3




Fig. 1 Splitting π/4 into three parts




all of the bits in the lower half of the mantissa are zero,
as shown in Fig. 1. Then, multiplying these numbers
                                                              cos 2x = 2 cos2 x − 1                                       (6)
with q does not produce an error at all, and subtract-                 √
ing these numbers in descending order from d does not          sin x = 1 − cos2 x                                         (7)
produce a cancellation error either.
                                                              Now, we have another problem. If x is close to 0, find-
    For the second step, we evaluate the sine and cosine
                                                              ing the value of the sine function using (7) produces a
functions on the argument s. In many existing imple-
                                                              cancellation error, since cos2 x is close to 1. In this pa-
mentations, these evaluations are performed by simply
                                                              per, instead of cos s, the value of 1 − cos s is found. Let
summing up the terms of the corresponding Taylor se-
                                                              f (x) be 1 − cos x, and we get the following equations.
ries, shown as (4).

                                                              f (2x) = 4f (x) − 2f (x)2                                   (8)
         ∞
        ∑ (−1)n                   x3   x5                              √
sin x =               x2n+1 = x −    +    − ···        (4)     sin x = 2f (x) − f (x)2                                    (9)
        n=0
            (2n + 1)!             3!   5!
                                                              Since s is between 0 and π/4, the application of (8) or
However, if these terms are summed up using double-           (9) does not produce cancellation or rounding errors. By
precision calculations, we cannot get an accurate result      evaluating the first five terms of the Taylor series and
because of the accumulation of errors. In order to im-        applying (8) three times, we can evaluate the functions
prove the accuracy while speeding up the evaluation,          of sine and cosine on the argument of s with sufficient
many implementations use another way of argument              accuracy. In order to improve the computational per-
reduction [11]. The argument s is divided by a certain        formance, we use the following technique. Let g(x) be
number, the terms of the Taylor series are evaluated          2f (x) and we get the following equation:
for this value, and then the triple-angle formula (5) is
used to obtain the final value. For example, in order          g(2x) = (4 − g(x)) · g(x)                                  (10)
to evaluate sin 0.27, we first divide 0.27 by 9, evaluate
the terms of the Taylor series for sin 0.03, and then ap-     By using (10) instead of (8), we can omit one multi-
ply the triple-angle formula two times to find the value       plication each time we apply the double-angle formula.
of sin 0.27. Since the argument for the Taylor series is      This also improves the accuracy by a little. To incorpo-
reduced, its convergence is accelerated, and we do not        rate the result from the first step, we use the symmetry
need to evaluate many terms to get an accurate value.         and periodicity of the trigonometric functions. This is
In this way, we can reduce the accumulation of errors.        performed by exchanging the derived values of sin s and
                                                              cos s, and multiplying those values by −1 according to
                                                              the remainder q divided by 4. In order to evaluate the
sin 3x = 3 sin x − 4 sin3 x                            (5)    tangent function, we can simply use (11).
                                                                  A working C code for the proposed method is shown
However, this way does not yield sufficient accuracy            in Fig. 21 .
with double-precision calculations, because the differ-
ence between 3 sin x and 4 sin3 x becomes large if sin x                sin x
is small, and applying the triple-angle formula produces      tan x =                                                    (11)
                                                                        cos x
a rounding error. To prevent this error, we can use the         1 The code contains conditional branches for ease of reading.
double-angle formula (6) of cosine. In this case, the         All of these conditional branches can be eliminated by unrolling
value of the sine function can be evaluated from the          loops or replacing them by combinations of comparisons and bit-
value of the cosine function using (7).                       wise operations.
4



             #i n c l u d e <math . h>

             typedef         struct     sincos t         { double s ,           c;    }   sincos t ;

             #d e f i n e N 3

             sincos t         x s i n c o s 0 ( double     s) {      / / 0 ≤ s < π/4
               int i ;

                 / / Argument reduction
                 s = s ∗ pow ( 2 , −N ) ;

                 s = s ∗ s ; / / Evaluating Taylor series
                 s = ( ( ( ( s /1814400 − 1.0/20160)∗ s + 1 . 0 / 3 6 0 )∗ s − 1 . 0 / 1 2 ) ∗ s + 1)∗ s ;

                 f o r ( i =0; i < ; i ++) { s = (4− s ) ∗ s ;
                                  N                                             } / / Applying double angle formula
                 s = s / 2;

                 s i n c o s t sc ;
                 s c . s = s q r t ((2 − s ) ∗ s ) ;       sc . c = 1 − s ;
                 return s c ;
             }

             #d e f i n e    PI4 A    . 7 8 5 3 9 8 1 5 5 4 5 0 8 2 0 9 2 2 8 5 1 5 6 2 5 / / π/4 split into three parts
             #d e f i n e    PI4 B    . 7 9 4 6 6 2 7 3 5 6 1 4 7 9 2 8 3 6 7 1 3 6 0 4 6 2 9 0 3 9 7 6 4 4 0 4 2 9 6 8 7 5 e−8
             #d e f i n e    PI4 C    . 3 0 6 1 6 1 6 9 9 7 8 6 8 3 8 2 9 4 3 0 6 5 1 6 4 8 3 0 6 8 7 5 0 2 6 4 5 5 2 4 3 7 3 6 1 4 8 0 7 6 9 e −16

             / / 4/π
             #d e f i n e    M 4 PI    1.273239544735162542821171882678754627704620361328125

             s i n c o s t x s i n c o s ( double d ) {
                 double s = f a b s ( d ) ;
                 i n t q = ( i n t ) ( s ∗ M 4 PI ) , r = q + ( q & 1 ) ;

                 s −= r ∗ PI4 A ;            s −= r ∗ PI4 B ;                s −= r ∗ PI4 C ;

                 sincos t        sc = xsincos0 ( s ) ;

                 if   ( ( ( q + 1 ) & 2 ) != 0 ) { s = s c . c ; s c . c = s c . s ;                          sc . s = s ;       }
                 if   ( ( ( q & 4 ) != 0 ) != ( d < 0 ) ) s c . s = −s c . s ;
                 if   ( ( ( q + 2 ) & 4 ) != 0 ) s c . c = −s c . c ;

                 return       sc ;
             }

             double         x s i n ( double d ) {       sincos t          sc = xsincos (d ) ;              return       sc . s ; }
             double         x c o s ( double d ) {       sincos t          sc = xsincos (d ) ;              return       sc . c ; }
             double         x t a n ( double d ) {       sincos t          sc = xsincos (d ) ;              return       sc . s / sc . c ;            }




Fig. 2 Working C code for evaluating trigonometric functions



3.2 Inverse Trigonometric Functions                                                                          Thus, if d ≤ 1, the argument can be reduced by evalu-
                                                                                                             ating the arc tangent function on the argument of 1/d
We cannot get an accurate result by simply summing                                                           instead of d, and subtracting the result from π/2. Thus,
up many terms of the Taylor series (12) using double-                                                        we can now assume d > 1 and π/4 < θ < π/2. In or-
precision calculations because of an accumulation of er-                                                     der to reduce the argument d = tan θ, we can use (14)
rors. As far as we surveyed, no good way is known for                                                        to enlarge cot θ = 1/ tan θ. Please note that cot x/2 is
reducing the argument for evaluating inverse trigono-                                                        larger than cot x if 0 < x < π/2. For example, assume
metric functions. In this paper, a new way of argument                                                       that we are evaluating arctan 2, and φ is a number such
reduction for evaluating the arc tangent function utiliz-                                                    that 2 = tan φ. We first apply (13) to get√ φ = 1/2.
                                                                                                                                                          cot
ing the half-angle formula (14) of the cotangent function                                                    Next, we apply (14) to get cot φ/2 = 1/2+ √1 + (1/2)2 ,
is proposed.                                                                                                 and apply (13) √ get tan φ/2 = 1/(1/2 + 5/4). Note
                                                                                                                             to
                                                                                                             that 1/(1/2 + 5/4) is less than 2, thus we reduced
             ∞
            ∑ (−1)n                                                                                          the argument. Then, we evaluate√   several terms of (12)
arctan x =             x2n+1                     (12)                                                        for the arc tangent of 1/(1/2 + 5/4), which is φ/2.
            n=0
                2n + 1
                        (π      )                                                                            Finally, we multiply 2 to the result to get φ = arctan 2.
              1
    cot x =       = tan      −x                  (13)                                                            By applying (14) two times and summing the first
            tan x √       2
       x                                                                                                     eleven terms of Taylor series, we can evaluate the arc
   cot = cot x ± 1 + cot2 x                      (14)                                                        tangent function with sufficient accuracy.
       2
   Let d be the argument and θ = arctan d. Thus, d =                                                             The arc sine and the arc cosine functions can be
tan θ holds. For simplicity, we now assume d ≥ 0 and                                                         evaluated using the following equation.
0 ≤ θ < π/2. From (13), the following equation holds.

      (π   )  1                                                                                                                                             x
tan      −θ =                                                                               (15)             arcsin x = arctan √                                   (16)
       2      d                                                                                                                                           1 − x2
5

                            √
                                1 − x2                                           from d, where we have a cancellation error when d is
arccos x = arctan                                                         (17)
                                  x                                              close to a multiple of loge 2. In order to prevent this,
However, these equations produce cancellation errors if                          the value of loge 2 is split into two parts in a similar
|x| is close to 1. These errors can be avoided by modify-                        way as in the case of the trigonometric functions, and
ing the above equations as follows. Note that no cancel-                         each part is multiplied by q and subtracted from d in
lation error is produced by subtracting x from 1 in these                        turn.
equations. A working C code for the proposed method                                  For the second step, we can use (21) for reducing the
is shown in Fig. 3.                                                              argument, and then evaluate the terms of the Taylor
                                                                                 series (22).
                                          x
arcsin x = arctan √                                                       (18)
                                (1 + x)(1 − x)                                   exp 2x = (exp x)2                                   (21)
                            √
                                (1 + x)(1 − x)                                             ∞
                                                                                          ∑ xn          x2   x3
arccos x = arctan                                                         (19)    exp x =         =1+x+    +    + ···                (22)
                                     x                                                        n!        2!   3!
                                                                                          n=0

                                                                                 However, we have a problem if we simply use these two
                                                                                 equations. If s is close to 0, the difference between the
       #i n c l u d e <math . h>
                                                                                 first term, which is 1, and the other terms in (22) is too
       #d e f i n e M PI 3 . 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6
       #d e f i n e N 2                                                          large and produces rounding errors. In order to avoid
        double x a t a n ( double d ) {                                          this problem, the proposed method finds the value of
          double x , y , z = f a b s ( d ) ;
          i f ( z < 1.0) x = 1.0/ z ; else                   x = z;              exp s − 1 instead of exp s. Let h(x) be exp x − 1, and
            i n t i ; / / Applying cotangent half angle formula                  we derive the following equation from (21) and (22):
            f o r ( i = 0 ; i < N ; i ++) { x = x + s q r t (1+ x∗x ) ;   }

            x = 1.0     / x;

            y = 0 ; / / Evaluating Taylor series
            f o r ( i =10; i >=0; i −−) {
                                                                                 h(2x) = (2 + h(x)) · h(x)                           (23)
                y = y∗x∗x + pow( −1 , i ) / ( 2 ∗ i + 1 ) ;                               ∞
                                                                                         ∑ xn
            }
            y ∗= x ∗ pow ( 2 , N ) ;                                              h(x) =                                             (24)
            i f ( z > 1 . 0 ) y = M PI /2 − y ;                                          n=1
                                                                                             n!
            return d < 0 ? −y : y ;
        }

        double x a s i n ( double d ) {
                                                                                     By summing the first eight terms of (24), applying
          return x a t a n ( d / s q r t ((1+ d )∗(1 − d ) ) ) ;
        }
                                                                                 (23) four times, and then adding 1 to the result, we can
        double x a c o s ( double d ) {
                                                                                 evaluate the exponential function on the argument of
          return x a t a n ( s q r t ((1+ d )∗(1 − d ) ) / d ) +
            ( d < 0 ? M PI : 0 ) ;                                               s with sufficient accuracy. By incorporating the result
        }
                                                                                 from the first step, we get the final result by multiplying
                                                                                 the result by 2q as shown in (25). Calculation of 2q from
Fig. 3 Working C code for evaluating inverse trigonometric func-                 q can be performed by substituting 1023 + q for the
tions                                                                            exponent of 1, utilizing the properties of the IEEE 754
                                                                                 standard.
                                                                                     A working C code for the proposed method is shown
                                                                                 in Fig. 4.
3.3 Exponential Function

The proposed method for evaluating the exponential                               exp(s + q · loge 2) = (exp s) · exp(q · loge 2)
function consists of two steps. As the first step, the
                                                                                                     = (exp s) · 2q                  (25)
argument d is reduced to within 0 and loge 2. Then, as
the second step, the exponential function is evaluated
on the reduced argument s.
                                                                                 3.4 Logarithmic Function
   For the first step, in order to evaluate exp d, we first
find s and an integer q in (20) so that 0 ≤ s < loge 2.
                                                                                 In order to evaluate the logarithmic function, we can
                                                                                 use the series shown as (26) which converges faster than
d = s + q · loge 2                                                        (20)   the Taylor series[20].

Here, we have the same problem as we had in the case of                                     ∞         (      )2n+1
the trigonometric functions. In order to find the value                                      ∑     1     x−1
                                                                                 log2 x = 2                                          (26)
of s, we have to multiply q by loge 2 and subtract it                                       n=0
                                                                                                2n + 1 x + 1
6



             #i n c l u d e <math . h>

             #d e f i n e M LN2 0 . 6 9 3 1 4 7 1 8 0 5 5 9 9 4 5 3 0 9 4 1 7 2 3 2 1 2 1 4 5 8 1 7 6 6    / / loge 2

             / / loge 2 split into two parts
             #d e f i n e L2U . 6 9 3 1 4 7 1 8 0 5 5 9 6 6 2 9 5 6 5 1 1 6 0 1 8 0 5 6 8 6 9 5 0 6 8 3 5 9 3 7 5
             #d e f i n e L2L . 2 8 2 3 5 2 9 0 5 6 3 0 3 1 5 7 7 1 2 2 5 8 8 4 4 8 1 7 5 0 1 3 4 3 6 0 2 5 5 2 5 4 1 2 0 6 8 e −12

             #d e f i n e N 4

             double xexp ( double d ) {
               i n t q = ( i n t ) ( d / M LN2 ) , i ;
               double s = d − q ∗ L2U − q ∗ L2L ,                              t;

                 s = s ∗ pow ( 2 , −N ) ;             / / Argument reduction

                 / / Evaluating Taylor series
                 t = ( ( ( s /40320 + 1 . 0 / 5 0 4 0 )∗ s + 1 . 0 / 7 2 0 ) ∗ s + 1 . 0 / 1 2 0 ) ∗ s ;
                 t = ( ( ( ( t + 1 . 0 / 2 4 )∗ s + 1 . 0 / 6 ) ∗ s + 1 . 0 / 2 ) ∗ s + 1)∗ s ;

                 f o r ( i =0; i < ; i ++) { t = (2+ t ) ∗ t ;
                                  N                                        }

                 return       ( t + 1 ) ∗ pow ( 2 ,          q);
             }




Fig. 4 Working C code for evaluating exponential function



The proposed method consists of two steps. The first
step is reducing argument by splitting the original argu-                                                               #i n c l u d e <math . h>

ment into its mantissa and exponent. The second step is                                                                 double x l o g ( double d ) {
                                                                                                                          int e , i ;
to evaluate (26) on the mantissa part as the argument.                                                                      double m = f r e x p ( d , &e ) ;
                                                                                                                            i f (m < 0 . 7 0 7 1 ) { m ∗= 2 ;    e −−; }
    As the first step, we split the original argument d
                                                                                                                            double x = (m−1) /          (m+ 1 ) , y = 0 ;
into its mantissa m and an integer exponent e where
0.5 ≤ m < 1, as shown in (27), utilizing the property of                                                                    f o r ( i =19; i >=1; i −=2) { y = y∗x∗x + 2 . 0 / i ;   }

                                                                                                                            return    e ∗ l o g ( 2 ) + x∗y ;
the IEEE 754 standard. Since (26) converges fast if its                                                                 }

argument is close to 1, we multiply 2 to m and subtract
                               √
1 from e √ m is smaller than 2/2. Then m is in the
           if          √                                                                                  Fig. 5 Working C code for evaluating logarithmic function
range of 2/2 ≤ m < 2.


x = m · 2e                                                                                (27)            We used gcc 4.3.3 with -msse2 -mfpmath=sse -O op-
    As the second step, we simply evaluate the first 10                                                    tions to compile our programs.
terms of the series of (26) to evaluate log m. We can get
the value of log d by adding e log 2 to the value of log m
to get the final result, as shown in (28).                                                                 4.1 Accuracy
    A C code for this method is shown in Fig. 5
                                                                                                          The accuracy of the proposed methods was tested by
                                                                                                          comparing the results produced by the proposed meth-
log(m · 2e ) = log m + log 2e                                                                             ods to the accurate results produced by 256 bit pre-
             = log m + e log 2                                                            (28)            cision calculations using the MPFR Library[12]. We
                                                                                                          checked the accuracy by evaluating the functions on
                                                                                                          the argument of each value in a range at regular inter-
4 Evaluation                                                                                              vals, and we measured the average and maximum error
                                                                                                          in ulp. We measured the evaluation accuracy within a
The proposed methods were tested to evaluate accu-                                                        few ranges for each function. The ranges and the mea-
racy, speed, and code size. The tests were performed on                                                   surement results are shown in Tables 1, 2, 3, 4 and 5.
a PC with Core i7 920 2.66GHz. The system is equipped                                                     Please note that because of the accuracy requirements
with 3 GB of RAM and runs Linux (Ubuntu Jaunty                                                            of IEEE 754 standard, these results do not change be-
with 2.6.28 kernel, IA32). The proposed methods were                                                      tween platforms comform to the standard.
implemented in C language with Intel SSE2 intrinsics2 .                                                       The evaluation error did not exceed 6 ulps in every
  2 An intrinsic is a function known by the compiler that directly                                        test. The average error of 0.28 ulp to 0.67 ulp can be
maps to a sequence of one or more assembly language instruc-                                              considered to be good, since even if accurate numbers
tions.                                                                                                    are correctly rounded, we have 0.25 ulp of average error.
7


Table 1 Average and maximum error for trigonometric functions (ulp)
                        Range          interval     sin avg        sin max       cos avg       cos max      tan avg        tan max
                    0 ≤ x ≤ π/4          1e-7        0.402           2.97           0.278        1.78        0.580          4.74
                   −10 ≤ x ≤ 10          1e-6        0.378           3.80           0.370        3.94        0.636          5.14
                   0 ≤ x ≤ 20000         1e-3        0.373           3.53           0.373        3.52        0.636          5.61
                 10000 ≤ x ≤ 10020       1e-6        0.374           3.79           0.373        3.35        0.639          5.86



Table 2 Average and maximum error for arc tangent functions (ulp)
                                          Range           interval        arctan avg        arctan max
                                        −1 ≤ x ≤ 1          1e-7            0.628              4.91
                                        0 ≤ x ≤ 20          1e-6            0.361              4.68



Table 3 Average and maximum error for arc sine and arc cosine functions (ulp)
                             Range       interval       arcsin avg        arcsin max        arccos avg      arccos max
                          −1 ≤ x ≤ 1      1e-7            0.665              5.98              0.434           5.15



Table 4 Average and maximum error for exponential function (ulp)
                                                Range         interval         exp avg       exp max
                                          0≤x≤1                   1e-7          0.295          1.97
                                         −10 ≤ x ≤ 10             1e-6          0.299          1.93



Table 5 Average and maximum error for logarithmic function (ulp)
                                                Range          interval        log avg      log max
                                            0<x≤1                  1e-7         0.430          2.54
                                          0 < x ≤ 1000             1e-4         0.286          2.65



Table 6 Time taken for pair of evaluations (sec.)
                                         Core i7 2.66GHz (IA32)                       Core 2 2.53GHz (x86 64)
                                       Proposed     FPU    MPFR                     Proposed     FPU    MPFR
                            sin+cos     3.29e-8         7.92e-8      3.53e-5        3.51e-8       8.87e-8    2.80e-5
                               sin      3.29e-8         7.04e-8      1.33e-5        3.51e-8       8.83e-8    1.08e-5
                               cos      3.29e-8         7.04e-8      2.14e-5        3.51e-8       8.92e-8    1.68e-5
                               tan      3.68e-8         1.82e-7      1.68e-5        4.12e-8       1.65e-7    1.33e-5
                             arctan     4.72e-8         1.82e-7      2.11e-5        4.98e-8       1.61e-7    1.33e-5
                             arcsin     6.64e-8          N/A         3.81e-5        7.35e-8        N/A       2.69e-5
                             arccos     6.73e-8          N/A         3.92e-5        7.68e-8        N/A       2.73e-5
                              exp       2.45e-8          N/A         1.59e-5        2.85e-8        N/A       1.33e-5
                               log      2.14e-8         4.59e-8      1.19e-5        2.69e-8       4.88e-8    8.14e-6



Table 7 Number of instructions and size of subroutines
                                                        sin+cos+tan          arctan+arcsin+arccos           exp      log
                           Number of instructions             94                          94                 52   49
                           Subroutine size (byte)             426                        416                253   254
8


4.2 Speed                                                    References

In this test, we compared the speed of the proposed           1. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash,
                                                                 P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Es-
methods for evaluating the functions to those of an FPU          pasa, E. Grochowski, T. Juan, and P. Hanrahan : “ Larrabee:
and the MPFR Library[12]. Besides the Core i7 system             A many-core x86 architecture for visual computing,” Proc.
mentioned above, we conducted the speed test on a PC             of ACM SIGGRAPH 2008, pp. 1–15, 2008.
with Core 2 Duo E7200 2.53GHz with 2GB of RAM                 2. K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S.
                                                                 Pakin and J. Sancho : “Entering the petaflop era: the archi-
running Linux x86 64 (Debian Squeeze) with a 2.6.26              tecture and performance of Roadrunner,” Proc. of the 2008
kernel. There was no register spill in our methods. In           ACM/IEEE conference on Supercomputing, pp. 1–11, 2008.
these tests, we used the gettimeofday function to mea-        3. M. Gschwind, H. Hofstee, B. Flachs, M. Hopkins, Y. Watan-
sure the time for each subroutine to execute 100 million         abe and T. Yamazaki : “Synergistic Processing in Cell’s Mul-
                                                                 ticore Architecture,” IEEE Micro, Vol. 26, No. 2, pp. 10–24,
times in a single thread. In order to measure the time           2006.
for an FPU to evaluate each function, we wrote subrou-        4. “Tesla C2050 and Tesla C2070 Computing Processor Board,”
tines with an inline assembly code for executing the cor-        http://www.nvidia.com/docs/IO/43395/BD-04983-001_
responding FPU instruction. These FPU subroutines                v01.pdf
                                                              5. S. Thakkar and T. Huff : “Internet Streaming SIMD Exten-
execute two corresponding FPU instructions per call,             sions,” Computer, Vol. 32, No. 12, pp. 26–34, 1999.
since the subroutines for the proposed methods evalu-         6. Approximate Math Library 2.0, http://www.intel.com/
ate the target function on the argument of each element          design/pentiumiii/devtools/AMaths.zip.
in a SIMD register, which contains two double precision       7. Simple SSE and SSE2 optimized sin, cos, log and exp, http:
                                                                 //gruntthepeon.free.fr/ssemath/.
numbers. For sin+cos measurement, we measured the             8. L. Nyland and M. Snyder: “Fast Trigonometric Functions
time for an FPU to execute the sincos instruction to             Using Intel’s SSE2 Instructions,” Tech. Report.
evaluate the sine and cosine functions simultaneously.        9. Linux Kernel Version 2.6.30.5, http://www.kernel.org/.
For the MPFR Library, we set 53 bit as its precision,        10. GNU C Library Version 2.7, http://www.gnu.org/software/
                                                                 libc/.
and measured the time for each subroutine to execute         11. R. Brent : “Fast Algorithms for High-Precision Computation
1 million times. The measured times in second for each           of Elementary Functions,” Proc. of 7th Conference on Real
subroutine to perform one pair of evaluations on the             Numbers and Computers (RNC 7), pp. 7-8, 2006.
two systems are shown in Table 6.                            12. The MPFR Library, http://www.mpfr.org/.
                                                             13. J. Detrey, F. Dinechin and X. Pujul : “Return of the hard-
    The execution speed of the subroutines of the pro-           ware floating-point elementary function,” Proceedings of the
posed methods was more than twice as fast as the equiv-          18th IEEE Symposium on Computer Arithmetic, pp. 161–
alent FPU instructions.                                          168, 2007.
                                                             14. I. Koren and O. Zinaty : “Evaluating Elementary Functions
                                                                 in a Numerical Coprocessor Based on Rational Approxima-
                                                                 tions,” IEEE Transactions on Computers, Vol. 39, No. 8,
4.3 Code Size                                                    pp.1030-1037, 1990.
                                                             15. M. Ercegovac, T. Lang, J. Muller and A. Tisserand : “Recip-
We measured the number of instructions and code size             rocation, Square Root, Inverse Square Root, and Some El-
for each subroutines of our methods by disassembling             ementary Functions Using Small Multipliers,” IEEE Trans-
                                                                 actions on Computers, Vol. 49, No. 7, pp. 628–637, 2000.
the executable files for an IA32 system using gdb and
                                                             16. E. Goto, W.f. Wong, “Fast Evaluation of the Elementary
counting these numbers in the subroutines. Please note           Functions in Single Precision,” IEEE Transactions on Com-
that each instruction in the subroutines is executed ex-         puters, Vol. 44, No. 3, pp. 453-457, 1995.
actly once per call, since they do not contain conditional   17. D. Scarpazza and G. Russell : “High-performance regular
                                                                 expression scanning on the Cell/B.E. processor,” Proc. of the
branches. The results are shown in Table 7.
                                                                 23rd international conference on Supercomputing, pp. 14–25,
                                                                 2009.
                                                             18. M. Rehman, K. Kothapalli and P. Narayanan : “Fast and
5 Conclusion                                                     scalable list ranking on the GPU,” Proc. of the 23rd inter-
                                                                 national conference on Supercomputing, pp. 235–243, 2009.
                                                             19. D. Goldberg : “What every computer scientist should know
We proposed efficient methods for evaluating elemen-
                                                                 about floating-point arithmetic,” ACM Computing Surveys,
tary functions in double precision without table look-           Vol. 23, No. 1, pp. 5–48, 1991.
ups, scattering from, or gathering into SIMD registers,      20. M. Abramowitz and I. Stegun : “Handbook of Mathemat-
or conditional branches. The methods were implemented            ical Functions: with Formulas, Graphs, and Mathematical
                                                                 Tables,” Dover Publications, 1965.
using the Intel SSE2 instruction set to check the accu-
racy and speed. The average and maximum errors were
less than 0.67 ulp and 6 ulps, respectively. The speed
was faster than the FPUs on Intel Core 2 and Core i7
processors.

Contenu connexe

Tendances

DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
 
9.design of high speed area efficient low power vedic multiplier using revers...
9.design of high speed area efficient low power vedic multiplier using revers...9.design of high speed area efficient low power vedic multiplier using revers...
9.design of high speed area efficient low power vedic multiplier using revers...nareshbk
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd Iaetsd
 
32-bit unsigned multiplier by using CSLA & CLAA
32-bit unsigned multiplier by using CSLA &  CLAA32-bit unsigned multiplier by using CSLA &  CLAA
32-bit unsigned multiplier by using CSLA & CLAAGanesh Sambasivarao
 
implementation and design of 32-bit adder
implementation and design of 32-bit adderimplementation and design of 32-bit adder
implementation and design of 32-bit adderveereshwararao
 
Iaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd Iaetsd
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 
Design of High Speed 128 bit Parallel Prefix Adders
Design of High Speed 128 bit Parallel Prefix AddersDesign of High Speed 128 bit Parallel Prefix Adders
Design of High Speed 128 bit Parallel Prefix AddersIJERA Editor
 
Design and Implementation of Variable Radius Sphere Decoding Algorithm
Design and Implementation of Variable Radius Sphere Decoding AlgorithmDesign and Implementation of Variable Radius Sphere Decoding Algorithm
Design and Implementation of Variable Radius Sphere Decoding Algorithmcsandit
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filteridescitation
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd Iaetsd
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsIOSR Journals
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check IJECEIAES
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPERRaj kumar
 

Tendances (20)

Computational Assignment Help
Computational Assignment HelpComputational Assignment Help
Computational Assignment Help
 
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMDESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORM
 
9.design of high speed area efficient low power vedic multiplier using revers...
9.design of high speed area efficient low power vedic multiplier using revers...9.design of high speed area efficient low power vedic multiplier using revers...
9.design of high speed area efficient low power vedic multiplier using revers...
 
Matopt
MatoptMatopt
Matopt
 
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization andIaetsd fpga implementation of cordic algorithm for pipelined fft realization and
Iaetsd fpga implementation of cordic algorithm for pipelined fft realization and
 
32-bit unsigned multiplier by using CSLA & CLAA
32-bit unsigned multiplier by using CSLA &  CLAA32-bit unsigned multiplier by using CSLA &  CLAA
32-bit unsigned multiplier by using CSLA & CLAA
 
implementation and design of 32-bit adder
implementation and design of 32-bit adderimplementation and design of 32-bit adder
implementation and design of 32-bit adder
 
Iaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applications
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 
Design of High Speed 128 bit Parallel Prefix Adders
Design of High Speed 128 bit Parallel Prefix AddersDesign of High Speed 128 bit Parallel Prefix Adders
Design of High Speed 128 bit Parallel Prefix Adders
 
Design and Implementation of Variable Radius Sphere Decoding Algorithm
Design and Implementation of Variable Radius Sphere Decoding AlgorithmDesign and Implementation of Variable Radius Sphere Decoding Algorithm
Design and Implementation of Variable Radius Sphere Decoding Algorithm
 
Addressing modes
Addressing modesAddressing modes
Addressing modes
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fft
 
Programming with matlab session 3 notes
Programming with matlab session 3 notesProgramming with matlab session 3 notes
Programming with matlab session 3 notes
 
Pointer
PointerPointer
Pointer
 
High Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing ApplicationsHigh Speed Signed multiplier for Digital Signal Processing Applications
High Speed Signed multiplier for Digital Signal Processing Applications
 
Gf3511031106
Gf3511031106Gf3511031106
Gf3511031106
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPER
 

Similaire à (Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...IRJET Journal
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaVLSICS Design
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMVLSICS Design
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...VLSICS Design
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...IEEEBEBTECHSTUDENTPROJECTS
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
The use of reversible logic gates in the design of residue number systems
The use of reversible logic gates in the design of residue number systems The use of reversible logic gates in the design of residue number systems
The use of reversible logic gates in the design of residue number systems IJECEIAES
 

Similaire à (Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation (20)

Bs25412419
Bs25412419Bs25412419
Bs25412419
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
High Speed and Area Efficient Matrix Multiplication using Radix-4 Booth Multi...
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORMDUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
DUAL FIELD DUAL CORE SECURE CRYPTOPROCESSOR ON FPGA PLATFORM
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS  An efficient-parallel-approach-fo...
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS An efficient-parallel-approach-fo...
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
F017423643
F017423643F017423643
F017423643
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
The use of reversible logic gates in the design of residue number systems
The use of reversible logic gates in the design of residue number systems The use of reversible logic gates in the design of residue number systems
The use of reversible logic gates in the design of residue number systems
 

Plus de Naoki Shibata

Circular barcode design resistant to linear motion blur (preliminary slides)
Circular barcode design resistant to linear motion blur (preliminary slides)Circular barcode design resistant to linear motion blur (preliminary slides)
Circular barcode design resistant to linear motion blur (preliminary slides)Naoki Shibata
 
(Paper) An Endorsement Based Mobile Payment System for a Disaster Area
(Paper) An Endorsement Based Mobile Payment System for a Disaster Area(Paper) An Endorsement Based Mobile Payment System for a Disaster Area
(Paper) An Endorsement Based Mobile Payment System for a Disaster AreaNaoki Shibata
 
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...Naoki Shibata
 
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...Naoki Shibata
 
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...Naoki Shibata
 
An Endorsement Based Mobile Payment System for A Disaster Area
An Endorsement Based Mobile Payment System for A Disaster AreaAn Endorsement Based Mobile Payment System for A Disaster Area
An Endorsement Based Mobile Payment System for A Disaster AreaNaoki Shibata
 
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...Naoki Shibata
 
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Naoki Shibata
 
GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
GPGPU-Assisted Subpixel Tracking Method for Fiducial MarkersGPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
GPGPU-Assisted Subpixel Tracking Method for Fiducial MarkersNaoki Shibata
 
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...Naoki Shibata
 
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...Naoki Shibata
 
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...Naoki Shibata
 
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...Naoki Shibata
 
(Paper) Self adaptive island GA
(Paper) Self adaptive island GA(Paper) Self adaptive island GA
(Paper) Self adaptive island GANaoki Shibata
 
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNsNaoki Shibata
 
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 (Paper) Task scheduling algorithm for multicore processor system for minimiz... (Paper) Task scheduling algorithm for multicore processor system for minimiz...
(Paper) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
 
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...Naoki Shibata
 
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...(Slides) A Personal Navigation System with a Schedule Planning Facility Based...
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...Naoki Shibata
 
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...Naoki Shibata
 

Plus de Naoki Shibata (20)

Circular barcode design resistant to linear motion blur (preliminary slides)
Circular barcode design resistant to linear motion blur (preliminary slides)Circular barcode design resistant to linear motion blur (preliminary slides)
Circular barcode design resistant to linear motion blur (preliminary slides)
 
(Paper) An Endorsement Based Mobile Payment System for a Disaster Area
(Paper) An Endorsement Based Mobile Payment System for a Disaster Area(Paper) An Endorsement Based Mobile Payment System for a Disaster Area
(Paper) An Endorsement Based Mobile Payment System for a Disaster Area
 
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
BalloonNet: A Deploying Method for a Three-Dimensional Wireless Network Surro...
 
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...
Congestion Alleviation Scheduling Technique for Car Drivers Based on Predicti...
 
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...
(Paper) MTcast: Robust and Efficient P2P-based Video Delivery for Heterogeneo...
 
An Endorsement Based Mobile Payment System for A Disaster Area
An Endorsement Based Mobile Payment System for A Disaster AreaAn Endorsement Based Mobile Payment System for A Disaster Area
An Endorsement Based Mobile Payment System for A Disaster Area
 
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...
GreenSwirl: Combining Traffic Signal Control and Route Guidance for Reducing ...
 
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
Task Scheduling Algorithm for Multicore Processor Systems with Turbo Boost an...
 
GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
GPGPU-Assisted Subpixel Tracking Method for Fiducial MarkersGPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers
 
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...
(Paper) BalloonNet: A Deploying Method for a Three-Dimensional Wireless Netwo...
 
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...
(Paper) Emergency Medical Support System for Visualizing Locations and Vital ...
 
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...
(Paper) A Method for Overlay Network Latency Estimation from Previous Observa...
 
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...
(Paper) Parking Navigation for Alleviating Congestion in Multilevel Parking F...
 
(Paper) Self adaptive island GA
(Paper) Self adaptive island GA(Paper) Self adaptive island GA
(Paper) Self adaptive island GA
 
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs
(Slides) A Decentralized Method for Maximizing k-coverage Lifetime in WSNs
 
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 (Paper) Task scheduling algorithm for multicore processor system for minimiz... (Paper) Task scheduling algorithm for multicore processor system for minimiz...
(Paper) Task scheduling algorithm for multicore processor system for minimiz...
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
 
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...
(Slides) A Technique for Information Sharing using Inter-Vehicle Communicatio...
 
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...(Slides) A Personal Navigation System with a Schedule Planning Facility Based...
(Slides) A Personal Navigation System with a Schedule Planning Facility Based...
 
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
(Slides) A Method for Distributed Computaion of Semi-Optimal Multicast Tree i...
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

  • 1. Noname manuscript No. (will be inserted by the editor) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation Naoki Shibata Received: date / Accepted: date Abstract Data-parallel architectures like SIMD (Sin- SIMT (Single Instruction Multiple Thread), rather than gle Instruction Multiple Data) or SIMT (Single Instruc- extracting more instruction-level parallelism. Intel’s x86 tion Multiple Thread) have been adopted in many re- architecture employs 128-bit-wide SIMD instructions, cent CPU and GPU architectures. Although some SIMD but it will be extended to 256 bit in the Sandy Bridge and SIMT instruction sets include double-precision arith- processors and 512 bit in the Larrabee processors[1]. metic and bitwise operations, there are no instructions Many supercomputers utilize AMD Opteron Proces- dedicated to evaluating elementary functions like trigono- sors as computing cores. IBM Roadrunner[2] also uti- metric functions in double precision. Thus, these func- lizes Cell Broadband Engines (Cell B.E.)[3], and both of tions have to be evaluated one by one using an FPU these architectures have powerful SIMD engines. nVidia’s or using a software library. However, traditional algo- CUDA-enabled GPGPUs use SIMT architecture, and rithms for evaluating these elementary functions in- can achieve 600 double-precision GFLOPS with a Tesla volve heavy use of conditional branches and/or table C2050 processor[4]. look-ups, which are not suitable for SIMD computa- Utilizing SIMD or SIMT processing is becoming more tion. In this paper, efficient methods are proposed for and more important for many kinds of applications, and evaluating the sine, cosine, arc tangent, exponential and double-precision calculation capabilities are especially logarithmic functions in double precision without table important for many scientific applications. However, look-ups, scattering from, or gathering into SIMD reg- those instruction sets only include basic arithmetic and isters, or conditional branches. We implemented these bitwise operations. Instructions for evaluating elemen- methods using the Intel SSE2 instruction set to evaluate tary functions like trigonometric functions in double- their accuracy and speed. The results showed that the precision are not provided, and we have to evaluate average error was less than 0.67 ulp, and the maximum these functions with a conventional FPU or a software error was 6 ulps. The computation speed was faster library. An FPU calculation is fast, but FPU instruc- than the FPUs on Intel Core 2 and Core i7 processors. tions for evaluating the elementary functions are only Keywords SIMD, elementary functions available in a few architectures. We need to take special care for efficient computa- tion utilizing SIMD instructions. Gathering and scat- 1 Introduction tering data from and into SIMD registers are expensive operations. Looking up different table entries according Recently, computer architects have been trying to achieve to the value of each element in a SIMD register can be high performance by adopting data-parallel architec- especially expensive, since each table look-up may cause tures like SIMD (Single Instruction Multiple Data) or a cache miss. Also, the SSE instructions[5] in the x86 architecture have restrictions on data alignment, and Naoki Shibata Department of Information Processing & Management, an access to unaligned data is always expensive. Ran- Shiga University dom conditional branches are expensive for processors 1-1-1 Bamba, Hikone 522-8522, Japan with a long instruction pipeline. On the other hand, E-mail: shibata@biwako.shiga-u.ac.jp many SIMD architectures provide instructions for di-
  • 2. 2 vision and taking square root, and these calculations ditional irregular algorithms are sometimes difficult to are now inexpensive. Thus, sometimes traditional algo- parallelize on massively multi-threaded hardware, and rithms are not suitable, and we need new algorithms many researchers are now working on new algorithms for them. for such hardware. In this paper, we propose new methods for evaluat- ing the sine, cosine, arc tangent, exponential and loga- rithmic functions in double precision which are suitable 3 Proposed Method for processors with a long instruction pipeline and an SIMD instruction set. Evaluations of these functions are In this section, the details of the proposed methods are carried out by a series of double-precision arithmetic explained. The working C codes are provided which can and bitwise operations without table look-ups, scatter- be easily converted into a series of SIMD instructions. ing/gathering operations, or conditional branches. The Having converted the codes, a target function can be total code size is very small, and thus they are also suit- evaluated with multiple inputs in a SIMD register si- able for Cell B.E. which has only 256K bytes of directly multaneously. accessible scratch pad memory in each SPE. The pro- posed methods were implemented using the Intel SSE2 instruction set for evaluating accuracy and speed. The 3.1 Trigonometric Functions results showed that the proposed methods were faster The proposed method evaluates the functions of sine than FPU calculations on Intel Core 2 and Core i7 pro- and cosine on the argument of d at the same time. The cessors, and the average and maximum errors were less proposed method consists of two steps. First, the ar- than 0.67 ulp (unit in the last place) and 6 ulps, respec- gument d is reduced to within 0 and π/4 utilizing the tively. The total code size was less than 1400 bytes. symmetry and periodicity of the sine and cosine func- The remainder of this paper is organized as fol- tions, shown as (1) and (2). Second, an evaluation of lows. Section 2 introduces some related works. Section the sine and cosine function on the reduced argument s 3 presents the proposed methods with working C codes. is performed assuming that the argument is within the Section 4 shows the results of the evaluation of accu- range. racy, speed and code size. Section 5 presents our con- clusions. π π cos x = sin( − x) = sin(x + ) (1) 2 2 2 Related Works π π sin x = cos( − x) = − cos(x + ) (2) 2 2 Some software libraries have capabilities for evaluating For the first step, we first find s and an integer q in elementary functions. A few libraries with capabilities (3) so that 0 ≤ s < π/4. for evaluating elementary functions using SIMD com- putation in single precision are available[6,7,8]. How- π ever, adding capabilities of double-precision evaluations d = s+ q (3) to these libraries is not a straightforward task. Software 4 modules like the x87 FPU emulator code in Linux oper- q can be found by dividing d by π/4 and applying the ating system kernel[9] or the libm in GNU C Library[10] floor function. Then s can be found by multiplying q by have capabilities for evaluating elementary functions in π/4 and subtracting it from d. We need to take special double precision. These modules do not utilize SIMD care here, because a cancellation error can make the computation, and they involve heavy use of conditional derived value of s inaccurate when d is a large num- branches. Some multiple-precision floating-point com- ber close to a multiple of π/4. Some implementations putation libraries have capabilities for evaluating ele- like FPUs on x86 architectures exhibit this problem. In mentary functions in multiple-precision[11, 12]. The pro- this paper, a way for efficiently preventing this prob- gram structures of these libraries are very different from lem is proposed. The problem is in the last multipli- the proposed method. cation and subtraction, and the proposed method pre- There are many researches on evaluating elementary vents the problem by calculating this part with extra functions on hardware[13,14,15, 16]. The restrictions of precision utilizing properties of the IEEE 754 standard calculation are very different from the SIMD computa- for floating-point arithmetic[19]. Here, q is assumed to tions, and many of these methods use table look-ups. be expressed in 26 bits, which is a half of the length of There are many researches on accelerating applica- the mantissa in an IEEE 754 double-precision FP num- tions using SIMD or SIMT computation[17,18]. Tra- ber. The value of π/4 is split into three parts so that
  • 3. 3 Fig. 1 Splitting π/4 into three parts all of the bits in the lower half of the mantissa are zero, as shown in Fig. 1. Then, multiplying these numbers cos 2x = 2 cos2 x − 1 (6) with q does not produce an error at all, and subtract- √ ing these numbers in descending order from d does not sin x = 1 − cos2 x (7) produce a cancellation error either. Now, we have another problem. If x is close to 0, find- For the second step, we evaluate the sine and cosine ing the value of the sine function using (7) produces a functions on the argument s. In many existing imple- cancellation error, since cos2 x is close to 1. In this pa- mentations, these evaluations are performed by simply per, instead of cos s, the value of 1 − cos s is found. Let summing up the terms of the corresponding Taylor se- f (x) be 1 − cos x, and we get the following equations. ries, shown as (4). f (2x) = 4f (x) − 2f (x)2 (8) ∞ ∑ (−1)n x3 x5 √ sin x = x2n+1 = x − + − ··· (4) sin x = 2f (x) − f (x)2 (9) n=0 (2n + 1)! 3! 5! Since s is between 0 and π/4, the application of (8) or However, if these terms are summed up using double- (9) does not produce cancellation or rounding errors. By precision calculations, we cannot get an accurate result evaluating the first five terms of the Taylor series and because of the accumulation of errors. In order to im- applying (8) three times, we can evaluate the functions prove the accuracy while speeding up the evaluation, of sine and cosine on the argument of s with sufficient many implementations use another way of argument accuracy. In order to improve the computational per- reduction [11]. The argument s is divided by a certain formance, we use the following technique. Let g(x) be number, the terms of the Taylor series are evaluated 2f (x) and we get the following equation: for this value, and then the triple-angle formula (5) is used to obtain the final value. For example, in order g(2x) = (4 − g(x)) · g(x) (10) to evaluate sin 0.27, we first divide 0.27 by 9, evaluate the terms of the Taylor series for sin 0.03, and then ap- By using (10) instead of (8), we can omit one multi- ply the triple-angle formula two times to find the value plication each time we apply the double-angle formula. of sin 0.27. Since the argument for the Taylor series is This also improves the accuracy by a little. To incorpo- reduced, its convergence is accelerated, and we do not rate the result from the first step, we use the symmetry need to evaluate many terms to get an accurate value. and periodicity of the trigonometric functions. This is In this way, we can reduce the accumulation of errors. performed by exchanging the derived values of sin s and cos s, and multiplying those values by −1 according to the remainder q divided by 4. In order to evaluate the sin 3x = 3 sin x − 4 sin3 x (5) tangent function, we can simply use (11). A working C code for the proposed method is shown However, this way does not yield sufficient accuracy in Fig. 21 . with double-precision calculations, because the differ- ence between 3 sin x and 4 sin3 x becomes large if sin x sin x is small, and applying the triple-angle formula produces tan x = (11) cos x a rounding error. To prevent this error, we can use the 1 The code contains conditional branches for ease of reading. double-angle formula (6) of cosine. In this case, the All of these conditional branches can be eliminated by unrolling value of the sine function can be evaluated from the loops or replacing them by combinations of comparisons and bit- value of the cosine function using (7). wise operations.
  • 4. 4 #i n c l u d e <math . h> typedef struct sincos t { double s , c; } sincos t ; #d e f i n e N 3 sincos t x s i n c o s 0 ( double s) { / / 0 ≤ s < π/4 int i ; / / Argument reduction s = s ∗ pow ( 2 , −N ) ; s = s ∗ s ; / / Evaluating Taylor series s = ( ( ( ( s /1814400 − 1.0/20160)∗ s + 1 . 0 / 3 6 0 )∗ s − 1 . 0 / 1 2 ) ∗ s + 1)∗ s ; f o r ( i =0; i < ; i ++) { s = (4− s ) ∗ s ; N } / / Applying double angle formula s = s / 2; s i n c o s t sc ; s c . s = s q r t ((2 − s ) ∗ s ) ; sc . c = 1 − s ; return s c ; } #d e f i n e PI4 A . 7 8 5 3 9 8 1 5 5 4 5 0 8 2 0 9 2 2 8 5 1 5 6 2 5 / / π/4 split into three parts #d e f i n e PI4 B . 7 9 4 6 6 2 7 3 5 6 1 4 7 9 2 8 3 6 7 1 3 6 0 4 6 2 9 0 3 9 7 6 4 4 0 4 2 9 6 8 7 5 e−8 #d e f i n e PI4 C . 3 0 6 1 6 1 6 9 9 7 8 6 8 3 8 2 9 4 3 0 6 5 1 6 4 8 3 0 6 8 7 5 0 2 6 4 5 5 2 4 3 7 3 6 1 4 8 0 7 6 9 e −16 / / 4/π #d e f i n e M 4 PI 1.273239544735162542821171882678754627704620361328125 s i n c o s t x s i n c o s ( double d ) { double s = f a b s ( d ) ; i n t q = ( i n t ) ( s ∗ M 4 PI ) , r = q + ( q & 1 ) ; s −= r ∗ PI4 A ; s −= r ∗ PI4 B ; s −= r ∗ PI4 C ; sincos t sc = xsincos0 ( s ) ; if ( ( ( q + 1 ) & 2 ) != 0 ) { s = s c . c ; s c . c = s c . s ; sc . s = s ; } if ( ( ( q & 4 ) != 0 ) != ( d < 0 ) ) s c . s = −s c . s ; if ( ( ( q + 2 ) & 4 ) != 0 ) s c . c = −s c . c ; return sc ; } double x s i n ( double d ) { sincos t sc = xsincos (d ) ; return sc . s ; } double x c o s ( double d ) { sincos t sc = xsincos (d ) ; return sc . c ; } double x t a n ( double d ) { sincos t sc = xsincos (d ) ; return sc . s / sc . c ; } Fig. 2 Working C code for evaluating trigonometric functions 3.2 Inverse Trigonometric Functions Thus, if d ≤ 1, the argument can be reduced by evalu- ating the arc tangent function on the argument of 1/d We cannot get an accurate result by simply summing instead of d, and subtracting the result from π/2. Thus, up many terms of the Taylor series (12) using double- we can now assume d > 1 and π/4 < θ < π/2. In or- precision calculations because of an accumulation of er- der to reduce the argument d = tan θ, we can use (14) rors. As far as we surveyed, no good way is known for to enlarge cot θ = 1/ tan θ. Please note that cot x/2 is reducing the argument for evaluating inverse trigono- larger than cot x if 0 < x < π/2. For example, assume metric functions. In this paper, a new way of argument that we are evaluating arctan 2, and φ is a number such reduction for evaluating the arc tangent function utiliz- that 2 = tan φ. We first apply (13) to get√ φ = 1/2. cot ing the half-angle formula (14) of the cotangent function Next, we apply (14) to get cot φ/2 = 1/2+ √1 + (1/2)2 , is proposed. and apply (13) √ get tan φ/2 = 1/(1/2 + 5/4). Note to that 1/(1/2 + 5/4) is less than 2, thus we reduced ∞ ∑ (−1)n the argument. Then, we evaluate√ several terms of (12) arctan x = x2n+1 (12) for the arc tangent of 1/(1/2 + 5/4), which is φ/2. n=0 2n + 1 (π ) Finally, we multiply 2 to the result to get φ = arctan 2. 1 cot x = = tan −x (13) By applying (14) two times and summing the first tan x √ 2 x eleven terms of Taylor series, we can evaluate the arc cot = cot x ± 1 + cot2 x (14) tangent function with sufficient accuracy. 2 Let d be the argument and θ = arctan d. Thus, d = The arc sine and the arc cosine functions can be tan θ holds. For simplicity, we now assume d ≥ 0 and evaluated using the following equation. 0 ≤ θ < π/2. From (13), the following equation holds. (π ) 1 x tan −θ = (15) arcsin x = arctan √ (16) 2 d 1 − x2
  • 5. 5 √ 1 − x2 from d, where we have a cancellation error when d is arccos x = arctan (17) x close to a multiple of loge 2. In order to prevent this, However, these equations produce cancellation errors if the value of loge 2 is split into two parts in a similar |x| is close to 1. These errors can be avoided by modify- way as in the case of the trigonometric functions, and ing the above equations as follows. Note that no cancel- each part is multiplied by q and subtracted from d in lation error is produced by subtracting x from 1 in these turn. equations. A working C code for the proposed method For the second step, we can use (21) for reducing the is shown in Fig. 3. argument, and then evaluate the terms of the Taylor series (22). x arcsin x = arctan √ (18) (1 + x)(1 − x) exp 2x = (exp x)2 (21) √ (1 + x)(1 − x) ∞ ∑ xn x2 x3 arccos x = arctan (19) exp x = =1+x+ + + ··· (22) x n! 2! 3! n=0 However, we have a problem if we simply use these two equations. If s is close to 0, the difference between the #i n c l u d e <math . h> first term, which is 1, and the other terms in (22) is too #d e f i n e M PI 3 . 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 #d e f i n e N 2 large and produces rounding errors. In order to avoid double x a t a n ( double d ) { this problem, the proposed method finds the value of double x , y , z = f a b s ( d ) ; i f ( z < 1.0) x = 1.0/ z ; else x = z; exp s − 1 instead of exp s. Let h(x) be exp x − 1, and i n t i ; / / Applying cotangent half angle formula we derive the following equation from (21) and (22): f o r ( i = 0 ; i < N ; i ++) { x = x + s q r t (1+ x∗x ) ; } x = 1.0 / x; y = 0 ; / / Evaluating Taylor series f o r ( i =10; i >=0; i −−) { h(2x) = (2 + h(x)) · h(x) (23) y = y∗x∗x + pow( −1 , i ) / ( 2 ∗ i + 1 ) ; ∞ ∑ xn } y ∗= x ∗ pow ( 2 , N ) ; h(x) = (24) i f ( z > 1 . 0 ) y = M PI /2 − y ; n=1 n! return d < 0 ? −y : y ; } double x a s i n ( double d ) { By summing the first eight terms of (24), applying return x a t a n ( d / s q r t ((1+ d )∗(1 − d ) ) ) ; } (23) four times, and then adding 1 to the result, we can double x a c o s ( double d ) { evaluate the exponential function on the argument of return x a t a n ( s q r t ((1+ d )∗(1 − d ) ) / d ) + ( d < 0 ? M PI : 0 ) ; s with sufficient accuracy. By incorporating the result } from the first step, we get the final result by multiplying the result by 2q as shown in (25). Calculation of 2q from Fig. 3 Working C code for evaluating inverse trigonometric func- q can be performed by substituting 1023 + q for the tions exponent of 1, utilizing the properties of the IEEE 754 standard. A working C code for the proposed method is shown in Fig. 4. 3.3 Exponential Function The proposed method for evaluating the exponential exp(s + q · loge 2) = (exp s) · exp(q · loge 2) function consists of two steps. As the first step, the = (exp s) · 2q (25) argument d is reduced to within 0 and loge 2. Then, as the second step, the exponential function is evaluated on the reduced argument s. 3.4 Logarithmic Function For the first step, in order to evaluate exp d, we first find s and an integer q in (20) so that 0 ≤ s < loge 2. In order to evaluate the logarithmic function, we can use the series shown as (26) which converges faster than d = s + q · loge 2 (20) the Taylor series[20]. Here, we have the same problem as we had in the case of ∞ ( )2n+1 the trigonometric functions. In order to find the value ∑ 1 x−1 log2 x = 2 (26) of s, we have to multiply q by loge 2 and subtract it n=0 2n + 1 x + 1
  • 6. 6 #i n c l u d e <math . h> #d e f i n e M LN2 0 . 6 9 3 1 4 7 1 8 0 5 5 9 9 4 5 3 0 9 4 1 7 2 3 2 1 2 1 4 5 8 1 7 6 6 / / loge 2 / / loge 2 split into two parts #d e f i n e L2U . 6 9 3 1 4 7 1 8 0 5 5 9 6 6 2 9 5 6 5 1 1 6 0 1 8 0 5 6 8 6 9 5 0 6 8 3 5 9 3 7 5 #d e f i n e L2L . 2 8 2 3 5 2 9 0 5 6 3 0 3 1 5 7 7 1 2 2 5 8 8 4 4 8 1 7 5 0 1 3 4 3 6 0 2 5 5 2 5 4 1 2 0 6 8 e −12 #d e f i n e N 4 double xexp ( double d ) { i n t q = ( i n t ) ( d / M LN2 ) , i ; double s = d − q ∗ L2U − q ∗ L2L , t; s = s ∗ pow ( 2 , −N ) ; / / Argument reduction / / Evaluating Taylor series t = ( ( ( s /40320 + 1 . 0 / 5 0 4 0 )∗ s + 1 . 0 / 7 2 0 ) ∗ s + 1 . 0 / 1 2 0 ) ∗ s ; t = ( ( ( ( t + 1 . 0 / 2 4 )∗ s + 1 . 0 / 6 ) ∗ s + 1 . 0 / 2 ) ∗ s + 1)∗ s ; f o r ( i =0; i < ; i ++) { t = (2+ t ) ∗ t ; N } return ( t + 1 ) ∗ pow ( 2 , q); } Fig. 4 Working C code for evaluating exponential function The proposed method consists of two steps. The first step is reducing argument by splitting the original argu- #i n c l u d e <math . h> ment into its mantissa and exponent. The second step is double x l o g ( double d ) { int e , i ; to evaluate (26) on the mantissa part as the argument. double m = f r e x p ( d , &e ) ; i f (m < 0 . 7 0 7 1 ) { m ∗= 2 ; e −−; } As the first step, we split the original argument d double x = (m−1) / (m+ 1 ) , y = 0 ; into its mantissa m and an integer exponent e where 0.5 ≤ m < 1, as shown in (27), utilizing the property of f o r ( i =19; i >=1; i −=2) { y = y∗x∗x + 2 . 0 / i ; } return e ∗ l o g ( 2 ) + x∗y ; the IEEE 754 standard. Since (26) converges fast if its } argument is close to 1, we multiply 2 to m and subtract √ 1 from e √ m is smaller than 2/2. Then m is in the if √ Fig. 5 Working C code for evaluating logarithmic function range of 2/2 ≤ m < 2. x = m · 2e (27) We used gcc 4.3.3 with -msse2 -mfpmath=sse -O op- As the second step, we simply evaluate the first 10 tions to compile our programs. terms of the series of (26) to evaluate log m. We can get the value of log d by adding e log 2 to the value of log m to get the final result, as shown in (28). 4.1 Accuracy A C code for this method is shown in Fig. 5 The accuracy of the proposed methods was tested by comparing the results produced by the proposed meth- log(m · 2e ) = log m + log 2e ods to the accurate results produced by 256 bit pre- = log m + e log 2 (28) cision calculations using the MPFR Library[12]. We checked the accuracy by evaluating the functions on the argument of each value in a range at regular inter- 4 Evaluation vals, and we measured the average and maximum error in ulp. We measured the evaluation accuracy within a The proposed methods were tested to evaluate accu- few ranges for each function. The ranges and the mea- racy, speed, and code size. The tests were performed on surement results are shown in Tables 1, 2, 3, 4 and 5. a PC with Core i7 920 2.66GHz. The system is equipped Please note that because of the accuracy requirements with 3 GB of RAM and runs Linux (Ubuntu Jaunty of IEEE 754 standard, these results do not change be- with 2.6.28 kernel, IA32). The proposed methods were tween platforms comform to the standard. implemented in C language with Intel SSE2 intrinsics2 . The evaluation error did not exceed 6 ulps in every 2 An intrinsic is a function known by the compiler that directly test. The average error of 0.28 ulp to 0.67 ulp can be maps to a sequence of one or more assembly language instruc- considered to be good, since even if accurate numbers tions. are correctly rounded, we have 0.25 ulp of average error.
  • 7. 7 Table 1 Average and maximum error for trigonometric functions (ulp) Range interval sin avg sin max cos avg cos max tan avg tan max 0 ≤ x ≤ π/4 1e-7 0.402 2.97 0.278 1.78 0.580 4.74 −10 ≤ x ≤ 10 1e-6 0.378 3.80 0.370 3.94 0.636 5.14 0 ≤ x ≤ 20000 1e-3 0.373 3.53 0.373 3.52 0.636 5.61 10000 ≤ x ≤ 10020 1e-6 0.374 3.79 0.373 3.35 0.639 5.86 Table 2 Average and maximum error for arc tangent functions (ulp) Range interval arctan avg arctan max −1 ≤ x ≤ 1 1e-7 0.628 4.91 0 ≤ x ≤ 20 1e-6 0.361 4.68 Table 3 Average and maximum error for arc sine and arc cosine functions (ulp) Range interval arcsin avg arcsin max arccos avg arccos max −1 ≤ x ≤ 1 1e-7 0.665 5.98 0.434 5.15 Table 4 Average and maximum error for exponential function (ulp) Range interval exp avg exp max 0≤x≤1 1e-7 0.295 1.97 −10 ≤ x ≤ 10 1e-6 0.299 1.93 Table 5 Average and maximum error for logarithmic function (ulp) Range interval log avg log max 0<x≤1 1e-7 0.430 2.54 0 < x ≤ 1000 1e-4 0.286 2.65 Table 6 Time taken for pair of evaluations (sec.) Core i7 2.66GHz (IA32) Core 2 2.53GHz (x86 64) Proposed FPU MPFR Proposed FPU MPFR sin+cos 3.29e-8 7.92e-8 3.53e-5 3.51e-8 8.87e-8 2.80e-5 sin 3.29e-8 7.04e-8 1.33e-5 3.51e-8 8.83e-8 1.08e-5 cos 3.29e-8 7.04e-8 2.14e-5 3.51e-8 8.92e-8 1.68e-5 tan 3.68e-8 1.82e-7 1.68e-5 4.12e-8 1.65e-7 1.33e-5 arctan 4.72e-8 1.82e-7 2.11e-5 4.98e-8 1.61e-7 1.33e-5 arcsin 6.64e-8 N/A 3.81e-5 7.35e-8 N/A 2.69e-5 arccos 6.73e-8 N/A 3.92e-5 7.68e-8 N/A 2.73e-5 exp 2.45e-8 N/A 1.59e-5 2.85e-8 N/A 1.33e-5 log 2.14e-8 4.59e-8 1.19e-5 2.69e-8 4.88e-8 8.14e-6 Table 7 Number of instructions and size of subroutines sin+cos+tan arctan+arcsin+arccos exp log Number of instructions 94 94 52 49 Subroutine size (byte) 426 416 253 254
  • 8. 8 4.2 Speed References In this test, we compared the speed of the proposed 1. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Es- methods for evaluating the functions to those of an FPU pasa, E. Grochowski, T. Juan, and P. Hanrahan : “ Larrabee: and the MPFR Library[12]. Besides the Core i7 system A many-core x86 architecture for visual computing,” Proc. mentioned above, we conducted the speed test on a PC of ACM SIGGRAPH 2008, pp. 1–15, 2008. with Core 2 Duo E7200 2.53GHz with 2GB of RAM 2. K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin and J. Sancho : “Entering the petaflop era: the archi- running Linux x86 64 (Debian Squeeze) with a 2.6.26 tecture and performance of Roadrunner,” Proc. of the 2008 kernel. There was no register spill in our methods. In ACM/IEEE conference on Supercomputing, pp. 1–11, 2008. these tests, we used the gettimeofday function to mea- 3. M. Gschwind, H. Hofstee, B. Flachs, M. Hopkins, Y. Watan- sure the time for each subroutine to execute 100 million abe and T. Yamazaki : “Synergistic Processing in Cell’s Mul- ticore Architecture,” IEEE Micro, Vol. 26, No. 2, pp. 10–24, times in a single thread. In order to measure the time 2006. for an FPU to evaluate each function, we wrote subrou- 4. “Tesla C2050 and Tesla C2070 Computing Processor Board,” tines with an inline assembly code for executing the cor- http://www.nvidia.com/docs/IO/43395/BD-04983-001_ responding FPU instruction. These FPU subroutines v01.pdf 5. S. Thakkar and T. Huff : “Internet Streaming SIMD Exten- execute two corresponding FPU instructions per call, sions,” Computer, Vol. 32, No. 12, pp. 26–34, 1999. since the subroutines for the proposed methods evalu- 6. Approximate Math Library 2.0, http://www.intel.com/ ate the target function on the argument of each element design/pentiumiii/devtools/AMaths.zip. in a SIMD register, which contains two double precision 7. Simple SSE and SSE2 optimized sin, cos, log and exp, http: //gruntthepeon.free.fr/ssemath/. numbers. For sin+cos measurement, we measured the 8. L. Nyland and M. Snyder: “Fast Trigonometric Functions time for an FPU to execute the sincos instruction to Using Intel’s SSE2 Instructions,” Tech. Report. evaluate the sine and cosine functions simultaneously. 9. Linux Kernel Version 2.6.30.5, http://www.kernel.org/. For the MPFR Library, we set 53 bit as its precision, 10. GNU C Library Version 2.7, http://www.gnu.org/software/ libc/. and measured the time for each subroutine to execute 11. R. Brent : “Fast Algorithms for High-Precision Computation 1 million times. The measured times in second for each of Elementary Functions,” Proc. of 7th Conference on Real subroutine to perform one pair of evaluations on the Numbers and Computers (RNC 7), pp. 7-8, 2006. two systems are shown in Table 6. 12. The MPFR Library, http://www.mpfr.org/. 13. J. Detrey, F. Dinechin and X. Pujul : “Return of the hard- The execution speed of the subroutines of the pro- ware floating-point elementary function,” Proceedings of the posed methods was more than twice as fast as the equiv- 18th IEEE Symposium on Computer Arithmetic, pp. 161– alent FPU instructions. 168, 2007. 14. I. Koren and O. Zinaty : “Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approxima- tions,” IEEE Transactions on Computers, Vol. 39, No. 8, 4.3 Code Size pp.1030-1037, 1990. 15. M. Ercegovac, T. Lang, J. Muller and A. Tisserand : “Recip- We measured the number of instructions and code size rocation, Square Root, Inverse Square Root, and Some El- for each subroutines of our methods by disassembling ementary Functions Using Small Multipliers,” IEEE Trans- actions on Computers, Vol. 49, No. 7, pp. 628–637, 2000. the executable files for an IA32 system using gdb and 16. E. Goto, W.f. Wong, “Fast Evaluation of the Elementary counting these numbers in the subroutines. Please note Functions in Single Precision,” IEEE Transactions on Com- that each instruction in the subroutines is executed ex- puters, Vol. 44, No. 3, pp. 453-457, 1995. actly once per call, since they do not contain conditional 17. D. Scarpazza and G. Russell : “High-performance regular expression scanning on the Cell/B.E. processor,” Proc. of the branches. The results are shown in Table 7. 23rd international conference on Supercomputing, pp. 14–25, 2009. 18. M. Rehman, K. Kothapalli and P. Narayanan : “Fast and 5 Conclusion scalable list ranking on the GPU,” Proc. of the 23rd inter- national conference on Supercomputing, pp. 235–243, 2009. 19. D. Goldberg : “What every computer scientist should know We proposed efficient methods for evaluating elemen- about floating-point arithmetic,” ACM Computing Surveys, tary functions in double precision without table look- Vol. 23, No. 1, pp. 5–48, 1991. ups, scattering from, or gathering into SIMD registers, 20. M. Abramowitz and I. Stegun : “Handbook of Mathemat- or conditional branches. The methods were implemented ical Functions: with Formulas, Graphs, and Mathematical Tables,” Dover Publications, 1965. using the Intel SSE2 instruction set to check the accu- racy and speed. The average and maximum errors were less than 0.67 ulp and 6 ulps, respectively. The speed was faster than the FPUs on Intel Core 2 and Core i7 processors.