SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228
www.ijera.com 225 | P a g e
Performing DCT8x8 Computation on GPU Using NVIDIA CUDA
Technology
Jagdamb Behari Srivastava, R. B. Singh, Jitendra Jain
Jawaharlal Nehru Krrishi Vishwavidyalaya Jabalpur
Abstract— In this paper, we have proposed sequential and parallel Discrete Cosine Transform (DCT) in
compute unified device architecture (CUDA) libraries. The introduction of programmable pipeline in the
graphics processing units (GPU) has enabled configurability. GPU which is available in every computer has a
tremendous feat of highly parallel SIMD processing, but its capability is often under-utilized. The two-
dimensional variation of the transform that operates on 8x8 blocks (DCT8x8) is widely used in image and video
coding because it exhibits high signal de-correlation rates and can be easily implemented on the majority of
contemporary computing architectures. Performing DCT8x8 computation on GPU using NVIDIA CUDA
technology gives significant performance boost even compared to a modern CPU.
Keywords: - Discrete Cosine Transform, Graphics Processor unit (GPU), Computed unified device architecture
(CUDA).
I. INTRODUCTION
The Discrete Cosine Transform (DCT) is a
Fourier-like transform, which was first proposed by
Ahmed et al. (1974). While the Fourier Transform
represents a signal as the mixture of sines and
cosines, the Cosine Transform performs only the
cosine-series expansion. The purpose of DCT is to
perform de-correlation of the input signal and to
present the output in the frequency domain.
There are several types of DCT [2]. The
most popular is two-dimensional symmetric variation
of the transform that operates on 8x8 blocks
(DCT8x8) and its inverse. The DCT8x8 is utilized in
JPEG compression routines and has become a de-
facto standard in image and video coding algorithms
and other DSP-related areas. Most of the CPU
implementations of DCT8x8 are well-optimized,
which includes the transform reparability utilization
on high-level and fixed point arithmetic, cache-
targeted optimizations on low-level.
GPU acceleration of DCT8x8 computation
has been possible since appearance of shader
languages. Nevertheless, this required a specific setup
to utilize common graphics API such as OpenGL or
Direct3D. CUDA, on the other hand, provides a
natural extension of C language that allows a
transparent implementation of GPU accelerated
algorithms. Also, DCT8x8 greatly benefits from
CUDA-specific features, such as shared memory and
explicit synchronization points.
The rest of the paper is organized as follows.
In section 2, block diagram of graphic processing unit
(GPU) is given. In section 3, explain the 1-D and 2-D
discrete cosine transform (DCT). In section 4,
Computation of discrete cosine transform on GPU
using NVIDIA CUDA technology is introduced.
Section 5 & 6 provides the simulation result and
conclusion.
II. GRAPHIC PROCESSING UNIT (GPU)
The block-diagram of a modern
programmable GPU is shown in Figure 1 [3]. The
architecture of GPU offers a large degree of
parallelism at a relatively low cost though the well-
known vector processing model known as Single
Instruction Multiple Data (SIMD). There are two type
of process used in GPU system i.e. vertex and pixel
processors. The vertex processor performs
mathematical operations that transform a vertex into a
screen position and the pixel processor performs the
texturing operations.
In this sense, GPUs are stream processors. A
stream is a set of records that require similar
computation and provide data parallelism. The
vertices and fragments are the elements in a stream.
For each element one can only read from the input,
perform operations on it, and write to the output.
However, recent improvements in graphics cards
have permitted to have multiple inputs, multiple
Input Data
Vertex
Buffer
Vertex
Processor
Rasterization
Processor
Fragment
Processor
Frame
Buffer
Texture
Buffer
Output Data
Figure 1: The Programmable Graphics Pipeline
RESEARCH ARTICLE OPEN ACCESS
Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228
www.ijera.com 226 | P a g e
outputs, but never a piece of memory that is both
readable and writable. It is important for general-
purpose applications to have high arithmetic
intensity; otherwise the memory access latency will
limit computation speed. Recent developments in
data transfer rate through PCI Express interface to an
extent addressed this issue [1], [4]. Programming the
GPU in a high level language can be done using Cg
from NVidia, OpenGL, etc. [6].
III. DISCRETE COSINE TRANSFORM
Formally, the discrete cosine transform is an
invertible function F: RR
NN
 or equivalently an
invertible square NN  matrix [1]. The formal
definition for the DCT of one-dimensional sequence
of length N is given by the following formula:





1
0
)1]......(
2
)12(
cos[)()()(
N
x N
ux
xfuuC


The inverse transform is defined as:





1
0
)2]......(
2
)12(
cos[)()()(
N
u N
ux
uCuxf


The coefficients at the beginnings of
formulae make the transform matrix orthogonal. For
both equations (1) and (2) the coefficients are given
by the following notation:









 )3......(..........
0
2
0
1
)(
u
N
u
N
u
To perform the DCT of length N effectively
the cosine values are usually pre-computed offline. A
1D DCT of size N will require N vectors of N
elements to store cosine values (matrix A). 1D cosine
transform can be then represented as a sequence of
dot products between the signal sample (vector x) and
cosine coefficient ( A
T
), resulting in transformed
vector ( xA
T
).
A 2D approach performs DCT on input
sample X by subsequently applying DCT to rows and
columns of the input signal, utilizing the separability
property of the transform. In matrix notation this can
be expressed using the following formula:
)4....(..........),( XAAvuC
T

IV. COMPUTATION OF DISCRETE
COSINE TRANSFORMATION (DCT)
With advent of CUDA technology it has
become possible to perform SIMD (single instruction,
multiple data) high-level program parallelization.
Generally, DCT8x8 is a high-level parallelizable
algorithm and thus can be easily computation by the
CUDA.
In applications, such as JPEG and MPEG,
DCT is a resource intensive task. In this section two
different approaches to implementing DCT 8×8 block
using CUDA. In the 1st
approaches, demonstrate
CUDA programming model benefits, a number of
source textures, the associated vertex streams, the
render target, the vertex shader and the pixel shader
are specified. The source textures hold the input data.
The vertex streams consist of vertices that contain the
target position and the associated texture address
information. The render target is a texture that holds
the resulting DCT coefficients. The vertex shader
calculates the target position. The 2nd
approaches,
which is triggered issuing the Draw Primitives call
and creating really fast highly optimized kernels.
Then the target texture is rasterized and the pixel
shader is subsequently executed to perform pixel-
wise calculations.
We used a gray scale image data (image size
64 × 64) to compute the DCT using OpenGL API
(Application Programming Interface).
 Discrete Cosine Transform for 8x8 Blocks
with CUDA in three projects are made
o Project name 1.cpp (C++ code)
o Project name2.cu (CUDA code)
o Project name3.cpp (C++ code)
In the first project name1.cpp include the
CImg.h header file; CImg.h defines the classes and
methods to manage images in your own C++ code.
CImg.h is self-contained and thus highly portable. It
fully works on different operating system and is
compatible with various C++ compilers. In this
project, firstly save the image in .bmp format in
project name1.cpp file. After than write the code
(C++ code) and calculated the pixel value of the
image.
In second project name 2, Copy the image pixel value
in project name1.cpp and paste project name2.cu file.
There are four kernels are used in this project.
In Equation (4) XAAvuC
T
),( used two kernel,
the first kernel used in multiplier the transform of
cosine coefficient and image pixel value 1kXA
T
 .
The second kernel used in multiplier the k1 and
cosine coefficient.
XAAvuC
T
),(
Show the pixel value of the input image in Equation
(5) and cosine coefficient of discrete cosine transform
in Equation (6).
In the same process applied to the inverse discrete
cosine transform.
Kernel 1
Kernel 2
Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228
www.ijera.com 227 | P a g e
One block of the input image pixel value is
X =
29 38 52 53 33 29 33 33
24 21 38 49 35 31 36 33
18 16 20 25 31 27 27 32
21 30 30 24 40 34 25 26
15 22 27 22 32 33 33 27
25 29 30 26 31 30 31 33
10 19 22 29 37 34 30 35
16 16 23 33 27 28 26 19
……… (5)
Cosine coefficient of the discrete cosine coefficient is
A=
.35 .35 .35 .35 .35 .35 .35 .35
.49 .41 .27 .09 .09 -.27 .41 -.49
.46 .19 -.19 -.46 -.46 -.19 .19 .46
.41 -.09 -.49 -.27 .27 .49 .09 -.41
.35 -.35 -.35 .35 .35 -.35 .35 .35
.27 -.49 .09 .41 -.41 -.09 .49 -.27
.19 -.46 .46 -.19 .19 .46 -.46 .19
.09 -.27 .41 -.49 .49 -.41 .27 -.09
..(6)
A kernel is the unit of work that the main
program running on the host computer offloads to the
GPU for computation on that device. In CUDA,
launching a kernel requires specifying three things:
 The dimensions of the grid
 The dimensions of the blocks
 The kernel functions to run on the device.
The implementation of DCT8x8 by
definition is performed using (7). To convert input
8x8 samples into the transform domain, two matrix
multiplications need to be performed.
This solution is never used in practice when
calculating DCT8x8 on CPU because it exhibits high
computational complexity relatively to some
separable methods. Things are different with CUDA;
the described approach maps nicely to CUDA
programming model and architecture specificity.
Image is split into a set of blocks as shown on Figure
2, Each CUDA-block runs 64 threads that perform
DCT for a single block. Every thread in a CUDA-
block computes a single DCT coefficient. All
waveforms are pre-computed beforehand and stored
in the array located in constant memory.
 Two-dimensional DCT is performed in four
steps (considering thread-level):
In first step, a thread with coordinates (ThreadIdx.x,
ThreadIdx.y) loads one pixel from a texture to shared
memory. In order to make sure the whole block is
loaded to the moment, all threads pass
synchronization point. In second step, the thread
computes a dot product between two vectors:
ThreadIdx.y column of cosine coefficients (which is
actually the row of XAA
T
with the same number)
and ThreadIdx.x column of the input block. To ensure
all coefficients of XAA
T
are calculated, the
synchronization must be passed. In third step, the
thread computes XAA
T
in the same manner as in
step second. In four steps, the whole block is copied
from shared memory to the output in global memory.
Each thread works with the single pixel.
V. SIMULATION RESULT
There were two versions of algorithm
sequential and parallel. Both of them was coded in C
with using CUDA libraries and run on NVIDIA
GeForce GT 630M with 2 GB dedicated VRAM
(Video Random Access Memory) graphics card
installed on Acer V3-571G, Intel Core i5-3210M
2.5GHz with Turbo Boost up to 3.1 GHz.
The computation of discrete cosine
transform (DCT) GeForce GT 630M graphics card
solved the day to day increasing demand for the
massive parallel general purpose computing. The
accelerated version on GPU was astonishingly fast,
because it took less time compare to sequential
implementation. However, this value is strictly
limited to the execution of computation kernel itself
and doesn't include any overhead cost.
VI. CONCLUSION
In this paper, we implemented and studied
the performance of computing the discrete cosine
transform for8×8 blocks with NVIDIA CUDA
technology. Discrete cosine transform approaches
were implemented for CPU and GPU. We have
shown that GPU is an excellent candidate for
performing data intensive discrete cosine transform.
A direct application of this work is to perform
discrete cosine transform in real time digital signal
processing and image processing. The performance
testing was held for both approaches and they
exhibited good speedup rates while keeping objective
result quality constant.
Figure 2: Input image (64 × 64). The input image
divided into horizontal and vertical eight equal parts,
each part have the dimension of 8 × 8.
Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228
www.ijera.com 228 | P a g e
REFERENCES
[1] Syed Ali Khayam. “The Discrete Cosine
Transform (DCT): Theory and Application”.
ECE 802 – 602: Information Theory and
Coding, March 10th 2003.
[2] R. Kresch and N. Merhav, “Fast DCT
domain filtering using the DCT and the
DST”. HPL Technical Report #HPL-95-140,
December 1995.
[3] Tze-Yun Sung, Yaw-Shih Shieh, Chun-Wang
Yu, Hsi-Chin Hsin,“High-Efficiency and
Low-Power Architectures for 2-D DCT and
IDCT Based on CORDIC Rotation”,
Proceedings of the 7th ICPDC, pp. 191-196,
2006.
[4] Simon Green. Discrete Cosine Transform
GPU
implementation.http://developer.download.n
vidia.com/SDK/9.5/Samples/vidimaging_sa
mples.html#gpgpu_dct.
[5] K. Fatahalian and M. Houston, “GPUs: A
Closer Look”, ACM Queue, Vol. 6, No. 2,
March/April 2008, pp. 18–28.
[6] S. P. Mohanty, N. Pati, and E. Kougianos,
“A Watermarking Co-Processor for New
Generation Graphics Processing Units”, in
Proc. 25th IEEE International Conference
on Consumer Electronics, pp. 303-304, 2007.
[7] V. Galiano, O.Lopez and H. Migallon,
“parallel Strategies for 2D Discrete Wavelet
Transform in Shared Memory Systems and
GPUs,” Published Springer Science+
Business Media, LLC 2012.
[8] R. Qu and Chunhong Zhang ,” High
Performance Finite Impulse Response Filter
on Graphics Processors,” 3nd Int. Conf. on
Intelligent Control and information
Processing 2012.
[9] Wladimir J. van der Laan, Andrei C. Jalba,
and Jos B.T.M. Roerdink ,“Accelerating
Wavelet Lifting on Graphics Hardware
Using CUDA,” IEEE Transactions on
Parallel and Distributed Systems, Vol. 22,
No. 1, January 2011.
[10] Saraju P. mohanty, “GPU-CPU MULTI-
Core for Real time Signal processing,” 978-
1-4244-2559-4/09 2009 IEEE.
[11] Ing. Vaclav Simek , “GPU Acceleration of
2-D DWT Image Compression in MATLAB
with CUDA ,” second UKSIM European
Symposium on Computer Modeling and
Simulation 2008.

Contenu connexe

Tendances

IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB)
IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB) IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB)
IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB) cscpconf
 
Image coding through ztransform with low energy and bandwidth (izeb)
Image coding through ztransform with low energy and bandwidth (izeb)Image coding through ztransform with low energy and bandwidth (izeb)
Image coding through ztransform with low energy and bandwidth (izeb)csandit
 
Mapping Parallel Programs into Hierarchical Distributed Computer Systems
Mapping Parallel Programs into Hierarchical Distributed Computer SystemsMapping Parallel Programs into Hierarchical Distributed Computer Systems
Mapping Parallel Programs into Hierarchical Distributed Computer SystemsMikhail Kurnosov
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1NVIDIA
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big DataAlexander Jung
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
Implementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
Implementation of the Binary Multiplier on CPLD Using Reversible Logic GatesImplementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
Implementation of the Binary Multiplier on CPLD Using Reversible Logic GatesIOSRJECE
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check IJECEIAES
 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUMahesh Khadatare
 
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
FPGA Implementation of SubByte & Inverse SubByte for AES AlgorithmFPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithmijsrd.com
 
Image proceesing with matlab
Image proceesing with matlabImage proceesing with matlab
Image proceesing with matlabAshutosh Shahi
 

Tendances (15)

IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB)
IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB) IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB)
IMAGE CODING THROUGH ZTRANSFORM WITH LOW ENERGY AND BANDWIDTH (IZEB)
 
Image coding through ztransform with low energy and bandwidth (izeb)
Image coding through ztransform with low energy and bandwidth (izeb)Image coding through ztransform with low energy and bandwidth (izeb)
Image coding through ztransform with low energy and bandwidth (izeb)
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
Mapping Parallel Programs into Hierarchical Distributed Computer Systems
Mapping Parallel Programs into Hierarchical Distributed Computer SystemsMapping Parallel Programs into Hierarchical Distributed Computer Systems
Mapping Parallel Programs into Hierarchical Distributed Computer Systems
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
Implementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
Implementation of the Binary Multiplier on CPLD Using Reversible Logic GatesImplementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
Implementation of the Binary Multiplier on CPLD Using Reversible Logic Gates
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check
 
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPUAcceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
Acceleration of the Longwave Rapid Radiative Transfer Module using GPGPU
 
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
FPGA Implementation of SubByte & Inverse SubByte for AES AlgorithmFPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
FPGA Implementation of SubByte & Inverse SubByte for AES Algorithm
 
50120130406040
5012013040604050120130406040
50120130406040
 
SPAA11
SPAA11SPAA11
SPAA11
 
Image proceesing with matlab
Image proceesing with matlabImage proceesing with matlab
Image proceesing with matlab
 

En vedette (20)

Ap35235240
Ap35235240Ap35235240
Ap35235240
 
As35252256
As35252256As35252256
As35252256
 
Ak35206217
Ak35206217Ak35206217
Ak35206217
 
Ai35194197
Ai35194197Ai35194197
Ai35194197
 
Cv35547551
Cv35547551Cv35547551
Cv35547551
 
Ax35277281
Ax35277281Ax35277281
Ax35277281
 
Am35221224
Am35221224Am35221224
Am35221224
 
Ar35247251
Ar35247251Ar35247251
Ar35247251
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
At35257260
At35257260At35257260
At35257260
 
Az35288290
Az35288290Az35288290
Az35288290
 
Bb35295305
Bb35295305Bb35295305
Bb35295305
 
Ao35229234
Ao35229234Ao35229234
Ao35229234
 
Ba35291294
Ba35291294Ba35291294
Ba35291294
 
Aw35271276
Aw35271276Aw35271276
Aw35271276
 
Al35218220
Al35218220Al35218220
Al35218220
 
20150326145412840
2015032614541284020150326145412840
20150326145412840
 
Hoja de Vida de Docentes del Diplomado Liderazgo para la Trasnformacion 2015 ...
Hoja de Vida de Docentes del Diplomado Liderazgo para la Trasnformacion 2015 ...Hoja de Vida de Docentes del Diplomado Liderazgo para la Trasnformacion 2015 ...
Hoja de Vida de Docentes del Diplomado Liderazgo para la Trasnformacion 2015 ...
 
Chuan muc ke toan dot 1
Chuan muc ke toan dot 1Chuan muc ke toan dot 1
Chuan muc ke toan dot 1
 
годовой план 14 15
годовой план 14 15годовой план 14 15
годовой план 14 15
 

Similaire à An35225228

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...IAEME Publication
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...inventionjournals
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsQuEST Global (erstwhile NeST Software)
 
FPGA based JPEG Encoder
FPGA based JPEG EncoderFPGA based JPEG Encoder
FPGA based JPEG EncoderIJERA Editor
 
VisionizeBeforeVisulaize_IEVC_Final
VisionizeBeforeVisulaize_IEVC_FinalVisionizeBeforeVisulaize_IEVC_Final
VisionizeBeforeVisulaize_IEVC_FinalMasatsugu HASHIMOTO
 
A Novel Image Compression Approach Inexact Computing
A Novel Image Compression Approach Inexact ComputingA Novel Image Compression Approach Inexact Computing
A Novel Image Compression Approach Inexact Computingijtsrd
 
Computer graphics
Computer graphicsComputer graphics
Computer graphicsamitsarda3
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm foreSAT Publishing House
 
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...VLSICS Design
 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...VLSICS Design
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAIJERD Editor
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIRJET Journal
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 

Similaire à An35225228 (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Squashed JPEG Image Compression via Sparse Matrix
Squashed JPEG Image Compression via Sparse MatrixSquashed JPEG Image Compression via Sparse Matrix
Squashed JPEG Image Compression via Sparse Matrix
 
Squashed JPEG Image Compression via Sparse Matrix
Squashed JPEG Image Compression via Sparse MatrixSquashed JPEG Image Compression via Sparse Matrix
Squashed JPEG Image Compression via Sparse Matrix
 
FPGA based JPEG Encoder
FPGA based JPEG EncoderFPGA based JPEG Encoder
FPGA based JPEG Encoder
 
VisionizeBeforeVisulaize_IEVC_Final
VisionizeBeforeVisulaize_IEVC_FinalVisionizeBeforeVisulaize_IEVC_Final
VisionizeBeforeVisulaize_IEVC_Final
 
Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
A Novel Image Compression Approach Inexact Computing
A Novel Image Compression Approach Inexact ComputingA Novel Image Compression Approach Inexact Computing
A Novel Image Compression Approach Inexact Computing
 
Computer graphics
Computer graphicsComputer graphics
Computer graphics
 
FIR filter on GPU
FIR filter on GPUFIR filter on GPU
FIR filter on GPU
 
Introduction to 2D/3D Graphics
Introduction to 2D/3D GraphicsIntroduction to 2D/3D Graphics
Introduction to 2D/3D Graphics
 
Bivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm forBivariatealgebraic integerencoded arai algorithm for
Bivariatealgebraic integerencoded arai algorithm for
 
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
Pipelined Architecture of 2D-DCT, Quantization and ZigZag Process for JPEG Im...
 
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
PIPELINED ARCHITECTURE OF 2D-DCT, QUANTIZATION AND ZIGZAG PROCESS FOR JPEG IM...
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDA
 
IIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGAIIIRJET-Implementation of Image Compression Algorithm on FPGA
IIIRJET-Implementation of Image Compression Algorithm on FPGA
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

An35225228

  • 1. Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228 www.ijera.com 225 | P a g e Performing DCT8x8 Computation on GPU Using NVIDIA CUDA Technology Jagdamb Behari Srivastava, R. B. Singh, Jitendra Jain Jawaharlal Nehru Krrishi Vishwavidyalaya Jabalpur Abstract— In this paper, we have proposed sequential and parallel Discrete Cosine Transform (DCT) in compute unified device architecture (CUDA) libraries. The introduction of programmable pipeline in the graphics processing units (GPU) has enabled configurability. GPU which is available in every computer has a tremendous feat of highly parallel SIMD processing, but its capability is often under-utilized. The two- dimensional variation of the transform that operates on 8x8 blocks (DCT8x8) is widely used in image and video coding because it exhibits high signal de-correlation rates and can be easily implemented on the majority of contemporary computing architectures. Performing DCT8x8 computation on GPU using NVIDIA CUDA technology gives significant performance boost even compared to a modern CPU. Keywords: - Discrete Cosine Transform, Graphics Processor unit (GPU), Computed unified device architecture (CUDA). I. INTRODUCTION The Discrete Cosine Transform (DCT) is a Fourier-like transform, which was first proposed by Ahmed et al. (1974). While the Fourier Transform represents a signal as the mixture of sines and cosines, the Cosine Transform performs only the cosine-series expansion. The purpose of DCT is to perform de-correlation of the input signal and to present the output in the frequency domain. There are several types of DCT [2]. The most popular is two-dimensional symmetric variation of the transform that operates on 8x8 blocks (DCT8x8) and its inverse. The DCT8x8 is utilized in JPEG compression routines and has become a de- facto standard in image and video coding algorithms and other DSP-related areas. Most of the CPU implementations of DCT8x8 are well-optimized, which includes the transform reparability utilization on high-level and fixed point arithmetic, cache- targeted optimizations on low-level. GPU acceleration of DCT8x8 computation has been possible since appearance of shader languages. Nevertheless, this required a specific setup to utilize common graphics API such as OpenGL or Direct3D. CUDA, on the other hand, provides a natural extension of C language that allows a transparent implementation of GPU accelerated algorithms. Also, DCT8x8 greatly benefits from CUDA-specific features, such as shared memory and explicit synchronization points. The rest of the paper is organized as follows. In section 2, block diagram of graphic processing unit (GPU) is given. In section 3, explain the 1-D and 2-D discrete cosine transform (DCT). In section 4, Computation of discrete cosine transform on GPU using NVIDIA CUDA technology is introduced. Section 5 & 6 provides the simulation result and conclusion. II. GRAPHIC PROCESSING UNIT (GPU) The block-diagram of a modern programmable GPU is shown in Figure 1 [3]. The architecture of GPU offers a large degree of parallelism at a relatively low cost though the well- known vector processing model known as Single Instruction Multiple Data (SIMD). There are two type of process used in GPU system i.e. vertex and pixel processors. The vertex processor performs mathematical operations that transform a vertex into a screen position and the pixel processor performs the texturing operations. In this sense, GPUs are stream processors. A stream is a set of records that require similar computation and provide data parallelism. The vertices and fragments are the elements in a stream. For each element one can only read from the input, perform operations on it, and write to the output. However, recent improvements in graphics cards have permitted to have multiple inputs, multiple Input Data Vertex Buffer Vertex Processor Rasterization Processor Fragment Processor Frame Buffer Texture Buffer Output Data Figure 1: The Programmable Graphics Pipeline RESEARCH ARTICLE OPEN ACCESS
  • 2. Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228 www.ijera.com 226 | P a g e outputs, but never a piece of memory that is both readable and writable. It is important for general- purpose applications to have high arithmetic intensity; otherwise the memory access latency will limit computation speed. Recent developments in data transfer rate through PCI Express interface to an extent addressed this issue [1], [4]. Programming the GPU in a high level language can be done using Cg from NVidia, OpenGL, etc. [6]. III. DISCRETE COSINE TRANSFORM Formally, the discrete cosine transform is an invertible function F: RR NN  or equivalently an invertible square NN  matrix [1]. The formal definition for the DCT of one-dimensional sequence of length N is given by the following formula:      1 0 )1]......( 2 )12( cos[)()()( N x N ux xfuuC   The inverse transform is defined as:      1 0 )2]......( 2 )12( cos[)()()( N u N ux uCuxf   The coefficients at the beginnings of formulae make the transform matrix orthogonal. For both equations (1) and (2) the coefficients are given by the following notation:           )3......(.......... 0 2 0 1 )( u N u N u To perform the DCT of length N effectively the cosine values are usually pre-computed offline. A 1D DCT of size N will require N vectors of N elements to store cosine values (matrix A). 1D cosine transform can be then represented as a sequence of dot products between the signal sample (vector x) and cosine coefficient ( A T ), resulting in transformed vector ( xA T ). A 2D approach performs DCT on input sample X by subsequently applying DCT to rows and columns of the input signal, utilizing the separability property of the transform. In matrix notation this can be expressed using the following formula: )4....(..........),( XAAvuC T  IV. COMPUTATION OF DISCRETE COSINE TRANSFORMATION (DCT) With advent of CUDA technology it has become possible to perform SIMD (single instruction, multiple data) high-level program parallelization. Generally, DCT8x8 is a high-level parallelizable algorithm and thus can be easily computation by the CUDA. In applications, such as JPEG and MPEG, DCT is a resource intensive task. In this section two different approaches to implementing DCT 8×8 block using CUDA. In the 1st approaches, demonstrate CUDA programming model benefits, a number of source textures, the associated vertex streams, the render target, the vertex shader and the pixel shader are specified. The source textures hold the input data. The vertex streams consist of vertices that contain the target position and the associated texture address information. The render target is a texture that holds the resulting DCT coefficients. The vertex shader calculates the target position. The 2nd approaches, which is triggered issuing the Draw Primitives call and creating really fast highly optimized kernels. Then the target texture is rasterized and the pixel shader is subsequently executed to perform pixel- wise calculations. We used a gray scale image data (image size 64 × 64) to compute the DCT using OpenGL API (Application Programming Interface).  Discrete Cosine Transform for 8x8 Blocks with CUDA in three projects are made o Project name 1.cpp (C++ code) o Project name2.cu (CUDA code) o Project name3.cpp (C++ code) In the first project name1.cpp include the CImg.h header file; CImg.h defines the classes and methods to manage images in your own C++ code. CImg.h is self-contained and thus highly portable. It fully works on different operating system and is compatible with various C++ compilers. In this project, firstly save the image in .bmp format in project name1.cpp file. After than write the code (C++ code) and calculated the pixel value of the image. In second project name 2, Copy the image pixel value in project name1.cpp and paste project name2.cu file. There are four kernels are used in this project. In Equation (4) XAAvuC T ),( used two kernel, the first kernel used in multiplier the transform of cosine coefficient and image pixel value 1kXA T  . The second kernel used in multiplier the k1 and cosine coefficient. XAAvuC T ),( Show the pixel value of the input image in Equation (5) and cosine coefficient of discrete cosine transform in Equation (6). In the same process applied to the inverse discrete cosine transform. Kernel 1 Kernel 2
  • 3. Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228 www.ijera.com 227 | P a g e One block of the input image pixel value is X = 29 38 52 53 33 29 33 33 24 21 38 49 35 31 36 33 18 16 20 25 31 27 27 32 21 30 30 24 40 34 25 26 15 22 27 22 32 33 33 27 25 29 30 26 31 30 31 33 10 19 22 29 37 34 30 35 16 16 23 33 27 28 26 19 ……… (5) Cosine coefficient of the discrete cosine coefficient is A= .35 .35 .35 .35 .35 .35 .35 .35 .49 .41 .27 .09 .09 -.27 .41 -.49 .46 .19 -.19 -.46 -.46 -.19 .19 .46 .41 -.09 -.49 -.27 .27 .49 .09 -.41 .35 -.35 -.35 .35 .35 -.35 .35 .35 .27 -.49 .09 .41 -.41 -.09 .49 -.27 .19 -.46 .46 -.19 .19 .46 -.46 .19 .09 -.27 .41 -.49 .49 -.41 .27 -.09 ..(6) A kernel is the unit of work that the main program running on the host computer offloads to the GPU for computation on that device. In CUDA, launching a kernel requires specifying three things:  The dimensions of the grid  The dimensions of the blocks  The kernel functions to run on the device. The implementation of DCT8x8 by definition is performed using (7). To convert input 8x8 samples into the transform domain, two matrix multiplications need to be performed. This solution is never used in practice when calculating DCT8x8 on CPU because it exhibits high computational complexity relatively to some separable methods. Things are different with CUDA; the described approach maps nicely to CUDA programming model and architecture specificity. Image is split into a set of blocks as shown on Figure 2, Each CUDA-block runs 64 threads that perform DCT for a single block. Every thread in a CUDA- block computes a single DCT coefficient. All waveforms are pre-computed beforehand and stored in the array located in constant memory.  Two-dimensional DCT is performed in four steps (considering thread-level): In first step, a thread with coordinates (ThreadIdx.x, ThreadIdx.y) loads one pixel from a texture to shared memory. In order to make sure the whole block is loaded to the moment, all threads pass synchronization point. In second step, the thread computes a dot product between two vectors: ThreadIdx.y column of cosine coefficients (which is actually the row of XAA T with the same number) and ThreadIdx.x column of the input block. To ensure all coefficients of XAA T are calculated, the synchronization must be passed. In third step, the thread computes XAA T in the same manner as in step second. In four steps, the whole block is copied from shared memory to the output in global memory. Each thread works with the single pixel. V. SIMULATION RESULT There were two versions of algorithm sequential and parallel. Both of them was coded in C with using CUDA libraries and run on NVIDIA GeForce GT 630M with 2 GB dedicated VRAM (Video Random Access Memory) graphics card installed on Acer V3-571G, Intel Core i5-3210M 2.5GHz with Turbo Boost up to 3.1 GHz. The computation of discrete cosine transform (DCT) GeForce GT 630M graphics card solved the day to day increasing demand for the massive parallel general purpose computing. The accelerated version on GPU was astonishingly fast, because it took less time compare to sequential implementation. However, this value is strictly limited to the execution of computation kernel itself and doesn't include any overhead cost. VI. CONCLUSION In this paper, we implemented and studied the performance of computing the discrete cosine transform for8×8 blocks with NVIDIA CUDA technology. Discrete cosine transform approaches were implemented for CPU and GPU. We have shown that GPU is an excellent candidate for performing data intensive discrete cosine transform. A direct application of this work is to perform discrete cosine transform in real time digital signal processing and image processing. The performance testing was held for both approaches and they exhibited good speedup rates while keeping objective result quality constant. Figure 2: Input image (64 × 64). The input image divided into horizontal and vertical eight equal parts, each part have the dimension of 8 × 8.
  • 4. Jitendra Jain et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.225-228 www.ijera.com 228 | P a g e REFERENCES [1] Syed Ali Khayam. “The Discrete Cosine Transform (DCT): Theory and Application”. ECE 802 – 602: Information Theory and Coding, March 10th 2003. [2] R. Kresch and N. Merhav, “Fast DCT domain filtering using the DCT and the DST”. HPL Technical Report #HPL-95-140, December 1995. [3] Tze-Yun Sung, Yaw-Shih Shieh, Chun-Wang Yu, Hsi-Chin Hsin,“High-Efficiency and Low-Power Architectures for 2-D DCT and IDCT Based on CORDIC Rotation”, Proceedings of the 7th ICPDC, pp. 191-196, 2006. [4] Simon Green. Discrete Cosine Transform GPU implementation.http://developer.download.n vidia.com/SDK/9.5/Samples/vidimaging_sa mples.html#gpgpu_dct. [5] K. Fatahalian and M. Houston, “GPUs: A Closer Look”, ACM Queue, Vol. 6, No. 2, March/April 2008, pp. 18–28. [6] S. P. Mohanty, N. Pati, and E. Kougianos, “A Watermarking Co-Processor for New Generation Graphics Processing Units”, in Proc. 25th IEEE International Conference on Consumer Electronics, pp. 303-304, 2007. [7] V. Galiano, O.Lopez and H. Migallon, “parallel Strategies for 2D Discrete Wavelet Transform in Shared Memory Systems and GPUs,” Published Springer Science+ Business Media, LLC 2012. [8] R. Qu and Chunhong Zhang ,” High Performance Finite Impulse Response Filter on Graphics Processors,” 3nd Int. Conf. on Intelligent Control and information Processing 2012. [9] Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink ,“Accelerating Wavelet Lifting on Graphics Hardware Using CUDA,” IEEE Transactions on Parallel and Distributed Systems, Vol. 22, No. 1, January 2011. [10] Saraju P. mohanty, “GPU-CPU MULTI- Core for Real time Signal processing,” 978- 1-4244-2559-4/09 2009 IEEE. [11] Ing. Vaclav Simek , “GPU Acceleration of 2-D DWT Image Compression in MATLAB with CUDA ,” second UKSIM European Symposium on Computer Modeling and Simulation 2008.