Publicité
Cuda project paper
Cuda project paper
Cuda project paper
Cuda project paper
Prochain SlideShare
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Chargement dans ... 3
1 sur 4
Publicité

Contenu connexe

Publicité

Similaire à Cuda project paper(20)

Publicité

Cuda project paper

  1. CUDA Speed Up for Side Information Generation In Distributed Video Coding Ping-Shang Wang National Taiwan University R98944043@ntu.edu.tw Kan-Han Lu National Taiwan Normal University 698470271@ntnu.edu.tw Cong-Min Huang National Taiwan University R98548012@ntu.edu.tw ABSTRACT Distributed Video Coding (DVC) has become increasingly popular in recent times among the researchers in video coding due to its attractive and promising features. DVC primarily has a modified complexity balance between the encoder and decoder, in contrast to conventional video codecs. However, Most of the reported DVC schemes have a high time-delay in decoder which hinders its practical application in real-time systems. In this work, we focus on speed up the Side Information(SI) generation module in DVC, which is a major function in the DVC coding algorithm and one of the time-consuming factor at the decoder. By applied it through Compute Unified Device Architecture (CUDA) based on General-Purpose Graphics Processing Unit (GPGPU), the experimental results show that a considerable speedup can be obtained by using the proposed parallelized SI generation algorithm. Keywords Distributed video coding, Wyner-Ziv video coding, Side information, frame interpolation, CUDA. 1. INTRODUCTION Distributed video coding (DVC) has recently attracted a vast amount of attention from the video coding community all around the world. The new coding paradigm is also known as Wyner-Ziv (WZ) video coding, which is based on Slepian-Wolf [1] and Wyner-Ziv [2] theorems. These theorems mainly state that separate encoding and joint decoding of two correlated sources, X and Y, can be encoded to the same minimum rate as joint encoding and decoding in the conventional video coding. DVC codec subverts the traditional prediction-based standard video coding scheme by exploiting the source statistics at the decoder with the development of simpler encoders. That is, the complexities of encoder and decoder are reversed. Hence the encoder becomes fairly simple and leaves all the computationally expensive processing to the decoder. This is done by shifting the complex procedure of motion estimation/compensation from the encoder to the decoder. In contrast to conventional coders the motion estimation is thus only done at the decoder side. It is used to generate a motion compensated prediction Y , called side information(SI), of the original frame X, so SI may be seen as a “corrupted” version of the original information, and error correcting codes (LDPC or Turbo code) are typically used to improve the quality of the side information until a target quality for the final decoded frame is achieved. These features are effectively utilized in several application domains, e.g. video conferencing with mobile devices, wireless video cameras and wireless low-power surveillance. However, most of the reported DVC architectures face a common problem: high decoding complexity, which restrains them from being used in real-time video application. The complexity arises mainly from two factors: one is iterative LDPC (or Turbo) decoding process with a feedback channel, and the other is motion estimation procedure in the Side Information (SI) generation. In order to obtain a solution which is more suitable for practical applications, new ideas have been proposed to amend or to optimize the structure of decoder. However, SI generation plays a key role in determining the performance of the codec and the reconstructed video quality is also sensitive to the side information. It is common that reduced cost on a motion search for a faster generation may cause a sharp decrease in PSNR. Besides, abundant channel decoding iterations guarantee decoding bit accuracy, which is also critical to the video quality. For these reasons, instead of reducing some computing steps, we are inclined to adopt parallel approaches to achieve a faster implementation with complete computation. With the Graphics Processing Unit (GPU) becoming more powerful and widespread, GPU are finding broader applications in scientific and general purpose computation. Our proposed parallel approach utilizes merely a low-grade NVIDIA GPU to significantly reduce the time necessary for SI generation while keeping the same SI quality. The paper is organized as follows. First, we introduce the DVC codec we used [3] and its SI generation module is discussed in section 2. Then, the proposed parallel approaches to SI generation are introduced in section 3. The experimental results are demonstrated in section. Finally, section 5 concludes the paper. 2. DISCOVER Codec Figure 1 - DISCOVER codec architecture
  2. 2.1 Introduction The DISCOVER WZ video codec , developed by a European project funded under the European Commission 1ST FP6 programme, is based on the early Stanford WZ video coding architecture proposed in [4, 5], further information may be obtained at [3, 6]. Its architecture is illustrated in Figure 1. The DISCOVER codec is probably the most efficient WZ video codec now available. Its performance is reported in detail with the corresponding test conditions in [6]; moreover, executable code may be downloaded, allowing all researchers to compare performances for other sequences and conditions as well. In this work, we only parallelized the SI generation module using GPGPU, the remaining modules are the same. Due to scope of this work, we only describe the SI generation module. For details of other modules, see [5]. 2.2 SI Generation Module in DISCOVER Codec The following techniques [7][8] are used to obtain high quality side information. Fig. 2 shows the architecture proposed for the frame interpolation scheme. First, forward motion estimation from Xb to Xf is performed. A block matching based on a modified MAD (mean absolute difference) criterion is used in order to regularize the motion vector field, which favors motion vectors closer to the origin. Then, bidirectional motion estimation is performed in order to find symmetric motion vectors from the current WZ frame to Xb and Xf. Spatial motion smoothing based on a weighted vector median filter is applied afterwards to the obtained motion field to remove outliers. Finally, motion compensation is performed between Xb and Xf along the obtained motion field, so as to generate the side information. A hierarchical coarse-to-fine approach is used in the bidirectional motion estimation: the first iteration corresponds to a large block size (16×16) and tracks fast motion reliably, while the second iteration achieves higher precision using a smaller block size (8×8). The motion search is performed using the half-pixel precision method described in [9]. Among the processes mentioned above, the most time- consuming steps are forward motion estimation and FIR filter (for half-pixel motion estimation), which comprise about 70% and 25% of the entire procedure, respectively. Consequently, we focused on parallelizing these two parts using GPGPU to reduce the processing time. 3. PARALLEL APPROACH TO SI- GENERATION ON GPGPU PLATFORM 3.1 Parallelized Forward Motion Estimation The proposed parallel algorithm for Forward motion estimation is implemented on GPGPU platform using CUDA. We parallelize this part at block level to induce the least thread overhead, To promise the balanced workload on each core, we also use the indexes of WZ blocks gathered before decoding and allocate almost the same number of blocks to each core. For simplicity, we illustrate the proposed approach with an example. The input sequence is QCIF (176x144), and block size is 16x16, 99 blocks in target frame. The parallel processing of the forward motion estimation is shown in Figure 4. First, we launch a CUDA kernel to compute the motion filed between past frame and future frame, which each block in future frame map to a respective CUDA block, and each have 1024-4096 candidate blocks within search range in reference frame. We using 512 threads per CUDA block to compute the cost(modified MAD) of each candidate block in parallel fashion. Then, each thread keeps the local minimum cost(among all candidate blocks it processed) and its respective motion vector in shared memory. Hence, we have 512 local minimum costs when all threads is done for a CUDA block, the next thing we need to do is pick the global minimum cost among local minimum costs by a reduction algorithm we refer form [10]. Finally, we get the motion filed when CUDA kernel is done and keep the result in device memory for another CUDA kernel launch after to find correspond motion vector that closest to the origin of the block in interpolated frame and copy the result back to the host. The reference frame is transmit to device and store in GPU`s texture memory, which can access faster when multiple read of the same position . And each 16x16block in future frame is store in shared memory for each thread in CUDA block to access faster. Moreover, the local minimum cost and its corresponding motion vector for each thread are store in shared memory for the same reason. Moreover, we do Loop unrolling, avoid bank conflicts in shared memory, and minimize the number of accesses of global memory. Figure 3 – Reduction Algorithm
  3. 3.2 Parallelized FIR Filter The utilized FIR filter[9] references several neighbor pixel locations to interpolate the resulting pixel, and therefore has a higher complexity. We improved the filter performance by parallelizing the upsampling process at pixel level. 4. EXPERIMENTAL RESULT All evaluations are run on a PC with an AMD Athlon 64 X2 5600+ CPU at 2.91GHz (1MB cache) and an NVIDIA GeForce GT220 graphics card. The test sequences are Foreman, Soccer, Coastguard and Hall Monitor with QCIF resolution, 15 Hz frame rate and whole sequences.GOP size is 2 and the 8-th quantization table(Q=8) is used. In addition, the spending time presented here only include each component in SI generation processing time, so order to focusing on the performance evaluation of SI generation module. All the time units are reported in seconds. The SI generation processing time for all test sequences is illustrated in Table 1. It is shown that we can achieve 14.15 times (avg) speed up for forward motion estimation and 6.87 times (avg) speed up for FIR filter. For entire SI generation procedure, we can achieve 9.46 times (avg) speed up. B1 B2 B99 We have also tested the algorithm on a PC with a less powerful CPU and the same grade NVIDIA graphics card, which resulted in even higher increase of processing speed (20-24 times), but not reported here. Therefore, the experimental result is highly depend on the power of CPU and GPU. 5. CONCLUSIONS In this paper, a parallel algorithm based on GPGPU using NVIDIA CUDA for SI generation in distributed video coding was proposed. To achieve a load balancing and optimal runtime, we presented a dynamic distribution scheme based on a task tool model and threshold searching method. Experimental results demonstrate that our algorithm can achieve up to 10 times (avg) faster than sequential processing of the side information module. 6. REFERENCES [1] D. Slepian and J. Wolf, "Noiseless coding of correlated information sources," IEEE Trans. Inf. Theory, vol. 19, no. 4, pp. 471-480, 1973. [2] A. D. Wyner and J. Ziv, "The rate-distortion function for source coding with side information at the decoder," IEEE Trans. Inf. Theory, vol. 22, pp. 1-10, 1976. 4096 candidates 4096 candidates 4096 candidates 4096 candidates 4096 candidates 4096 candidates 4096 candidates 4096 candidates 176/16 144/16 Kernel Shared MemoryFeature Frame Figure 4 - Parallel approach for forward motion estimation
  4. [3] X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov and M. Ouaret, "The discover codec: Architecture, techniques and evaluation," Nov, 2007. [4] A. A. Aaron, S. Rane, E. Setton, and B. Girod, “Transform- Domain Wyner–Ziv Codec for Video,” Visual Communications and Image Processing, San Jose, CA, January 2004. [5] B. Girod, A. Aaron, S. Rane, and D. Rebollo- Monedero,“Distributed Video Coding,” Proceedings of the IEEE, vol. 93, no. 1, pp. 71–83, January 2005. [6] DISCOVER Page, http://www.img.lx.it.pt/~discover/home.html [7] J. Ascenso, C. Brites and F. Pereira “Content Adaptive Wyner-Ziv Video Coding Driven by Motion Activity”, IEEE International Conference on Image Processing, Atlanta, USA, October 2006. [8] J. Ascenso, C. Brites and F. Pereira, “Improving Frame Interpolation with Spatial Motion Smoothing for Pixel Domain Distributed Video Coding”, 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic, July 2005. [9] S. Klomp, Y. Vatis and J. Ostermann, “Side Information Interpolation with Sub-pel Motion Compensation for Wyner- Ziv Decoder”, Int. Conf. on Signal Processing and Multimedia Applications, Setúbal, Portugal, August 2006. [10] Mark Harris, “Optimizing parallel reduction in CUDA”, NVIDIA Developer Technology, 2007. Table 1. SI Generation time for test sequences SI Generation Time (ms) CPU GPGPU CPU/GPGPU Foreman (74 WZ frames, 76 key frames) FIR filter 1228(11.8%) 178(16.5%) 6.90 Forward ME 8850 (85.2%) 594 (54.8%) 14.90 Others 306 (3%) 311 (28.7%) - Total 10384 (100%) 1083 (100%) 9.59 Average (per WZ frame) 140.32 14.64 9.59 Soccer (74 WZ frames, 76 key frames) FIR filter 1173(11.3%) 173(16.8%) 6.78 Forward ME 8911 (86.0%) 593 (57.5%) 15.03 Others 266 (2.7%) 265 (25.7%) - Total 10350 (100%) 1031 (100%) 10.04 Average (per WZ frame) 139.86 13.93 10.04 Coastguard (74 WZ frames, 76 key frames) FIR filter 1267 (12.3%) 181(15.3%) 7.00 Forward ME 8769(84.9%) 705(59.7%) 12.44 Others 294 (2.8%) 294 (25.0%) - Total 10330 (100%) 1180 (100%) 8.75 Average (per WZ frame) 139.59 15.95 8.75 Hall Monitor (81 WZ frames, 83 key frames) FIR filter 1386(12.2%) 204(16.9%) 6.79 Forward ME 9702(85.0%) 682(56.6%) 14.23 Others 322 (2.8%) 319 (26.5%) - Total 11410 (100%) 1205 (100%) 9.47 Average (per WZ frame) 140.86 14.88 9.47
Publicité