SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Accelerate Reed-Solomon
Erasure Codes on GPUs
Student: ,
Advisor: Prof. Jerry Chi-Yuan Chou
Shuai Yuan
LSA Lab, NTHU
Outline
Introduction
Overview of Our Acceleration Targets
Accelerating Operations in Galois Field
Accelerating Matrix Multiplication
Reducing Data Transfer Overhead
Experiment
Conclusion
Introduction
Erasure Codes
A redundancy solution for fault-tolerance, widely used in:
signals from deep space satellites
reliable multimedia multicasting
reliable storage systems
Cloud Storage Systems: Why
Redundancy?
Cloud storage vendors claim to provide highly available and
reliable services in their SLA (Service-Level Agreement) with
the customers. Both availability and reliability imply strict
fault tolerance requirements for cloud storage system.
However, as the scale of storage system grows larger and
larger, the probability of failure becomes significant:
→ SLA violation.
Therefore, redundancy must be introduced into cloud system.
Cloud Storage Systems: Why
Redundancy?
The simplest and straightforward redundancy solution is
replication of the data in multiple storage nodes.
Triple replication solution have been favored in cloud storage
systems like the GFS (Google File System) and HDFS
(Hadoop Distributed File System).
Cloud Storage Systems: Why Erasure
Codes?
Problem of replication: large storage overhead.
Erasure codes can reduce the storage overhead significantly
while at the same time maintaining the same level of fault
tolerance as replication → a better redundancy solution.
Cloud Storage Systems: Why Erasure
Codes?
MDS (Maximum Distance Separable) codes where and
are integers and :
(n, k) n
k n > k
File → equal-size native data chunks.
equal-size native chunks → code chunks.
The native and code chunks are distributed on different
storage nodes.
tolerates the failure of any storage nodes.
k
k (n − k)
n
(n − k)
Example:
Reed-Solomon Code Triple Replication(n = 4, k = 2)
Cloud Storage Systems: Why Reed-
Solomon Codes?
Reed-Solomon codes are one of the most
popular MDS erasure codes.
is used in HDFS-RAID in Facebook and
is used in GFS II in Google.
RS(k, n − k)
RS(10, 4) RS(6, 3)
Shortcomings of Reed-Solomon
Codes
extra high computation cost compared to replication: encoding
and decoding.
Our contributions: use GPU to accelerate the Reed-Solomon
encoding and decoding.
Overview of Our
Acceleration Targets
Reed-Solomon Code Overview
Reed-Solomon Encoding Reed-Solomon Decoding
Acceleration Targets - 1
Computation bottleneck: matrix multiplication.
≡
C = A ⋅ B
( = × )cj ∑
i=1
k
ai,j bi
Addition and multiplication are defined as arithmetic over
Galois Field GF( ).2
8
Acceleration targets for computation:
Arithmetic over Galois Field.
Matrix multiplication.
GPU-Accelerated Reed-Solomon
Code Overview
Reed-Solomon
Encoding in a
GPU
Reed-Solomon
Decoding in a
GPU
Acceleration Targets - 2
Extra overhead in GPU implementation: data transfers between
CPU and GPU.
Another acceleration targets: reducing data transfer overhead.
Overview of Our
Acceleration Targets
Accelerating Arithmetic
over Galois Field
Brief Introduction of Galois Field
GF( ) contains polynomials. For each polynomial, its
degree is at most and its coefficient is in {0, 1}.
2
w
2
w
w − 1
For GF( ), every element can be one-to-one mapped into a
byte, and polynomial operations in GF( ) is isomorphic to
operations on bytes:
2
8
2
8
Addition: isomorphic to bitwise XOR → inexpensive.
Multiplication: isomorphic to bitwise operations → still time-
consuming.
Therefore, how to accelerate multiplication over GF( ) will be
our focus.
2
8
GPU Accelerating Options for
Multiplication
loop-based method
a set of table-based methods
Loop-based Method
compute directly: cost a loop of at most eight iterations to
multiply two elements in GF( ).
computation bound
2
8
Table-based Methods
precompute and store the results in tables
e.g. Full Multiplication Table
38 39 40
20 194 214 26
21 228 241 50
22 142 152 74
… …
⋮ … … … … …
… …
… …
… …
⋮ … … … … …
Table-based Methods
Full
Multiplication
Table
"Double
Table"/"Left-
Right Table"
Log&Exp
Table
Space
Complexity
for GF( )
Computation
Complexity
one table-
lookup
2 table-
lookup, 2
AND, 1 XOR,
and 1 SHIFT
3 table-lookup,
1 MOD, 1
ADD, and 2
branches
Memory
space for GF(
)
64 KB 8 KB 512 Bytes
2
w
O( )2
2w
O( )2
3w/2+1
O( )2
w+1
2
8
Use log&exp table-based method in our GPU implementation.
GPU Implementation: Loop-based or Table-based?
The loop-based method is able to achieve the maximum bandwidth even when the chunk size is
small.
The bandwidth of the table-based method can still scale up as the chunk size grows larger.
The maximum bandwidth of the table-based method exceeds that of the loop-based method.
[1] Plank J S, Xu L. OptimizingCauchy Reed-Solomon codes for fault-tolerant network storage applications[C]//NCA 2006.
[2] Shojania H, Li B, WangX. Nuclei: GPU-acceleratedmany-core network coding[C]//INFOCOM 2009.
[3] Kalcher S, Lindenstruth V. AcceleratingGalois Fieldarithmeticfor Reed-Solomon erasure codes in storage applications[C]//Cluster
Computing(CLUSTER) 2011.
[4] Chu X, Zhao K. Practical random linear network codingon GPUs[M]//GPU Solutions to Multi-scale Problems in Science and
Engineering. Springer Berlin Heidelberg, 2013: 115-130.
Further Improvement of the
Log&exp Table-based Method
baseline:
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
if (a == 0 || b == 0) {
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) % (NW-1);
return gf256_exp_table[result];
}
Further Improvement of the
Log&exp Table-based Method
Improvement
Approach 1
Replace the slow modular operations with
more efficient operations.
Improvement
Approach 2
Remove the slow modular operations by
augmenting one table.
Improvement
Approach 3
Further eliminates the conditional branches
by augmenting both two tables.
if (a == 0 || b == 0) {
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1)
+ (gf256_log_table[a] + gf256_log_table[b]) >> width;
if (a == 0 || b == 0) {
return 0;
}
result = gf256_log_table[a] + gf256_log_table[b];
result = gf256_log_table[a] + gf256_log_table[b];
Further Improvement of the
Log&exp Table-based Method
In GPU implementation, where to store the tables and how to
initialize them will affect the performance.
Further Improvement of the
Log&exp Table-based Method
Appropriate GPU memory:
constant memory: off-chip memory whose accesses are
usually cached in the constant cache.
shared memory: on-chip memory which has the smallest
access latency except the register file.
Further Improvement of the
Log&exp Table-based Method
What we have implemented:
Store two tables in the constant memory and initialize them
at compile time.
Store two tables in the shared memory and run-time initialize
them serially at the beginning of each kernel function.
Store two tables in the off-chip memory and then load into
the shared memory parallely at the beginning of each kernel
function.
Further Improvement of the Log&exp Table-based Method
encoding a 1GB file with and .k = 4 n = 6
The elimination of conditional branches improves the
performance.
Accessing the tables in the constant memory is more time-
consuming.
Accelerating Matrix
Multiplication
Naive Implementation
Naive Implementation
Naive Implementation
Naive Implementation
Problems of Naive Implementation
a lot of global memory (off-chip) transactions: each takes 400
- 800 clock cycles.
memory access in column major → poor temporal locality
and high cache missrate.
Square-Tiling Algorithm
Problems of Square-Tiling Algorithm
not suitable for general input cases of Reed-Solomon Codes:
a small matrix multiple a huge matrix
encoding--A: , B: .
decoding--A: , B: .
where
: total chunk number (1-100)
: native chunk number (1-100)
: chunk size (more than 10,000,000)
(n − k) × k k × CS
k × k k × CS
n
k
CS
Generalized Tiling Algorithm
How to Determine the Parameter of Tiles
tileDepth
tileWidthRow
tileWidthCol
How to Determine the Parameter of Tiles
set tileDepth to ( ) → remove the loop of accessing
matrix and tile by tile.
A. width k
A B
How to Determine the Parameter of Tiles
set tileDepth to → remove the loop of accessing matrix
and tile by tile.
k A
B
How to Determine the Parameter of Tiles
set tileDepth to → remove the loop of accessing matrix
and tile by tile.
k A
B
How to Determine the Parameter of Tiles
is equal to the CUDA
block size (number of threads in a CUDA block) → find the
best CUDA block size by tuning occupancy.
Finally we need to further determine tileWidthRow and
tileWidthCol.
tileWidthRow × tileWidthCol
How to Determine the Parameter of Tiles
Define the aspect ratio of the tile in the result matrix as:AR
AR =
tileWidthRow
tileWidthCol
three strategies:
Min minimize tileWidthRow,
maximize tileWidthCol
Max maximize tileWidthRow,
minimize tileWidthCol
AR
AR = 1 tileWidthRow = tileWidthCol
AR
How to Determine the Parameter of Tiles
use the following testcases for encoding:
TESTCASE CHUNK
SIZE (MB)
DESCRIPTION
FILE
SIZE
CODE CHUNK
NUMBER
1 2 200 1 small large
2 4 150 1
3 4 200 1
4 2 200 10 large large
5 4 150 10
6 4 200 10
7 4 6 256 large small
8 8 16 128
k n
How to Determine the Parameter of Tiles
The first six testcases, especially the 4th-6th ones, represent
the worse cases for the strategy "Max ".
The last two testcases represent the worse cases for the
strategy " ".
Overall, the strategy "Min " is the best.
AR
AR = 1
AR
How to Determine the Parameter of Tiles
Generally, the "Min " strategy has the least number of
conditional branches, while the "Max " has the most. The
overhead of branches is significant when encoding a large
file into a large number of code chunks.
AR
AR
How to Determine the Parameter of Tiles
The " " strategy has significant performance
degradation when the number of code chunks is smaller than
tileWidthRow (wasting cache space, extra boundary
checking).
AR = 1
Further Improvement of Tiling
Algorithm
When each chunk can be word-aligned, we can further improve
the tiling algorithm.
Observation 1: Byte-length Matrix Multiplication
Multiplication: table look-ups → memory accesses
Addition: 8-bit XORs -> ALU operations
Each ALU operation:
operand1 (8-bit) XOR operand2 (8-bit) = result (8-bit)
GPU Execution Units: 32-bit
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Improvement Technique 1: Byte-by-word Matrix
Multiplication
Observation 2: Global Memory Transactions
a lot of 8-bit global memory transactions!
Improvement Technique 2: Packing Memory Transactions
pack 8-bit global memory transactions into 32-bit transactions
→ reduce global memory load and store.
Further Improvement of Tiling
Algorithm
e.g. , , chunk size = 200 MBk = 4 n = 200
Original Word-
alignment
Improvement
Effective
bandwidth (MB/s)
1119.419693 1591.051924 42.13%
ALU Operations 12331000000 3116531712 74.73%
Global Memory
Load Transactions
516963328 131611648 74.54%
Global Memory
Store Transactions
64225280 16056320 75%
Number of
Branches
6685685440 2703981504 59.56%
Reducing Data Transfer
Overhead
Using CUDA Streams
Using CUDA Streams
CUDA streams are used for further overlapping data transfers
with computation.
Sequential vs. Concurrent copy and execute
Using CUDA Streams
encoding under settings. The input file size is scaled from 1000 MB to 1800 MB.k = 4, n = 6
Using CUDA streaming can improve the performance by
more than 29%.
As the file size increases, the data transfer overhead
increases, and the improvement of CUDA streaming is more
significant.
Using CUDA Streams
encoding a 2000 MB file under settings.k = 4, n = 6
The kernel execution time is increasing with the growth of the
CUDA stream number. → The overhead of kernel execution
will exceed the time saving of data transfer at some point.
Experiment
Experiment Setup
Experiment Setup
CentOS-6 with 2.6.32 Linux kernel.
Intel Xeon Processor E5-2670 v2 x 2
10 cores
2.5 GHz
NVIDIA Tesla K20X GPU x 2
2688 CUDA cores
peak performance: 1.31 Tflops (double precision floating
point calculation) and 3.95 Tflops (single precision floating
point)
maximum size of GPU GDDR5 memory: 6 GB
theoretical memory bandwidth 243 GB/s
two copy engines → supports concurrent data copy and
kernel execution
maximum bidirectional bandwidth of the PCI-Express bus: 8
GB/s
Experiment Setup
Input files are randomly generated.
Most of our experimental results reflect the average of 100
runs.
Due to the similarity of the performance result of encoding
and that of decoding in most experiments, the latter one is
omitted.
Overall Performance Evaluation
We evaluate the overall performance by encoding a 1600 MB
file with .k = 4, n = 6
Overall Performance Evaluation
Step-by-step Improvement
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped Data
Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling
algorithm
1.469 x 0 1.469 x 0
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
Overall Performance Evaluation
Step-by-step Improvement
Step Speedup Cumulative Speedup
Kernel
Execution
Non-overlapped
Data Transfer
Kernel
Execution
Non-overlapped Data
Transfer
optimize tiling algorithm 1.469 x 0 1.469 x 0
optimize log&exp table-
based method
1.502 x 0 2.206 x 0
use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
parallel in 2GPUs 2.030 x 2.857 x 3.861 x 52.718 x
Total Cumulative Speedup: 7.145 x
Overall Performance Evaluation
GPU vs. CPU
best CPU implementation (Jerasure library, compiled by
clang with the -O3 compiler optimization flag): 4309.08 ms.
optimized GPU implementation: 292.977 ms (
speedup).
14.71 ×
Conclusion
We have studied several techniques to improve the
performance of Reed-Solomon codes according to their
coding mechanism, and figured out the best choices on the
basis of GPU architecture.
We have illustrated methods to reduce the data transfer
overhead introduced by the GPU implementation.
We present an optimized GPU implementation of Reed-
Solomon Codes, which can achieve a speedup of 14.71 over
the current best CPU implementation.
Future Works
Better corporation of CPU and GPU.
Heuristic strategies for deciding the best CUDA stream
number.
THE END
Q & A
Technical details:
repo open under GPL v3 license:
http://galoisplusplus.gitcafe.com/blog/2014/06/23/optimizing-
special-matrix-multiplication-in-cuda/
https://github.com/yszheda/GPU-RSCode

Contenu connexe

Tendances

A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
 

Tendances (20)

USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
Web Service Composition as a Planning Task: Experiments using Knowledge-Based...
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
NAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estadoNAT64 en LACNIC 18: Experimentos con NAT64 sin estado
NAT64 en LACNIC 18: Experimentos con NAT64 sin estado
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 

En vedette

Finland presentation
Finland presentationFinland presentation
Finland presentation
hatice ekiz
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Maxim Birger (马克斯)
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
Sarah Cruz
 

En vedette (20)

The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua table
 
Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Finland presentation
Finland presentationFinland presentation
Finland presentation
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...
 
English nibelungenlied
English nibelungenliedEnglish nibelungenlied
English nibelungenlied
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
 
Android things intro
Android things introAndroid things intro
Android things intro
 
oVirt – open your virtual datacenter
oVirt – open your virtual datacenteroVirt – open your virtual datacenter
oVirt – open your virtual datacenter
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
 
Nibelungenlied
NibelungenliedNibelungenlied
Nibelungenlied
 
Spark CL
Spark CLSpark CL
Spark CL
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Interview preparation workshop
Interview preparation workshopInterview preparation workshop
Interview preparation workshop
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloads
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projects
 
Journée portrait du 23 fevrier
Journée portrait du 23 fevrierJournée portrait du 23 fevrier
Journée portrait du 23 fevrier
 

Similaire à Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
lilyco
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
Jorge E. López de Vergara Méndez
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
Ricky Zhu
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
inside-BigData.com
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
inside-BigData.com
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
James McGalliard
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 

Similaire à Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system (20)

Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
Oow2007 performance
Oow2007 performanceOow2007 performance
Oow2007 performance
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Distribute Storage System May-2014
Distribute Storage System May-2014Distribute Storage System May-2014
Distribute Storage System May-2014
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing WorldOMI - The Missing Piece of a Modular, Flexible and Composable Computing World
OMI - The Missing Piece of a Modular, Flexible and Composable Computing World
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

  • 1. Accelerate Reed-Solomon Erasure Codes on GPUs Student: , Advisor: Prof. Jerry Chi-Yuan Chou Shuai Yuan LSA Lab, NTHU
  • 2. Outline Introduction Overview of Our Acceleration Targets Accelerating Operations in Galois Field Accelerating Matrix Multiplication Reducing Data Transfer Overhead Experiment Conclusion
  • 4. Erasure Codes A redundancy solution for fault-tolerance, widely used in: signals from deep space satellites reliable multimedia multicasting reliable storage systems
  • 5. Cloud Storage Systems: Why Redundancy? Cloud storage vendors claim to provide highly available and reliable services in their SLA (Service-Level Agreement) with the customers. Both availability and reliability imply strict fault tolerance requirements for cloud storage system. However, as the scale of storage system grows larger and larger, the probability of failure becomes significant: → SLA violation. Therefore, redundancy must be introduced into cloud system.
  • 6. Cloud Storage Systems: Why Redundancy? The simplest and straightforward redundancy solution is replication of the data in multiple storage nodes. Triple replication solution have been favored in cloud storage systems like the GFS (Google File System) and HDFS (Hadoop Distributed File System).
  • 7. Cloud Storage Systems: Why Erasure Codes? Problem of replication: large storage overhead. Erasure codes can reduce the storage overhead significantly while at the same time maintaining the same level of fault tolerance as replication → a better redundancy solution.
  • 8. Cloud Storage Systems: Why Erasure Codes? MDS (Maximum Distance Separable) codes where and are integers and : (n, k) n k n > k File → equal-size native data chunks. equal-size native chunks → code chunks. The native and code chunks are distributed on different storage nodes. tolerates the failure of any storage nodes. k k (n − k) n (n − k) Example: Reed-Solomon Code Triple Replication(n = 4, k = 2)
  • 9. Cloud Storage Systems: Why Reed- Solomon Codes? Reed-Solomon codes are one of the most popular MDS erasure codes. is used in HDFS-RAID in Facebook and is used in GFS II in Google. RS(k, n − k) RS(10, 4) RS(6, 3)
  • 10. Shortcomings of Reed-Solomon Codes extra high computation cost compared to replication: encoding and decoding. Our contributions: use GPU to accelerate the Reed-Solomon encoding and decoding.
  • 12. Reed-Solomon Code Overview Reed-Solomon Encoding Reed-Solomon Decoding
  • 13. Acceleration Targets - 1 Computation bottleneck: matrix multiplication. ≡ C = A ⋅ B ( = × )cj ∑ i=1 k ai,j bi Addition and multiplication are defined as arithmetic over Galois Field GF( ).2 8 Acceleration targets for computation: Arithmetic over Galois Field. Matrix multiplication.
  • 14. GPU-Accelerated Reed-Solomon Code Overview Reed-Solomon Encoding in a GPU Reed-Solomon Decoding in a GPU
  • 15. Acceleration Targets - 2 Extra overhead in GPU implementation: data transfers between CPU and GPU. Another acceleration targets: reducing data transfer overhead.
  • 18. Brief Introduction of Galois Field GF( ) contains polynomials. For each polynomial, its degree is at most and its coefficient is in {0, 1}. 2 w 2 w w − 1 For GF( ), every element can be one-to-one mapped into a byte, and polynomial operations in GF( ) is isomorphic to operations on bytes: 2 8 2 8 Addition: isomorphic to bitwise XOR → inexpensive. Multiplication: isomorphic to bitwise operations → still time- consuming. Therefore, how to accelerate multiplication over GF( ) will be our focus. 2 8
  • 19. GPU Accelerating Options for Multiplication loop-based method a set of table-based methods
  • 20. Loop-based Method compute directly: cost a loop of at most eight iterations to multiply two elements in GF( ). computation bound 2 8
  • 21. Table-based Methods precompute and store the results in tables e.g. Full Multiplication Table 38 39 40 20 194 214 26 21 228 241 50 22 142 152 74 … … ⋮ … … … … … … … … … … … ⋮ … … … … …
  • 22. Table-based Methods Full Multiplication Table "Double Table"/"Left- Right Table" Log&Exp Table Space Complexity for GF( ) Computation Complexity one table- lookup 2 table- lookup, 2 AND, 1 XOR, and 1 SHIFT 3 table-lookup, 1 MOD, 1 ADD, and 2 branches Memory space for GF( ) 64 KB 8 KB 512 Bytes 2 w O( )2 2w O( )2 3w/2+1 O( )2 w+1 2 8 Use log&exp table-based method in our GPU implementation.
  • 23. GPU Implementation: Loop-based or Table-based? The loop-based method is able to achieve the maximum bandwidth even when the chunk size is small. The bandwidth of the table-based method can still scale up as the chunk size grows larger. The maximum bandwidth of the table-based method exceeds that of the loop-based method. [1] Plank J S, Xu L. OptimizingCauchy Reed-Solomon codes for fault-tolerant network storage applications[C]//NCA 2006. [2] Shojania H, Li B, WangX. Nuclei: GPU-acceleratedmany-core network coding[C]//INFOCOM 2009. [3] Kalcher S, Lindenstruth V. AcceleratingGalois Fieldarithmeticfor Reed-Solomon erasure codes in storage applications[C]//Cluster Computing(CLUSTER) 2011. [4] Chu X, Zhao K. Practical random linear network codingon GPUs[M]//GPU Solutions to Multi-scale Problems in Science and Engineering. Springer Berlin Heidelberg, 2013: 115-130.
  • 24. Further Improvement of the Log&exp Table-based Method baseline: uint8_t gf256_mul(uint8_t a, uint8_t b) { int result; if (a == 0 || b == 0) { return 0; } result = (gf256_log_table[a] + gf256_log_table[b]) % (NW-1); return gf256_exp_table[result]; }
  • 25. Further Improvement of the Log&exp Table-based Method Improvement Approach 1 Replace the slow modular operations with more efficient operations. Improvement Approach 2 Remove the slow modular operations by augmenting one table. Improvement Approach 3 Further eliminates the conditional branches by augmenting both two tables. if (a == 0 || b == 0) { return 0; } result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1) + (gf256_log_table[a] + gf256_log_table[b]) >> width; if (a == 0 || b == 0) { return 0; } result = gf256_log_table[a] + gf256_log_table[b]; result = gf256_log_table[a] + gf256_log_table[b];
  • 26. Further Improvement of the Log&exp Table-based Method In GPU implementation, where to store the tables and how to initialize them will affect the performance.
  • 27. Further Improvement of the Log&exp Table-based Method Appropriate GPU memory: constant memory: off-chip memory whose accesses are usually cached in the constant cache. shared memory: on-chip memory which has the smallest access latency except the register file.
  • 28. Further Improvement of the Log&exp Table-based Method What we have implemented: Store two tables in the constant memory and initialize them at compile time. Store two tables in the shared memory and run-time initialize them serially at the beginning of each kernel function. Store two tables in the off-chip memory and then load into the shared memory parallely at the beginning of each kernel function.
  • 29. Further Improvement of the Log&exp Table-based Method encoding a 1GB file with and .k = 4 n = 6 The elimination of conditional branches improves the performance. Accessing the tables in the constant memory is more time- consuming.
  • 35. Problems of Naive Implementation a lot of global memory (off-chip) transactions: each takes 400 - 800 clock cycles. memory access in column major → poor temporal locality and high cache missrate.
  • 37. Problems of Square-Tiling Algorithm not suitable for general input cases of Reed-Solomon Codes: a small matrix multiple a huge matrix encoding--A: , B: . decoding--A: , B: . where : total chunk number (1-100) : native chunk number (1-100) : chunk size (more than 10,000,000) (n − k) × k k × CS k × k k × CS n k CS
  • 38. Generalized Tiling Algorithm How to Determine the Parameter of Tiles tileDepth tileWidthRow tileWidthCol
  • 39. How to Determine the Parameter of Tiles set tileDepth to ( ) → remove the loop of accessing matrix and tile by tile. A. width k A B
  • 40. How to Determine the Parameter of Tiles set tileDepth to → remove the loop of accessing matrix and tile by tile. k A B
  • 41. How to Determine the Parameter of Tiles set tileDepth to → remove the loop of accessing matrix and tile by tile. k A B
  • 42. How to Determine the Parameter of Tiles is equal to the CUDA block size (number of threads in a CUDA block) → find the best CUDA block size by tuning occupancy. Finally we need to further determine tileWidthRow and tileWidthCol. tileWidthRow × tileWidthCol
  • 43. How to Determine the Parameter of Tiles Define the aspect ratio of the tile in the result matrix as:AR AR = tileWidthRow tileWidthCol three strategies: Min minimize tileWidthRow, maximize tileWidthCol Max maximize tileWidthRow, minimize tileWidthCol AR AR = 1 tileWidthRow = tileWidthCol AR
  • 44. How to Determine the Parameter of Tiles use the following testcases for encoding: TESTCASE CHUNK SIZE (MB) DESCRIPTION FILE SIZE CODE CHUNK NUMBER 1 2 200 1 small large 2 4 150 1 3 4 200 1 4 2 200 10 large large 5 4 150 10 6 4 200 10 7 4 6 256 large small 8 8 16 128 k n
  • 45. How to Determine the Parameter of Tiles The first six testcases, especially the 4th-6th ones, represent the worse cases for the strategy "Max ". The last two testcases represent the worse cases for the strategy " ". Overall, the strategy "Min " is the best. AR AR = 1 AR
  • 46. How to Determine the Parameter of Tiles Generally, the "Min " strategy has the least number of conditional branches, while the "Max " has the most. The overhead of branches is significant when encoding a large file into a large number of code chunks. AR AR
  • 47. How to Determine the Parameter of Tiles The " " strategy has significant performance degradation when the number of code chunks is smaller than tileWidthRow (wasting cache space, extra boundary checking). AR = 1
  • 48. Further Improvement of Tiling Algorithm When each chunk can be word-aligned, we can further improve the tiling algorithm.
  • 49. Observation 1: Byte-length Matrix Multiplication Multiplication: table look-ups → memory accesses Addition: 8-bit XORs -> ALU operations Each ALU operation: operand1 (8-bit) XOR operand2 (8-bit) = result (8-bit) GPU Execution Units: 32-bit
  • 50. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 51. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 52. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 53. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 54. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 55. Improvement Technique 1: Byte-by-word Matrix Multiplication
  • 56. Observation 2: Global Memory Transactions a lot of 8-bit global memory transactions!
  • 57. Improvement Technique 2: Packing Memory Transactions pack 8-bit global memory transactions into 32-bit transactions → reduce global memory load and store.
  • 58. Further Improvement of Tiling Algorithm e.g. , , chunk size = 200 MBk = 4 n = 200 Original Word- alignment Improvement Effective bandwidth (MB/s) 1119.419693 1591.051924 42.13% ALU Operations 12331000000 3116531712 74.73% Global Memory Load Transactions 516963328 131611648 74.54% Global Memory Store Transactions 64225280 16056320 75% Number of Branches 6685685440 2703981504 59.56%
  • 61. Using CUDA Streams CUDA streams are used for further overlapping data transfers with computation. Sequential vs. Concurrent copy and execute
  • 62. Using CUDA Streams encoding under settings. The input file size is scaled from 1000 MB to 1800 MB.k = 4, n = 6 Using CUDA streaming can improve the performance by more than 29%. As the file size increases, the data transfer overhead increases, and the improvement of CUDA streaming is more significant.
  • 63. Using CUDA Streams encoding a 2000 MB file under settings.k = 4, n = 6 The kernel execution time is increasing with the growth of the CUDA stream number. → The overhead of kernel execution will exceed the time saving of data transfer at some point.
  • 66. Experiment Setup CentOS-6 with 2.6.32 Linux kernel. Intel Xeon Processor E5-2670 v2 x 2 10 cores 2.5 GHz NVIDIA Tesla K20X GPU x 2 2688 CUDA cores peak performance: 1.31 Tflops (double precision floating point calculation) and 3.95 Tflops (single precision floating point) maximum size of GPU GDDR5 memory: 6 GB theoretical memory bandwidth 243 GB/s two copy engines → supports concurrent data copy and kernel execution maximum bidirectional bandwidth of the PCI-Express bus: 8 GB/s
  • 67. Experiment Setup Input files are randomly generated. Most of our experimental results reflect the average of 100 runs. Due to the similarity of the performance result of encoding and that of decoding in most experiments, the latter one is omitted.
  • 68. Overall Performance Evaluation We evaluate the overall performance by encoding a 1600 MB file with .k = 4, n = 6
  • 70. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0
  • 71. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0
  • 72. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0 use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x
  • 73. Overall Performance Evaluation Step-by-step Improvement Step Speedup Cumulative Speedup Kernel Execution Non-overlapped Data Transfer Kernel Execution Non-overlapped Data Transfer optimize tiling algorithm 1.469 x 0 1.469 x 0 optimize log&exp table- based method 1.502 x 0 2.206 x 0 use CUDA streams 0.862 x 18.452 x 1.902 x 18.452 x parallel in 2GPUs 2.030 x 2.857 x 3.861 x 52.718 x Total Cumulative Speedup: 7.145 x
  • 74. Overall Performance Evaluation GPU vs. CPU best CPU implementation (Jerasure library, compiled by clang with the -O3 compiler optimization flag): 4309.08 ms. optimized GPU implementation: 292.977 ms ( speedup). 14.71 ×
  • 75. Conclusion We have studied several techniques to improve the performance of Reed-Solomon codes according to their coding mechanism, and figured out the best choices on the basis of GPU architecture. We have illustrated methods to reduce the data transfer overhead introduced by the GPU implementation. We present an optimized GPU implementation of Reed- Solomon Codes, which can achieve a speedup of 14.71 over the current best CPU implementation. Future Works Better corporation of CPU and GPU. Heuristic strategies for deciding the best CUDA stream number.
  • 76. THE END Q & A Technical details: repo open under GPL v3 license: http://galoisplusplus.gitcafe.com/blog/2014/06/23/optimizing- special-matrix-multiplication-in-cuda/ https://github.com/yszheda/GPU-RSCode