2. Who am I?
ï§ Kazuaki Ishizaki
ï§ Research staff member at IBM Research â Tokyo
â http://ibm.co/kiszk
ï§ Research interests
â compiler optimizations, language runtime, and parallel processing
ï§ Worked for Java virtual machine and just-in-time compiler over 20 years
â From JDK 1.0 to Java SE 8
ï§ Twitter: @kiszk
ï§ Slideshare: http://www.slideshare.net/ishizaki
ï§ Github: https://github.com/kiszk
2 Exploting GPUs in Spark - Kazuaki Ishizaki
3. Agenda
ï§ Motivation & Goal
ï§ Introduction of GPUs
ï§ Design & New Components
â Binary columnar
â GPU enabler
ï§ Current Implementation
ï§ Performance Experiment
â Achieved 3.15x performance of a naĂŻve logistic regression by using a GPU
ï§ Future Direction in Spark 2.0 and beyond
â with Dataset (introduced in Spark 1.6)
ï§ Conclusion
3 Exploting GPUs in Spark - Kazuaki Ishizaki
4. Want to Accelerate Computation-heavy Application
ï§ Motivation
â Want to shorten execution time of a long-running Spark application
ï§ Computation-heavy
ï§ Shuffle-heavy
ï§ I/O-heavy
ï§ Goal
â Accelerate a Spark computation-heavy application
ï§ According to Reynoldâs talk (p. 21), CPU will become bottleneck on Spark
4 Exploting GPUs in Spark - Kazuaki Ishizaki
5. Accelerate a Spark Application by GPUs
ï§ Approach
â Accelerate a Spark application by using GPUs effectively and transparently
ï§ Exploit high performance of GPUs
ï§ Do not ask users to change their Spark programs
ï§ New components
â Binary columnar
â GPU enabler
5 Exploting GPUs in Spark - Kazuaki Ishizaki
6. ï§ Motivation & Goal
ï§ Introduction of GPUs
ï§ Design & New Components
ï§ Current Implementation
ï§ Performance Experiment
ï§ Future Direction in Spark 2.0 and beyond
ï§ Conclusion
7. GPU Programming Model
ï§ Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory to GPU device memory
3. Launch a GPU kernel to be executed in parallel on cores
4. Copy back data on GPU device memory to CPU main memory
5. Free GPU device memory
ï§ Usually, a programmer has to write these steps in CUDA or OpenCL
7 Exploting GPUs in Spark - Kazuaki Ishizaki
device memory
(up to 12GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy
over PCIe
dozen cores/socket thousands cores
8. How We Can Run Program Faster on GPU
ï§ Assign a lot of parallel computations into cores
ï§ Make memory accesses coalesced
â An example
â Column-oriented layout achieves better performance
ï§ This paper reports about 3x performance improvement of GPU kernel execution of
kmeans over row-oriented layout
8 Exploting GPUs in Spark - Kazuaki Ishizaki
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced by GPU hardware
2 v.s. 4
memory accesses to
GPU device memory Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
coresx1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
9. ï§ Motivation & Goal
ï§ Introduction of GPUs
ï§ Design & New Components
ï§ Current Implementation
ï§ Performance Experiment
ï§ Future Direction in Spark 2.0 and beyond
ï§ Conclusion
10. Design of GPU Exploitation
ï§ Efficient
â Reduce data copy overhead between CPU and GPU
â Make memory accesses efficient on GPU
ï§ Transparent
â Map parallelism in a program
into GPU native code
Userâs Spark Program (scala)
10
case class Pt(x: Int, y: Int)
rdd1Â =Â sc.parallelize(Array(
Pt(1, 4), Pt(2, 5),
Pt(3, 6), Pt(4, 7),
Pt(5, 8), Pt(6, 9)), 3)
rdd2 = rdd1.map(p => Pt(p.x*2, p.yâ1))
cnt =Â Â rdd2.reduce(
(p1, p2) => p1.x + p2.x)
Translate to
GPU native
code
Nativecode
1
GPU
4
2 5
3 6
4 7
5 8
6 9
1 4
2 5
3 6
4 7
5 8
6 9
2 3
4 4
6 5
8 6
10 7
12 8
2 3
4 4
6 5
8 6
10 7
12 8
*2=
-1=
rdd
1
Data
transfer
x y
Exploting GPUs in Spark - Kazuaki Ishizaki
GPU enabler
binary columnar Off-heap
x y
GPU can exploit parallelism both
among blocks in RDD and
within a block of RDD
rdd
2
block
GPU
kernel
CPU
11. What Binary Columnar does?
ï§ Keep data as binary representation (not Java object representation)
ï§ Keep data as column-oriented layout
ï§ Keep data on off-heap or GPU device memory
11 Exploting GPUs in Spark - Kazuaki Ishizaki
2 51 4
Off-heap
case class Pt(x: Int, y: Int)
Array(Pt(1, 4),
Pt(2, 5))
Example
2 51 4
Off-heap
Columnar (column-oriented) Row-oriented
12. Current RDD as Java objects on Java heap
12 Exploting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
Object header for Java virtual machine
1 4 2 5
Java heap
Current RDD
Row-oriented layout
Java object representation
On Java heap
Pt Pt
14. 2.1.
Long Path from Current RDD to GPU
ï§ Three steps to send data from RDD to GPU
1. Java objects to column-oriented binary representation on Java heap
ï§ From a Java object to binary representation
ï§ From a row-oriented format to columnar
2. Binary representation on Java heap to binary columnar on off-heap
ï§ Garbage collection may move objects on Java heap during GPU related operations
3. Off-heap to GPU device memory
14 Exploting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(âŠ).reduce(âŠ) // execute on GPU
1 4 2 5 2 51 4 2 51 4 2 51 4
Off-heap GPU device memoryJava heap Java heap
This thread in dev ML also discusses overhead of copying data between RDD and GPU
3.
Pt Pt ByteBuffer ByteBuffer
15. Short Path from Binary Columnar RDD to GPU
ï§ RDD with binary columnar can be simply copied to GPU device memory
15 Exploting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
rdd.map(âŠ).reduce(âŠ) // execute on GPU
Off-heap GPU device memory
Eliminated
2 51 4 2 51 4
1 4 2 5 2 51 4 2 51 4
Off-heap GPU device memoryJava heap
2 51 4
Java heap
16. Can Execute map() in Parallel Using Binary Columnar
ï§ Adjacent elements in binary columnar RDD can be accessed in parallel
ï§ The same type of operations ( * or -) can be executed in parallel for data
to be loaded in parallel
16 Exploting GPUs in Spark - Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
rdd = sc.parallelize(Array(Pt(1, 4),
Pt(2, 5)))
rdd1= rdd1.map(p => Pt(p.x*2, p.yâ1))Â
1 4 2 5
Java heap Off-heap
2 51 4
Current RDD Binary columnar RDD
Memory access
order 1 2 3 4 1 1 2 2
17. Advantages of Binary Columnar
ï§ Can exploit high performance of GPUs
ï§ Can reduce overhead of data copy between CPU and GPU
ï§ Consume less memory footprint
ï§ Can directly compute data, which are stored in columnar, from Apache
Parquet
ï§ Can exploit SIMD instructions on CPU
17 Exploting GPUs in Spark - Kazuaki Ishizaki
18. What GPU Enabler Does?
ï§ Copy data in binary columnar RDD between CPU main memory and GPU
device memory
ï§ Launch GPU kernels
ï§ Cache GPU native code for kernels
ï§ Generate GPU native code from transformations and actions in a program
â We already productized the IBM Java just-in-time compiler that generate GPU
native code from a lambda expression in Java 8
18 Exploting GPUs in Spark - Kazuaki Ishizaki
19. ï§ Motivation & Goal
ï§ Introduction of GPUs
ï§ Design & New Components
ï§ Current Implementation
ï§ Performance Experiment
ï§ Future Direction in Spark 2.0 and beyond
ï§ Conclusion
20. Software Stack in Current Spark 2.0-SNAPSHOT
ï§ RDD keeps data on Java heap
20 Exploting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
Userâs Spark program
21. Off-heap
Software Stack of GPU Exploitation
ï§ Current RDD and binary columnar RDD co-exist
21 Exploting GPUs in Spark - Kazuaki Ishizaki
RDD API
Java heap
RDD data
Userâs Spark program
Columnar
GPU
enabler
GPU device memory
Columnar
22. Current Implementation of Binary Columnar
ï§ Work with RDD
ï§ Convert from current RDD to binary columnar RDD and vice versa
â Our current implementation eliminates conversion overhead between CPU and
GPU in a task
22 Exploting GPUs in Spark - Kazuaki Ishizaki
23. Current Implementation of GPU Enabler
ï§ Execute user-provided GPU kernels from map()/reduce() functions
â GPU memory managements and data copy are automatically handled
ï§ Generate GPU native code for simple map()/reduce() methods
â âspark.gpu.codegen=trueâ in spark-defaults.conf
23 Exploting GPUs in Spark - Kazuaki Ishizaki
rdd1 = sc.parallelize(1 to n, 2).convert(ColumnFormat) // rdd1 uses binary columnar RDD
sum  = rdd1.map(i => i * 2)
.reduce((x, y) => (x + y))
//Â CUDA
__global__ void sample_map(int *inX, int *inY, int *outX, int *outY, long size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;
if (size <= ix) return;
outX[ix]Â =Â inX[ix]Â *Â 2;
outY[ix]Â =Â inY[ix]Â â 1;
}
//Â Spark
mapFunction = new CUDAFunction(âsample_map", // CUDA method name
Array("this.x", "this.y"), // input object  has two fields
Array("this.xâ, âthis.yâ), // output object has two fields
this.getClass.getResource("/sample.ptx")) // ptx is generated by CUDA complier
rdd1 = sc.parallelize(âŠ).convert(ColumnFormat) // rdd1 uses binary columnar RDD
rdd2 = rdd1.mapExtFunc(p => Pt(p.x*2, p.yâ1), mapFunction)
24. How to Use GPU Exploitation version
ï§ Easy to install by one-liner and to run by one-liner
â on x86_64, mac, and ppc64le with CUDA 7.0 or later with any JVM such as IBM
JDK or OpenJDK
ï§ Run script for AWS EC2 is available, which support spot instances24 Exploting GPUs in Spark - Kazuaki Ishizaki
$ wget https://s3.amazonaws.com/sparkâgpuâpublic/sparkâgpuâlatestâbinâhadoop2.4.tgz &&
tar xf sparkâgpuâlatestâbinâhadoop2.4.tgz && cd sparkâgpu
$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 MASTER='local[2]' ./bin/runâexample SparkGPULR 8 3200 32 5
âŠ
numSlices=8, N=3200, D=32, ITERATIONS=5                                        Â
On iteration 1
On iteration 2
On iteration 3
On iteration 4
On iteration 5
Elapsed time: 431 ms
$
Available at http://kiszk.github.io/spark-gpu/
âą 3 contributors
âą Private communications
with other developers
25. Achieved 3.15x Performance Improvement by GPU
ï§ Ran naĂŻve implementation of logistic regression
ï§ Achieved 3.15x performance improvement of logistic regression over
without GPU on a 16-core IvyBridge box with an NVIDIA K40 GPU card
â We have rooms to improve performance
25 Exploting GPUs in Spark - Kazuaki Ishizaki
Details are available at https://github.com/kiszk/spark-gpu/wiki/Benchmark
Program parameters
N=1,000,000 (# of points), D=400 (# of features), ITERATIONS=5
Slices=128 (without GPU), 16 (with GPU)
MASTER=local[8] (without and with GPU)
Hardware and software
Machine: nx360 M4, 2 sockets 8âcore Intel Xeon E5â2667 3.3GHz, 256GB memory, one NVIDIA K40m card
OS: RedHat 6.6, CUDA: 7.0
26. ï§ Motivation & Goal
ï§ Introduction of GPUs
ï§ Design & New Components
ï§ Current Implementation
ï§ Performance Experiment
ï§ Future Direction in Spark 2.0 and beyond
ï§ Conclusion
27. Comparisons among DataFrame, Dataset, and RDD
ï§ DataFrame (with relational operations) and Dataset (with lambda
functions) use Catalyst and row-oriented data representation on off-heap
27 Exploting GPUs in Spark - Kazuaki Ishizaki
ds =Â d.toDS()
ds.filter(p => p.x>1)
.count()
1 4 2 5
Java heap
rdd =Â sc.parallelize(d)
rdd.filter(p => p.x>1)
.count()
df = d.toDF(âŠ)
df.filter(âx>1â)
.count()
case class Pt(x: Int, y: Int)
d = Array(Pt(1, 4), Pt(2, 5))
Frontend
API
2 51 4
Off-heap
Data
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Catalyst
Backend
computation
Generated
Java bytecode
Java bytecode in
Spark program and runtime
Row-oriented
Row-oriented
28. Design Concepts of Dataset and GPU Exploitation
ï§ Keep data as binary representation
ï§ Keep data on off-heap
ï§ Take advantages of Catalyst optimizer
28 Exploting GPUs in Spark - Kazuaki Ishizaki
2 51 4
Off-heap
case class Pt(x: Int, y: Int)
sc.parallelize(Array(Pt(1, 4),Pt(2, 5)))
Comparison of data representations
2 51 4
Off-heap
case class Pt(x: Int, y: Int)
ds = (Pt(1, 4),Pt(2, 5)).toDS()
How can we apply binary columnar and GPU enabler to Dataset?
Dataset Binary columnar RDD
Binary columnar also does
GPU enabler could use
Row-oriented Columnar
29. GPU kernel launcher
Column Encoder
Binary Encoder
In-memory storage
Components in GPU Exploitation
ï§ Binary columnar
â Columnar
ï§ In-memory storage keeps data in binary representation on off-heap or GPU memory
ï§ BinaryEncoder converts a data representation between a Java object and binary format
ï§ ColumnEncoder puts a set of data elements as column-oriented layout
â Memory Manager
ï§ Manage off-heap and GPU memory
ï§ Columnar cache manages
persistency of in-memory storage
ï§ GPU enabler
â GPU kernel launcher
ï§ Launch kernels with data copy
ï§ Caching GPU binary for kernels
â GPU code generator
ï§ Generate GPU code from Spark program
29 Exploting GPUs in Spark - Kazuaki Ishizaki
Columnar cache
GPU code generator
Pre-compiled
libraries for GPU
Memory Manager Columnar
GPU memory
Off-heap memory
30. Software Stack in Spark 2.0 and Beyond
ï§ Dataset will become a primary data structure for computation
ï§ Dataset keeps data in UnsafeRow on off-heap
30 Exploting GPUs in Spark - Kazuaki Ishizaki
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
UnsafeRow
Userâs Spark program
Logical optimizer
CPU code generator
31. Columnar with Dataset
ï§ Keep data in UnsafeRow or Columnar on off-heap, or Columnar on GPU
device memory
31 Exploting GPUs in Spark - Kazuaki Ishizaki
Userâs Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
UnsafeRow
GPU device memory
Columnar
Logical optimizer
Memory manager
CPU code generator
Columnar
32. Two Approaches for Binary Columnar with Dataset
ï§ Binary Columnar as a first-class citizen
â Better end-to-end performance in a job without conversion
â Need more code changes to the existing source code
ï§ Binary Columnar as a cache in a task
â Produce overhead of representation conversions between two tasks at shuffle
â Need less code changes to the existing source code
32 Exploting GPUs in Spark - Kazuaki Ishizaki
ds1Â =
d.toDS()
ds2Â =
ds1.map(âŠ)
ds11Â =
ds3.groupby(âŠ)
ds3Â =
ds2.map(âŠ)
ds12Â =
ds11.map(âŠ)
As a
first-class
citizen task1 task2
As a
cache
shuffle
33. GPU Support in Tungsten
ï§ According to Reynoldâs talk (p. 25), Tungsten backend has a plan to enable
GPU exploitation
Exploiting GPUs in Spark - Kazuaki Ishizaki33
34. GPU Enabler in Catalyst
ï§ Place GPU kernel launcher and GPU code generator into Catalyst
34 Exploting GPUs in Spark - Kazuaki Ishizaki
Userâs Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
UnsafeRow
GPU device memory
Columnar
Logical optimizer
Memory manager
CPU code generator
GPU code generatorGPU kernel launcher
Columnar
35. Future Direction
ï§ Do refactoring to make current implementation decomposable
â Some components exist in one Scala file
ï§ Make pull requests for each component
â to support columnar Dataset
â to exploit GPUs
35 Exploting GPUs in Spark - Kazuaki Ishizaki
Memory Manager Columnar
Binary
encoder
Column
encoder
In-memory
storage
Memory
manager
Cache
manager
As a cache
in task
As a first-
class citizen
Multiple
backend
support
CPU code
generator for
Columnar
CPU code
generator for
Columnar
GPU kernel launcher
Column Encoder
Binary Encoder
In-memory storageColumnar cache
GPU code generator
GPU memory
Off-heap memory
Roadmap for pull requests
Off-heap
Catalyst
36. Takeaway
ï§ Accelerate a Spark application by using GPUs effectively and transparently
ï§ Devised two New components
â Binary columnar to alleviate overhead for GPU exploitation
â GPU enabler to manage GPU kernel execution from a Spark program
ï§ Call pre-compiled libraries for GPU
ï§ Generate GPU native code at runtime
ï§ Available at http://kiszk.github.io/spark-gpu/
36
Component Initial design
(Spark 1.3-1.5)
Current status
(Spark 2.0-Snapshot)
Future
(Spark 2.x)
Binary
columnar
with RDD with RDD with Dataset
GPU enabler launch GPU kernels
generate GPU native code
launch GPU kernels
generate GPU native code
in Catalyst
Exploting GPUs in Spark - Kazuaki Ishizaki
Appreciate any your feedback and contributions