9. Accelerated
computing
many-core GPGPU
NO!
CPU GPU
GPGPU was dead!!
GPU will be dead soon!!
10. Why GPU -> GPGPU is
BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
• Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on specific GPU maker’s
GPU
• Not portable.
11. Why CPU -> Accelerated computing is
GOOD
• Easy to program
• CPU maker provides good internal spec
documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile
17. No unified way to
describe SIMD op
• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add
18. CPU ISA changes
frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future
CPU?
• Keeping up with them is hard and
not productive. Waste of your
time.
19. SSE2 C code
SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR
CPU or Arch dependent
code
20. Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
= I’m currently working on
21. Future direction
• Cache miss analysis and memory access
optimization
• Valgrind, Cache Miss Equation(CME)
• Automatic optimization
• Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for
floating point computation
• Interval Arithmetic, Affine Arithmetic, Gappa
22. Performance gap
100
75
Better
50
Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory
23. Performance gap
100
Optimizing memory access is much
75
more important than SIMDization
Better
50
Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory