This document contains slides from a PLDI 2017 tutorial on vectorization with LMS using SIMD intrinsics. It discusses Single Instruction Multiple Data (SIMD) parallelism and how to implement SIMD intrinsics in LMS. Some challenges include handling large classes, automatically producing read/write effects, type mappings for unsigned values and pointers, and implementing arrays instead of general containers for the DSL. The slides provide examples of scalar and SIMD-vectorized code for addition, as well as an overview of SIMD instruction sets like AVX and the large number of intrinsics that need to be supported. It proposes generating intrinsics automatically from metadata to address porting them all to LMS.
10. Challenge #1
Scala chokes on big classes ~ 64kB
limit for a method
• Split the implementation
into multiple classes
• Make one trait inherit all
split classes
11. Challenge #2
LMS has read / write effects
• Produce the effects
automatically using the
category data in the Intel
Intrinsics Guide
<intrinsic tech='AVX' rettype='__m256d' name='_mm256_loadu_pd'>
<type>Floating Point</type>
<CPUID>AVX</CPUID>
<category>Load</category>
<parameter varname='mem_addr' type='double const *’ />
<description>
Load 256-bits (composed of 4 packed
double-precision (64-bit) floating-point elements)
from memory into "dst". "mem_addr" does not need
to be aligned on any particular boundary.
</description>
<operation>
dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
</operation>
<instruction name='vmovupd' form='ymm, m256’ />
<header>immintrin.h</header>
</intrinsic>
12. Challenge #3
Type Mappings – unsigned?
• Use Scala Unsigned for
unsigned operations.
Challenge #4
Pointers?
• Disallow and use memory
offsets instead
Challenge #5
Implement Arrays only?
• Abstract containers for the
need of the DSL
Challenge #6, #7, ...
Try to think of everything?
• Checked.