PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics

•

1 j'aime•758 vues

This document contains slides from a PLDI 2017 tutorial on vectorization with LMS using SIMD intrinsics. It discusses Single Instruction Multiple Data (SIMD) parallelism and how to implement SIMD intrinsics in LMS. Some challenges include handling large classes, automatically producing read/write effects, type mappings for unsigned values and pointers, and implementing arrays instead of general containers for the DSL. The slides provide examples of scalar and SIMD-vectorized code for addition, as well as an overview of SIMD instruction sets like AVX and the large number of intrinsics that need to be supported. It proposes generating intrinsics automatically from metadata to address porting them all to LMS.

Technologie

PLDI 2017 Tutorial Session
Vectorization with LMS:
SIMD Intrinsics
Alen StojanovDepartment of Computer Science,
ETH Zurich, Switzerland

2
SISD
SIMD
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
1
3
2
4
What is SIMD?
Single Instruction
Multiple Data

$3 SISD SIMD 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 AVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } }$

$4 SISD SIMDAVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } } LBB0_3: movsd (%rdi,%rax,8), %xmm0 addsd (%rsi,%rax,8), %xmm0 movsd %xmm0, (%rdx,%rax,8) incq %rax cmpl %eax, %r9d jne LBB0_3 LBB0_3: vmovupd (%rdi,%r10,8), %ymm0 vaddpd (%rsi,%r10,8), %ymm0, %ymm0 vmovupd %ymm0, (%rax) addq $4, %r10 addq $32, %rax addq $1, %rcx jne LBB0_3$

• MMX
• SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2
• AVX / AVX2 / AVX-512
• FMA / KNC / SVML
8x float
4x double
32x 8-bits
16x 16-bits
8x 32-bits
4x 64-bits
256-bit
AVX
4x floats
2x doubles
16x 8-bits
8x 16-bits
4x 32-bits
2x 64-bits
SSE
operands
for each

6
That’s not all
Shuffles:
• _mm256_permutevar_pd
• _mm256_shufflehi_epi16
• …
Strings:
• _mm_cmpestrm
• _mm_cmpistrm
• ..
Bitwise operators:
• _mm256_bslli_epi128
• _mm512_rol_epi32
• …
Statistics:
• _mm_avg_epu8
• _mm256_cdfnorm_pd
• …
Logical:
• _mm256_or_pd
• _mm256_andnot_pd
• …
Crypto:
• _mm_aesdec_si128
• _mm_sha1msg1_epu32
• …
Loads:
• _mm_i32gather_epi32
• _mm256_broadcast_ps
• …
Stores:
• _mm512_storenrngo_pd
• _mm_store_pd1.
• …
Casts:
• _mm256_castps_pd
• _mm256_cvtps_epi32
• …

7
There are a lot of SIMD instructions
AVX-512 has 3519 intrinsics

How do you port all intrinsics into LMS?
Ivaylo Toskov
ETH Zurich
Idea #2: Generate them automatically
Idea #1: Get a Master student to do it

Challenge #1
Scala chokes on big classes ~ 64kB
limit for a method
• Split the implementation
into multiple classes
• Make one trait inherit all
split classes

Challenge #2
LMS has read / write effects
• Produce the effects
automatically using the
category data in the Intel
Intrinsics Guide
<intrinsic tech='AVX' rettype='__m256d' name='_mm256_loadu_pd'>
<type>Floating Point</type>
<CPUID>AVX</CPUID>
<category>Load</category>
<parameter varname='mem_addr' type='double const *’ />
<description>
Load 256-bits (composed of 4 packed
double-precision (64-bit) floating-point elements)
from memory into "dst". "mem_addr" does not need
to be aligned on any particular boundary.
</description>
<operation>
dst[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
</operation>
<instruction name='vmovupd' form='ymm, m256’ />
<header>immintrin.h</header>
</intrinsic>

Challenge #3
Type Mappings – unsigned?
• Use Scala Unsigned for
unsigned operations.
Challenge #4
Pointers?
• Disallow and use memory
offsets instead
Challenge #5
Implement Arrays only?
• Abstract containers for the
need of the DSL
Challenge #6, #7, ...
Try to think of everything?
• Checked.

13
https://github.com/ivtoskov/lms-intrinsics

15
https://github.com/astojanov/lms-tutorial-pldi

Contenu connexe

Tendances

The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days

Computer graphics lab manualUma mohan

Digital Logic Circuitssathish sak

Dpsd lecture-notesAVC College of Engineering

Bitwise Operations in ProgrammingSvetlin Nakov

Computer Graphics Lab File C ProgramsKandarp Tiwari

Decodersahed dewan

Computer graphics lab report with code in cppAlamgir Hossain

Digital Logic & Design (DLD) presentationfoyez ahammad

Unit 4 dicaPavan Mukku

Decoder for digital electronicsIIT, KANPUR INDIA

Cg my own programsAmit Kapoor

PDT DC015 Chapter 2 Computer System 2017/2018 (f)Fizaril Amzari Omar

Computer graphics programs in c++Ankit Kumar

Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...Hsien-Hsin Sean Lee, Ph.D.

PST SC015 Chapter 2 Computer System (III) 2017/2018Fizaril Amzari Omar

Name dld preparationPadam Rai

Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...Hsien-Hsin Sean Lee, Ph.D.

Defense Senior College on Error Coding presentation 4/22/2010Felicia Fort, MBA

Computer graphics Prianka Padmanaban

Tendances (20)

The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...

Computer graphics lab manual

Digital Logic Circuits

Dpsd lecture-notes

Bitwise Operations in Programming

Computer Graphics Lab File C Programs

Decoder

Computer graphics lab report with code in cpp

Digital Logic & Design (DLD) presentation

Unit 4 dica

Decoder for digital electronics

Cg my own programs

PDT DC015 Chapter 2 Computer System 2017/2018 (f)

Computer graphics programs in c++

Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...

PST SC015 Chapter 2 Computer System (III) 2017/2018

Name dld preparation

Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...

Defense Senior College on Error Coding presentation 4/22/2010

Computer graphics

Similaire à PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics

Designing C++ portable SIMD supportJoel Falcou

Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub

SIMD.pptxdk03006

Covering a function using a Dynamic Symbolic Execution approach Jonathan Salwan

4-DES.pdfShimoFcis

Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia

Дмитрий Вовк: Векторизация кода под мобильные платформыDevGAMM Conference

Cryptography and secure systemsVsevolod Stakhov

Vectorization on x86: all you need to knowRoberto Agostino Vitillo

Nsd, il tuo compagno di viaggio quando Domino va in crashFabio Pignatti

CryptographySmruti Ranjan Sahoo

Two fish & Rijndael (AES) Encryption AlgorithmRifat Tasnim

5 - Advanced SVE.pdfJunZhao68

Tech day ngobrol santai tensorflowRamdhan Rizki

Overview on Cryptography and Network SecurityDr. Rupa Ch

Js2517181724IJERA Editor

SE-4128, DRM: From software secrets to hardware protection, by Rod SchultzAMD Developer Central

Kaizen cso002 l1asslang

Cryptography 202UTD Computer Security Group

Similaire à PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics (20)

Designing C++ portable SIMD support

Efficient SIMD Vectorization for Hashing in OpenCL

SIMD.pptx

Covering a function using a Dynamic Symbolic Execution approach

4-DES.pdf

Georgy Nosenko - An introduction to the use SMT solvers for software security

Дмитрий Вовк: Векторизация кода под мобильные платформы

Cryptography and secure systems

Vectorization on x86: all you need to know

Nsd, il tuo compagno di viaggio quando Domino va in crash

Cryptography

Two fish & Rijndael (AES) Encryption Algorithm

5 - Advanced SVE.pdf

Tech day ngobrol santai tensorflow

Overview on Cryptography and Network Security

Js2517181724

SE-4128, DRM: From software secrets to hardware protection, by Rod Schultz

Kaizen cso002 l1

Cryptography 202

Dernier

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

A Call to Action for Generative AI in 2024Results

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Partners Life - Insurer Innovation Award 2024The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

🐬 The future of MySQL is Postgres 🐘RTylerCroy

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Dernier (20)

Salesforce Community Group Quito, Salesforce 101

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to convert PDF to text with Nanonets

A Call to Action for Generative AI in 2024

A Domino Admins Adventures (Engage 2024)

Handwritten Text Recognition for manuscripts and early printed texts

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Partners Life - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

How to Troubleshoot Apps for the Modern Connected Worker

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

The Codex of Business Writing Software for Real-World Solutions 2.pptx

🐬 The future of MySQL is Postgres 🐘

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

Boost PC performance: How more available memory can improve productivity

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics

1. PLDI 2017 Tutorial Session Vectorization with LMS: SIMD Intrinsics Alen StojanovDepartment of Computer Science, ETH Zurich, Switzerland

2. 2 SISD SIMD 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 What is SIMD? Single Instruction Multiple Data

3. 3 SISD SIMD 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 1 3 2 4 AVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } }

4. 4 SISD SIMDAVX x4 #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; ++i) { T x1, y1, z1; x1 = x[i]; y1 = y[i]; z1 = x1 + y1; z[i] = z1; } } Scalar #define T double void add(T* x, T* y, T* z, int N) { for(int i = 0; i < N; i += 4) { __m256d x1, y1, z1; x1 = _mm256_loadu_pd(x + i); y1 = _mm256_loadu_pd(y + i); z1 = _mm256_add_pd(x1, y1); _mm256_storeu_pd(z + i, z1); } } LBB0_3: movsd (%rdi,%rax,8), %xmm0 addsd (%rsi,%rax,8), %xmm0 movsd %xmm0, (%rdx,%rax,8) incq %rax cmpl %eax, %r9d jne LBB0_3 LBB0_3: vmovupd (%rdi,%r10,8), %ymm0 vaddpd (%rsi,%r10,8), %ymm0, %ymm0 vmovupd %ymm0, (%rax) addq $4, %r10 addq $32, %rax addq $1, %rcx jne LBB0_3

5. • MMX • SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2 • AVX / AVX2 / AVX-512 • FMA / KNC / SVML 8x float 4x double 32x 8-bits 16x 16-bits 8x 32-bits 4x 64-bits 256-bit AVX 4x floats 2x doubles 16x 8-bits 8x 16-bits 4x 32-bits 2x 64-bits SSE operands for each

6. 6 That’s not all Shuffles: • _mm256_permutevar_pd • _mm256_shufflehi_epi16 • … Strings: • _mm_cmpestrm • _mm_cmpistrm • .. Bitwise operators: • _mm256_bslli_epi128 • _mm512_rol_epi32 • … Statistics: • _mm_avg_epu8 • _mm256_cdfnorm_pd • … Logical: • _mm256_or_pd • _mm256_andnot_pd • … Crypto: • _mm_aesdec_si128 • _mm_sha1msg1_epu32 • … Loads: • _mm_i32gather_epi32 • _mm256_broadcast_ps • … Stores: • _mm512_storenrngo_pd • _mm_store_pd1. • … Casts: • _mm256_castps_pd • _mm256_cvtps_epi32 • …

7. 7 There are a lot of SIMD instructions AVX-512 has 3519 intrinsics

8. How do you port all intrinsics into LMS? Ivaylo Toskov ETH Zurich Idea #2: Generate them automatically Idea #1: Get a Master student to do it

9. 9 data-3.3.16.xml

10. Challenge #1 Scala chokes on big classes ~ 64kB limit for a method • Split the implementation into multiple classes • Make one trait inherit all split classes

11. Challenge #2 LMS has read / write effects • Produce the effects automatically using the category data in the Intel Intrinsics Guide <intrinsic tech='AVX' rettype='__m256d' name='_mm256_loadu_pd'> <type>Floating Point</type> <CPUID>AVX</CPUID> <category>Load</category> <parameter varname='mem_addr' type='double const *’ /> <description> Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into "dst". "mem_addr" does not need to be aligned on any particular boundary. </description> <operation> dst[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0 </operation> <instruction name='vmovupd' form='ymm, m256’ /> <header>immintrin.h</header> </intrinsic>

12. Challenge #3 Type Mappings – unsigned? • Use Scala Unsigned for unsigned operations. Challenge #4 Pointers? • Disallow and use memory offsets instead Challenge #5 Implement Arrays only? • Abstract containers for the need of the DSL Challenge #6, #7, ... Try to think of everything? • Checked.

13. 13 https://github.com/ivtoskov/lms-intrinsics

14. How do we make use of the intrinsics ?

15. 15 https://github.com/astojanov/lms-tutorial-pldi

PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics

Similaire à PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics (20)

Dernier

Dernier (20)

PLDI 2017 Tutorial Session on Vectorization with LMS SIMD Intrinsics