Muda Proposal

•

1 j'aime•1,244 vues

Syoyo Fujita

Technologie Business

GPU slumps
CPU soars
Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)
500 GFlops? (+50% of G80)

?
No
update !

PS3 Mac Pro octa
179.2 Gﬂops
+800 %
204 Gﬂops

2007 Feb/2008

Subprime shock!
Nikkei 225 index Credit boom ends!
US economy declines!
Green IT!

Future of GPU trend

Accelerated
computing

many-core GPGPU

CPU GPU

Accelerated
computing

many-core GPGPU

NO!
CPU GPU

GPGPU was dead!!
GPU will be dead soon!!

Why GPU -> GPGPU is
BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
• Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on speciﬁc GPU maker’s
GPU
• Not portable.

Why CPU -> Accelerated computing is
GOOD

• Easy to program
• CPU maker provides good internal spec
documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile

Accelerated
computing

many-core

MUDA
CPU

MUDA’s goal

• Withdraw CPU’s maximum
ﬂoating point performance for
large data
• SIMD
• Cache optimized computation

MUDA example
MUDA code
vec sqrtmu(vec x)
{
vec y0, y0x, y0xhalf;
vec oneish = bit(0x3f800001);

y0 = rsqrt(x);
y0x = y0 * x;
y0xhalf = 0.5 * y0x;

return ((oneish - y0 * y0x) * y0xhalf + y0x);
}

$__m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }$

No uniﬁed way to
describe SIMD op

• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add

CPU ISA changes
frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future
CPU?
• Keeping up with them is hard and
not productive. Waste of your
time.

SSE2 C code

SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR

CPU or Arch dependent
code

Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
= I’m currently working on

Future direction
• Cache miss analysis and memory access
optimization

• Valgrind, Cache Miss Equation(CME)

• Automatic optimization
• Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for
ﬂoating point computation

• Interval Arithmetic, Afﬁne Arithmetic, Gappa

Performance gap
100

75

Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Performance gap
100

Optimizing memory access is much
75
more important than SIMDization
Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Contenu connexe

Similaire à Muda Proposal

Gpu perf-presentationGiannisTsagatakis

GPGPU Computationjtsagata

Provision Intel® Optane™ DC Persistent Memory in Linux*Intel® Software

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance

7nm "Navi" GPU - A GPU Built For Performance AMD

Introduction to AcceleratorsDilum Bandara

Vectorization on x86: all you need to knowRoberto Agostino Vitillo

BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat Security Conference

Introduction to cuda geek camp singapore 2011Raymond Tay

Vpu technology &gpgpu computingArka Ghosh

Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito

PG-Strom - GPU Accelerated AsyncrKohei KaiGai

Дмитрий Вовк: Векторизация кода под мобильные платформыDevGAMM Conference

Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937

Introduction to CUDARaymond Tay

Vpu technology &gpgpu computingArka Ghosh

Linux kernel debugging(PDF format)yang firo

Linux kernel debugging(ODP format)yang firo

Similaire à Muda Proposal (20)

Gpu perf-presentation

GPGPU Computation

Provision Intel® Optane™ DC Persistent Memory in Linux*

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD

7nm "Navi" GPU - A GPU Built For Performance

Introduction to Accelerators

Vectorization on x86: all you need to know

BlueHat v18 || A mitigation for kernel toctou vulnerabilities

Introduction to cuda geek camp singapore 2011

Vpu technology &gpgpu computing

Nvidia® cuda™ 5 sample evaluationresult_2

PG-Strom - GPU Accelerated Asyncr

Дмитрий Вовк: Векторизация кода под мобильные платформы

Anatomy of ROCgdb presentation at gcc cauldron 2022

Introduction to CUDA

Vpu technology &gpgpu computing

Linux kernel debugging(PDF format)

Linux kernel debugging(ODP format)

Dernier

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Advanced Computer Architecture – An IntroductionDilum Bandara

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

CloudStudio User manual (basic edition):comworks

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Developer Data Modeling Mistakes: From Postgres to NoSQL

Search Engine Optimization SEO PDF for 2024.pdf

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

Vertex AI Gemini Prompt Engineering Tips

Advanced Computer Architecture – An Introduction

Unleash Your Potential - Namagunga Girls Coding Club

Commit 2024 - Secret Management made easy

Anypoint Exchange: It’s Not Just a Repo!

Dev Dives: Streamline document processing with UiPath Studio Web

Artificial intelligence in cctv survelliance.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Streamlining Python Development: A Guide to a Modern Project Setup

Ensuring Technical Readiness For Copilot in Microsoft 365

TeamStation AI System Report LATAM IT Salaries 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

CloudStudio User manual (basic edition):

DSPy a system for AI to Write Prompts and Do Fine Tuning

Muda Proposal

1. MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA

2. ?

3. Nikkei 225 index

4. ?

5. GPU slumps CPU soars Geforce 9800 GX2 rumor 1 TFlops?( 3x of G80) 500 GFlops? (+50% of G80) ? No update ! PS3 Mac Pro octa 179.2 Gﬂops +800 % 204 Gﬂops 2007 Feb/2008

6. Nikkei 225 index

7. Subprime shock! Nikkei 225 index Credit boom ends! US economy declines! Green IT! Future of GPU trend

8. Accelerated computing many-core GPGPU CPU GPU

9. Accelerated computing many-core GPGPU NO! CPU GPU GPGPU was dead!! GPU will be dead soon!!

10. Why GPU -> GPGPU is BAD • Larger latency : host <-> PCI-ex • Internal architecture is black box • Only GPU maker knows it • Larger cost of branching • Debugger? • Program only runs on speciﬁc GPU maker’s GPU • Not portable.

11. Why CPU -> Accelerated computing is GOOD • Easy to program • CPU maker provides good internal spec documentation • Fast execution of branching • gdb :-) • Portable & Versatile

12. Accelerated computing many-core MUDA CPU

13. MUDA’s goal • Withdraw CPU’s maximum ﬂoating point performance for large data • SIMD • Cache optimized computation

14. MUDA example MUDA code vec sqrtmu(vec x) { vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001); y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }

15. __m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }

16. Why MUDA?

17. No uniﬁed way to describe SIMD op • SSE: _mm_add_ps() • AltiVec: vec_add • SPE: spu_add

18. CPU ISA changes frequently • SSE2(2000), SSE3(2004), SSE4(2006) • SSE5 and Coming New CPU design(?) • 8-element SIMD?, no SIMD in the future CPU? • Keeping up with them is hard and not productive. Waste of your time.

19. SSE2 C code SSE4 C code MUDA MUDA compiler VMX C code Portable, CPU independent description LLVM IR CPU or Arch dependent code

20. Status • SSE2 backend : 75 % • SSE4 backend : 0 % • VMX backend : 20 % • LLVM IR backend : 30 % • SIMD math function for MUDA : 5 % • Automatic optimizer : TODO = I’m currently working on

21. Future direction • Cache miss analysis and memory access optimization • Valgrind, Cache Miss Equation(CME) • Automatic optimization • Such like FFTW, ATLAS and Spiral are doing • Automatic error measurement for ﬂoating point computation • Interval Arithmetic, Afﬁne Arithmetic, Gappa

22. Performance gap 100 75 Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory

23. Performance gap 100 Optimizing memory access is much 75 more important than SIMDization Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory

Muda Proposal

Recommandé

Recommandé

Contenu connexe

Similaire à Muda Proposal

Similaire à Muda Proposal (20)

Dernier

Dernier (20)

Muda Proposal