Muda Proposal

•

1 j'aime•1,245 vues

Syoyo Fujita

Technologie Business

GPU slumps
CPU soars
Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)
500 GFlops? (+50% of G80)

?
No
update !

PS3 Mac Pro octa
179.2 Gﬂops
+800 %
204 Gﬂops

2007 Feb/2008

Subprime shock!
Nikkei 225 index Credit boom ends!
US economy declines!
Green IT!

Future of GPU trend

Accelerated
computing

many-core GPGPU

CPU GPU

Accelerated
computing

many-core GPGPU

NO!
CPU GPU

GPGPU was dead!!
GPU will be dead soon!!

Why GPU -> GPGPU is
BAD
• Larger latency : host <-> PCI-ex
• Internal architecture is black box
• Only GPU maker knows it
• Larger cost of branching
• Debugger?
• Program only runs on speciﬁc GPU maker’s
GPU
• Not portable.

Why CPU -> Accelerated computing is
GOOD

• Easy to program
• CPU maker provides good internal spec
documentation
• Fast execution of branching
• gdb :-)
• Portable & Versatile

Accelerated
computing

many-core

MUDA
CPU

MUDA’s goal

• Withdraw CPU’s maximum
ﬂoating point performance for
large data
• SIMD
• Cache optimized computation

MUDA example
MUDA code
vec sqrtmu(vec x)
{
vec y0, y0x, y0xhalf;
vec oneish = bit(0x3f800001);

y0 = rsqrt(x);
y0x = y0 * x;
y0xhalf = 0.5 * y0x;

return ((oneish - y0 * y0x) * y0xhalf + y0x);
}

$__m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }$

No uniﬁed way to
describe SIMD op

• SSE: _mm_add_ps()
• AltiVec: vec_add
• SPE: spu_add

CPU ISA changes
frequently
• SSE2(2000), SSE3(2004), SSE4(2006)
• SSE5 and Coming New CPU design(?)
• 8-element SIMD?, no SIMD in the future
CPU?
• Keeping up with them is hard and
not productive. Waste of your
time.

SSE2 C code

SSE4 C code
MUDA
MUDA
compiler
VMX C code
Portable,
CPU independent
description
LLVM IR

CPU or Arch dependent
code

Status
• SSE2 backend : 75 %
• SSE4 backend : 0 %
• VMX backend : 20 %
• LLVM IR backend : 30 %
• SIMD math function for MUDA : 5 %
• Automatic optimizer : TODO
= I’m currently working on

Future direction
• Cache miss analysis and memory access
optimization

• Valgrind, Cache Miss Equation(CME)

• Automatic optimization
• Such like FFTW, ATLAS and Spiral are doing
• Automatic error measurement for
ﬂoating point computation

• Interval Arithmetic, Afﬁne Arithmetic, Gappa

Performance gap
100

75

Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Performance gap
100

Optimizing memory access is much
75
more important than SIMDization
Better
50

Scalar:SIMD cache miss:cache hit
25
= =
1:4 1:100
0
SIMD Memory

Contenu connexe

Similaire à Muda Proposal

Gpu perf-presentationGiannisTsagatakis

GPGPU Computationjtsagata

Provision Intel® Optane™ DC Persistent Memory in Linux*Intel® Software

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMDEdge AI and Vision Alliance

7nm "Navi" GPU - A GPU Built For Performance AMD

Introduction to AcceleratorsDilum Bandara

Vectorization on x86: all you need to knowRoberto Agostino Vitillo

BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat Security Conference

Introduction to cuda geek camp singapore 2011Raymond Tay

Vpu technology &gpgpu computingArka Ghosh

Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito

PG-Strom - GPU Accelerated AsyncrKohei KaiGai

Дмитрий Вовк: Векторизация кода под мобильные платформыDevGAMM Conference

Anatomy of ROCgdb presentation at gcc cauldron 2022ssuser866937

Introduction to CUDARaymond Tay

Vpu technology &gpgpu computingArka Ghosh

Linux kernel debugging(PDF format)yang firo

Linux kernel debugging(ODP format)yang firo

Similaire à Muda Proposal (20)

Gpu perf-presentation

GPGPU Computation

Provision Intel® Optane™ DC Persistent Memory in Linux*

“Programming Vision Pipelines on AMD’s AI Engines,” a Presentation from AMD

7nm "Navi" GPU - A GPU Built For Performance

Introduction to Accelerators

Vectorization on x86: all you need to know

BlueHat v18 || A mitigation for kernel toctou vulnerabilities

Introduction to cuda geek camp singapore 2011

Vpu technology &gpgpu computing

Nvidia® cuda™ 5 sample evaluationresult_2

PG-Strom - GPU Accelerated Asyncr

Дмитрий Вовк: Векторизация кода под мобильные платформы

Anatomy of ROCgdb presentation at gcc cauldron 2022

Introduction to CUDA

Vpu technology &gpgpu computing

Linux kernel debugging(PDF format)

Linux kernel debugging(ODP format)

Dernier

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

A Call to Action for Generative AI in 2024Results

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Dernier (20)

Maximizing Board Effectiveness 2024 Webinar.pptx

Salesforce Community Group Quito, Salesforce 101

Injustice - Developers Among Us (SciFiDevCon 2024)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Unblocking The Main Thread Solving ANRs and Frozen Frames

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to convert PDF to text with Nanonets

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

A Call to Action for Generative AI in 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Muda Proposal

1. MUDA MUltiple Data Accelerator language Project Overview Feb 24, 2008 Syoyo FUJITA

2. ?

3. Nikkei 225 index

4. ?

5. GPU slumps CPU soars Geforce 9800 GX2 rumor 1 TFlops?( 3x of G80) 500 GFlops? (+50% of G80) ? No update ! PS3 Mac Pro octa 179.2 Gﬂops +800 % 204 Gﬂops 2007 Feb/2008

6. Nikkei 225 index

7. Subprime shock! Nikkei 225 index Credit boom ends! US economy declines! Green IT! Future of GPU trend

8. Accelerated computing many-core GPGPU CPU GPU

9. Accelerated computing many-core GPGPU NO! CPU GPU GPGPU was dead!! GPU will be dead soon!!

10. Why GPU -> GPGPU is BAD • Larger latency : host <-> PCI-ex • Internal architecture is black box • Only GPU maker knows it • Larger cost of branching • Debugger? • Program only runs on speciﬁc GPU maker’s GPU • Not portable.

11. Why CPU -> Accelerated computing is GOOD • Easy to program • CPU maker provides good internal spec documentation • Fast execution of branching • gdb :-) • Portable & Versatile

12. Accelerated computing many-core MUDA CPU

13. MUDA’s goal • Withdraw CPU’s maximum ﬂoating point performance for large data • SIMD • Cache optimized computation

14. MUDA example MUDA code vec sqrtmu(vec x) { vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001); y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }

15. __m128 sqrtmu (const __m128 * x) { x86/SSE output __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ; }

16. Why MUDA?

17. No uniﬁed way to describe SIMD op • SSE: _mm_add_ps() • AltiVec: vec_add • SPE: spu_add

18. CPU ISA changes frequently • SSE2(2000), SSE3(2004), SSE4(2006) • SSE5 and Coming New CPU design(?) • 8-element SIMD?, no SIMD in the future CPU? • Keeping up with them is hard and not productive. Waste of your time.

19. SSE2 C code SSE4 C code MUDA MUDA compiler VMX C code Portable, CPU independent description LLVM IR CPU or Arch dependent code

20. Status • SSE2 backend : 75 % • SSE4 backend : 0 % • VMX backend : 20 % • LLVM IR backend : 30 % • SIMD math function for MUDA : 5 % • Automatic optimizer : TODO = I’m currently working on

21. Future direction • Cache miss analysis and memory access optimization • Valgrind, Cache Miss Equation(CME) • Automatic optimization • Such like FFTW, ATLAS and Spiral are doing • Automatic error measurement for ﬂoating point computation • Interval Arithmetic, Afﬁne Arithmetic, Gappa

22. Performance gap 100 75 Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory

23. Performance gap 100 Optimizing memory access is much 75 more important than SIMDization Better 50 Scalar:SIMD cache miss:cache hit 25 = = 1:4 1:100 0 SIMD Memory

Muda Proposal

Recommandé

Recommandé

Contenu connexe

Similaire à Muda Proposal

Similaire à Muda Proposal (20)

Dernier

Dernier (20)

Muda Proposal