Brief Overview of a Parallel Nbody Code

•Télécharger en tant que PPTX, PDF•

0 j'aime•2,518 vues

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Technologie Formation

Brief overview of a
parallel nbody code
Implementation and analysis

Filipo Novo Mór
Graduate Program in Computer Science UFRGS
Prof. Nicollas Maillard
2013, December

Overview
• About the nbody problem
• The Serial Implementation

• The OpenMP Implementation
• The CUDA Implementation
• Experimental Results

• Conclusion

About the nbody problem
Features:
 Force calculation between all particles.
 Complexity O(N2)
 Energy should be constant.
 The brute force algorithm demands huge
computational power.

The Serial Implementation
NAIVE!

• Clearly N2
• Each pair is evaluated twice
• Acceleration has to be adjusted at the end.

The Serial Implementation

• It stills under N2 domain, but:
• Each pair is evaluated once only.
• Acceleration it’s OK at the end!

The OpenMP Implementation

• MUST be based on the “naive” version.
• We lost the “/2”, but we gain the “/p”!
• OBS: the static schedule seems to be slightly faster than dynamic schedule.

Analysis

*****
*****
*****
*****
*****

for (i=0; i<N; i++)
{
for(j=i+1; j<N; j++)
{
printf(“*”);
}
printf(“n”);
}

The CUDA Implementation

Basic CUDA GPU architecture

Global Memory

N = 15
K=3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

Global Memory

Active Tasks
Active Transfers

Shared
Memory
Bank

BARRIER

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Active Tasks
Active Transfers

0

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

Shared
Memory
Bank

Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

BARRIER

Active Transfers

Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4

5

Shared
Memory
Bank

Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

6

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

7

8

Shared
Memory
Bank

Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

9 10 11

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank

Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank

Global Memory

Active Tasks
Active Transfers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

Global Memory
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank

Analysis

C : cost of the CalculateForce
function.
M : transfer cost between global and
shared memories.
T : transfer cost between CPU and
device memories.

 Access to shared memory is
around 100X faster than to the
global memory.

Experimental Results

How much would it cost???
Testing Environment:
 Dell PowerEdge R610
 2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading
 8 physical cores, 16 threads.
 RAM 16GB
 NVIDIA Tesla S2050
 Ubuntu Server 10.0.4 LTS
 GCC 4.4.3
 CUDA 5.0

Version

Cost

Naive

$

0.49

Smart

$

0.33

OMP

$

0.08

CUDA

$

0.05



Amazon EC2:

General Purpose - m1.large plan

GPU Instances - g2.2xlarge plan

Conclusions
• PRAM is OK for sequential and OpenMP.
• But for CUDA, we need a better model!
– Considering block threads, warps and latency.

Thanks!

About the nbody problem
• Calculations
Force (acceleration)

𝑚 𝑗 𝑟𝑖𝑗

𝑓𝑖 ≈ 𝐺𝑚 𝑖
1<𝑗<𝑁
𝑗 ≠𝑖

𝑟𝑖𝑗

2

+

3
2
𝜀2

Energy (kinetic and potential)

𝐸 = 𝐸𝑘 + 𝐸𝑝
𝑁

𝐺𝑚 𝑖 𝑚𝑗

𝐸𝑝 = −
1<𝑗 <𝑁
𝑖≠𝑗

𝑁

𝐸𝑘 =
1<𝑖<𝑁

𝑟𝑖𝑗

𝑚 𝑖 𝑣2
𝑖
2

Softening Factor
collisionless system
virtual particles

Recommandé

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Filipo Mór

PregelWeiru Dai

Super COMPUTING JournalPandey_G

Map reduce - simplified data processing on large clustersCleverence Kombe

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea

MapReduce: Simplified Data Processing On Large Clusterskazuma_sato

C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G

Low Energy Task Scheduling based on Work StealingLEGATO project

Recommandé

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Filipo Mór

PregelWeiru Dai

Super COMPUTING JournalPandey_G

Map reduce - simplified data processing on large clustersCleverence Kombe

"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea

MapReduce: Simplified Data Processing On Large Clusterskazuma_sato

C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G

Low Energy Task Scheduling based on Work StealingLEGATO project

MapReduce with HadoopVitalie Scurtu

Afanasov14flynet slidesMikhail Afanasov

MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin

Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...Hyo jeong Lee

Continuous Performance Regression Testing with JfrUnitScyllaDB

work load characterizationRaghu Golla

MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh

06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh

cnsm2011_slidererngvit yanggratoke

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11David Ribeiro Alves

Using eBPF to Measure the k8s Cluster HealthScyllaDB

Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh

Paper_Design of Swap-aware Java Virtual Machine Garbage Collector PolicyHyo jeong Lee

BDC-presentationPavel Popa

Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano

Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi

Determining the k in k-means with MapReduceThibault Debatty

A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

vegaGEORGE VEGA

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...ORAU

Contenu connexe

Tendances

MapReduce with HadoopVitalie Scurtu

Afanasov14flynet slidesMikhail Afanasov

MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin

Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...Hyo jeong Lee

Continuous Performance Regression Testing with JfrUnitScyllaDB

work load characterizationRaghu Golla

MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh

06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh

cnsm2011_slidererngvit yanggratoke

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11David Ribeiro Alves

Using eBPF to Measure the k8s Cluster HealthScyllaDB

Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh

Paper_Design of Swap-aware Java Virtual Machine Garbage Collector PolicyHyo jeong Lee

BDC-presentationPavel Popa

Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...NECST Lab @ Politecnico di Milano

Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi

Determining the k in k-means with MapReduceThibault Debatty

A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

Tendances (20)

MapReduce with Hadoop

Afanasov14flynet slides

MapReduce: Simplified Data Processing on Large Clusters

Paper_An Efficient Garbage Collection in Java Virtual Machine via Swap I/O O...

Continuous Performance Regression Testing with JfrUnit

work load characterization

MapReduce : Simplified Data Processing on Large Clusters

06 how to write a map reduce version of k-means clustering

cnsm2011_slide

$IEEE CLOUD \'11$ $IEEE CLOUD \'11$

IEEE CLOUD \'11

Using eBPF to Measure the k8s Cluster Health

Mapreduce - Simplified Data Processing on Large Clusters

Paper_Design of Swap-aware Java Virtual Machine Garbage Collector Policy

BDC-presentation

Processing Big Data in Real-Time - Yanai Franchi, Tikal

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...

Temporal Performance Modelling of Serverless Computing Platforms - WoSC6

Determining the k in k-means with MapReduce

A GPU Implementation of Generalized Graph Processing Algorithm GIM-V

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...

Similaire à Brief Overview of a Parallel Nbody Code

vegaGEORGE VEGA

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...ORAU

wd1-01-jaseel-madhusudan-pres-userjaseel_abdulla

Local collaborative autoencoders (WSDM2021)민진 최

How We Made Scylla Maintenance Easier, Safer and FasterScyllaDB

Just In Time Scalability Agile Methods To Support Massive Growth PresentationLong Nguyen

Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich

presentationErik Thorsell

OpenNebulaConf2018 - How Inoreader Migrated from Bare-Metal Containers to Ope...OpenNebula Project

Dynamic Change Data Capture with Flink CDC and Consistent HashingHostedbyConfluent

Dynamic Change Data Capture with Flink CDC and Consistent HashingYaroslav Tkachenko

Scrum with Kanban. Small adjustments, big improvements.Johann Arispe

Guagua an iterative computing framework on hadooppengshanzhang

Scala & Spark(1.6) in Performance Aspect for Scala TaiwanJimin Hsieh

RedisConf17 - Too Big to Failover - A cautionary tale of scaling RedisRedis Labs

Active record, standalone migrations, and working with ArelAlex Tironati

Garbage First and youKai Koenig

Garbage First and You!devObjective

Garbage First & YouColdFusionConference

OpenStack at Scale Inside NetAppTesora

Similaire à Brief Overview of a Parallel Nbody Code (20)

vega

Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...

wd1-01-jaseel-madhusudan-pres-user

Local collaborative autoencoders (WSDM2021)

How We Made Scylla Maintenance Easier, Safer and Faster

Just In Time Scalability Agile Methods To Support Massive Growth Presentation

Using Local Spectral Methods to Robustify Graph-Based Learning

presentation

OpenNebulaConf2018 - How Inoreader Migrated from Bare-Metal Containers to Ope...

Dynamic Change Data Capture with Flink CDC and Consistent Hashing

Scrum with Kanban. Small adjustments, big improvements.

Guagua an iterative computing framework on hadoop

Scala & Spark(1.6) in Performance Aspect for Scala Taiwan

RedisConf17 - Too Big to Failover - A cautionary tale of scaling Redis

Active record, standalone migrations, and working with Arel

Garbage First and you

Garbage First and You!

Garbage First & You

OpenStack at Scale Inside NetApp

Plus de Filipo Mór

Desenvolvendo Aplicações de Uso Geral para GPU com CUDAFilipo Mór

Master Thesis DefenseFilipo Mór

Programaçao C - Aula 2Filipo Mór

Programação C - Aula 1Filipo Mór

Uma Abordagem Paralela da Evolução Diferencial em GPUFilipo Mór

Aula 6 - Redes de Computadores A - Endereçamento IPFilipo Mór

Aula Especial - Redes de Computadores A - SocketsFilipo Mór

Aula 4 - Redes de Computadores A - Camadas Modelos TCP/IP e OSI. Camada Física.Filipo Mór

Auditoria e Segurança em TI - Aula 4Filipo Mór

Aula 3 - Redes de Computadores A - Administração da Internet. Modelo TCP/IP.Filipo Mór

Auditoria e Segurança em TI - Aula 3Filipo Mór

Aula 1 - Redes de Computadores A - Conceitos Básicos.Filipo Mór

Aula 1 - Conceitos de TI e PDTIFilipo Mór

Curso "Desenvolvendo aplicações de uso geral para GPU com CUDA".Filipo Mór

Aula 12 - Gestão do ConhecimentoFilipo Mór

Aula 11 - Terceirização em TIFilipo Mór

Aula 10 - Acompanhamento de ProjetosFilipo Mór

Aula 9 - Controle de Atividades e CustosFilipo Mór

Aula 8 - Técnicas de Negociação e Gestão de RHFilipo Mór

Aula 7 - Técnicas de PlanejamentoFilipo Mór

Plus de Filipo Mór (20)

Desenvolvendo Aplicações de Uso Geral para GPU com CUDA

Master Thesis Defense

Programaçao C - Aula 2

Programação C - Aula 1

Uma Abordagem Paralela da Evolução Diferencial em GPU

Aula 6 - Redes de Computadores A - Endereçamento IP

Aula Especial - Redes de Computadores A - Sockets

Aula 4 - Redes de Computadores A - Camadas Modelos TCP/IP e OSI. Camada Física.

Auditoria e Segurança em TI - Aula 4

Aula 3 - Redes de Computadores A - Administração da Internet. Modelo TCP/IP.

Auditoria e Segurança em TI - Aula 3

Aula 1 - Redes de Computadores A - Conceitos Básicos.

Aula 1 - Conceitos de TI e PDTI

Curso "Desenvolvendo aplicações de uso geral para GPU com CUDA".

Aula 12 - Gestão do Conhecimento

Aula 11 - Terceirização em TI

Aula 10 - Acompanhamento de Projetos

Aula 9 - Controle de Atividades e Custos

Aula 8 - Técnicas de Negociação e Gestão de RH

Aula 7 - Técnicas de Planejamento

Dernier

Install Stable Diffusion in windows machinePadma Pradeep

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

CloudStudio User manual (basic edition):comworks

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Story boards and shot lists for my a level piececharlottematthew16

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Search Engine Optimization SEO PDF for 2024.pdfRankYa

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Dernier (20)

Install Stable Diffusion in windows machine

Human Factors of XR: Using Human Factors to Design XR Systems

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

CloudStudio User manual (basic edition):

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Dev Dives: Streamline document processing with UiPath Studio Web

Anypoint Exchange: It’s Not Just a Repo!

Are Multi-Cloud and Serverless Good or Bad?

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Story boards and shot lists for my a level piece

Streamlining Python Development: A Guide to a Modern Project Setup

Artificial intelligence in cctv survelliance.pptx

Scanning the Internet for External Cloud Exposures via SSL Certs

"Debugging python applications inside k8s environment", Andrii Soldatenko

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Developer Data Modeling Mistakes: From Postgres to NoSQL

Nell’iperspazio con Rocket: il Framework Web di Rust!

Search Engine Optimization SEO PDF for 2024.pdf

DevEX - reference for building teams, processes, and platforms

Brief Overview of a Parallel Nbody Code

1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December

2. Overview • About the nbody problem • The Serial Implementation • The OpenMP Implementation • The CUDA Implementation • Experimental Results • Conclusion

3. About the nbody problem Features:  Force calculation between all particles.  Complexity O(N2)  Energy should be constant.  The brute force algorithm demands huge computational power.

4. The Serial Implementation NAIVE! • Clearly N2 • Each pair is evaluated twice • Acceleration has to be adjusted at the end.

5. The Serial Implementation • It stills under N2 domain, but: • Each pair is evaluated once only. • Acceleration it’s OK at the end!

6. The OpenMP Implementation • MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”! • OBS: the static schedule seems to be slightly faster than dynamic schedule.

7. Analysis ***** ***** ***** ***** ***** for (i=0; i<N; i++) { for(j=i+1; j<N; j++) { printf(“*”); } printf(“n”); }

8. The CUDA Implementation Basic CUDA GPU architecture

9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank

10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank

12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers

13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank

14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers

15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank

16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers

17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank

18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers

19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank

20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank

21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank

22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories.  Access to shared memory is around 100X faster than to the global memory.

23. Experimental Results How much would it cost??? Testing Environment:  Dell PowerEdge R610  2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading  8 physical cores, 16 threads.  RAM 16GB  NVIDIA Tesla S2050  Ubuntu Server 10.0.4 LTS  GCC 4.4.3  CUDA 5.0 Version Cost Naive $ 0.49 Smart $ 0.33 OMP $ 0.08 CUDA $ 0.05  Amazon EC2:  General Purpose - m1.large plan  GPU Instances - g2.2xlarge plan

24. Conclusions • PRAM is OK for sequential and OpenMP. • But for CUDA, we need a better model! – Considering block threads, warps and latency. Thanks!

25. Additional Slides

26. About the nbody problem • Calculations Force (acceleration) 𝑚 𝑗 𝑟𝑖𝑗 𝑓𝑖 ≈ 𝐺𝑚 𝑖 1<𝑗<𝑁 𝑗 ≠𝑖 𝑟𝑖𝑗 2 + 3 2 𝜀2 Energy (kinetic and potential) 𝐸 = 𝐸𝑘 + 𝐸𝑝 𝑁 𝐺𝑚 𝑖 𝑚𝑗 𝐸𝑝 = − 1<𝑗 <𝑁 𝑖≠𝑗 𝑁 𝐸𝑘 = 1<𝑖<𝑁 𝑟𝑖𝑗 𝑚 𝑖 𝑣2 𝑖 2 Softening Factor collisionless system virtual particles

27. About the nbody problem