6. 병렬 처리 (3/3)
주된 목적 : 더욱 큰 문제를 더욱 빨리 처리하는 것
프로그램의 wall-clock time 감소
해결할 수 있는 문제의 크기 증가
병렬 컴퓨팅 계산 자원
여러 개의 프로세서(CPU)를 가지는 단일 컴퓨터
네트워크로 연결된 다수의 컴퓨터
7. 왜 병렬인가?
고성능 단일 프로세서 시스템 개발의 제한
전송속도의 한계 (구리선 : 9 cm/nanosec)
소형화의 한계
경제적 제한
보다 빠른 네트워크, 분산 시스템, 다중 프로세서 시스템 아키텍처의 등장 병렬 컴퓨팅 환경
상대적으로 값싼 프로세서를 여러 개 묶어 동시에 사용함
으로써 원하는 성능이득 기대
8. 프로그램과 프로세스
프로세스는 보조 기억 장치에 하나의 파일로서 저장되어 있던 실행 가능한 프로그램이 로딩되어 운영
체제(커널)의 실행 제어 상태에 놓인 것
프로그램 : 보조 기억 장치에 저장
프로세스 : 컴퓨터 시스템에 의하여 실행 중인 프로그램
태스크 = 프로세스
9. 프로세스
프로그램 실행을 위한 자원 할당의 단위가 되고, 한 프로그램에서 여러 개 실행 가능
다중 프로세스를 지원하는 단일 프로세서 시스템
자원 할당의 낭비, 문맥교환으로 인한 부하 발생
문맥교환
• 어떤 순간 한 프로세서에서 실행 중인 프로세스는 항상 하나
• 현재 프로세스 상태 저장 다른 프로세스 상태 적재
분산메모리 병렬 프로그래밍 모델의 작업할당 기준
10. 스레드
프로세스에서 실행의 개념만을 분리한 것
프로세스 = 실행단위(스레드) + 실행환경(공유자원)
하나의 프로세스에 여러 개 존재가능
같은 프로세스에 속한 다른 스레드와 실행환경을 공유
다중 스레드를 지원하는 단일 프로세서 시스템
다중 프로세스보다 효율적인 자원 할당
다중 프로세스보다 효율적인 문맥교환
공유 메모리 병렬 프로그래밍 모델의 작업할당 기준
12. 병렬성 유형
데이터 병렬성 (Data Parallelism)
도메인 분해 (Domain Decomposition)
각 태스크는 서로 다른 데이터를 가지고 동일한 일련의 계산을 수행
태스크 병렬성 (Task Parallelism)
기능적 분해 (Functional Decomposition)
각 태스크는 같거나 또는 다른 데이터를 가지고 서로 다른 계산을 수행
13. 데이터 병렬성 (1/3)
데이터 병렬성 : 도메인 분해
Problem Data Set
Task 1
Task 2
Task 3
Task 4
14. 데이터 병렬성 (2/3)
코드 예) : 행렬의 곱셈 (OpenMP)
Serial Code
Parallel Code
!$OMP PARALLEL DO
DO K=1,N
DO K=1,N
DO J=1,N
DO J=1,N
DO I=1,N
C(I,J) = C(I,J) +
DO I=1,N
C(I,J) = C(I,J) +
(A(I,K)*B(K,J))
END DO
END DO
END DO
A(I,K)*B(K,J)
END DO
END DO
END DO
!$OMP END PARALLEL DO
15. 데이터 병렬성 (3/3)
데이터 분해 (프로세서 4개:K=1,20일 때)
Process
Proc0
Proc1
Proc2
Proc3
Iterations of K
K =
K =
1:5
6:10
K = 11:15
K = 16:20
Data Elements
A(I,1:5)
B(1:5,J)
A(I,6:10)
B(6:10,J)
A(I,11:15)
B(11:15,J)
A(I,16:20)
B(16:20,J)
16. 태스크 병렬성 (1/3)
태스크 병렬성 : 기능적 분해
Problem Instruction Set
Task 1
Task 2
Task 3
Task 4
17. 태스크 병렬성 (2/3)
코드 예) : (OpenMP)
Serial Code
Parallel Code
PROGRAM MAIN
…
CALL interpolate()
CALL compute_stats()
CALL gen_random_params()
…
END
PROGRAM MAIN
…
!$OMP PARALLEL
!$OMP SECTIONS
CALL interpolate()
!$OMP SECTION
CALL compute_stats()
!$OMP SECTION
CALL gen_random_params()
!$OMP END SECTIONS
!$OMP END PARALLEL
…
END
19. 병렬 아키텍처 (1/2)
Processor Organizations
Single Instruction,
Single Instruction,
Single Data Stream Multiple Data Stream
(SISD)
(SIMD)
Multiple Instruction, Multiple Instruction,
Single Data Stream Multiple Data Stream
(MIMD)
(MISD)
Uniprocessor
Vector
Processor
Shared memory
Array
Processor (tightly coupled)
Distributed memory
(loosely coupled)
Clusters
Symmetric
multiprocessor
(SMP)
Non-uniform
Memory
Access
(NUMA)
20. 병렬 아키텍처 (2/2)
최근의 고성능 시스템 : 분산-공유 메모리 지원
소프트 웨어적 DSM (Distributed Shared Memory) 구현
• 공유 메모리 시스템에서 메시지 패싱 지원
• 분산 메모리 시스템에서 변수 공유 지원
하드웨어적 DSM 구현 : 분산-공유 메모리 아키텍처
• 분산 메모리 시스템의 각 노드를 공유 메모리 시스템으로 구성
• NUMA : 사용자들에게 하나의 공유 메모리 아키텍처로 보여짐
ex) Superdome(HP), Origin 3000(SGI)
• SMP 클러스터 : SMP로 구성된 분산 시스템으로 보여짐
ex) SP(IBM), Beowulf Clusters
21. 병렬 프로그래밍 모델
공유메모리 병렬 프로그래밍 모델
공유 메모리 아키텍처에 적합
다중 스레드 프로그램
OpenMP, Pthreads
메시지 패싱 병렬 프로그래밍 모델
분산 메모리 아키텍처에 적합
MPI, PVM
하이브리드 병렬 프로그래밍 모델
분산-공유 메모리 아키텍처
OpenMP + MPI
22. 공유 메모리 병렬 프로그래밍 모델
Single thread
time
time
S1
Multi-thread
Thread
S1
fork
P1
P2
P1
P2
P3
P3
join
S2
S2
Shared address space
P4
Process
S2
Process
P4
23. 메시지 패싱 병렬 프로그래밍 모델
Serial
time
time
S1
Messagepassing
S1
S1
S1
S1
P1
P1
P2
P3
P4
P2
S2
S2
S2
S2
S2
S2
S2
S2
Process 0
Process 1
Process 2
Process 3
Node 1
Node 2
Node 3
Node 4
P3
P4
S2
S2
Process
Data transmission over the interconnect
24. 하이브리드 병렬 프로그래밍 모델
Message-passing
P1
fork
P2
time
time
S1
Thread
S1
P3
Shared
address
fork
P4
join
join
S2
S2
Thread
S2
S2
Shared
address
Process 0
Process 1
Node 1
Node 2
25. DSM 시스템의 메시지 패싱
time
S1
S1
S1
S1
P1
P2
P3
P4
Message-passing
S2
S2
S2
S2
S2
S2
S2
S2
Process 0
Process 1
Process 2
Process 3
Node 1
Node 2
26. SPMD와 MPMD (1/4)
SPMD(Single Program Multiple Data)
하나의 프로그램이 여러 프로세스에서 동시에 수행됨
어떤 순간 프로세스들은 같은 프로그램내의 명령어들을 수행하며 그 명령어들은 같을 수도
다를 수도 있음
MPMD (Multiple Program Multiple Data)
한 MPMD 응용 프로그램은 여러 개의 실행 프로그램으로 구성
응용프로그램이 병렬로 실행될 때 각 프로세스는 다른 프로세스와 같거나 다른 프로그램을
실행할 수 있음
31. 프로그램 실행시간 측정 (1/2)
time
사용방법(bash, ksh) : $time [executable]
$ time mpirun –np 4 –machinefile machines ./exmpi.x
real 0m3.59s
user 0m3.16s
sys
0m0.04s
real = wall-clock time
User = 프로그램 자신과 호출된 라이브러리 실행에 사용된 CPU 시간
Sys = 프로그램에 의해 시스템 호출에 사용된 CPU 시간
user + sys = CPU time
32. 프로그램 실행시간 측정 (2/2)
사용방법(csh) : $time [executable]
$ time testprog
1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w
①
②
③
④
⑤
⑥
⑦ ⑧
① user CPU time (1.15초)
② system CPU time (0.02초)
③ real time (0분 1.76초)
④ real time에서 CPU time이 차지하는 정도(66.4%)
⑤ 메모리 사용 : Shared (15Kbytes) + Unshared (3981Kbytes)
⑥ 입력(24 블록) + 출력(10 블록)
⑦ no page faults
⑧ no swaps
34. 성능향상도 (1/7)
성능향상도 (Speed-up) : S(n)
S(n) =
순차 프로그램의 실행시간
=
병렬 프로그램의 실행시간(n개 프로세서)
ts
tp
순차 프로그램에 대한 병렬 프로그램의 성능이득 정도
실행시간 = Wall-clock time
실행시간이 100초가 걸리는 순차 프로그램을 병렬화 하여 10개의 프로세서로 50초 만에 실행
되었다면,
S(10) =
100
=
50
2
35. 성능향상도 (2/7)
이상(Ideal) 성능향상도 : Amdahl‟s Law
f : 코드의 순차부분 (0 ≤ f ≤ 1)
tp = fts + (1-f)ts/n
순차부분 실행시
간
병렬부분 실행시
간
37. 성능향상도 (4/7)
S(n) =
ts =
tp
ts
fts + (1-f)ts/n
1
S(n) =
최대 성능향상도 ( n ∞ )
S(n) =
f + (1-f)/n
1
f
프로세서의 개수를 증가하면, 순차부분 크기의 역수에 수렴
38. 성능향상도 (5/7)
f = 0.2, n = 4
Serial
Parallel
process 1
20
20
80
20
process 2
process 3
cannot be parallelized
process 4
can be parallelized
S(4) =
1
0.2 + (1-0.2)/4
= 2.5
39. 성능향상도 (6/7)
프로세서 개수 대 성능향상도
f=0
24
Speed-up
20
16
f=0.05
12
f=0.1
8
f=0.2
4
0
0
4
8
12
16
20
number of processors, n
24
41. 효율
효율 (Efficiency) : E(n)
E(n) =
ts
=
tpⅹn
S(n)
n
프로세서 개수에 따른 병렬 프로그램의 성능효율을 나타냄
• 10개의 프로세서로 2배의 성능향상 :
– S(10) = 2
E(10) = 20 %
• 100개의 프로세서로 10배의 성능향상 :
– S(100) = 10
E(100) = 10 %
42. Cost
Cost
Cost = 실행시간 ⅹ 프로세서 개수
순차 프로그램 : Cost = ts
병렬 프로그램 : Cost = tp ⅹ n =
tsn
S(n)
=
ts
E(n)
예) 10개의 프로세서로 2배, 100개의 프로세서로 10배의 성능향상
ts
tp
n
S(n)
E(n)
Cost
100
50
10
2
0.2
500
100
10
100
10
0.1
1000
43. 실질적 성능향상에 고려할 사항
실제 성능향상도 : 통신부하, 로드 밸런싱 문제
20
80
Serial
parallel
20
20
process 1
cannot be parallelized
process 2
can be parallelized
process 3
communication overhead
process 4
Load unbalance
44. 성능증가를 위한 방안들
1.
프로그램에서 병렬화 가능한 부분(Coverage) 증가
알고리즘 개선
2.
작업부하의 균등 분배 : 로드 밸런싱
3.
통신에 소비하는 시간(통신부하) 감소
45. 성능에 영향을 주는 요인들
Coverage : Amdahl’s Law
로드 밸런싱
동기화
통신부하
세분성
입출력
46. 로드 밸런싱
모든 프로세스들의 작업시간이 가능한 균등하도록 작업을 분배하여 작업대기시간을 최소화 하는 것
데이터 분배방식(Block, Cyclic, Block-Cyclic) 선택에 주의
이기종 시스템을 연결시킨 경우, 매우 중요함
동적 작업할당을 통해 얻을 수도 있음
task0
WORK
task1
WAIT
task2
task3
time
47. 동기화
병렬 태스크의 상태나 정보 등을 동일하게 설정하기 위한 조정작업
대표적 병렬부하 : 성능에 악영향
장벽, 잠금, 세마포어(semaphore), 동기통신 연산 등 이용
병렬부하 (Parallel Overhead)
병렬 태스크의 시작, 종료, 조정으로 인한 부하
• 시작 : 태스크 식별, 프로세서 지정, 태스크 로드, 데이터 로드 등
• 종료 : 결과의 취합과 전송, 운영체제 자원의 반납 등
• 조정 : 동기화, 통신 등
48. 통신부하 (1/4)
데이터 통신에 의해 발생하는 부하
네트워크 고유의 지연시간과 대역폭 존재
메시지 패싱에서 중요
통신부하에 영향을 주는 요인들
동기통신? 비동기 통신?
블록킹? 논블록킹?
점대점 통신? 집합통신?
데이터전송 횟수, 전송하는 데이터의 크기
49. 통신부하 (2/4)
통신시간 = 지연시간 +
메시지 크기
대역폭
지연시간 : 메시지의 첫 비트가 전송되는데 걸리는 시간
• 송신지연 + 수신지연 + 전달지연
대역폭 : 단위시간당 통신 가능한 데이터의 양(MB/sec)
유효 대역폭 =
메시지 크기
=
통신시간
대역폭
1+지연시간ⅹ대역폭/메시지크기
52. 세분성 (1/2)
병렬 프로그램내의 통신시간에 대한 계산시간의 비
Fine-grained 병렬성
• 통신 또는 동기화 사이의 계산작업이 상대적으로 적음
• 로드 밸런싱에 유리
Coarse-grained 병렬성
• 통신 또는 동기화 사이의 계산작업이 상대적으로 많음
• 로드 밸런싱에 불리
일반적으로 Coarse-grained 병렬성이 성능면에서 유리
계산시간 < 통신 또는 동기화 시간
알고리즘과 하드웨어 환경에 따라 다를 수 있음
54. 입출력
일반적으로 병렬성을 방해함
쓰기 : 동일 파일공간을 이용할 경우 겹쳐 쓰기 문제
읽기 : 다중 읽기 요청을 처리하는 파일서버의 성능 문제
네트워크를 경유(NFS, non-local)하는 입출력의 병목현상
입출력을 가능하면 줄일 것
I/O 수행을 특정 순차영역으로 제한해 사용
지역적인 파일공간에서 I/O 수행
병렬 파일시스템의 개발 (GPFS, PVFS, PPFS…)
병렬 I/O 프로그래밍 인터페이스 개발 (MPI-2 : MPI I/O)
55. 확장성 (1/2)
확장된 환경에 대한 성능이득을 누릴 수 있는 능력
하드웨어적 확장성
알고리즘적 확장성
확장성에 영향을 미치는 주요 하드웨어적 요인
CPU-메모리 버스 대역폭
네트워크 대역폭
메모리 용량
프로세서 클럭 속도
57. 의존성과 교착
데이터 의존성 : 프로그램의 실행 순서가 실행 결과에 영향을 미치는 것
DO k = 1, 100
F(k + 2) = F(k +1) + F(k)
ENDDO
교착 : 둘 이상의 프로세스들이 서로 상대방의 이벤트 발생을 기다리는 상태
Process 1
X = 4
SOURCE = TASK2
RECEIVE (SOURCE,Y)
DEST = TASK2
SEND (DEST,X)
Z = X + Y
Process 2
Y = 8
SOURCE = TASK1
RECEIVE (SOURCE,X)
DEST = TASK1
SEND (DEST,Y)
Z = X + Y
59. 병렬 프로그램 작성 순서
①
순차코드 작성, 분석(프로파일링), 최적화
②
hotspot, 병목지점, 데이터 의존성 등을 확인
데이터 병렬성/태스크 병렬성 ?
병렬코드 개발
MPI/OpenMP/… ?
태스크 할당과 제어, 통신, 동기화 코드 추가
③
컴파일, 실행, 디버깅
④
병렬코드 최적화
성능측정과 분석을 통한 성능개선
60. 디버깅과 성능분석
디버깅
코드 작성시 모듈화 접근 필요
통신, 동기화, 데이터 의존성, 교착 등에 주의
디버거 : TotalView
성능측정과 분석
timer 함수 사용
프로파일러 : prof, gprof, pgprof, TAU
64. OpenMP의 역사
1990년대 :
고성능 공유 메모리 시스템의 발전
업체 고유의 지시어 집합 사용 표준화의 필요성
1994년 ANSI X3H5 1996년 openmp.org 설립
1997년 OpenMP API 발표
Release History
OpenMP Fortran API 버전 1.0 : 1997년 10월
C/C++ API 버전 1.0 : 1998년 10월
Fortran API 버전 1.1 : 1999년 11월
Fortran API 버전 2.0 : 2000년 11월
C/C++ API 버전 2.0 : 2002년 3월
Combined C/C++ and Fortran API 버전 2.5 : 2005년 5월
API 버전 3.0 : 2008년 5월
67. OpenMP의 구성 (2/2)
컴파일러 지시어
스레드 사이의 작업분담, 통신, 동기화를 담당
좁은 의미의 OpenMP
예) C$OMP PARALLEL DO
실행시간 라이브러리
병렬 매개변수(참여 스레드의 개수, 번호 등)의 설정과 조회
예) CALL omp_set_num_threads(128)
환경변수
실행 시스템의 병렬 매개변수(스레드 개수 등)를 정의
예) export OMP_NUM_THREADS=8
68. OpenMP 프로그래밍 모델 (1/4)
컴파일러 지시어 기반
순차코드의 적절한 위치에 컴파일러 지시어 삽입
컴파일러가 지시어를 참고하여 다중 스레드 코드 생성
OpenMP를 지원하는 컴파일러 필요
동기화, 의존성 제거 등의 작업 필요
69. OpenMP 프로그래밍 모델 (2/4)
Fork-Join
병렬화가 필요한 부분에 다중 스레드 생성
병렬계산을 마치면 다시 순차적으로 실행
F
J
F
J
O
O
O
O
Master
R
I
R
I
Thread
K
N
K
N
[Parallel Region]
[Parallel Region]
70. OpenMP 프로그래밍 모델 (3/4)
컴파일러 지시어 삽입
Serial Code
PROGRAM exam
…
ialpha = 2
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
PRINT *, a
END
Parallel Code
PROGRAM exam
…
ialpha = 2
!$OMP PARALLEL DO
DO i = 1, 100
a(i) = a(i) + ialpha*b(i)
ENDDO
!$OMP END PARALLEL DO
PRINT *, a
END
71. OpenMP 프로그래밍 모델 (4/4)
Fork-Join
※ export OMP_NUM_THREADS = 4
ialpha = 2
(Master Thread)
(Fork)
DO i=1,25
DO i=26,50
DO i=51,75
DO i=76,100
...
...
...
...
(Join)
(Master)
PRINT *, a
(Slave)
(Master Thread)
(Slave)
(Slave)
72. OpenMP의 장점과 단점
장 점
MPI보다 코딩, 디버깅이 쉬움
데이터 분배가 수월
단 점
• 공유메모리환경의 다중 프로세서
아키텍처에서만 구현 가능
점진적 병렬화가 가능
• OpenMP를 지원하는 컴파일러 필요
하나의 코드를 병렬코드와 순차코
• 루프에 대한 의존도가 큼 낮은
드로 컴파일 가능
상대적으로 코드 크기가 작음
병렬화 효율성
• 공유메모리 아키텍처의 확장성
(프로세서 수, 메모리 등) 한계
73. OpenMP의 전형적 사용
데이터 병렬성을 이용한 루프의 병렬화
1. 시간이 많이 걸리는 루프를 찾음 (프로파일링)
2. 의존성, 데이터 유효범위 조사
3. 지시어 삽입으로 병렬화
태스크 병렬성을 이용한 병렬화도 가능
75. 지시어 (2/5)
병렬영역 지시어
PARALLEL/END PARALLEL
코드부분을 병렬영역으로 지정
지정된 영역은 여러 스레드에서 동시에 실행됨
작업분할 지시어
DO/FOR
병렬영역 내에서 사용
루프인덱스를 기준으로 각 스레드에게 루프작업 할당
결합된 병렬 작업분할 지시어
PARALLEL DO/FOR
PARALLEL + DO/FOR의 역할을 수행
76. 지시어 (3/5)
병렬영역 지정
Fortran
!$OMP PARALLEL
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
!$OMP END PARALLEL
C
#pragma omp parallel
for(i=1; i<=10; i++)
printf(“Hello World %dn”,i);
77. 지시어 (4/5)
병렬영역과 작업분할
Fortran
C
!$OMP PARALLEL
#pragma omp parallel
!$OMP DO
DO i = 1, 10
PRINT *, „Hello World‟, i
ENDDO
[!$OMP END DO]
!$OMP END PARALLEL
{
#pragma omp for
for(i=1; i<=10; i++)
printf(“Hello World %dn”,i);
}
78. 지시어 (5/5)
병렬영역과 작업분할
Fortran
!$OMP PARALLEL
!$OMP DO
DO i = 1, n
a(i) = b(i) + c(i)
ENDDO
[!$OMP END DO]
Optional
!$OMP DO
…
[!$OMP END DO]
!$OMP END PARALLEL
C
#pragma omp parallel
{
#pragma omp for
for (i=1; i<=n; i++) {
a[i] = b[i] + c[i]
}
#pragma omp for
for(…){
…
}
}
79. 실행시간 라이브러리와 환경변수 (1/3)
실행시간 라이브러리
omp_set_num_threads(integer) : 스레드 개수 지정
omp_get_num_threads() : 스레드 개수 리턴
omp_get_thread_num() : 스레드 ID 리턴
환경변수
OMP_NUM_THREADS : 사용 가능한 스레드 최대 개수
• export OMP_NUM_THREADS=16 (ksh)
• setenv OMP_NUM_THREADS 16 (csh)
C : #include <omp.h>
82. clause : reduction (1/4)
reduction(operator|intrinsic:var1, var2,…)
reduction 변수는 shared
• 배열 가능(Fortran only): deferred shape, assumed shape array 사
용 불가
• C는 scalar 변수만 가능
각 스레드에 복제돼 연산에 따라 다른 값으로 초기화되고(표 참조) 병렬 연산 수행
다중 스레드에서 병렬로 수행된 계산결과를 환산해 최종 결과를 마스터 스레드로 내 놓
음
83. clause : reduction (2/4)
!$OMP DO reduction(+:sum)
DO i = 1, 100
sum = sum + x(i)
ENDDO
Thread 0
Thread 1
sum0 = 0
sum1 = 0
DO i = 1, 50
DO i = 51, 100
sum0 = sum0 + x(i)
ENDDO
sum = sum0 + sum1
sum1 = sum1 + x(i)
ENDDO
84. clause : reduction (3/4)
Reduction Operators : Fortran
Operator
Data Types
초기값
+
integer, floating point (complex or real)
0
*
integer, floating point (complex or real)
1
-
integer, floating point (complex or real)
0
.AND.
logical
.TRUE.
.OR.
logical
.FALSE.
.EQV.
logical
.TRUE.
.NEQV.
logical
.FALSE.
MAX
integer, floating point (real only)
가능한 최소값
MIN
integer, floating point (real only)
가능한 최대값
IAND
integer
all bits on
IOR
integer
0
IEOR
integer
0
85. clause : reduction (4/4)
Reduction Operators : C
Operator
Data Types
초기값
+
integer, floating point
0
*
integer, floating point
1
-
integer, floating point
0
&
integer
all bits on
|
integer
0
^
integer
0
&&
integer
1
||
integer
0
90. What is MPI?
MPI = Message Passing Interface
MPI is a specification for the developers and users of message passing libraries. By itself, it
is NOT a library – but rather the specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming model : data is moved
from the address space of one process to that of another process through cooperative
operations on each process.
Simply stated, the goal of the message Passing Interface is to provide a widely used standard
for writing message passing programs. The interface attempts to be :
Portable
Efficient
90
Practical
Flexible
91. What is MPI?
The MPI standard has gone through a number of revisions, with the most recent version
being MPI-3.
Interface specifications have been defined for C and Fortran90 language bindings :
C++ bindings from MPI-1 are removed in MPI-3
MPI-3 also provides support for Fortran 2003 and 2008 features
Actual MPI library implementations differ in which version and features of the MPI standard
they support. Developers/users will need to be aware of this.
91
92. Programming Model
Originally, MPI was designed for distributed memory architectures, which were becoming
increasingly popular at time (1980s – early 1990s).
As architecture trends changed, shared memory SMPs were combined over networks
creating hybrid distributed memory/shared memory systems.
92
93. Programming Model
MPI implementers adapted their libraries to handle both types of underlying memory
architectures seamlessly. They also adapted/developed ways of handing different
interconnects and protocols.
Today, MPI runs on virtually any hardware platform :
Distributed Memory
Shared Memory
Hybrid
The programming model clearly remains a distributed memory model however, regardless of
the underlying physical architecture of the machine.
93
94. Reasons for Using MPI
Standardization
MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous
message passing libraries.
Portability
There is little or no need to modify your source code when you port your application to a
different platform that supports (and is compliant with) the MPI standard.
Performance Opportunities
Vendor implementations should be able to exploit native hardware features to optimize
performance.
Functionality
There are over 440 routines defined in MPI-3, which includes the majority of those in
MPI-2 and MPI-1.
Availability
94
A Variety of implementations are available, both vendor and public domain.
95. History and Evolution
MPI has resulted from the efforts of numerous individuals and groups that began in 1992.
1980s – early 1990s : Distributed memory, parallel computing develops, as do a number of
incompatible soft ware tools for writing such programs – usually with tradeoffs between
portability, performance, functionality and price. Recognition of the need for a standard arose.
Apr 1992 : Workshop on Standards for Message Passing in a Distributed Memory
Environment, Sponsored by the Center for Research on Parallel Computing, Williamsburg,
Virginia. The basic features essential to a standard message passing interface were
discussed, and a working group established to continue the standardization process.
Preliminary draft proposal developed subsequently.
95
96. History and Evolution
Nov 1992 : Working group meets in Minneapolis. MPI draft proposal (MPI1) from ORNL
presented. Group adopts procedures and organization to form the MPI Forum. It eventually
comprised of about 175 individuals from 40 organizations including parallel computer
vendors, software writers, academia and application scientists.
Nov 1993 : Supercomputing 93 conference – draft MPI standard presented.
May 1994 : Final version of MPI-1.0 released.
MPI-1.0 was followed by versions MPI-1.1 (Jun 1995), MPI-1.2 (Jul 1997) and MPI-1.3 (May
2008).
MPI-2 picked up where the first MPI specification left off, and addressed topics which went far
beyond the MPI-1 specification. Was finalized in 1996.
MPI-2.1 (Sep 2009), and MPI-2.2 (Sep 2009) followed.
Sep 2012 : The MPI-3.0 standard was approved.
96
99. A Header File for MPI routines
Required for all programs that make MPI library calls.
C include file
Fortran include file
#include “mpi.h”
include „mpif.h‟
With MPI-3 Fortran, the USE mpi_f80 module is preferred over using the include file shown
above.
99
100. The Format of MPI Calls
C names are case sensitive; Fortran name are not.
Programs must not declare variables or functions with names beginning with the prefix MPI_
or PMPI_ (profiling interface).
C Binding
Format
rc = MPI_Xxxxx(parameter, …)
Example
rc = MPI_Bsend(&buf, count, type, dest, tag, comm)
Error code
Returned as “rc”, MPI_SUCCESS if successful.
Fortran Binding
Format
Example
call MPI_BSEND(buf, count, type, dest, tag, comm, ierr)
Error code
100
CALL MPI_XXXXX(parameter, …, ierr)
call mpi_xxxxx(parameter, …, ierr)
Returned as “ierr” parameter, MPI_SUCCESS if successful.
101. Communicators and Groups
MPI uses objects called communicators and groups to define which collection of processes
may communicate with each other.
Most MPI routines require you to specify a communicator as an argument.
Communicators and groups will be covered in more detail later. For now, simply use
MPI_COMM_WORLD whenever a communicator is required - it is the predefined
communicator that includes all of your MPI processes.
101
102. Rank
Within a communicator, every process has its own unique, integer identifier assigned by the
system when the process initializes. A rank is sometimes also called a “task ID”. Ranks are
contiguous and begin at zero.
Used by the programmer to specify the source and destination of messages. Often used
conditionally by the application to control program execution (if rank = 0 do this / if rank = 1
do that).
102
103. Error Handling
Most MPI routines include a return/error code parameter, as described in “Format of MPI
Calls” section above.
However, according to the MPI standard, the default behavior of an MPI call is to abort if there
is an error. This means you will probably not be able to capture a return/error code other than
MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler. You can also
consult the error handing section of the MPI Standard located at http://www.mpiforum.org/docs/mpi-11-html/node148.html .
The types of errors displayed to the user are implementation dependent.
103
104. Environment Management Routines
MPI_Init
Initializes the MPI execution environment. This function must be called is every MPI
program, must be called before any other MPI functions and must be called only once in
an MPI program. For C programs, MPI_Init may be used to pass the command line
arguments to all processes, although this is not required by the standard and is
implementation dependent.
C
MPI_Init(&argc, &argv)
104
Fortran
MPI_INIT(ierr)
Input parameters
• argc : Pointer to the number of arguments
• argv : Pointer to the argument vector
ierr : the error return argument
105. Environment Management Routines
MPI_Comm_size
Returns the total number of MPI processes in the specified communicator, such as
MPI_COMM_WORLD. If the communicator is MPI_COMM_WORLD, then it represents the
number of MPI tasks available to your application.
C
MPI_Comm_size(comm, &size)
105
Fortran
MPI_COMM_SIZE(comm, size, ierr)
Input parameters
• comm : communicator (handle)
Output parameters
• size : number of processes in the group of comm (integer)
ierr : the error return argument
106. Environment Management Routines
MPI_Comm_rank
Returns the rank of the calling MPI process within the specified communicator. Initially,
each process will be assigned a unique integer rank between 0 and number of tasks -1
within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID.
If a process becomes associated with other communicators, it will have a unique rank
within each of these as well.
C
MPI_Comm_rank(comm, &rank)
106
Fortran
MPI_COMM_SIZE(comm, rank, ierr)
Input parameters
• comm : communicator (handle)
Output parameters
• rank : rank of the calling process in the group of comm (integer)
ierr : the error return argument
107. Environment Management Routines
MPI_Finalize
Terminates the MPI execution environment. This function should be the last MPI routine
called in every MPI program – no other MPI routines may be called after it.
C
MPI_Finalize()
107
ierr : the error return argument
Fortran
MPI_FINALIZE(ierr)
108. Environment Management Routines
MPI_Abort
Terminates all MPI processes associated with the communicator. In most MPI
implementations it terminates ALL processes regardless of the communicator specified.
C
MPI_Abort(comm, errorcode)
108
Fortran
MPI_ABORT(comm, errorcode, ierr)
Input parameters
• comm : communicator (handle)
• errorcode : error code to return to invoking environment
ierr : the error return argument
109. Environment Management Routines
MPI_Get_processor_name
Return the processor name. Also returns the length of the name. The buffer for “name”
must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into
“name” is implementation dependent – may not be the same as the output of the
“hostname” or “host” shell commands.
C
Fortran
MPI_Get_processor_name(&name,
&resultlength)
MPI_GET_PROCESSOR_NAME(n
ame, resultlength, ierr)
109
Output parameters
• name : A unique specifies for the actual (as opposed to virtual) node. This must be
an array of size at least MPI_MAX_PROCESOR_NAME .
• resultlen : Length (in characters) of the name.
ierr : the error return argument
110. Environment Management Routines
MPI_Get_version
Returns the version (either 1 or 2) and subversion of MPI.
C
MPI_Get_version(&version,
&subversion)
110
Fortran
MPI_GET_VERSION(version,
subversion, ierr)
Output parameters
• version : Major version of MPI (1 or 2)
• subversion : Miner version of MPI.
ierr : the error return argument
111. Environment Management Routines
MPI_Initialized
Indicates whether MPI_Init has been called – returns flag as either logical true(1) or
false(0).
C
MPI_Initialized(&flag)
111
Fortran
MPI_INITIALIZED(flag, ierr)
Output parameters
• flag : Flag is true if MPI_Init has been called and false otherwise.
ierr : the error return argument
112. Environment Management Routines
MPI_Wtime
Returns an elapsed wall clock time in seconds (double precision) on the calling
processor.
C
MPI_Wtime()
Fortran
MPI_WTIME()
Return value
• Time in seconds since an arbitrary time in the past.
MPI_Wtick
Returns the resolution in seconds (double precision) of MPI_Wtime.
C
MPI_Wtick()
112
Fortran
MPI_WTICK()
Return value
• Time in seconds of the resolution MPI_Wtime.
114. Example: Hello world
Execute a mpi program.
$ module load [compiler] [mpi]
$ mpicc hello.c
$ mpirun –np 4 –hostfile [hostfile] ./a.out
Make out a hostfile.
ibs0001
ibs0002
ibs0003
ibs0003
…
114
slots=2
slots=2
slots=2
slots=2
115. Example : Environment Management Routine
#include "mpi.h”
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, len, rc;
char hostname[MPI_MAX_PROCESSOR_NAME];
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf ("Error starting MPI program. Terminating.n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Get_processor_name(hostname, &len);
printf ("Number of tasks= %d My rank= %d Running on %sn", numtasks,rank,hostname);
/*******
do some work *******/
rc = MPI_Finalize();
return 0;
}
115
116. Types of Point-to-Point Operations
MPI point-to-point operations typically involve message passing between two, and only two,
different MPI tasks. One task is performing a send operation and the other task is performing
a matching receive operation.
There are different types of send and receive routines used for different purposes.
Synchronous send
Blocking send/blocking receive
Non-blocking send/non-blocking receive
Buffered send
Combined send/receive
“Ready” send
Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send – receive operations, such as those used to wait for
a message’s arrival or prove to find out if a message has arrived.
116
117. Buffering
In a perfect world, every send operation would be perfectly synchronized with its matching re
ceive. This is rarely the case. Somehow or other, the MPI implementation must be able to deal
with storing data when the two tasks are out of sync.
Consider the following two cases
117
A send operation occurs 5 seconds before the receive is ready – where is the message w
hile the receive is pending?
Multiple sends arrive at the same receiving task which can only accept one send at a tim
e – what happens to the messages that are “backing up”?
118. Buffering
The MPI implementation (not the MPI standard) decides what happens to data in these types
of cases. Typically, a system buffer area is reserved to hold data in transit.
118
119. Buffering
System buffer space is :
119
Opaque to the programmer and managed entirely by the MPI library
A finite resource that can be easy to exhaust
Often mysterious and not well documented
Able to exist on the sending side, the receiving side, or both
Something that may improve program performance because it allows send – receive ope
rations to be asynchronous.
120. Blocking vs. Non-blocking
Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode.
Blocking
A blocking send routine will only “return” after it is safe to modify the application buffer (your
send data) for reuse. Safe means that modifications will not affect the data intended for the rec
eive task. Safe dose not imply that the data was actually received – it may very well be sitting i
n a system buffer.
A blocking send can be synchronous which means there is handshaking occurring with the re
ceive task to confirm a safe send.
A blocking send can be asynchronous if a system buffer is used to hold the data for eventual d
elivery to the receive.
A blocking receive only “returns” after the data has arrived and is ready for use by the progra
m.
Non-blocking
120
Non-blocking send and receive routines behave similarly – they will return almost immediately.
They do not wait for any communication events to complete, such as message copying from u
ser memory to system buffer space or the actual arrival of message.
Non-blocking operations simply “request” the MPI library to perform the operation when it is a
ble. The user can not predict when it is able. The user can not predict when that will happen.
It is unsafe to modify the application buffer (your variable space) until you know for a fact the r
equested non-blocking operation was actually performed by the library. There are “wait” routin
es used to do this.
Non-blocking communications are primarily used to overlap computation with communication
and exploit possibale performance gains.
121. MPI Message Passing Routine Arguments
MPI point-to-point communication routines generally have an argument list that takes one of t
he following formats :
Blocking sends
MPI_Send(buffer, count, type, dest, tag, comm)
Non-blocking sends
MPI_Isend(buffer, count, type, dest, tag, comm, request)
Blocking receive
MPI_Recv(buffer, count, type, source, tag, comm, status)
Non-blocking receive
MPI_Irecv(buffer, count, type, source, tag, comm, request)
Buffer
Program (application) address space that references the data that is to be sent or receiv
ed. In most cases, this is simply the variable name that is be sent/received. For C progra
ms, this argument is passed by reference and usually must be prepended with an amper
sand : &var1
Data count
121
Indicates the number of data elements of a particular type to be sent.
122. MPI Message Passing Routine Arguments
Data type
For reasons of portability, MPI predefines its elementary data types. The table below lists
those required by the standard.
C Data Types
MPI_CHAR
MPI_SHORT
signed short int
MPI_INT
signed int
MPI_LONG
signed long int
MPI_SIGNED_CHAR
signed char
MPI_UNSIGNED_CHAR
unsigned char
MPI_UNSIGNED_SHORT
unsigned short int
MPI_UNSIGNED
unsigned int
MPI_UNSIGNED_LONG
unsigned long int
MPI_FLOAT
float
MPI_DOUBLE
double
MPI_LONG_DOUBLE
122
signed char
long double
123. MPI Message Passing Routine Arguments
Destination
An argument to send routines that indicates the process where a message should be del
ivered. Specified as the rank of the receiving process.
Tag
Arbitrary non-negative integer assigned by the programmer to uniquely identify a messa
ge. Send and receive operations should match message tags. For a receive operation, th
e wild card MPI_ANY_TAG can be used to receive any message regardless of its tag. The
MPI standard guarantees that integers 0 – 32767 can be used as tags, but most impleme
ntations allow a much larger range than this.
Communicator
123
Indicates the communication context, or set of processes for which the source or destin
ation fields are valid. Unless the programmer is explicitly creating new communicator, th
e predefined communicator MPI_COMM_WORLD is usually used.
124. MPI Message Passing Routine Arguments
Status
For a receive operation, indicates the source of the message and the tag of the message.
In C, this argument is a pointer to predefined structure MPI_Status (ex. stat.MPI_SOURC
E, stat.MPI_TAG).
In Fortran, it is an integer array of size MPI_STATUS_SIZE (ex. stat(MPI_SOURCE), stat(M
PI_TAG)).
Additionally, the actual number of bytes received are obtainable from Status via MPI_Get
_out routine.
Request
124
Used by non-blocking send and receive operations.
Since non-blocking operations may return before the requested system buffer space is o
btained, the system issues a unique “request number”.
The programmer uses this system assigned “handle” later (in a WAIT type routine) to det
ermine completion of the non-blocking operation.
In C, this argument is pointer to predefined structure MPI_Request.
In Fortran, it is an integer.
130. Advanced Example : Monte-Carlo Simulation
<Problem>
Monte carlo simulation
Random number use
PI = 4 ⅹAc/As
<Requirement>
N’s processor(rank) use
P2p communication
r
130
131. Advanced Example : Monte-Carlo Simulation for PI
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;
printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}
131
132. Advanced Example : Numerical integration for PI
<Problem>
Get PI using Numerical integration
1
0
f ( x1 )
f ( x2 )
4.0
dx =
2)
(1+x
f ( xn )
<Requirement>
Point to point communication
n
4
i 1
1 2
1 ((i 0.5) )
n
1
n
....
1
n
1
(2 0.5)
n
1
(1 0.5)
n
x2
x1
132
xn
(n 0.5)
1
n
133. Advanced Example : Numerical integration for PI
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}
133
134. Type of Collective Operations
Synchronization
processes wait until all members of the group have reached the synchronization point.
Data Movement
broadcast, scatter/gather, all to all.
Collective Computation (reductions)
134
one member of the group collects data from the other members and performs an operati
on (min, max, add, multiply, etc.) on that data.
135. Programming Considerations and Restrictions
With MPI-3, collective operations can be blocking or non-blocking. Only blocking operations
are covered in this tutorial.
Collective communication routines do not take message tag arguments.
Collective operations within subset of processes are accomplished by first partitioning the su
bsets into new groups and then attaching the new groups to new communicators.
Con only be used with MPI predefined datatypes – not with MPI Derived Data Types.
MPI-2 extended most collective operations to allow data movement between intercommunicat
ors (not covered here).
135
136. Collective Communication Routines
MPI_Barrier
Synchronization operation. Creates a barrier synchronization in a group. Each task,
when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same
MPI_Barrier call. Then all tasks are free to proceed.
C
MPI_Barrier(comm)
136
Fortran
MPI_BARRIER(comm, ierr)
137. Collective Communication Routines
MPI_Bcast
Data movement operation. Broadcasts (sends) a message from the process with rank "r
oot" to all other processes in the group.
C
MPI_Bcast(&buffer, count, datatype,
root, comm)
137
Fortran
MPI_BCAST
(buffer,count,datatype,root,comm,ier
r)
138. Collective Communication Routines
MPI_Scatter
Data movement operation. Distributes distinct messages from a single source task to ea
ch task in the group.
C
Fortran
MPI_Scatter
MPI_SCATTER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcnt,recvtype,root,comm)
recvcnt,recvtype,root,comm,ierr)
138
139. Collective Communication Routines
MPI_Gather
Data movement operation. Gathers distinct messages from each task in the group to a si
ngle destination task. This routine is the reverse operation of MPI_Scatter.
C
Fortran
MPI_Gather
MPI_GATHER
(&sendbuf,sendcnt,sendtype,&recvb (sendbuf,sendcnt,sendtype,recvbuf,
uf, recvcount,recvtype,root,comm)
recvcount,recvtype,root,comm,ierr)
139
140. Collective Communication Routines
MPI_Allgather
Data movement operation. Concatenation of data to all tasks in a group. Each task in the
group, in effect, performs a one-to-all broadcasting operation within the group.
C
Fortran
MPI_Allgather
MPI_ALLGATHER
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcount,recvtype,comm)
uf, recvcount,recvtype,comm,info)
140
141. Collective Communication Routines
MPI_Reduce
Collective computation operation. Applies a reduction operation on all tasks in the group
and places the result in one task.
C
MPI_Reduce
(&sendbuf,&recvbuf,count,datatype,
op,root,comm)
141
Fortran
MPI_REDUCE
(sendbuf,recvbuf,count,datatype,op,
root,comm,ierr)
142. Collective Communication Routines
The predefined MPI reduction operations appear below. Users can also define their own
reduction functions by using the MPI_Op_create routine.
MPI Reduction Operation
C Data Types
MPI_MAX
maximum
integer, float
MPI_MIN
minimum
integer, float
MPI_SUM
sum
integer, float
MPI_PROD
product
integer, float
MPI_LAND
logical AND
integer
MPI_BAND
bit-wise AND
integer, MPI_BYTE
MPI_LOR
logical OR
integer
MPI_BOR
bit-wise OR
integer, MPI_BYTE
MPI_LXOR
logical XOR
integer
MPI_BXOR
bit-wise XOR
integer, MPI_BYTE
MPI_MAXLOC
max value and location
float, double and long double
MPI_MINLOC
min value and location
float, double and long double
142
143. Collective Communication Routines
MPI_Allreduce
Collective computation operation + data movement. Applies a reduction operation and pl
aces the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by
an MPI_Bcast.
C
MPI_Allreduce
(&sendbuf,&recvbuf,count,datatype,
op,comm)
143
Fortran
MPI_ALLREDUCE
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
144. Collective Communication Routines
MPI_Reduce_scatter
Collective computation operation + data movement. First does an element-wise reductio
n on a vector across all tasks in the group. Next, the result vector is split into disjoint se
gments and distributed across the tasks. This is equivalent to an MPI_Reduce followed b
y an MPI_Scatter operation.
C
MPI_Reduce_scatter
(&sendbuf,&recvbuf,recvcount,datat
ype, op,comm)
144
Fortran
MPI_REDUCE_SCATTER
(sendbuf,recvbuf,recvcount,datatype,
op,comm,ierr)
145. Collective Communication Routines
MPI_Alltoall
Data movement operation. Each task in a group performs a scatter operation, sending a
distinct message to all the tasks in the group in order by index.
C
Fortran
MPI_Alltoall
MPI_ALLTOALL
(&sendbuf,sendcount,sendtype,&rec (sendbuf,sendcount,sendtype,recvb
vbuf, recvcnt,recvtype,comm)
uf, recvcnt,recvtype,comm,ierr)
145
146. Collective Communication Routines
MPI_Scan
Performs a scan operation with respect to a reduction operation across a task group.
C
MPI_Scan
(&sendbuf,&recvbuf,count,datatype,
op,comm)
146
Fortran
MPI_SCAN
(sendbuf,recvbuf,count,datatype,op,
comm,ierr)
147. Collective Communication Routines
data
P0
A
A
P0
A
A
P1
B
P2
A
P2
C
P3
A
P3
D
broadcast
P1
A*B*C*D
reduce
*:some operator
P0
A
B
C
D
A
P0
A
P1
B
P1
B
P2
C
P2
C
A*B*C*D
D
P3
D
A*B*C*D
scatter
gather
P3
A*B*C*D
all
reduce
A*B*C*D
*:some operator
P0
A
A
B
C
D
P0
A
P1
B
A
B
C
D
P1
B
P2
C
A
B
C
D
P2
C
A*B*C
P3
D
A
B
C
D
P3
D
A*B*C*D
allgather
A
scan
A*B
*:some operator
P0
A0
A1
A2
A3
alltoall
A0
B0
C0
D0
P0
A0
A1
A2
A0*B0*C0*D0
A3
reduce
scatter
A1*B1*C1*D1
P1
B0
B1
B2
B3
A1
B1
C1
D1
P1
B0
B1
B2
B3
P2
C0
C1
C2
C3
A2
B2
C2
D2
P2
C0
C1
C2
C3
A2*B2*C2*D2
P3
D0
D1
D2
D3
A3
B3
C3
D3
P3
D0
D1
D2
D3
A3*B3*C3*D3
*:some operator
147
148. Example : Collective Communication (1/2)
Perform a scatter operation on the rows of an array
#include "mpi.h"
#include <stdio.h>
#define SIZE 4
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
float recvbuf[SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
148
150. Advanced Example : Monte-Carlo Simulation for PI
Use the collective communication routines!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i, cnt;
double pi, x, y, r;
printf(“-----------------------------------------------------------n”);
pi = 0.0;
cnt = 0;
r = 0.0;
for (i=0; i<num_step; i++) {
x = rand() / (RAND_MAX+1.0);
y = rand() / (RAND_MAX+1.0);
r = sqrt(x*x + y*y);
if (r<=1) cnt += 1;
}
pi = 4.0 * (double)(cnt) / (double)(num_step);
printf(“PI = %17.15lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}
150
151. Advanced Example : Numerical integration for PI
Use the collective communication routines!
#include <stdio.h>
#include <math.h>
int main() {
const long num_step=100000000;
long i;
double sum, step, pi, x;
step = (1.0/(double)num_step);
sum=0.0;
printf(“-----------------------------------------------------------n”);
for (i=0; i<num_step; i++) {
x = ((double)i - 0.5) * step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
printf(“PI = %5lf (Error = %e)n”, pi, fabs(acos(-1.0)-pi));
printf(“-----------------------------------------------------------n”);
return 0;
}
151